{ "cells": [ { "cell_type": "markdown", "id": "22c47298", "metadata": {}, "source": [ "# Passage Reranking Multilingual BERT 🔃 🌍\n" ] }, { "cell_type": "markdown", "id": "0bb73e0f", "metadata": {}, "source": [ "## Model description\n", "**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n" ] }, { "cell_type": "markdown", "id": "fedf5cb8", "metadata": {}, "source": [ "**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n", "It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n" ] }, { "cell_type": "markdown", "id": "146e3be4", "metadata": {}, "source": [ "**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n" ] }, { "cell_type": "markdown", "id": "772c5c82", "metadata": {}, "source": [ "**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n" ] }, { "cell_type": "markdown", "id": "e5974e46", "metadata": {}, "source": [ "## Intended uses & limitations\n", "Both query[1] and passage[2] have to fit in 512 Tokens.\n", "As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n" ] }, { "cell_type": "markdown", "id": "7d878609", "metadata": {}, "source": [ "## How to use\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e0941f1f", "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "id": "3bc201bf", "metadata": {}, "outputs": [], "source": [ "import paddle\n", "from paddlenlp.transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n", "input_ids = paddle.randint(100, 200, shape=[1, 20])\n", "print(model(input_ids))" ] }, { "cell_type": "markdown", "id": "674ccc3a", "metadata": {}, "source": [ "## Training data\n" ] }, { "cell_type": "markdown", "id": "4404adda", "metadata": {}, "source": [ "This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n" ] }, { "cell_type": "markdown", "id": "79af5e42", "metadata": {}, "source": [ "> The model introduction and model weights originate from [https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }