{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "22c47298",
   "metadata": {},
   "source": [
    "# Passage Reranking Multilingual BERT 🔃 🌍\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bb73e0f",
   "metadata": {},
   "source": [
    "## Model description\n",
    "**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fedf5cb8",
   "metadata": {},
   "source": [
    "**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n",
    "It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "146e3be4",
   "metadata": {},
   "source": [
    "**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "772c5c82",
   "metadata": {},
   "source": [
    "**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5974e46",
   "metadata": {},
   "source": [
    "## Intended uses & limitations\n",
    "Both query[1] and passage[2] have to fit in 512 Tokens.\n",
    "As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d878609",
   "metadata": {},
   "source": [
    "## How to use\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0941f1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade paddlenlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bc201bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddlenlp.transformers import AutoModel\n",
    "\n",
    "model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n",
    "input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
    "print(model(input_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "674ccc3a",
   "metadata": {},
   "source": [
    "## Training data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4404adda",
   "metadata": {},
   "source": [
    "This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79af5e42",
   "metadata": {},
   "source": [
    "> The model introduction and model weights originate from [https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}