introduction_cn.ipynb 3.9 KB
Notebook
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83244d63",
   "metadata": {},
   "source": [
    "# Passage Reranking Multilingual BERT 🔃 🌍\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c8c922a",
   "metadata": {},
   "source": [
    "## Model description\n",
    "**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b40d5de",
   "metadata": {},
   "source": [
    "**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n",
    "It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9d89366",
   "metadata": {},
   "source": [
    "**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29745195",
   "metadata": {},
   "source": [
    "**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "010a4d92",
   "metadata": {},
   "source": [
    "## Intended uses & limitations\n",
    "Both query[1] and passage[2] have to fit in 512 Tokens.\n",
    "As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9f2dea7",
   "metadata": {},
   "source": [
    "## How to use\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d023555",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade paddlenlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c83eef3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddlenlp.transformers import AutoModel\n",
    "\n",
    "model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n",
    "input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
    "print(model(input_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2611b122",
   "metadata": {},
   "source": [
    "## Training data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba62fbe0",
   "metadata": {},
   "source": [
    "This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afc188f2",
   "metadata": {},
   "source": [
    "> 此模型介绍及权重来源于[https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco),并转换为飞桨模型格式。\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}