{ "cells": [ { "cell_type": "markdown", "id": "83244d63", "metadata": {}, "source": [ "# Passage Reranking Multilingual BERT 🔃 🌍\n" ] }, { "cell_type": "markdown", "id": "4c8c922a", "metadata": {}, "source": [ "## Model description\n", "**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n" ] }, { "cell_type": "markdown", "id": "8b40d5de", "metadata": {}, "source": [ "**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n", "It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n" ] }, { "cell_type": "markdown", "id": "c9d89366", "metadata": {}, "source": [ "**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n" ] }, { "cell_type": "markdown", "id": "29745195", "metadata": {}, "source": [ "**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n" ] }, { "cell_type": "markdown", "id": "010a4d92", "metadata": {}, "source": [ "## Intended uses & limitations\n", "Both query[1] and passage[2] have to fit in 512 Tokens.\n", "As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n" ] }, { "cell_type": "markdown", "id": "a9f2dea7", "metadata": {}, "source": [ "## How to use\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8d023555", "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "id": "4c83eef3", "metadata": {}, "outputs": [], "source": [ "import paddle\n", "from paddlenlp.transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n", "input_ids = paddle.randint(100, 200, shape=[1, 20])\n", "print(model(input_ids))" ] }, { "cell_type": "markdown", "id": "2611b122", "metadata": {}, "source": [ "## Training data\n" ] }, { "cell_type": "markdown", "id": "ba62fbe0", "metadata": {}, "source": [ "This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n" ] }, { "cell_type": "markdown", "id": "afc188f2", "metadata": {}, "source": [ "> 此模型介绍及权重来源于[https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco),并转换为飞桨模型格式。\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }