未验证 提交 1352e3d3 编写于 作者: 骑马小猫 提交者: GitHub

add paddlenlp community models (#5660)

* update project

* update icon and keyword
上级 747a474a
# 模型列表
## CLTL/MedRoBERTa.nl
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|CLTL/MedRoBERTa.nl| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/vocab.txt) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models CLTL/MedRoBERTa.nl
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|CLTL/MedRoBERTa.nl| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/CLTL/MedRoBERTa.nl/vocab.txt) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models CLTL/MedRoBERTa.nl
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "CLTL/MedRoBERTa.nl"
description: "MedRoBERTa.nl"
description_en: "MedRoBERTa.nl"
icon: ""
from_repo: "https://huggingface.co/CLTL/MedRoBERTa.nl"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "CLTL"
License: "mit"
Language: "Dutch"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: conll2003
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: mit
Model_Info:
name: "Jean-Baptiste/roberta-large-ner-english"
description: "roberta-large-ner-english: model fine-tuned from roberta-large for NER task"
description_en: "roberta-large-ner-english: model fine-tuned from roberta-large for NER task"
icon: ""
from_repo: "https://huggingface.co/Jean-Baptiste/roberta-large-ner-english"
description: 'roberta-large-ner-english: model fine-tuned from roberta-large for
NER task'
description_en: 'roberta-large-ner-english: model fine-tuned from roberta-large
for NER task'
from_repo: https://huggingface.co/Jean-Baptiste/roberta-large-ner-english
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: Jean-Baptiste/roberta-large-ner-english
Paper: null
Publisher: Jean-Baptiste
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Token Classification"
sub_tag: "Token分类"
Example:
Datasets: "conll2003"
Publisher: "Jean-Baptiste"
License: "mit"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: Token分类
sub_tag_en: Token Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "743f4950",
"metadata": {},
"source": [
"# roberta-large-ner-english: model fine-tuned from roberta-large for NER task\n"
]
},
{
"cell_type": "markdown",
"id": "0d517a6d",
"metadata": {},
"source": [
"## Introduction\n"
]
},
{
"cell_type": "markdown",
"id": "bbb5e934",
"metadata": {},
"source": [
"roberta-large-ner-english is an english NER model that was fine-tuned from roberta-large on conll2003 dataset.\n",
"Model was validated on emails/chat data and outperformed other models on this type of data specifically.\n",
"In particular the model seems to work better on entity that don't start with an upper case.\n"
]
},
{
"cell_type": "markdown",
"id": "a13117c3",
"metadata": {},
"source": [
"## How to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9e58955",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db077413",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Jean-Baptiste/roberta-large-ner-english\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "86ae5e96",
"metadata": {},
"source": [
"For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails:\n",
"https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa\n",
"\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/Jean-Baptiste/roberta-large-ner-english](https://huggingface.co/Jean-Baptiste/roberta-large-ner-english),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "b0541e6a",
"metadata": {},
"source": [
"# roberta-large-ner-english: model fine-tuned from roberta-large for NER task\n"
]
},
{
"cell_type": "markdown",
"id": "c85540d7",
"metadata": {},
"source": [
"## Introduction\n"
]
},
{
"cell_type": "markdown",
"id": "c2e2ebde",
"metadata": {},
"source": [
"roberta-large-ner-english is an english NER model that was fine-tuned from roberta-large on conll2003 dataset.\n",
"Model was validated on emails/chat data and outperformed other models on this type of data specifically.\n",
"In particular the model seems to work better on entity that don't start with an upper case.\n"
]
},
{
"cell_type": "markdown",
"id": "4f6d5dbe",
"metadata": {},
"source": [
"## How to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a159cf92",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daa60299",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Jean-Baptiste/roberta-large-ner-english\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "2a66154e",
"metadata": {},
"source": [
"For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails:\n",
"https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa\n",
"\n",
"> The model introduction and model weights originate from https://huggingface.co/Jean-Baptiste/roberta-large-ner-english and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: Chinese
License: apache-2.0
Model_Info:
name: "Langboat/mengzi-bert-base-fin"
description: "Mengzi-BERT base fin model (Chinese)"
description_en: "Mengzi-BERT base fin model (Chinese)"
icon: ""
from_repo: "https://huggingface.co/Langboat/mengzi-bert-base-fin"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "Langboat"
License: "apache-2.0"
Language: "Chinese"
description: Mengzi-BERT base fin model (Chinese)
description_en: Mengzi-BERT base fin model (Chinese)
from_repo: https://huggingface.co/Langboat/mengzi-bert-base-fin
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: Langboat/mengzi-bert-base-fin
Paper:
- title: 'Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese'
url: 'http://arxiv.org/abs/2110.06696v2'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- title: 'Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese'
url: http://arxiv.org/abs/2110.06696v2
Publisher: Langboat
Task:
- sub_tag: 槽位填充
sub_tag_en: Fill-Mask
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "18d5c43e",
"metadata": {},
"source": [
"# Mengzi-BERT base fin model (Chinese)\n",
"Continue trained mengzi-bert-base with 20G financial news and research reports. Masked language modeling(MLM), part-of-speech(POS) tagging and sentence order prediction(SOP) are used as training task.\n"
]
},
{
"cell_type": "markdown",
"id": "9aa78f76",
"metadata": {},
"source": [
"[Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese](https://arxiv.org/abs/2110.06696)\n"
]
},
{
"cell_type": "markdown",
"id": "12bbac99",
"metadata": {},
"source": [
"## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b18fe48",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1bb0e345",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Langboat/mengzi-bert-base-fin\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "a8d785f4",
"metadata": {},
"source": [
"```\n",
"@misc{zhang2021mengzi,\n",
"title={Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese},\n",
"author={Zhuosheng Zhang and Hanqing Zhang and Keming Chen and Yuhang Guo and Jingyun Hua and Yulong Wang and Ming Zhou},\n",
"year={2021},\n",
"eprint={2110.06696},\n",
"archivePrefix={arXiv},\n",
"primaryClass={cs.CL}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "ceb1547c",
"metadata": {},
"source": [
"> 此模型介绍及权重来源于[https://huggingface.co/Langboat/mengzi-bert-base-fin](https://huggingface.co/Langboat/mengzi-bert-base-fin),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "752656a4",
"metadata": {},
"source": [
"# Mengzi-BERT base fin model (Chinese)\n",
"Continue trained mengzi-bert-base with 20G financial news and research reports. Masked language modeling(MLM), part-of-speech(POS) tagging and sentence order prediction(SOP) are used as training task.\n"
]
},
{
"cell_type": "markdown",
"id": "26c65092",
"metadata": {},
"source": [
"[Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese](https://arxiv.org/abs/2110.06696)\n"
]
},
{
"cell_type": "markdown",
"id": "ea5404c7",
"metadata": {},
"source": [
"## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebeb5daa",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2c66056",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Langboat/mengzi-bert-base-fin\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "a39809dc",
"metadata": {},
"source": [
"```\n",
"@misc{zhang2021mengzi,\n",
"title={Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese},\n",
"author={Zhuosheng Zhang and Hanqing Zhang and Keming Chen and Yuhang Guo and Jingyun Hua and Yulong Wang and Ming Zhou},\n",
"year={2021},\n",
"eprint={2110.06696},\n",
"archivePrefix={arXiv},\n",
"primaryClass={cs.CL}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f25bda96",
"metadata": {},
"source": [
"> The model introduction and model weights originate from https://huggingface.co/Langboat/mengzi-bert-base-fin and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
# 模型列表
## PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-biomedical-clinical-es| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/vocab.txt) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-biomedical-clinical-es| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es/vocab.txt) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es"
description: "Biomedical-clinical language model for Spanish"
description_en: "Biomedical-clinical language model for Spanish"
icon: ""
from_repo: "https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "PlanTL-GOB-ES"
License: "apache-2.0"
Language: "Spanish"
Paper:
- title: 'Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario'
url: 'http://arxiv.org/abs/2109.03570v2'
- title: 'Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models'
url: 'http://arxiv.org/abs/2109.07765v1'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## PlanTL-GOB-ES/roberta-base-biomedical-es
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-biomedical-es| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-biomedical-es
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-biomedical-es| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-biomedical-es/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-biomedical-es
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "PlanTL-GOB-ES/roberta-base-biomedical-es"
description: "Biomedical language model for Spanish"
description_en: "Biomedical language model for Spanish"
icon: ""
from_repo: "https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-es"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "PlanTL-GOB-ES"
License: "apache-2.0"
Language: "Spanish"
Paper:
- title: 'Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario'
url: 'http://arxiv.org/abs/2109.03570v2'
- title: 'Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models'
url: 'http://arxiv.org/abs/2109.07765v1'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## PlanTL-GOB-ES/roberta-base-ca
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-ca| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-ca
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|PlanTL-GOB-ES/roberta-base-ca| | 633.14MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/PlanTL-GOB-ES/roberta-base-ca/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models PlanTL-GOB-ES/roberta-base-ca
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "PlanTL-GOB-ES/roberta-base-ca"
description: "BERTa: RoBERTa-based Catalan language model"
description_en: "BERTa: RoBERTa-based Catalan language model"
icon: ""
from_repo: "https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "PlanTL-GOB-ES"
License: "apache-2.0"
Language: "Catalan"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: xnli
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: Spanish
License: mit
Model_Info:
name: "Recognai/bert-base-spanish-wwm-cased-xnli"
description: "bert-base-spanish-wwm-cased-xnli"
description_en: "bert-base-spanish-wwm-cased-xnli"
icon: ""
from_repo: "https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli"
description: bert-base-spanish-wwm-cased-xnli
description_en: bert-base-spanish-wwm-cased-xnli
from_repo: https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: Recognai/bert-base-spanish-wwm-cased-xnli
Paper: null
Publisher: Recognai
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Zero-Shot Classification"
sub_tag: "零样本分类"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: "xnli"
Publisher: "Recognai"
License: "mit"
Language: "Spanish"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 零样本分类
sub_tag_en: Zero-Shot Classification
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 文本分类
sub_tag_en: Text Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "0b1e9532",
"metadata": {},
"source": [
"# bert-base-spanish-wwm-cased-xnli\n"
]
},
{
"cell_type": "markdown",
"id": "2b09a9af",
"metadata": {},
"source": [
"## Model description\n"
]
},
{
"cell_type": "markdown",
"id": "e348457b",
"metadata": {},
"source": [
"This model is a fine-tuned version of the [spanish BERT model](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) with the Spanish portion of the XNLI dataset. \n"
]
},
{
"cell_type": "markdown",
"id": "6643a3b7",
"metadata": {},
"source": [
"### How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8475d429",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ced3e559",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Recognai/bert-base-spanish-wwm-cased-xnli\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "47419faf",
"metadata": {},
"source": [
"## Eval results\n"
]
},
{
"cell_type": "markdown",
"id": "9b87e64b",
"metadata": {},
"source": [
"Accuracy for the test set:\n"
]
},
{
"cell_type": "markdown",
"id": "7be74f6f",
"metadata": {},
"source": [
"| | XNLI-es |\n",
"|-----------------------------|---------|\n",
"|bert-base-spanish-wwm-cased-xnli | 79.9% |\n",
"> 此模型介绍及权重来源于[https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli](https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "7a8a1587",
"metadata": {},
"source": [
"# bert-base-spanish-wwm-cased-xnli\n"
]
},
{
"cell_type": "markdown",
"id": "210c8e3a",
"metadata": {},
"source": [
"## Model description\n"
]
},
{
"cell_type": "markdown",
"id": "fe16ef03",
"metadata": {},
"source": [
"This model is a fine-tuned version of the spanish BERT model with the Spanish portion of the XNLI dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "b23d27b0",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37e5b840",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "117b1e15",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"Recognai/bert-base-spanish-wwm-cased-xnli\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "65669489",
"metadata": {},
"source": [
"## Eval results\n",
"\n",
"Accuracy for the test set:\n",
"\n",
"| | XNLI-es |\n",
"|-----------------------------|---------|\n",
"|bert-base-spanish-wwm-cased-xnli | 79.9% |\n",
"\n",
"> The model introduction and model weights originate from https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
# 模型列表
## allenai/macaw-3b
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|allenai/macaw-3b| | 10.99G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/tokenizer_config.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models allenai/macaw-3b
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|allenai/macaw-3b| | 10.99G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/allenai/macaw-3b/tokenizer_config.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models allenai/macaw-3b
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "allenai/macaw-3b"
description: "macaw-3b"
description_en: "macaw-3b"
icon: ""
from_repo: "https://huggingface.co/allenai/macaw-3b"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text2Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "allenai"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "allenai/macaw-large"
description: "macaw-large"
description_en: "macaw-large"
icon: ""
from_repo: "https://huggingface.co/allenai/macaw-large"
description: macaw-large
description_en: macaw-large
from_repo: https://huggingface.co/allenai/macaw-large
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: allenai/macaw-large
Paper: null
Publisher: allenai
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text2Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "allenai"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 文本生成
sub_tag_en: Text2Text Generation
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "d50965ae",
"metadata": {},
"source": [
"# macaw-large\n",
"\n",
"## Model description\n",
"\n",
"Macaw (<b>M</b>ulti-<b>a</b>ngle <b>c</b>(q)uestion <b>a</b>ns<b>w</b>ering) is a ready-to-use model capable of\n",
"general question answering,\n",
"showing robustness outside the domains it was trained on. It has been trained in \"multi-angle\" fashion,\n",
"which means it can handle a flexible set of input and output \"slots\"\n",
"(question, answer, multiple-choice options, context, and explanation) .\n",
"\n",
"Macaw was built on top of [T5](https://github.com/google-research/text-to-text-transfer-transformer) and comes in\n",
"three sizes: macaw-11b, macaw-3b,\n",
"and macaw-large, as well as an answer-focused version featured on\n",
"various leaderboards macaw-answer-11b.\n",
"\n",
"See https://github.com/allenai/macaw for more details."
]
},
{
"cell_type": "markdown",
"id": "1c0bce56",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb7a2c88",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0fd69ae",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"allenai/macaw-large\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "955d0705",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/allenai/macaw-large](https://huggingface.co/allenai/macaw-large),并转换为飞桨模型格式。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "f5a296e3",
"metadata": {},
"source": [
"# macaw-large\n",
"\n",
"## Model description\n",
"\n",
"Macaw (<b>M</b>ulti-<b>a</b>ngle <b>c</b>(q)uestion <b>a</b>ns<b>w</b>ering) is a ready-to-use model capable of\n",
"general question answering,\n",
"showing robustness outside the domains it was trained on. It has been trained in \"multi-angle\" fashion,\n",
"which means it can handle a flexible set of input and output \"slots\"\n",
"(question, answer, multiple-choice options, context, and explanation) .\n",
"\n",
"Macaw was built on top of [T5](https://github.com/google-research/text-to-text-transfer-transformer) and comes in\n",
"three sizes: macaw-11b, macaw-3b,\n",
"and macaw-large, as well as an answer-focused version featured on\n",
"various leaderboards macaw-answer-11b.\n",
"\n",
"See https://github.com/allenai/macaw for more details."
]
},
{
"cell_type": "markdown",
"id": "27cf8ebc",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "027c735c",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f52c07a",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"allenai/macaw-large\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "ce759903",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/allenai/macaw-large](https://huggingface.co/allenai/macaw-large) and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "allenai/specter"
description: "SPECTER"
description_en: "SPECTER"
icon: ""
from_repo: "https://huggingface.co/allenai/specter"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Feature Extraction"
sub_tag: "特征抽取"
Example:
Datasets: ""
Publisher: "allenai"
License: "apache-2.0"
Language: "English"
description: SPECTER
description_en: SPECTER
from_repo: https://huggingface.co/allenai/specter
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: allenai/specter
Paper:
- title: 'SPECTER: Document-level Representation Learning using Citation-informed Transformers'
url: 'http://arxiv.org/abs/2004.07180v4'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- title: 'SPECTER: Document-level Representation Learning using Citation-informed
Transformers'
url: http://arxiv.org/abs/2004.07180v4
Publisher: allenai
Task:
- sub_tag: 特征抽取
sub_tag_en: Feature Extraction
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "a5b54f39",
"metadata": {},
"source": [
"## SPECTER\n",
"\n",
"SPECTER is a pre-trained language model to generate document-level embedding of documents. It is pre-trained on a a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning.\n",
"\n",
"Paper: [SPECTER: Document-level Representation Learning using Citation-informed Transformers](https://arxiv.org/pdf/2004.07180.pdf)\n",
"\n",
"Original Repo: [Github](https://github.com/allenai/specter)\n",
"\n",
"Evaluation Benchmark: [SciDocs](https://github.com/allenai/scidocs)\n",
"\n",
"Authors: *Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, Daniel S. Weld*"
]
},
{
"cell_type": "markdown",
"id": "e279b43d",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3dcf4e0b",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7348a84e",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"allenai/specter\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "89c70552",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/allenai/specter](https://huggingface.co/allenai/specter),并转换为飞桨模型格式。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "a09f5723",
"metadata": {},
"source": [
"## SPECTER\n",
"\n",
"SPECTER is a pre-trained language model to generate document-level embedding of documents. It is pre-trained on a a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning.\n",
"\n",
"Paper: [SPECTER: Document-level Representation Learning using Citation-informed Transformers](https://arxiv.org/pdf/2004.07180.pdf)\n",
"\n",
"Original Repo: [Github](https://github.com/allenai/specter)\n",
"\n",
"Evaluation Benchmark: [SciDocs](https://github.com/allenai/scidocs)\n",
"\n",
"Authors: *Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, Daniel S. Weld*"
]
},
{
"cell_type": "markdown",
"id": "b62bbb59",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dff923a",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e60739cc",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"allenai/specter\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "cd668864",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/allenai/specter](https://huggingface.co/allenai/specter) and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "alvaroalon2/biobert_chemical_ner"
description: ""
description_en: ""
icon: ""
from_repo: "https://huggingface.co/alvaroalon2/biobert_chemical_ner"
description: ''
description_en: ''
from_repo: https://huggingface.co/alvaroalon2/biobert_chemical_ner
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: alvaroalon2/biobert_chemical_ner
Paper: null
Publisher: alvaroalon2
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Token Classification"
sub_tag: "Token分类"
Example:
Datasets: ""
Publisher: "alvaroalon2"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: Token分类
sub_tag_en: Token Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "0b8f2339",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with BC5CDR-chemicals and BC4CHEMD corpus.\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "934c3f34",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8516341",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70114f31",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_chemical_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "fb7b2eb8",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> 此模型介绍及权重来源于:[https://huggingface.co/alvaroalon2/biobert_chemical_ner](https://huggingface.co/alvaroalon2/biobert_chemical_ner),并转换为飞桨模型格式。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "f769316b",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with BC5CDR-chemicals and BC4CHEMD corpus.\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "3a77ed26",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "202a3ef9",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc11d032",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_chemical_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "762dee96",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/alvaroalon2/biobert_chemical_ner](https://huggingface.co/alvaroalon2/biobert_chemical_ner) and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ncbi_disease
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "alvaroalon2/biobert_diseases_ner"
description: ""
description_en: ""
icon: ""
from_repo: "https://huggingface.co/alvaroalon2/biobert_diseases_ner"
description: ''
description_en: ''
from_repo: https://huggingface.co/alvaroalon2/biobert_diseases_ner
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: alvaroalon2/biobert_diseases_ner
Paper: null
Publisher: alvaroalon2
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Token Classification"
sub_tag: "Token分类"
Example:
Datasets: "ncbi_disease"
Publisher: "alvaroalon2"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: Token分类
sub_tag_en: Token Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "578bdb21",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with BC5CDR-diseases and NCBI-diseases corpus\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "d18b8736",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b304ea9",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49b790e5",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_diseases_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "ab48464f",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/alvaroalon2/biobert_diseases_ner](https://huggingface.co/alvaroalon2/biobert_diseases_ner),并转换为飞桨模型格式。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "98591560",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with BC5CDR-diseases and NCBI-diseases corpus\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "da577da0",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ee7d4df",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6dfd3c0",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_diseases_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "7a58f3ef",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/alvaroalon2/biobert_diseases_ner](https://huggingface.co/alvaroalon2/biobert_diseases_ner) and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "alvaroalon2/biobert_genetic_ner"
description: ""
description_en: ""
icon: ""
from_repo: "https://huggingface.co/alvaroalon2/biobert_genetic_ner"
description: ''
description_en: ''
from_repo: https://huggingface.co/alvaroalon2/biobert_genetic_ner
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: alvaroalon2/biobert_genetic_ner
Paper: null
Publisher: alvaroalon2
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Token Classification"
sub_tag: "Token分类"
Example:
Datasets: ""
Publisher: "alvaroalon2"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: Token分类
sub_tag_en: Token Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "795618b9",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with JNLPBA and BC2GM corpus for genetic class entities.\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "bf1bde1a",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90bf4208",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3f9ddc9",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_genetic_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "45bef570",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/alvaroalon2/biobert_genetic_ner](https://huggingface.co/alvaroalon2/biobert_genetic_ner),并转换为飞桨模型格式。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "eeb5731b",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"BioBERT model fine-tuned in NER task with JNLPBA and BC2GM corpus for genetic class entities.\n",
"\n",
"This was fine-tuned in order to use it in a BioNER/BioNEN system which is available at: https://github.com/librairy/bio-ner"
]
},
{
"cell_type": "markdown",
"id": "3501c0f5",
"metadata": {},
"source": [
"## How to Use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da1caa55",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8a173da",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"alvaroalon2/biobert_genetic_ner\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "0c74ebfe",
"metadata": {},
"source": [
"## Reference\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/alvaroalon2/biobert_genetic_ner](https://huggingface.co/alvaroalon2/biobert_genetic_ner) and were converted to PaddlePaddle format for ease of use in PaddleNLP."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: ''
License: apache-2.0
Model_Info:
name: "amberoad/bert-multilingual-passage-reranking-msmarco"
description: "Passage Reranking Multilingual BERT 🔃 🌍"
description_en: "Passage Reranking Multilingual BERT 🔃 🌍"
icon: ""
from_repo: "https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: ""
Publisher: "amberoad"
License: "apache-2.0"
Language: ""
description: Passage Reranking Multilingual BERT 🔃 🌍
description_en: Passage Reranking Multilingual BERT 🔃 🌍
from_repo: https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: amberoad/bert-multilingual-passage-reranking-msmarco
Paper:
- title: 'Passage Re-ranking with BERT'
url: 'http://arxiv.org/abs/1901.04085v5'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- title: Passage Re-ranking with BERT
url: http://arxiv.org/abs/1901.04085v5
Publisher: amberoad
Task:
- sub_tag: 文本分类
sub_tag_en: Text Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "83244d63",
"metadata": {},
"source": [
"# Passage Reranking Multilingual BERT 🔃 🌍\n"
]
},
{
"cell_type": "markdown",
"id": "4c8c922a",
"metadata": {},
"source": [
"## Model description\n",
"**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n"
]
},
{
"cell_type": "markdown",
"id": "8b40d5de",
"metadata": {},
"source": [
"**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n",
"It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n"
]
},
{
"cell_type": "markdown",
"id": "c9d89366",
"metadata": {},
"source": [
"**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n"
]
},
{
"cell_type": "markdown",
"id": "29745195",
"metadata": {},
"source": [
"**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n"
]
},
{
"cell_type": "markdown",
"id": "010a4d92",
"metadata": {},
"source": [
"## Intended uses & limitations\n",
"Both query[1] and passage[2] have to fit in 512 Tokens.\n",
"As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n"
]
},
{
"cell_type": "markdown",
"id": "a9f2dea7",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d023555",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c83eef3",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "2611b122",
"metadata": {},
"source": [
"## Training data\n"
]
},
{
"cell_type": "markdown",
"id": "ba62fbe0",
"metadata": {},
"source": [
"This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n"
]
},
{
"cell_type": "markdown",
"id": "afc188f2",
"metadata": {},
"source": [
"> 此模型介绍及权重来源于[https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "22c47298",
"metadata": {},
"source": [
"# Passage Reranking Multilingual BERT 🔃 🌍\n"
]
},
{
"cell_type": "markdown",
"id": "0bb73e0f",
"metadata": {},
"source": [
"## Model description\n",
"**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.\n"
]
},
{
"cell_type": "markdown",
"id": "fedf5cb8",
"metadata": {},
"source": [
"**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.\n",
"It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.\n"
]
},
{
"cell_type": "markdown",
"id": "146e3be4",
"metadata": {},
"source": [
"**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).\n"
]
},
{
"cell_type": "markdown",
"id": "772c5c82",
"metadata": {},
"source": [
"**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.\n"
]
},
{
"cell_type": "markdown",
"id": "e5974e46",
"metadata": {},
"source": [
"## Intended uses & limitations\n",
"Both query[1] and passage[2] have to fit in 512 Tokens.\n",
"As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.\n"
]
},
{
"cell_type": "markdown",
"id": "7d878609",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0941f1f",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bc201bf",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"amberoad/bert-multilingual-passage-reranking-msmarco\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "674ccc3a",
"metadata": {},
"source": [
"## Training data\n"
]
},
{
"cell_type": "markdown",
"id": "4404adda",
"metadata": {},
"source": [
"This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ \"Microsoft MS Marco\"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.\n"
]
},
{
"cell_type": "markdown",
"id": "79af5e42",
"metadata": {},
"source": [
"> The model introduction and model weights originate from [https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco](https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
# 模型列表
## asi/gpt-fr-cased-base
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|asi/gpt-fr-cased-base| | 4.12G | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models asi/gpt-fr-cased-base
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|asi/gpt-fr-cased-base| | 4.12G | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-base/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models asi/gpt-fr-cased-base
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "asi/gpt-fr-cased-base"
description: "Model description"
description_en: "Model description"
icon: ""
from_repo: "https://huggingface.co/asi/gpt-fr-cased-base"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "asi"
License: "apache-2.0"
Language: "French"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## asi/gpt-fr-cased-small
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|asi/gpt-fr-cased-small| | 620.45MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models asi/gpt-fr-cased-small
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|asi/gpt-fr-cased-small| | 620.45MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/asi/gpt-fr-cased-small/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models asi/gpt-fr-cased-small
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "asi/gpt-fr-cased-small"
description: "Model description"
description_en: "Model description"
icon: ""
from_repo: "https://huggingface.co/asi/gpt-fr-cased-small"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "asi"
License: "apache-2.0"
Language: "French"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: German
License: mit
Model_Info:
name: "benjamin/gerpt2-large"
description: "GerPT2"
description_en: "GerPT2"
icon: ""
from_repo: "https://huggingface.co/benjamin/gerpt2-large"
description: GerPT2
description_en: GerPT2
from_repo: https://huggingface.co/benjamin/gerpt2-large
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: benjamin/gerpt2-large
Paper: null
Publisher: benjamin
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "benjamin"
License: "mit"
Language: "German"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 文本生成
sub_tag_en: Text Generation
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "e42aa4df",
"metadata": {},
"source": [
"# GerPT2\n"
]
},
{
"cell_type": "markdown",
"id": "08fd6403",
"metadata": {},
"source": [
"See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.\n"
]
},
{
"cell_type": "markdown",
"id": "8295e28d",
"metadata": {},
"source": [
"## Comparison to dbmdz/german-gpt2\n"
]
},
{
"cell_type": "markdown",
"id": "c0f50f67",
"metadata": {},
"source": [
"I evaluated both GerPT2-large and the other German GPT2, dbmdz/german-gpt2 on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:\n"
]
},
{
"cell_type": "markdown",
"id": "6ecdc149",
"metadata": {},
"source": [
"| | CC-100 (PPL) | Wikipedia (PPL) |\n",
"|-------------------|--------------|-----------------|\n",
"| dbmdz/german-gpt2 | 49.47 | 62.92 |\n",
"| GerPT2 | 24.78 | 35.33 |\n",
"| GerPT2-large | __16.08__ | __23.26__ |\n",
"| | | |\n"
]
},
{
"cell_type": "markdown",
"id": "3cddd6a8",
"metadata": {},
"source": [
"See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.\n"
]
},
{
"cell_type": "markdown",
"id": "d838da15",
"metadata": {},
"source": [
"## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "476bf523",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f509fec",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"benjamin/gerpt2-large\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "d135a538",
"metadata": {},
"source": [
"```\n",
"@misc{Minixhofer_GerPT2_German_large_2020,\n",
"author = {Minixhofer, Benjamin},\n",
"doi = {10.5281/zenodo.5509984},\n",
"month = {12},\n",
"title = {{GerPT2: German large and small versions of GPT2}},\n",
"url = {https://github.com/bminixhofer/gerpt2},\n",
"year = {2020}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "63e09ad7",
"metadata": {},
"source": [
"## Acknowledgements\n"
]
},
{
"cell_type": "markdown",
"id": "d9dc51e1",
"metadata": {},
"source": [
"Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.\n",
"Huge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.\n",
"\n",
"> 此模型介绍及权重来源于[https://huggingface.co/benjamin/gerpt2-large](https://huggingface.co/benjamin/gerpt2-large),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "85c2e1a7",
"metadata": {},
"source": [
"# GerPT2\n"
]
},
{
"cell_type": "markdown",
"id": "595fe7cb",
"metadata": {},
"source": [
"See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.\n"
]
},
{
"cell_type": "markdown",
"id": "5b4f950b",
"metadata": {},
"source": [
"## Comparison to dbmdz/german-gpt2\n"
]
},
{
"cell_type": "markdown",
"id": "95be6eb8",
"metadata": {},
"source": [
"I evaluated both GerPT2-large and the other German GPT2, dbmdz/german-gpt2 on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:\n"
]
},
{
"cell_type": "markdown",
"id": "8acd14be",
"metadata": {},
"source": [
"| | CC-100 (PPL) | Wikipedia (PPL) |\n",
"|-------------------|--------------|-----------------|\n",
"| dbmdz/german-gpt2 | 49.47 | 62.92 |\n",
"| GerPT2 | 24.78 | 35.33 |\n",
"| GerPT2-large | __16.08__ | __23.26__ |\n",
"| | | |\n"
]
},
{
"cell_type": "markdown",
"id": "6fa10d79",
"metadata": {},
"source": [
"See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.\n"
]
},
{
"cell_type": "markdown",
"id": "a8514e1e",
"metadata": {},
"source": [
"## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bc62c63",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63f78302",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"benjamin/gerpt2-large\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "563152f3",
"metadata": {},
"source": [
"```\n",
"@misc{Minixhofer_GerPT2_German_large_2020,\n",
"author = {Minixhofer, Benjamin},\n",
"doi = {10.5281/zenodo.5509984},\n",
"month = {12},\n",
"title = {{GerPT2: German large and small versions of GPT2}},\n",
"url = {https://github.com/bminixhofer/gerpt2},\n",
"year = {2020}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b0d67d21",
"metadata": {},
"source": [
"## Acknowledgements\n"
]
},
{
"cell_type": "markdown",
"id": "474c1c61",
"metadata": {},
"source": [
"Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.\n",
"Huge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.\n",
"\n",
"> The model introduction and model weights originate from [https://huggingface.co/benjamin/gerpt2-large](https://huggingface.co/benjamin/gerpt2-large) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
# 模型列表
## benjamin/gerpt2
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|benjamin/gerpt2| | 621.95MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models benjamin/gerpt2
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|benjamin/gerpt2| | 621.95MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/benjamin/gerpt2/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models benjamin/gerpt2
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "benjamin/gerpt2"
description: "GerPT2"
description_en: "GerPT2"
icon: ""
from_repo: "https://huggingface.co/benjamin/gerpt2"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "benjamin"
License: "mit"
Language: "German"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: Korean
License: apache-2.0
Model_Info:
name: "beomi/kcbert-base"
description: "KcBERT: Korean comments BERT"
description_en: "KcBERT: Korean comments BERT"
icon: ""
from_repo: "https://huggingface.co/beomi/kcbert-base"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: ""
Publisher: "beomi"
License: "apache-2.0"
Language: "Korean"
description: 'KcBERT: Korean comments BERT'
description_en: 'KcBERT: Korean comments BERT'
from_repo: https://huggingface.co/beomi/kcbert-base
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: beomi/kcbert-base
Paper:
- title: 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
url: 'http://arxiv.org/abs/1810.04805v2'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- title: 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
url: http://arxiv.org/abs/1810.04805v2
Publisher: beomi
Task:
- sub_tag: 槽位填充
sub_tag_en: Fill-Mask
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "8a51a2c8",
"metadata": {},
"source": [
"# KcBERT: Korean comments BERT\n"
]
},
{
"cell_type": "markdown",
"id": "29c7e5a4",
"metadata": {},
"source": [
"Kaggle에 학습을 위해 정제한(아래 `clean`처리를 거친) Dataset을 공개하였습니다!\n"
]
},
{
"cell_type": "markdown",
"id": "95a25c77",
"metadata": {},
"source": [
"직접 다운받으셔서 다양한 Task에 학습을 진행해보세요 :)\n"
]
},
{
"cell_type": "markdown",
"id": "edd96db1",
"metadata": {},
"source": [
"공개된 한국어 BERT는 대부분 한국어 위키, 뉴스 기사, 책 등 잘 정제된 데이터를 기반으로 학습한 모델입니다. 한편, 실제로 NSMC와 같은 댓글형 데이터셋은 정제되지 않았고 구어체 특징에 신조어가 많으며, 오탈자 등 공식적인 글쓰기에서 나타나지 않는 표현들이 빈번하게 등장합니다.\n"
]
},
{
"cell_type": "markdown",
"id": "a2df738b",
"metadata": {},
"source": [
"KcBERT는 위와 같은 특성의 데이터셋에 적용하기 위해, 네이버 뉴스에서 댓글과 대댓글을 수집해, 토크나이저와 BERT모델을 처음부터 학습한 Pretrained BERT 모델입니다.\n"
]
},
{
"cell_type": "markdown",
"id": "a0eb4ad8",
"metadata": {},
"source": [
"KcBERT는 Huggingface의 Transformers 라이브러리를 통해 간편히 불러와 사용할 수 있습니다. (별도의 파일 다운로드가 필요하지 않습니다.)\n"
]
},
{
"cell_type": "markdown",
"id": "d1c07267",
"metadata": {},
"source": [
"## KcBERT Performance\n"
]
},
{
"cell_type": "markdown",
"id": "52872aa3",
"metadata": {},
"source": [
"- Finetune 코드는 https://github.com/Beomi/KcBERT-finetune 에서 찾아보실 수 있습니다.\n"
]
},
{
"cell_type": "markdown",
"id": "fa15ccaf",
"metadata": {},
"source": [
"| | Size<br/>(용량) | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |\n",
"| :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |\n",
"| KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |\n",
"| KcBERT-Large | 1.2G | **90.68** | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |\n",
"| KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |\n",
"| XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |\n",
"| HanBERT | 614M | 90.16 | **87.31** | 82.40 | **80.89** | 83.33 | 94.19 | 78.74 / 92.02 |\n",
"| KoELECTRA-Base | 423M | **90.21** | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |\n",
"| KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | **83.90** | 80.61 | **84.30** | **94.72** | **84.34 / 92.58** |\n",
"| DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |\n"
]
},
{
"cell_type": "markdown",
"id": "5193845f",
"metadata": {},
"source": [
"\\*HanBERT의 Size는 Bert Model과 Tokenizer DB를 합친 것입니다.\n"
]
},
{
"cell_type": "markdown",
"id": "93aecc1a",
"metadata": {},
"source": [
"\\***config의 세팅을 그대로 하여 돌린 결과이며, hyperparameter tuning을 추가적으로 할 시 더 좋은 성능이 나올 수 있습니다.**\n"
]
},
{
"cell_type": "markdown",
"id": "6f889bbd",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "465d2dee",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f884ed37",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"beomi/kcbert-base\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "a92e65b7",
"metadata": {},
"source": [
"```\n",
"@inproceedings{lee2020kcbert,\n",
"title={KcBERT: Korean Comments BERT},\n",
"author={Lee, Junbum},\n",
"booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},\n",
"pages={437--440},\n",
"year={2020}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "21364621",
"metadata": {},
"source": [
"- 논문집 다운로드 링크: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*혹은 http://hclt.kr/symp/?lnb=conference )\n"
]
},
{
"cell_type": "markdown",
"id": "45cdafe0",
"metadata": {},
"source": [
"## Acknowledgement\n"
]
},
{
"cell_type": "markdown",
"id": "a741fcf0",
"metadata": {},
"source": [
"KcBERT Model을 학습하는 GCP/TPU 환경은 TFRC 프로그램의 지원을 받았습니다.\n"
]
},
{
"cell_type": "markdown",
"id": "1c9655e9",
"metadata": {},
"source": [
"모델 학습 과정에서 많은 조언을 주신 [Monologg](https://github.com/monologg/) 님 감사합니다 :)\n"
]
},
{
"cell_type": "markdown",
"id": "85cb1e08",
"metadata": {},
"source": [
"## Reference\n"
]
},
{
"cell_type": "markdown",
"id": "227d89d2",
"metadata": {},
"source": [
"### Github Repos\n"
]
},
{
"cell_type": "markdown",
"id": "5e8f4de7",
"metadata": {},
"source": [
"- [BERT by Google](https://github.com/google-research/bert)\n",
"- [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)\n",
"- [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)\n"
]
},
{
"cell_type": "markdown",
"id": "730bfede",
"metadata": {},
"source": [
"- [Transformers by Huggingface](https://github.com/huggingface/transformers)\n",
"- [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)\n"
]
},
{
"cell_type": "markdown",
"id": "66dbd496",
"metadata": {},
"source": [
"### Papers\n"
]
},
{
"cell_type": "markdown",
"id": "84fe619a",
"metadata": {},
"source": [
"- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)\n"
]
},
{
"cell_type": "markdown",
"id": "63bb3dd3",
"metadata": {},
"source": [
"### Blogs\n"
]
},
{
"cell_type": "markdown",
"id": "a5aa5385",
"metadata": {},
"source": [
"- [Monologg님의 KoELECTRA 학습기](https://monologg.kr/categories/NLP/ELECTRA/)\n"
]
},
{
"cell_type": "markdown",
"id": "bcbd3600",
"metadata": {},
"source": [
"> 此模型介绍及权重来源于[https://huggingface.co/beomi/kcbert-base](https://huggingface.co/beomi/kcbert-base),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "21e8b000",
"metadata": {},
"source": [
"# KcBERT: Korean comments BERT\n"
]
},
{
"cell_type": "markdown",
"id": "336ee0b8",
"metadata": {},
"source": [
"Kaggle에 학습을 위해 정제한(아래 `clean`처리를 거친) Dataset을 공개하였습니다!\n"
]
},
{
"cell_type": "markdown",
"id": "691c1f27",
"metadata": {},
"source": [
"직접 다운받으셔서 다양한 Task에 학습을 진행해보세요 :)\n"
]
},
{
"cell_type": "markdown",
"id": "36ec915c",
"metadata": {},
"source": [
"공개된 한국어 BERT는 대부분 한국어 위키, 뉴스 기사, 책 등 잘 정제된 데이터를 기반으로 학습한 모델입니다. 한편, 실제로 NSMC와 같은 댓글형 데이터셋은 정제되지 않았고 구어체 특징에 신조어가 많으며, 오탈자 등 공식적인 글쓰기에서 나타나지 않는 표현들이 빈번하게 등장합니다.\n"
]
},
{
"cell_type": "markdown",
"id": "b5b8d7d7",
"metadata": {},
"source": [
"KcBERT는 위와 같은 특성의 데이터셋에 적용하기 위해, 네이버 뉴스에서 댓글과 대댓글을 수집해, 토크나이저와 BERT모델을 처음부터 학습한 Pretrained BERT 모델입니다.\n"
]
},
{
"cell_type": "markdown",
"id": "b0095da8",
"metadata": {},
"source": [
"KcBERT는 Huggingface의 Transformers 라이브러리를 통해 간편히 불러와 사용할 수 있습니다. (별도의 파일 다운로드가 필요하지 않습니다.)\n"
]
},
{
"cell_type": "markdown",
"id": "4bf51d97",
"metadata": {},
"source": [
"## KcBERT Performance\n"
]
},
{
"cell_type": "markdown",
"id": "9679c8b9",
"metadata": {},
"source": [
"- Finetune 코드는 https://github.com/Beomi/KcBERT-finetune 에서 찾아보실 수 있습니다.\n"
]
},
{
"cell_type": "markdown",
"id": "486782a2",
"metadata": {},
"source": [
"| | Size<br/>(용량) | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |\n",
"| :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |\n",
"| KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |\n",
"| KcBERT-Large | 1.2G | **90.68** | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |\n",
"| KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |\n",
"| XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |\n",
"| HanBERT | 614M | 90.16 | **87.31** | 82.40 | **80.89** | 83.33 | 94.19 | 78.74 / 92.02 |\n",
"| KoELECTRA-Base | 423M | **90.21** | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |\n",
"| KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | **83.90** | 80.61 | **84.30** | **94.72** | **84.34 / 92.58** |\n",
"| DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |\n"
]
},
{
"cell_type": "markdown",
"id": "e86103a2",
"metadata": {},
"source": [
"\\*HanBERT의 Size는 Bert Model과 Tokenizer DB를 합친 것입니다.\n"
]
},
{
"cell_type": "markdown",
"id": "1078bc5d",
"metadata": {},
"source": [
"\\***config의 세팅을 그대로 하여 돌린 결과이며, hyperparameter tuning을 추가적으로 할 시 더 좋은 성능이 나올 수 있습니다.**\n"
]
},
{
"cell_type": "markdown",
"id": "8ac2ee11",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e171068a",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38c7ad79",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"beomi/kcbert-base\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "a794d15a",
"metadata": {},
"source": [
"```\n",
"@inproceedings{lee2020kcbert,\n",
"title={KcBERT: Korean Comments BERT},\n",
"author={Lee, Junbum},\n",
"booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},\n",
"pages={437--440},\n",
"year={2020}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "c0183cbe",
"metadata": {},
"source": [
"- 논문집 다운로드 링크: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*혹은 http://hclt.kr/symp/?lnb=conference )\n"
]
},
{
"cell_type": "markdown",
"id": "ba768b26",
"metadata": {},
"source": [
"## Acknowledgement\n"
]
},
{
"cell_type": "markdown",
"id": "ea148064",
"metadata": {},
"source": [
"KcBERT Model을 학습하는 GCP/TPU 환경은 TFRC 프로그램의 지원을 받았습니다.\n"
]
},
{
"cell_type": "markdown",
"id": "78732669",
"metadata": {},
"source": [
"모델 학습 과정에서 많은 조언을 주신 [Monologg](https://github.com/monologg/) 님 감사합니다 :)\n"
]
},
{
"cell_type": "markdown",
"id": "5ffa9ed9",
"metadata": {},
"source": [
"## Reference\n"
]
},
{
"cell_type": "markdown",
"id": "ea69da89",
"metadata": {},
"source": [
"### Github Repos\n"
]
},
{
"cell_type": "markdown",
"id": "d72d564c",
"metadata": {},
"source": [
"- [BERT by Google](https://github.com/google-research/bert)\n",
"- [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)\n",
"- [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)\n"
]
},
{
"cell_type": "markdown",
"id": "38503607",
"metadata": {},
"source": [
"- [Transformers by Huggingface](https://github.com/huggingface/transformers)\n",
"- [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)\n"
]
},
{
"cell_type": "markdown",
"id": "a71a565f",
"metadata": {},
"source": [
"### Papers\n"
]
},
{
"cell_type": "markdown",
"id": "9aa4d324",
"metadata": {},
"source": [
"- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)\n"
]
},
{
"cell_type": "markdown",
"id": "6b1ba932",
"metadata": {},
"source": [
"### Blogs\n"
]
},
{
"cell_type": "markdown",
"id": "5c9e32e1",
"metadata": {},
"source": [
"- [Monologg님의 KoELECTRA 학습기](https://monologg.kr/categories/NLP/ELECTRA/)\n"
]
},
{
"cell_type": "markdown",
"id": "0b551dcf",
"metadata": {},
"source": [
"> The model introduction and model weights originate from [https://huggingface.co/beomi/kcbert-base](https://huggingface.co/beomi/kcbert-base) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
# 模型列表
## bhadresh-savani/roberta-base-emotion
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|bhadresh-savani/roberta-base-emotion| | 475.53MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/vocab.txt) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models bhadresh-savani/roberta-base-emotion
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|bhadresh-savani/roberta-base-emotion| | 475.53MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/vocab.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/bhadresh-savani/roberta-base-emotion/vocab.txt) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models bhadresh-savani/roberta-base-emotion
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "bhadresh-savani/roberta-base-emotion"
description: "robert-base-emotion"
description_en: "robert-base-emotion"
icon: ""
from_repo: "https://huggingface.co/bhadresh-savani/roberta-base-emotion"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: "emotion,emotion"
Publisher: "bhadresh-savani"
License: "apache-2.0"
Language: "English"
Paper:
- title: 'RoBERTa: A Robustly Optimized BERT Pretraining Approach'
url: 'http://arxiv.org/abs/1907.11692v1'
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## cahya/bert-base-indonesian-522M
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|cahya/bert-base-indonesian-522M| | 518.25MB | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/tokenizer_config.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/vocab.txt) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models cahya/bert-base-indonesian-522M
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|cahya/bert-base-indonesian-522M| | 518.25MB | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/tokenizer_config.json)<br>[vocab.txt](https://bj.bcebos.com/paddlenlp/models/community/cahya/bert-base-indonesian-522M/vocab.txt) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models cahya/bert-base-indonesian-522M
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "cahya/bert-base-indonesian-522M"
description: "Indonesian BERT base model (uncased)"
description_en: "Indonesian BERT base model (uncased)"
icon: ""
from_repo: "https://huggingface.co/cahya/bert-base-indonesian-522M"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
Example:
Datasets: "wikipedia"
Publisher: "cahya"
License: "mit"
Language: "Indonesian"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## cahya/gpt2-small-indonesian-522M
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|cahya/gpt2-small-indonesian-522M| | 621.95MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/vocab.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models cahya/gpt2-small-indonesian-522M
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|cahya/gpt2-small-indonesian-522M| | 621.95MB | [merges.txt](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/merges.txt)<br>[model_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/tokenizer_config.json)<br>[vocab.json](https://bj.bcebos.com/paddlenlp/models/community/cahya/gpt2-small-indonesian-522M/vocab.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models cahya/gpt2-small-indonesian-522M
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "cahya/gpt2-small-indonesian-522M"
description: "Indonesian GPT2 small model"
description_en: "Indonesian GPT2 small model"
icon: ""
from_repo: "https://huggingface.co/cahya/gpt2-small-indonesian-522M"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "cahya"
License: "mit"
Language: "Indonesian"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## ceshine/t5-paraphrase-paws-msrp-opinosis
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|ceshine/t5-paraphrase-paws-msrp-opinosis| | 1.11G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/tokenizer_config.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models ceshine/t5-paraphrase-paws-msrp-opinosis
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|ceshine/t5-paraphrase-paws-msrp-opinosis| | 1.11G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-paws-msrp-opinosis/tokenizer_config.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models ceshine/t5-paraphrase-paws-msrp-opinosis
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "ceshine/t5-paraphrase-paws-msrp-opinosis"
description: "T5-base Parapharasing model fine-tuned on PAWS, MSRP, and Opinosis"
description_en: "T5-base Parapharasing model fine-tuned on PAWS, MSRP, and Opinosis"
icon: ""
from_repo: "https://huggingface.co/ceshine/t5-paraphrase-paws-msrp-opinosis"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text2Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "ceshine"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
# 模型列表
## ceshine/t5-paraphrase-quora-paws
| 模型名称 | 模型介绍 | 模型大小 | 模型下载 |
| --- | --- | --- | --- |
|ceshine/t5-paraphrase-quora-paws| | 1.11G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/tokenizer_config.json) |
也可以通过`paddlenlp` cli 工具来下载对应的模型权重,使用步骤如下所示:
* 安装paddlenlp
```shell
pip install --upgrade paddlenlp
```
* 下载命令行
```shell
paddlenlp download --cache-dir ./pretrained_models ceshine/t5-paraphrase-quora-paws
```
有任何下载的问题都可以到[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)中发Issue提问。
\ No newline at end of file
# model list
##
| model | description | model_size | download |
| --- | --- | --- | --- |
|ceshine/t5-paraphrase-quora-paws| | 1.11G | [model_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/model_config.json)<br>[model_state.pdparams](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/model_state.pdparams)<br>[tokenizer_config.json](https://bj.bcebos.com/paddlenlp/models/community/ceshine/t5-paraphrase-quora-paws/tokenizer_config.json) |
or you can download all of model file with the following steps:
* install paddlenlp
```shell
pip install --upgrade paddlenlp
```
* download model with cli tool
```shell
paddlenlp download --cache-dir ./pretrained_models ceshine/t5-paraphrase-quora-paws
```
If you have any problems with it, you can post issue on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) to get support.
Model_Info:
name: "ceshine/t5-paraphrase-quora-paws"
description: "T5-base Parapharasing model fine-tuned on PAWS and Quora"
description_en: "T5-base Parapharasing model fine-tuned on PAWS and Quora"
icon: ""
from_repo: "https://huggingface.co/ceshine/t5-paraphrase-quora-paws"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text2Text Generation"
sub_tag: "文本生成"
Example:
Datasets: ""
Publisher: "ceshine"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: Russian,English
License: mit
Model_Info:
name: "cointegrated/rubert-tiny"
description: "pip install transformers sentencepiece"
description_en: "pip install transformers sentencepiece"
icon: ""
from_repo: "https://huggingface.co/cointegrated/rubert-tiny"
description: pip install transformers sentencepiece
description_en: pip install transformers sentencepiece
from_repo: https://huggingface.co/cointegrated/rubert-tiny
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: cointegrated/rubert-tiny
Paper: null
Publisher: cointegrated
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Feature Extraction"
sub_tag: "特征抽取"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Sentence Similarity"
sub_tag: "句子相似度"
Example:
Datasets: ""
Publisher: "cointegrated"
License: "mit"
Language: "Russian,English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 特征抽取
sub_tag_en: Feature Extraction
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 槽位填充
sub_tag_en: Fill-Mask
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 句子相似度
sub_tag_en: Sentence Similarity
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "83973edc",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This is a very small distilled version of the bert-base-multilingual-cased model for Russian and English (45 MB, 12M parameters). There is also an **updated version of this model**, rubert-tiny2, with a larger vocabulary and better quality on practically all Russian NLU tasks.\n"
]
},
{
"cell_type": "markdown",
"id": "59944441",
"metadata": {},
"source": [
"This model is useful if you want to fine-tune it for a relatively simple Russian task (e.g. NER or sentiment classification), and you care more about speed and size than about accuracy. It is approximately x10 smaller and faster than a base-sized BERT. Its `[CLS]` embeddings can be used as a sentence representation aligned between Russian and English.\n"
]
},
{
"cell_type": "markdown",
"id": "c0e2918f",
"metadata": {},
"source": [
"It was trained on the [Yandex Translate corpus](https://translate.yandex.ru/corpus), [OPUS-100](https://huggingface.co/datasets/opus100) and Tatoeba, using MLM loss distilled from bert-base-multilingual-cased, translation ranking loss, and `[CLS]` embeddings distilled from LaBSE, rubert-base-cased-sentence, Laser and USE.\n"
]
},
{
"cell_type": "markdown",
"id": "b0c0158e",
"metadata": {},
"source": [
"There is a more detailed [description in Russian](https://habr.com/ru/post/562064/).\n"
]
},
{
"cell_type": "markdown",
"id": "28ce4026",
"metadata": {},
"source": [
"Sentence embeddings can be produced as follows:\n"
]
},
{
"cell_type": "markdown",
"id": "d521437a",
"metadata": {},
"source": [
"## How to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da5acdb0",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df2d3cc6",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "065bda47",
"metadata": {},
"source": [
"> 此模型介绍及权重来源于[https://huggingface.co/cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "b59db37b",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This is a very small distilled version of the bert-base-multilingual-cased model for Russian and English (45 MB, 12M parameters). There is also an **updated version of this model**, rubert-tiny2, with a larger vocabulary and better quality on practically all Russian NLU tasks.\n"
]
},
{
"cell_type": "markdown",
"id": "5e7c8c35",
"metadata": {},
"source": [
"This model is useful if you want to fine-tune it for a relatively simple Russian task (e.g. NER or sentiment classification), and you care more about speed and size than about accuracy. It is approximately x10 smaller and faster than a base-sized BERT. Its `[CLS]` embeddings can be used as a sentence representation aligned between Russian and English.\n"
]
},
{
"cell_type": "markdown",
"id": "bc3c5717",
"metadata": {},
"source": [
"It was trained on the [Yandex Translate corpus](https://translate.yandex.ru/corpus), [OPUS-100](https://huggingface.co/datasets/opus100) and Tatoeba, using MLM loss (distilled from bert-base-multilingual-cased\n",
"), translation ranking loss, and `[CLS]` embeddings distilled from LaBSE, rubert-base-cased-sentence, Laser and USE.\n"
]
},
{
"cell_type": "markdown",
"id": "2db0a3ee",
"metadata": {},
"source": [
"There is a more detailed [description in Russian](https://habr.com/ru/post/562064/).\n"
]
},
{
"cell_type": "markdown",
"id": "c3a52477",
"metadata": {},
"source": [
"Sentence embeddings can be produced as follows:\n"
]
},
{
"cell_type": "markdown",
"id": "add13de4",
"metadata": {},
"source": [
"## How to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0a8f905",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "481d0ca6",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "e6df17e3",
"metadata": {},
"source": [
"> The model introduction and model weights originate from [https://huggingface.co/cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: Russian
License: mit
Model_Info:
name: "cointegrated/rubert-tiny2"
description: "pip install transformers sentencepiece"
description_en: "pip install transformers sentencepiece"
icon: ""
from_repo: "https://huggingface.co/cointegrated/rubert-tiny2"
description: pip install transformers sentencepiece
description_en: pip install transformers sentencepiece
from_repo: https://huggingface.co/cointegrated/rubert-tiny2
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: cointegrated/rubert-tiny2
Paper: null
Publisher: cointegrated
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Feature Extraction"
sub_tag: "特征抽取"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Fill-Mask"
sub_tag: "槽位填充"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Sentence Similarity"
sub_tag: "句子相似度"
Example:
Datasets: ""
Publisher: "cointegrated"
License: "mit"
Language: "Russian"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 特征抽取
sub_tag_en: Feature Extraction
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 槽位填充
sub_tag_en: Fill-Mask
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 句子相似度
sub_tag_en: Sentence Similarity
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "9eef057a",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.\n"
]
},
{
"cell_type": "markdown",
"id": "08d9a049",
"metadata": {},
"source": [
"The differences from the previous version include:\n",
"- a larger vocabulary: 83828 tokens instead of 29564;\n",
"- larger supported sequences: 2048 instead of 512;\n",
"- sentence embeddings approximate LaBSE closer than before;\n",
"- meaningful segment embeddings (tuned on the NLI task)\n",
"- the model is focused only on Russian.\n"
]
},
{
"cell_type": "markdown",
"id": "8a7ba50b",
"metadata": {},
"source": [
"The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.\n"
]
},
{
"cell_type": "markdown",
"id": "184e1cc6",
"metadata": {},
"source": [
"Sentence embeddings can be produced as follows:\n"
]
},
{
"cell_type": "markdown",
"id": "a9613056",
"metadata": {},
"source": [
"## How to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d60b7b64",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "716f2b63",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny2\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "0ba8c599",
"metadata": {},
"source": [
"> 此模型介绍及权重来源于[https://huggingface.co/cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "db267b71",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.\n"
]
},
{
"cell_type": "markdown",
"id": "801acf5c",
"metadata": {},
"source": [
"The differences from the previous version include:\n",
"- a larger vocabulary: 83828 tokens instead of 29564;\n",
"- larger supported sequences: 2048 instead of 512;\n",
"- sentence embeddings approximate LaBSE closer than before;\n",
"- meaningful segment embeddings (tuned on the NLI task)\n",
"- the model is focused only on Russian.\n"
]
},
{
"cell_type": "markdown",
"id": "f2c7dbc1",
"metadata": {},
"source": [
"The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.\n"
]
},
{
"cell_type": "markdown",
"id": "9ff63df2",
"metadata": {},
"source": [
"Sentence embeddings can be produced as follows:\n"
]
},
{
"cell_type": "markdown",
"id": "2b073558",
"metadata": {},
"source": [
"## how to use"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c98c0cce",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81978806",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny2\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "33dbe378",
"metadata": {},
"source": [
"> The model introduction and model weights originate from [https://huggingface.co/cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: ''
License: apache-2.0
Model_Info:
name: "cross-encoder/ms-marco-MiniLM-L-12-v2"
description: "Cross-Encoder for MS Marco"
description_en: "Cross-Encoder for MS Marco"
icon: ""
from_repo: "https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2"
description: Cross-Encoder for MS Marco
description_en: Cross-Encoder for MS Marco
from_repo: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: cross-encoder/ms-marco-MiniLM-L-12-v2
Paper: null
Publisher: cross-encoder
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: ""
Publisher: "cross-encoder"
License: "apache-2.0"
Language: ""
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 文本分类
sub_tag_en: Text Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "b14e9fee",
"metadata": {},
"source": [
"# Cross-Encoder for MS Marco\n"
]
},
{
"cell_type": "markdown",
"id": "770d5215",
"metadata": {},
"source": [
"This model was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.\n"
]
},
{
"cell_type": "markdown",
"id": "0e8686b5",
"metadata": {},
"source": [
"The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco)\n"
]
},
{
"cell_type": "markdown",
"id": "c437c78a",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f4581da",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "295c7df7",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-12-v2\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "706017d9",
"metadata": {},
"source": [
"## Performance\n",
"In the following table, we provide various pre-trained Cross-Encoders together with their performance on the [TREC Deep Learning 2019](https://microsoft.github.io/TREC-2019-Deep-Learning/) and the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "2aa6bf22",
"metadata": {},
"source": [
"| Model-Name | NDCG@10 (TREC DL 19) | MRR@10 (MS Marco Dev) | Docs / Sec |\n",
"| ------------- |:-------------| -----| --- |\n",
"| **Version 2 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 69.84 | 32.56 | 9000\n",
"| cross-encoder/ms-marco-MiniLM-L-2-v2 | 71.01 | 34.85 | 4100\n",
"| cross-encoder/ms-marco-MiniLM-L-4-v2 | 73.04 | 37.70 | 2500\n",
"| cross-encoder/ms-marco-MiniLM-L-6-v2 | 74.30 | 39.01 | 1800\n",
"| cross-encoder/ms-marco-MiniLM-L-12-v2 | 74.31 | 39.02 | 960\n",
"| **Version 1 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2 | 67.43 | 30.15 | 9000\n",
"| cross-encoder/ms-marco-TinyBERT-L-4 | 68.09 | 34.50 | 2900\n",
"| cross-encoder/ms-marco-TinyBERT-L-6 | 69.57 | 36.13 | 680\n",
"| cross-encoder/ms-marco-electra-base | 71.99 | 36.41 | 340\n",
"| **Other models** | | |\n",
"| nboost/pt-tinybert-msmarco | 63.63 | 28.80 | 2900\n",
"| nboost/pt-bert-base-uncased-msmarco | 70.94 | 34.75 | 340\n",
"| nboost/pt-bert-large-msmarco | 73.36 | 36.48 | 100\n",
"| Capreolus/electra-base-msmarco | 71.23 | 36.89 | 340\n",
"| amberoad/bert-multilingual-passage-reranking-msmarco | 68.40 | 35.54 | 330\n",
"| sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco | 72.82 | 37.88 | 720\n"
]
},
{
"cell_type": "markdown",
"id": "65eda465",
"metadata": {},
"source": [
"Note: Runtime was computed on a V100 GPU.\n",
"> 此模型介绍及权重来源于[https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "366980e6",
"metadata": {},
"source": [
"# Cross-Encoder for MS Marco\n"
]
},
{
"cell_type": "markdown",
"id": "4c7d726e",
"metadata": {},
"source": [
"This model was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.\n"
]
},
{
"cell_type": "markdown",
"id": "1535e90f",
"metadata": {},
"source": [
"The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco)\n"
]
},
{
"cell_type": "markdown",
"id": "3eda3140",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74d5bcd7",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59553cde",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-12-v2\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "0b6883fa",
"metadata": {},
"source": [
"## Performance\n",
"In the following table, we provide various pre-trained Cross-Encoders together with their performance on the [TREC Deep Learning 2019](https://microsoft.github.io/TREC-2019-Deep-Learning/) and the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "e04ad9db",
"metadata": {},
"source": [
"| Model-Name | NDCG@10 (TREC DL 19) | MRR@10 (MS Marco Dev) | Docs / Sec |\n",
"| ------------- |:-------------| -----| --- |\n",
"| **Version 2 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 69.84 | 32.56 | 9000\n",
"| cross-encoder/ms-marco-MiniLM-L-2-v2 | 71.01 | 34.85 | 4100\n",
"| cross-encoder/ms-marco-MiniLM-L-4-v2 | 73.04 | 37.70 | 2500\n",
"| cross-encoder/ms-marco-MiniLM-L-6-v2 | 74.30 | 39.01 | 1800\n",
"| cross-encoder/ms-marco-MiniLM-L-12-v2 | 74.31 | 39.02 | 960\n",
"| **Version 1 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2 | 67.43 | 30.15 | 9000\n",
"| cross-encoder/ms-marco-TinyBERT-L-4 | 68.09 | 34.50 | 2900\n",
"| cross-encoder/ms-marco-TinyBERT-L-6 | 69.57 | 36.13 | 680\n",
"| cross-encoder/ms-marco-electra-base | 71.99 | 36.41 | 340\n",
"| **Other models** | | |\n",
"| nboost/pt-tinybert-msmarco | 63.63 | 28.80 | 2900\n",
"| nboost/pt-bert-base-uncased-msmarco | 70.94 | 34.75 | 340\n",
"| nboost/pt-bert-large-msmarco | 73.36 | 36.48 | 100\n",
"| Capreolus/electra-base-msmarco | 71.23 | 36.89 | 340\n",
"| amberoad/bert-multilingual-passage-reranking-msmarco | 68.40 | 35.54 | 330\n",
"| sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco | 72.82 | 37.88 | 720\n"
]
},
{
"cell_type": "markdown",
"id": "18e7124d",
"metadata": {},
"source": [
"Note: Runtime was computed on a V100 GPU.\n",
"> The model introduction and model weights originate from [https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: ''
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: ''
License: apache-2.0
Model_Info:
name: "cross-encoder/ms-marco-TinyBERT-L-2"
description: "Cross-Encoder for MS Marco"
description_en: "Cross-Encoder for MS Marco"
icon: ""
from_repo: "https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2"
description: Cross-Encoder for MS Marco
description_en: Cross-Encoder for MS Marco
from_repo: https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: cross-encoder/ms-marco-TinyBERT-L-2
Paper: null
Publisher: cross-encoder
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: ""
Publisher: "cross-encoder"
License: "apache-2.0"
Language: ""
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 文本分类
sub_tag_en: Text Classification
tag: 自然语言处理
tag_en: Natural Language Processing
{
"cells": [
{
"cell_type": "markdown",
"id": "32947f83",
"metadata": {},
"source": [
"# Cross-Encoder for MS Marco\n"
]
},
{
"cell_type": "markdown",
"id": "d34eaa08",
"metadata": {},
"source": [
"This model was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.\n"
]
},
{
"cell_type": "markdown",
"id": "dcf2e434",
"metadata": {},
"source": [
"The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco)\n"
]
},
{
"cell_type": "markdown",
"id": "bb938635",
"metadata": {},
"source": [
"## How to use\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "463fcbb2",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3ac7704",
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"from paddlenlp.transformers import AutoModel\n",
"\n",
"model = AutoModel.from_pretrained(\"cross-encoder/ms-marco-TinyBERT-L-2\")\n",
"input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
"print(model(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "e185e8d7",
"metadata": {},
"source": [
"## Performance\n",
"In the following table, we provide various pre-trained Cross-Encoders together with their performance on the [TREC Deep Learning 2019](https://microsoft.github.io/TREC-2019-Deep-Learning/) and the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "1b6ce4a0",
"metadata": {},
"source": [
"| Model-Name | NDCG@10 (TREC DL 19) | MRR@10 (MS Marco Dev) | Docs / Sec |\n",
"| ------------- |:-------------| -----| --- |\n",
"| **Version 2 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 69.84 | 32.56 | 9000\n",
"| cross-encoder/ms-marco-MiniLM-L-2-v2 | 71.01 | 34.85 | 4100\n",
"| cross-encoder/ms-marco-MiniLM-L-4-v2 | 73.04 | 37.70 | 2500\n",
"| cross-encoder/ms-marco-MiniLM-L-6-v2 | 74.30 | 39.01 | 1800\n",
"| cross-encoder/ms-marco-MiniLM-L-12-v2 | 74.31 | 39.02 | 960\n",
"| **Version 1 models** | | |\n",
"| cross-encoder/ms-marco-TinyBERT-L-2 | 67.43 | 30.15 | 9000\n",
"| cross-encoder/ms-marco-TinyBERT-L-4 | 68.09 | 34.50 | 2900\n",
"| cross-encoder/ms-marco-TinyBERT-L-6 | 69.57 | 36.13 | 680\n",
"| cross-encoder/ms-marco-electra-base | 71.99 | 36.41 | 340\n",
"| **Other models** | | |\n",
"| nboost/pt-tinybert-msmarco | 63.63 | 28.80 | 2900\n",
"| nboost/pt-bert-base-uncased-msmarco | 70.94 | 34.75 | 340\n",
"| nboost/pt-bert-large-msmarco | 73.36 | 36.48 | 100\n",
"| Capreolus/electra-base-msmarco | 71.23 | 36.89 | 340\n",
"| amberoad/bert-multilingual-passage-reranking-msmarco | 68.40 | 35.54 | 330\n",
"| sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco | 72.82 | 37.88 | 720\n"
]
},
{
"cell_type": "markdown",
"id": "478f9bd9",
"metadata": {},
"source": [
"Note: Runtime was computed on a V100 GPU.\n",
"> 此模型介绍及权重来源于[https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2](https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2),并转换为飞桨模型格式。\n"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
Datasets: multi_nli,snli
Example: null
IfOnlineDemo: 0
IfTraining: 0
Language: English
License: apache-2.0
Model_Info:
name: "cross-encoder/nli-MiniLM2-L6-H768"
description: "Cross-Encoder for Natural Language Inference"
description_en: "Cross-Encoder for Natural Language Inference"
icon: ""
from_repo: "https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768"
description: Cross-Encoder for Natural Language Inference
description_en: Cross-Encoder for Natural Language Inference
from_repo: https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768
icon: https://paddlenlp.bj.bcebos.com/models/community/transformer-layer.png
name: cross-encoder/nli-MiniLM2-L6-H768
Paper: null
Publisher: cross-encoder
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Zero-Shot Classification"
sub_tag: "零样本分类"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Text Classification"
sub_tag: "文本分类"
Example:
Datasets: "multi_nli,snli"
Publisher: "cross-encoder"
License: "apache-2.0"
Language: "English"
Paper:
IfTraining: 0
IfOnlineDemo: 0
\ No newline at end of file
- sub_tag: 零样本分类
sub_tag_en: Zero-Shot Classification
tag: 自然语言处理
tag_en: Natural Language Processing
- sub_tag: 文本分类
sub_tag_en: Text Classification
tag: 自然语言处理
tag_en: Natural Language Processing
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册