提交 41b030e2 编写于 作者: C chenxuyi

update readme for distill

上级 d3310dbf
......@@ -16,8 +16,12 @@ English | [简体中文](./README.zh.md)
* [IR Relevance Task](#ir-relevance-task)
* [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration)
* [Compare the ERNIE 1.0 and ERNIE 2.0](#compare-the-ernie-10-and-ernie-20)
* [Results](#results)
* [Results on English Datasets](#results-on-english-datasets)
* [Results on Chinese Datasets](#results-on-chinese-datasets)
* [Release Notes](#release-notes)
* [Communication](#communication)
* [Usage](#usage)
![ernie2.0_paper](.metas/ernie2.0_paper.png)
......@@ -109,21 +113,6 @@ Integrating both phrase information and named entity information enables the mod
| **Structure-aware** | | ✅ Sentence Reordering | ✅ Sentence Reordering <br> ✅ Sentence Distance |
| **Semantic-aware** | ✅ Next Sentence Prediction | ✅ Discourse Relation | ✅ Discourse Relation <br> ✅ IR Relevance |
## Release Notes
- Aug 21, 2019: featuers update: fp16 finetuning, multiprocess finetining.
- July 30, 2019: release ERNIE 2.0
- Apr 10, 2019: update ERNIE_stable-1.0.1.tar.gz, update config and vocab
- Mar 18, 2019: update ERNIE_stable.tgz
- Mar 15, 2019: release ERNIE 1.0
## Communication
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## Results
......@@ -626,6 +615,21 @@ LCQMC is a Chinese question semantic matching corpus published in COLING2018. [u
BQ Corpus (Bank Question corpus) is a Chinese corpus for sentence semantic equivalence identification. This dataset was published in EMNLP 2018. [url: https://www.aclweb.org/anthology/D18-1536]
```
## Release Notes
- Aug 21, 2019: featuers update: fp16 finetuning, multiprocess finetining.
- July 30, 2019: release ERNIE 2.0
- Apr 10, 2019: update ERNIE_stable-1.0.1.tar.gz, update config and vocab
- Mar 18, 2019: update ERNIE_stable.tgz
- Mar 15, 2019: release ERNIE 1.0
## Communication
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## Usage
* [Install PaddlePaddle](#install-paddlepaddle)
......@@ -645,7 +649,8 @@ BQ Corpus (Bank Question corpus) is a Chinese corpus for sentence semantic equiv
* [Machine Reading Comprehension](#machine-reading-comprehension)
* [Pre-training with ERNIE 1.0](#pre-training-with-ernie-10)
* [Data Preprocessing](#data-preprocessing)
* [PreTrain ERNIE1.0](#pretrain-ernie10)
* [Pretrain ERNIE1.0](#pretrain-ernie10)
* [Distillation](#distillation)
* [FAQ](#faq)
* [FAQ1: How to get sentence/tokens embedding of ERNIE?](#faq1-how-to-get-sentencetokens-embedding-of-ernie)
* [FAQ2: How to predict on new data with Fine-tuning model?](#faq2-how-to-predict-on-new-data-with-fine-tuning-model)
......@@ -654,7 +659,7 @@ BQ Corpus (Bank Question corpus) is a Chinese corpus for sentence semantic equiv
* [FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq5-can-not-find-library-libncclso-please-try-to-add-the-lib-path-to-ld_library_path)
## Install PaddlePaddle
### Install PaddlePaddle
This code base has been tested with Paddle Fluid 1.5.1 under Python2.
......@@ -671,11 +676,15 @@ If you have been armed with certain level of deep learning knowledge, and it hap
For more information about paddlepadde, Please refer to [PaddlePaddle Github](https://github.com/PaddlePaddle/Paddle) or [Official Website](https://www.paddlepaddle.org.cn/) for details.
Other dependency of ERNIE is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```
## Pre-trained Models & Datasets
### Pre-trained Models & Datasets
### Models
#### Models
| Model | Description |
| :------------------------------------------------- | :----------------------------------------------------------- |
......@@ -685,23 +694,23 @@ For more information about paddlepadde, Please refer to [PaddlePaddle Github](ht
| [ERNIE 2.0 Base for English](https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz) | with params, config and vocabs |
| [ERNIE 2.0 Large for English](https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz) | with params, config and vocabs |
### Datasets
#### Datasets
#### English Datasets
##### English Datasets
Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `${TASK_DATA_PATH}`
After the dataset is downloaded, you should run `sh ./script/en_glue/preprocess/cvt.sh $TASK_DATA_PATH` to convert the data format for training. If everything goes well, there will be a folder named `glue_data_processed` created with all the converted datas in it.
#### Chinese Datasets
##### Chinese Datasets
You can download Chinese Datasets from [here](https://ernie.bj.bcebos.com/task_data_zh.tgz)
## Fine-tuning
#### Fine-tuning
### Batchsize and GPU Settings
##### Batchsize and GPU Settings
In our experiments, we found that the batch size is important for different tasks. For users can more easily reproducing results, we list the batch size and gpu cards here:
......@@ -728,7 +737,7 @@ In our experiments, we found that the batch size is important for different task
\* *For MNLI, QNLI,we used 32GB V100, for other tasks we used 22GB P40*
### Multiprocessing and fp16 auto mix-precision finetune
#### Multiprocessing and fp16 auto mix-precision finetune
multiprocessing finetuning can be simply enabled with `finetune_launch.py` in your finetune script.
with multiprocessing finetune paddle can fully utilize your CPU/GPU capacity to accelerate finetuning.
......@@ -738,9 +747,9 @@ fp16 finetuning can be simply enable by specifing `--use_fp16 true` in your trai
dynamic loss scale is used to avoid gradient vanish.
### Classification
#### Classification
#### Single Sentence Classification Tasks
##### Single Sentence Classification Tasks
The code used to perform classification/regression finetuning is in `run_classifier.py`, we also provide the shell scripts for each task including best hyperpameters.
......@@ -798,7 +807,7 @@ Similarly, for the Chinese task `ChnSentCorp`, after setting the environment var
#### Sentence Pair Classification Tasks
##### Sentence Pair Classification Tasks
Take `RTE` as an example, the data should have 3 fields `text_a text_b label` with tsv format. Here is some example datas:
```
......@@ -834,9 +843,9 @@ testing ./data/test.tsv, save to output/test_out.5.2019-07-23-15-25-06.tsv.4.781
### Sequence Labeling
#### Sequence Labeling
#### Named Entity Recognition
##### Named Entity Recognition
Take `MSRA-NER(SIGHAN2006)` as an example, the data should have 2 fields, `text_a label`, with tsv format. Here is some example datas :
```
......@@ -853,7 +862,7 @@ Also, remember to set environmental variables like above, and run `sh script/zh_
[test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s
```
### Machine Reading Comprehension
#### Machine Reading Comprehension
Take `DRCD` as an example, convert the data into SQUAD format firstly:
......@@ -896,9 +905,9 @@ Also, remember to set environmental variables like above, and run `sh script/zh_
```
## Pre-training with ERNIE 1.0
### Pre-training with ERNIE 1.0
### Data Preprocessing
#### Data Preprocessing
We construct the training dataset based on [Baidu Baike](https://en.wikipedia.org/wiki/Baidu_Baike), [Baidu Knows(Baidu Zhidao)](https://en.wikipedia.org/wiki/Baidu_Knows), [Baidu Tieba](https://en.wikipedia.org/wiki/Baidu_Tieba) for Chinese version ERNIE, and [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download), [Reddit](https://en.wikipedia.org/wiki/Reddit), [BookCorpus](https://github.com/soskek/bookcorpus) for English version ERNIE.
......@@ -912,7 +921,7 @@ Here are some train instances after processing (which can be found in [`data/dem
Each instance is composed of 5 fields, which are joined by `;`in one line, represented `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label` respectively. Especially, in the field`seg_labels`, 0 means the begin of one word, 1 means non-begin of one word, -1 means placeholder, the other number means `CLS` or `SEP`.
### PreTrain ERNIE 1.0
#### Pretrain ERNIE 1.0
The start entry for pretrain is [`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh). Before we run the train program, remember to set CUDA、cuDNN、NCCL2 etc. in the environment variable LD_LIBRARY_PATH.
......@@ -932,10 +941,15 @@ epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent
```
### Distillation
ERNIE provide a toolkit for data distillation to further accelerate your ineference, see <a href="./distill/README.md">here</a> for detail
## FAQ
### FAQ
### FAQ1: How to get sentence/tokens embedding of ERNIE?
#### FAQ1: How to get sentence/tokens embedding of ERNIE?
Run ```ernie_encoder.py ``` we can get the both sentence embedding and tokens embeddings. The input data format should be same as that mentioned in chapter [Fine-tuning](#fine-tuning).
......@@ -960,7 +974,7 @@ when finished running this script, `cls_emb.npy` and `top_layer_emb.npy `will b
### FAQ2: How to predict on new data with Fine-tuning model?
#### FAQ2: How to predict on new data with Fine-tuning model?
Take classification tasks for example, here is the script for batch prediction:
......@@ -984,18 +998,18 @@ Argument `init_checkpoint` is the path of the model, `predict_set` is the path
### FAQ3: Is the argument batch_size for one GPU card or for all GPU cards?
#### FAQ3: Is the argument batch_size for one GPU card or for all GPU cards?
For one GPU card.
### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.
#### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.
Export the path of cuda to LD_LIBRARY_PATH, e.g.: `export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v[your cudnn version]/cuda/lib64`
### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
#### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
Download [NCCL2](https://developer.nvidia.com/nccl/nccl-download), and export the library path to LD_LIBRARY_PATH, e.g.:`export LD_LIBRARY_PATH=/home/work/nccl/lib`
......@@ -16,8 +16,12 @@
* [IR Relevance Task](#ir-relevance-task)
* [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration)
* [对比 ERNIE 1.0 和 ERNIE 2.0](#对比-ernie-10-和-ernie-20)
* [效果验证](#效果验证)
* [中文效果验证](#中文效果验证)
* [英文效果验证](#英文效果验证)
* [开源记录](#开源记录)
* [技术交流](#技术交流)
* [使用](#使用)
![ernie2.0_paper](.metas/ernie2.0_paper.png)
......@@ -105,26 +109,16 @@
| **Semantic-aware** | ✅ Next Sentence Prediction | ✅ Discourse Relation | ✅ Discourse Relation <br> ✅ IR Relevance |
## 开源记录
- 2019-07-30 发布 ERNIE 2.0
- 2019-04-10 更新: update ERNIE_stable-1.0.1.tar.gz, 将模型参数、配置 ernie_config.json、vocab.txt 打包发布
- 2019-03-18 更新: update ERNIE_stable.tgz
- 2019-03-15 发布 ERNIE 1.0
## 技术交流
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- ERNIE QQ 群: 760439550 (ERNIE discussion group).
- [论坛](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## 效果验证
## 中文效果验证
### 中文效果验证
我们在 9 个任务上验证 ERNIE 2.0 中文模型的效果。这些任务包括:自然语言推断任务 XNLI;阅读理解任务 DRCD、DuReader、CMRC2018;命名实体识别任务 MSRA-NER (SIGHAN2006);情感分析任务 ChnSentiCorp;语义相似度任务 BQ Corpus、LCQMC;问答任务 NLPCC2016-DBQA 。任务的详情和效果会在如下章节中介绍。
### 自然语言推断任务
#### 自然语言推断任务
<table>
<tbody>
......@@ -189,7 +183,7 @@
XNLI 是由 Facebook 和纽约大学的研究者联合构建的自然语言推断数据集,包括 15 种语言的数据。我们用其中的中文数据来评估模型的语言理解能力。[链接: https://github.com/facebookresearch/XNLI]
```
### 阅读理解任务
#### 阅读理解任务
<table>
<tbody>
......@@ -318,9 +312,7 @@ CMRC2018 是中文信息学会举办的评测,评测的任务是抽取类阅
DRCD 是台达研究院发布的繁体中文阅读理解数据集,目标是从篇章中抽取出连续片段作为答案。我们在实验时先将其转换成简体中文。[链接: https://github.com/DRCKnowledgeTeam/DRCD]
```
### 命名实体识别任务
#### 命名实体识别任务
<table>
<tbody>
......@@ -377,9 +369,7 @@ DRCD 是台达研究院发布的繁体中文阅读理解数据集,目标是从
MSRA-NER (SIGHAN2006) 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,包括人名、地名、机构名。
```
### 情感分析任务
#### 情感分析任务
<table>
<tbody>
......@@ -436,9 +426,7 @@ MSRA-NER (SIGHAN2006) 数据集由微软亚研院发布,其目标是识别文
ChnSentiCorp 是一个中文情感分析数据集,包含酒店、笔记本电脑和书籍的网购评论。
```
### 问答任务
#### 问答任务
<table>
<tbody>
......@@ -512,9 +500,7 @@ ChnSentiCorp 是一个中文情感分析数据集,包含酒店、笔记本电
NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务,其目标是从候选中找到合适的文档作为问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
```
### 语义相似度
#### 语义相似度
<table>
<tbody>
......@@ -597,14 +583,14 @@ BQ Corpus 是在自然语言处理国际顶会 EMNLP 2018 发布的语义匹配
## 英文效果验证
### 英文效果验证
ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址为 https://gluebenchmark.com/ ,该评测涵盖了不同类型任务的 10 个数据集,其中包含 11 个测试集,涉及到 Accuracy, F1-score, Spearman Corr,. Pearson Corr,. Matthew Corr., 5 类指标。GLUE 排行榜使用每个数据集的平均分作为总体得分,并以此为依据将不同算法进行排名。
### GLUE - 验证集结果
#### GLUE - 验证集结果
| <strong>数据集</strong> | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong> | <strong>STS-B</strong> | <strong>QQP</strong> | <strong>MNLI-m</strong> | <strong>QNLI</strong> | <strong>RTE</strong> |
| ----------- | ---- | ----- | ---- | ----- | ---- | ---- | ---- | ---- |
......@@ -617,7 +603,7 @@ ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址
### GLUE - 测试集结果
#### GLUE - 测试集结果
| <strong>数据集</strong> | - | <strong>CoLA</strong> | <strong>SST-2</strong> | <strong>MRPC</strong> | <strong>STS-B</strong> | <strong>QQP</strong> | <strong>MNLI-m</strong> | <strong>MNLI-mm</strong> | <strong>QNLI</strong> | <strong>RTE</strong> | <strong>WNLI</strong> |<strong>AX</strong>|
| ----------- | ----- | ---- | ----- | ---- | ----- | ---- | ------ | ------- | ---- | ---- | ---- | ---- |
......@@ -631,6 +617,19 @@ ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址
由于 XLNet 暂未公布 GLUE 测试集上的单模型结果,所以我们只与 BERT 进行单模型比较。上表为ERNIE 2.0 单模型在 GLUE 测试集的表现结果。
## 开源记录
- 2019-07-30 发布 ERNIE 2.0
- 2019-04-10 更新: update ERNIE_stable-1.0.1.tar.gz, 将模型参数、配置 ernie_config.json、vocab.txt 打包发布
- 2019-03-18 更新: update ERNIE_stable.tgz
- 2019-03-15 发布 ERNIE 1.0
## 技术交流
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- ERNIE QQ 群: 760439550 (ERNIE discussion group).
- [论坛](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## 使用
* [PaddlePaddle 安装](#paddlepaddle安装)
* [模型&amp;数据](#模型数据)
......@@ -650,6 +649,10 @@ ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址
* [预训练 (ERNIE 1.0)](#预训练-ernie-10)
* [数据预处理](#数据预处理)
* [开始训练](#开始训练)
* [蒸馏](#蒸馏)
* [上线](#上线)
* [生成inference_model](#生成inference_model)
* [在线预测](#在线预测)
* [FAQ](#faq)
* [FAQ1: 如何获取输入句子/词经过 ERNIE 编码后的 Embedding 表示?](#faq1-如何获取输入句子词经过-ernie-编码后的-embedding-表示)
* [FAQ2: 如何利用 Fine-tuning 得到的模型对新数据进行批量预测?](#faq2-如何利用-fine-tuning-得到的模型对新数据进行批量预测)
......@@ -672,6 +675,11 @@ ERNIE 2.0 的英文效果验证在 GLUE 上进行。GLUE 评测的官方地址
> - [训练神经网络](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/index_cn.html):介绍如何使用 Fluid 进行单机训练、多机训练、以及保存和载入模型变量
> - [模型评估与调试](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/evaluation_and_debugging/index_cn.html):介绍在 Fluid 下进行模型评估和调试的方法
ERNIE的其他依赖列在`requirements.txt`文件中,使用以下命令安装
```script
pip install -r requirements.txt
```
## 模型&数据
......@@ -918,6 +926,36 @@ epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent
如果用自定义的真实数据进行训练,请参照[`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh)脚本对参数做相应修改。
## 蒸馏
ERNIE提供了通过数据蒸馏从而达到模型压缩、加速的开发套件,具体开发流程请参考 <a href="./distill/README.md">这里</a>
## 上线
完成finetune之后只需几步操作即可生成inference\_model, PaddlePaddle可以在生产环境中加载生成的预测模型并进行高效地预测。
### 生成inference\_model
运行`classify_infer.py`或者`predict_classifier.py` 脚本时通过指定 `--save_inference_model_path` 便可生成 inference_model 到指定位置。
如果您采用 `propeller` 完成finetune,则 `BestInferenceExporter` 会在finetune过程中根据预测指标,挑最好的模型生成 inference_model . 使用 `propeller` 完成finetune的流程请参考 `propeller_xnli_demo.ipynb`
### 在线预测
随后您可以使用[PaddleInference C++ API](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/native_infer.html)将模型的前向预测代码联编到您的生产环境中。或者您可以使用我们为您构建好的python预测引擎来完成一个简单的服务。只需将本代码库中的 `./propeller` 文件夹放入您的 `PYTHONPATH` 中并执行如下指令,便可以开启一个propeller server:
```script
python -m propeller.tools.start_server -m /path/to/saved/model -p 8888
```
您可以在python脚本很方便地调用propeller server:
```python
from propeller.service.client import InferenceClient
client = InferenceClient('tcp://localhost:8113')
result = client(sentence_id, position_id, token_type_id, input_mask)
```
`client`的请求参数类型是numpy array,对应了save_inference_model时指定的输入tensor. 如果是使用`classify_infer.py` 生成的inference_model则请求参数有四个:(sentence_id, position_id, token_type_id, input_mask)。 如果是`propeller` 生成的inference_model, client的请求参数对应您`eval_dataset` 的元素类型。
## FAQ
### FAQ1: 如何获取输入句子/词经过 ERNIE 编码后的 Embedding 表示?
......@@ -981,3 +1019,4 @@ python -u predict_classifier.py \
### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
需要先下载 [NCCL](https://developer.nvidia.com/nccl/nccl-download),然后在 LD_LIBRARY_PATH 中添加 NCCL 库的路径,如`export LD_LIBRARY_PATH=/home/work/nccl/lib`
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册