未验证 提交 830e2b7e 编写于 作者: Z zhanghan 提交者: GitHub

Merge pull request #1 from PaddlePaddle/develop

update
*.pyc *.pyc
*.un~ *.un~
*.swp
*.egg-info/
export FLAGS_enable_parallel_graph=1
export FLAGS_sync_nccl_allreduce=1
BERT_BASE_PATH="chinese_L-12_H-768_A-12"
TASK_NAME='xnli'
DATA_PATH=data/xnli/XNLI-MT-1.0
CKPT_PATH=pretrain_model
train(){
python -u run_classifier.py --task_name ${TASK_NAME} \
--use_cuda true \
--do_train true \
--do_val false \
--do_test false \
--batch_size 8192 \
--in_tokens true \
--init_checkpoint pretrain_model/chinese_L-12_H-768_A-12/ \
--data_dir ${DATA_PATH} \
--vocab_path pretrain_model/chinese_L-12_H-768_A-12/vocab.txt \
--checkpoints ${CKPT_PATH} \
--save_steps 1000 \
--weight_decay 0.01 \
--warmup_proportion 0.0 \
--validation_steps 25 \
--epoch 1 \
--max_seq_len 512 \
--bert_config_path pretrain_model/chinese_L-12_H-768_A-12/bert_config.json \
--learning_rate 1e-4 \
--skip_steps 10 \
--random_seed 100 \
--enable_ce \
--shuffle false
}
export CUDA_VISIBLE_DEVICES=0
train | python _ce.py
export CUDA_VISIBLE_DEVICES=0,1,2,3
train | python _ce.py
Hi!
This directory has been deprecated.
Please visit the project at [models/PaddleNLP/language_representations_kit/BERT](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_representations_kit/BERT).
train() {
python train.py \
--train_path='data/train/sentence_file_*' \
--test_path='data/dev/sentence_file_*' \
--vocab_path data/vocabulary_min5k.txt \
--learning_rate 0.2 \
--use_gpu True \
--all_train_tokens 35479 \
--max_epoch 10 \
--log_interval 5 \
--dev_interval 20 \
--local True $@ \
--enable_ce \
--shuffle false \
--random_seed 100
}
export CUDA_VISIBLE_DEVICES=0
train | python _ce.py
export CUDA_VISIBLE_DEVICES=0,1,2,3
train | python _ce.py
Hi!
This directory has been deprecated.
Please visit the project at [models/PaddleNLP/language_representations_kit/ELMo](
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_representations_kit/ELMo).
<div align="center">
<h1>
<font color="red">
ERNIE 项目已经迁移至 <a href="../README.zh.md">这里</a>
</font>
</h1>
</div>
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
## ERNIE: **E**nhanced **R**epresentation through k**N**owledge **I**nt**E**gration
**** **2019-04-10 更新**: update ERNIE_stable-1.0.1.tar.gz, 将模型参数、配置 ernie_config.json、vocab.txt 打包发布 ****
**** **2019-03-18 更新**: update ERNIE_stable.tgz ****
**ERNIE** 通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 **BERT** 学习原始语言信号,**ERNIE** 直接对先验语义知识单元进行建模,增强了模型语义表示能力。
这里我们举个例子:
```Learnt by BERT :哈 [mask] 滨是 [mask] 龙江的省会,[mask] 际冰 [mask] 文化名城。```
```Learnt by ERNIE:[mask] [mask] [mask] 是黑龙江的省会,国际 [mask] [mask] 文化名城。```
在 **BERT** 模型中,我们通过『哈』与『滨』的局部共现,即可判断出『尔』字,模型没有学习与『哈尔滨』相关的任何知识。而 **ERNIE** 通过学习词与实体的表达,使模型能够建模出『哈尔滨』与『黑龙江』的关系,学到『哈尔滨』是 『黑龙江』的省会以及『哈尔滨』是个冰雪城市。
训练数据方面,除百科类、资讯类中文语料外,**ERNIE** 还引入了论坛对话类数据,利用 **DLM**(Dialogue Language Model)建模 Query-Response 对话结构,将对话 Pair 对作为输入,引入 Dialogue Embedding 标识对话的角色,利用 Dialogue Response Loss 学习对话的隐式关系,进一步提升模型的语义表示能力。
我们在自然语言推断,语义相似度,命名实体识别,情感分析,问答匹配 5 个公开的中文数据集合上进行了效果验证,**ERNIE** 模型相较 **BERT** 取得了更好的效果。
<table>
<tbody>
<tr>
<th><strong>数据集</strong>
<br></th>
<th colspan="2"><strong>XNLI</strong></th>
<th colspan="2"><strong>LCQMC</strong></th>
<th colspan="2"><strong>MSRA-NER(SIGHAN 2006)</strong></th>
<th colspan="2"><strong>ChnSentiCorp</strong></th>
<th colspan="4"><strong>nlpcc-dbqa</strong></th></tr>
<tr>
<td rowspan="2">
<p>
<strong>评估</strong></p>
<p>
<strong>指标</strong>
<br></p>
</td>
<td colspan="2">
<strong>acc</strong>
<br></td>
<td colspan="2">
<strong>acc</strong>
<br></td>
<td colspan="2">
<strong>f1-score</strong>
<br></td>
<td colspan="2">
<strong>acc</strong>
<strong></strong>
<br></td>
<td colspan="2">
<strong>mrr</strong>
<br></td>
<td colspan="2">
<strong>f1-score</strong>
<br></td>
</tr>
<tr>
<th colspan="1" width="">
<strong>dev</strong>
<br></th>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>dev</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>dev</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>dev</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>dev</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>dev</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
</tr>
<tr>
<td>
<strong>BERT
<br></strong></td>
<td>78.1</td>
<td>77.2</td>
<td>88.8</td>
<td>87.0</td>
<td>94.0
<br></td>
<td>
<span>92.6</span></td>
<td>94.6</td>
<td>94.3</td>
<td colspan="1">94.7</td>
<td colspan="1">94.6</td>
<td colspan="1">80.7</td>
<td colspan="1">80.8</td></tr>
<tr>
<td>
<strong>ERNIE
<br></strong></td>
<td>79.9 <span>(<strong>+1.8</strong>)</span></td>
<td>78.4 <span>(<strong>+1.2</strong>)</span></td>
<td>89.7 <span>(<strong>+0.9</strong>)</span></td>
<td>87.4 <span>(<strong>+0.4</strong>)</span></td>
<td>95.0 <span>(<strong>+1.0</strong>)</span></td>
<td>93.8 <span>(<strong>+1.2</strong>)</span></td>
<td>95.2 <span>(<strong>+0.6</strong>)</span></td>
<td>95.4 <span>(<strong>+1.1</strong>)</span></td>
<td colspan="1">95.0 <span>(<strong>+0.3</strong>)</span></td>
<td colspan="1">95.1 <span>(<strong>+0.5</strong>)</span></td>
<td colspan="1">82.3 <span>(<strong>+1.6</strong>)</span></td>
<td colspan="1">82.7 <span>(<strong>+1.9</strong>)</span></td></tr>
</tbody>
</table>
- **自然语言推断任务** XNLI
```text
XNLI 由 Facebook 和纽约大学的研究者联合构建,旨在评测模型多语言的句子理解能力。目标是判断两个句子的关系(矛盾、中立、蕴含)。[链接: https://github.com/facebookresearch/XNLI]
```
- **语义相似度** LCQMC
```text
LCQMC 是哈尔滨工业大学在自然语言处理国际顶会 COLING2018 构建的问答匹配数据集,其目标是判断两个问题的语义是否相同。[链接: http://aclweb.org/anthology/C18-1166]
```
- **命名实体识别任务** MSRA-NER(SIGHAN 2006)
```text
MSRA-NER(SIGHAN 2006) 数据集由微软亚研院发布,其目标是命名实体识别,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名等。
```
- **情感分析任务** ChnSentiCorp
```text
ChnSentiCorp 是中文情感分析数据集,其目标是判断一段话的情感态度。
```
- **检索式问答任务** nlpcc-dbqa
```text
nlpcc-dbqa是由国际自然语言处理和中文计算会议NLPCC于2016年举办的评测任务,其目标是选择能够回答问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
```
### 模型&数据
1) 预训练模型下载
| Model | Description |
| :------| :------ |
| [模型](https://ernie.bj.bcebos.com/ERNIE_stable.tgz) | 包含预训练模型参数 |
| [模型(含配置文件及词典)](https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz)) | 包含预训练模型参数、词典 vocab.txt、模型配置 ernie_config.json|
2) [任务数据下载](https://ernie.bj.bcebos.com/task_data_zh.tgz)
### 安装
本项目依赖于 Paddle Fluid 1.3.1,请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。
**Note**: 预训练任务和finetune任务测试机器为P40, 显存22G;如果显存低于22G, 某些任务可能会因显存不足报错;
### 预训练
#### 数据预处理
基于百科类、资讯类、论坛对话类数据构造具有上下文关系的句子对数据,利用百度内部词法分析工具对句对数据进行字、词、实体等不同粒度的切分,然后基于 [`tokenization.py`](tokenization.py) 中的 CharTokenizer 对切分后的数据进行 token 化处理,得到明文的 token 序列及切分边界,然后将明文数据根据词典 [`config/vocab.txt`](config/vocab.txt) 映射为 id 数据,在训练过程中,根据切分边界对连续的 token 进行随机 mask 操作;
我们给出了 id 化后的部分训练数据:[`data/demo_train_set.gz`](./data/demo_train_set.gz`)、和测试数据:[`data/demo_valid_set.gz`](./data/demo_valid_set.gz),每行数据为1个训练样本,示例如下:
```
1 1048 492 1333 1361 1051 326 2508 5 1803 1827 98 164 133 2777 2696 983 121 4 19 9 634 551 844 85 14 2476 1895 33 13 983 121 23 7 1093 24 46 660 12043 2 1263 6 328 33 121 126 398 276 315 5 63 44 35 25 12043 2;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55;-1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 -1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 -1;0
```
每个样本由5个 '`;`' 分隔的字段组成,数据格式: `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label`;其中 `seg_labels` 表示分词边界信息: 0表示词首、1表示非词首、-1为占位符, 其对应的词为 `CLS` 或者 `SEP`
#### 开始训练
预训练任务的启动脚本是 [`script/pretrain.sh`](./script/pretrain.sh)
在开始预训练之前需要把 CUDA、cuDNN、NCCL2 等动态库路径加入到环境变量 LD_LIBRARY_PATH 之中;然后执行 `bash script/pretrain.sh` 就可以基于demo数据和默认参数配置开始预训练;
预训练任务进行的过程中会输出当前学习率、训练数据所经过的轮数、当前迭代的总步数、训练误差、训练速度等信息,根据 --validation_steps ${N} 的配置,每间隔 N 步输出模型在验证集的各种指标:
```
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 30, loss: 10.540648, ppl: 19106.925781, next_sent_acc: 0.625000, speed: 0.849662 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 40, loss: 10.529287, ppl: 18056.654297, next_sent_acc: 0.531250, speed: 0.849549 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent_acc: 0.625000, speed: 0.843776 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
```
如果用自定义的真实数据进行训练,请参照[`script/pretrain.sh`](./script/pretrain.sh)脚本对参数做相应修改。
### Fine-tuning 任务
在完成 ERNIE 模型的预训练后,即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下基于 ERNIE 的预训练模型,示例如何进行分类任务和序列标注任务的 Fine-tuning,如果要运行这些任务,请通过 [模型&数据](#模型-数据) 一节提供的链接预先下载好对应的预训练模型。
将下载的模型解压到 `${MODEL_PATH}` 路径下,`${MODEL_PATH}` 路径下包含模型参数目录 `params`;
将下载的任务数据解压到 `${TASK_DATA_PATH}` 路径下,`${TASK_DATA_PATH}` 路径包含 `LCQMC``XNLI``MSRA-NER``ChnSentCorp``nlpcc-dbqa` 5个任务的训练数据和测试数据;
#### 单句和句对分类任务
1) **单句分类任务**:
`ChnSentiCorp` 情感分类数据集作为单句分类任务示例,数据格式为包含2个字段的tsv文件,2个字段分别为: `text_a label`, 示例数据如下:
```
label text_a
0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。
0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件!
1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!
```
执行 `bash script/run_ChnSentiCorp.sh` 即可开始finetune,执行结束后会输出如下所示的在验证集和测试集上的测试结果:
```
[dev evaluation] ave loss: 0.189373, ave acc: 0.954167, data_num: 1200, elapsed time: 14.984404 s
[test evaluation] ave loss: 0.189387, ave acc: 0.950000, data_num: 1200, elapsed time: 14.737691 s
```
2) **句对分类任务**
`LCQMC` 语义相似度任务作为句对分类任务示例,数据格式为包含3个字段的tsv文件,3个字段分别为: `text_a text_b label`,示例数据如下:
```
text_a text_b label
开初婚未育证明怎么弄? 初婚未育情况证明怎么开? 1
谁知道她是网络美女吗? 爱情这杯酒谁喝都会醉是什么歌 0
这腰带是什么牌子 护腰带什么牌子好 0
```
执行 `bash script/run_lcqmc.sh` 即可开始finetune,执行结束后会输出如下所示的在验证集和测试集上的测试结果:
```
[dev evaluation] ave loss: 0.290925, ave acc: 0.900704, data_num: 8802, elapsed time: 32.240948 s
[test evaluation] ave loss: 0.345714, ave acc: 0.878080, data_num: 12500, elapsed time: 39.738015 s
```
#### 序列标注任务
1) **实体识别**
`MSRA-NER(SIGHAN 2006)` 作为示例,数据格式为包含2个字段的tsv文件,2个字段分别为: `text_a label`, 示例数据如下:
```
text_a label
在 这 里 恕 弟 不 恭 之 罪 , 敢 在 尊 前 一 诤 : 前 人 论 书 , 每 曰 “ 字 字 有 来 历 , 笔 笔 有 出 处 ” , 细 读 公 字 , 何 尝 跳 出 前 人 藩 篱 , 自 隶 变 而 后 , 直 至 明 季 , 兄 有 何 新 出 ? O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
相 比 之 下 , 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 , 但 乏 善 可 陈 。 O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O
理 由 多 多 , 最 无 奈 的 却 是 : 5 月 恰 逢 双 重 考 试 , 她 攻 读 的 博 士 学 位 论 文 要 通 考 ; 她 任 教 的 两 所 学 校 , 也 要 在 这 段 时 日 大 考 。 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
```
执行 `bash script/run_msra_ner.sh` 即可开始 finetune,执行结束后会输出如下所示的在验证集和测试集上的测试结果:
```
[dev evaluation] f1: 0.951949, precision: 0.944636, recall: 0.959376, elapsed time: 19.156693 s
[test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s
```
### FAQ
#### 如何获取输入句子经过 ERNIE 编码后的 Embedding 表示?
可以通过 ernie_encoder.py 抽取出输入句子的 Embedding 表示和句子中每个 token 的 Embedding 表示,数据格式和 [Fine-tuning 任务](#Fine-tuning-任务) 一节中介绍的各种类型 Fine-tuning 任务的训练数据格式一致;以获取 LCQM dev 数据集中的句子 Embedding 和 token embedding 为例,示例脚本如下:
```
export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=7
python -u ernie_encoder.py \
--use_cuda true \
--batch_size 32 \
--output_dir "./test" \
--init_pretraining_params ${MODEL_PATH}/params \
--data_set ${TASK_DATA_PATH}/lcqmc/dev.tsv \
--vocab_path config/vocab.txt \
--max_seq_len 128 \
--ernie_config_path config/ernie_config.json
```
上述脚本运行结束后,会在当前路径的 test 目录下分别生成 `cls_emb.npy` 文件存储句子 embeddings 和 `top_layer_emb.npy` 文件存储 token embeddings; 实际使用时,参照示例脚本修改数据路径、embeddings 文件存储路径等配置即可运行;
#### 如何获取输入句子中每个 token 经过 ERNIE 编码后的 Embedding 表示?
[解决方案同上](#如何获取输入句子经过-ERNIE-编码后的-Embedding-表示?)
#### 如何利用 finetune 得到的模型对新数据进行批量预测?
我们以分类任务为例,给出了分类任务进行批量预测的脚本, 使用示例如下:
```
python -u predict_classifier.py \
--use_cuda true \
--batch_size 32 \
--vocab_path config/vocab.txt \
--init_checkpoint "./checkpoints/step_100" \
--do_lower_case true \
--max_seq_len 128 \
--ernie_config_path config/ernie_config.json \
--do_predict true \
--predict_set ${TASK_DATA_PATH}/lcqmc/test.tsv \
--num_labels 2
```
实际使用时,需要通过 `init_checkpoint` 指定预测用的模型,通过 `predict_set` 指定待预测的数据文件,通过 `num_labels` 配置分类的类别数目;
**Note**: predict_set 的数据格式是由 text_a、text_b(可选) 组成的1列/2列 tsv 文件;
English | [简体中文](./README.zh.md) English | [简体中文](./README.zh.md)
**
Try ERNIE with *eager execution*, please checkout to branch: `dygraph`.
**
## ERNIE 2.0: A Continual Pre-training Framework for Language Understanding ## ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
...@@ -16,8 +20,11 @@ English | [简体中文](./README.zh.md) ...@@ -16,8 +20,11 @@ English | [简体中文](./README.zh.md)
* [IR Relevance Task](#ir-relevance-task) * [IR Relevance Task](#ir-relevance-task)
* [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration) * [ERNIE 1.0: <strong>E</strong>nhanced <strong>R</strong>epresentation through k<strong>N</strong>owledge <strong>I</strong>nt<strong>E</strong>gration](#ernie-10-enhanced-representation-through-knowledge-integration)
* [Compare the ERNIE 1.0 and ERNIE 2.0](#compare-the-ernie-10-and-ernie-20) * [Compare the ERNIE 1.0 and ERNIE 2.0](#compare-the-ernie-10-and-ernie-20)
* [Results](#results)
* [Results on English Datasets](#results-on-english-datasets) * [Results on English Datasets](#results-on-english-datasets)
* [Results on Chinese Datasets](#results-on-chinese-datasets) * [Results on Chinese Datasets](#results-on-chinese-datasets)
* [Communication](#communication)
* [Usage](#usage)
![ernie2.0_paper](.metas/ernie2.0_paper.png) ![ernie2.0_paper](.metas/ernie2.0_paper.png)
...@@ -109,20 +116,6 @@ Integrating both phrase information and named entity information enables the mod ...@@ -109,20 +116,6 @@ Integrating both phrase information and named entity information enables the mod
| **Structure-aware** | | ✅ Sentence Reordering | ✅ Sentence Reordering <br> ✅ Sentence Distance | | **Structure-aware** | | ✅ Sentence Reordering | ✅ Sentence Reordering <br> ✅ Sentence Distance |
| **Semantic-aware** | ✅ Next Sentence Prediction | ✅ Discourse Relation | ✅ Discourse Relation <br> ✅ IR Relevance | | **Semantic-aware** | ✅ Next Sentence Prediction | ✅ Discourse Relation | ✅ Discourse Relation <br> ✅ IR Relevance |
## Release Notes
- July 30, 2019: release ERNIE 2.0
- Apr 10, 2019: update ERNIE_stable-1.0.1.tar.gz, update config and vocab
- Mar 18, 2019: update ERNIE_stable.tgz
- Mar 15, 2019: release ERNIE 1.0
## Communication
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## Results ## Results
...@@ -339,7 +332,7 @@ XNLI is a natural language inference dataset in 15 languages. It was jointly bui ...@@ -339,7 +332,7 @@ XNLI is a natural language inference dataset in 15 languages. It was jointly bui
*\*The DRCD dataset is converted from Traditional Chinese to Simplified Chinese based on tool: https://github.com/skydark/nstools/tree/master/zhtools* *\*The DRCD dataset is converted from Traditional Chinese to Simplified Chinese based on tool: https://github.com/skydark/nstools/tree/master/zhtools*
\* *The pre-training data of ERNIE 1.0 BASE does not contain instances whose length exceeds 128, but other models is pre-trained with the instances whose length are 512. It causes poorer performance of ERNIE 1.0 BASE on long-text tasks. So We have released [ERNIE 1.0 Base(max-len-512)](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) on July 29th, 2019* \* *The pre-training data of ERNIE 1.0 BASE does not contain instances whose length exceeds 128, but other models is pre-trained with the instances whose length are 512. It causes poorer performance of ERNIE 1.0 BASE on long-text tasks. So We have released [ERNIE 1.0 Base (max-len-512)](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) on July 29th, 2019*
...@@ -371,7 +364,7 @@ DRCD is an open domain Traditional Chinese machine reading comprehension (MRC) d ...@@ -371,7 +364,7 @@ DRCD is an open domain Traditional Chinese machine reading comprehension (MRC) d
<tr> <tr>
<th><strong>Dataset</strong> <th><strong>Dataset</strong>
<br></th> <br></th>
<th colspan="2"><center><strong>MSRA-NER(SIGHAN2006)</strong></center></th> <th colspan="2"><center><strong>MSRA-NER (SIGHAN2006)</strong></center></th>
<tr> <tr>
<td rowspan="2"> <td rowspan="2">
<p> <p>
...@@ -413,10 +406,10 @@ DRCD is an open domain Traditional Chinese machine reading comprehension (MRC) d ...@@ -413,10 +406,10 @@ DRCD is an open domain Traditional Chinese machine reading comprehension (MRC) d
</tbody> </tbody>
</table> </table>
- **MSRA-NER(SIGHAN2006)** - **MSRA-NER (SIGHAN2006)**
```text ```text
MSRA-NER(SIGHAN2006) dataset is released by MSRA for recognizing the names of people, locations and organizations in text. MSRA-NER (SIGHAN2006) dataset is released by MSRA for recognizing the names of people, locations and organizations in text.
``` ```
#### Results on Sentiment Analysis Task #### Results on Sentiment Analysis Task
...@@ -622,10 +615,17 @@ LCQMC is a Chinese question semantic matching corpus published in COLING2018. [u ...@@ -622,10 +615,17 @@ LCQMC is a Chinese question semantic matching corpus published in COLING2018. [u
- **BQ Corpus** - **BQ Corpus**
```text ```text
BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equivalence identification. This dataset was published in EMNLP 2018. [url: https://www.aclweb.org/anthology/D18-1536] BQ Corpus (Bank Question corpus) is a Chinese corpus for sentence semantic equivalence identification. This dataset was published in EMNLP 2018. [url: https://www.aclweb.org/anthology/D18-1536]
``` ```
## Communication
- [Github Issues](https://github.com/PaddlePaddle/ERNIE/issues): bug reports, feature requests, install issues, usage issues, etc.
- QQ discussion group: 760439550 (ERNIE discussion group).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
## Usage ## Usage
* [Install PaddlePaddle](#install-paddlepaddle) * [Install PaddlePaddle](#install-paddlepaddle)
* [Pre-trained Models &amp; Datasets](#pre-trained-models--datasets) * [Pre-trained Models &amp; Datasets](#pre-trained-models--datasets)
...@@ -635,6 +635,7 @@ BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equiva ...@@ -635,6 +635,7 @@ BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equiva
* [Chinese Datasets](#chinese-datasets) * [Chinese Datasets](#chinese-datasets)
* [Fine-tuning](#fine-tuning) * [Fine-tuning](#fine-tuning)
* [Batchsize and GPU Settings](#batchsize-and-gpu-settings) * [Batchsize and GPU Settings](#batchsize-and-gpu-settings)
* [Multiprocessing and fp16 auto mix-precision finetune](#multiprocessing-and-fp16-auto-mix-precision-finetune)
* [Classification](#classification) * [Classification](#classification)
* [Single Sentence Classification Tasks](#single-sentence-classification-tasks) * [Single Sentence Classification Tasks](#single-sentence-classification-tasks)
* [Sentence Pair Classification Tasks](#sentence-pair-classification-tasks) * [Sentence Pair Classification Tasks](#sentence-pair-classification-tasks)
...@@ -643,20 +644,22 @@ BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equiva ...@@ -643,20 +644,22 @@ BQ Corpus(Bank Question corpus) is a Chinese corpus for sentence semantic equiva
* [Machine Reading Comprehension](#machine-reading-comprehension) * [Machine Reading Comprehension](#machine-reading-comprehension)
* [Pre-training with ERNIE 1.0](#pre-training-with-ernie-10) * [Pre-training with ERNIE 1.0](#pre-training-with-ernie-10)
* [Data Preprocessing](#data-preprocessing) * [Data Preprocessing](#data-preprocessing)
* [PreTrain ERNIE1.0](#pretrain-ernie10) * [Pretrain ERNIE1.0](#pretrain-ernie10)
* [Distillation](#distillation)
* [FAQ](#faq) * [FAQ](#faq)
* [FAQ1: How to get sentence/tokens embedding of ERNIE?](#faq1-how-to-get-sentencetokens-embedding-of-ernie) * [FAQ1: How to get sentence/tokens embedding of ERNIE?](#faq1-how-to-get-sentencetokens-embedding-of-ernie)
* [FAQ2: How to predict on new data with Fine-tuning model?](#faq2-how-to-predict-on-new-data-with-fine-tuning-model) * [FAQ2: How to predict on new data with Fine-tuning model?](#faq2-how-to-predict-on-new-data-with-fine-tuning-model)
* [FAQ3: Is the argument batch_size for one GPU card or for all GPU cards?](#faq3-is-the--argument-batch_size-for-one-gpu-card-or-for-all-gpu-cards) * [FAQ3: Is the argument batch_size for one GPU card or for all GPU cards?](#faq3-is-the--argument-batch_size-for-one-gpu-card-or-for-all-gpu-cards)
* [FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq4-can-not-find-library-libcudnnso-please-try-to-add-the-lib-path-to-ld_library_path) * [FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq4-can-not-find-library-libcudnnso-please-try-to-add-the-lib-path-to-ld_library_path)
* [FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq5-can-not-find-library-libncclso-please-try-to-add-the-lib-path-to-ld_library_path) * [FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.](#faq5-can-not-find-library-libncclso-please-try-to-add-the-lib-path-to-ld_library_path)
* [FQA6: Runtime error: `ModuleNotFoundError No module named propeller`](#faq6)
## Install PaddlePaddle ### Install PaddlePaddle
This code base has been tested with Paddle Fluid 1.5.1 under Python2. This code base has been tested with Paddle Fluid 1.6 with Python 2/3.5+, since Paddle 1.6 has changed some of APIs, using version before 1.6 might have bug on NER tasks.
**\*Important\*** When finished installing Paddle Fluid, remember to update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2, for more information, you can click [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/index_en.html) and [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/install/install_Ubuntu_en.html). Also, you can read FAQ at the end of this document when you encounter errors. **\*Important\*** When finished installing Paddle Fluid, remember to update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2, for more information on paddlepaddle setup, you can click [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/index_en.html) and [here](http://en.paddlepaddle.org/documentation/docs/en/1.5/beginners_guide/install/install_Ubuntu_en.html). Also, you can read FAQ at the end of this document when you encounter errors.
For beginners of PaddlePaddle, the following documentation will tutor you about installing PaddlePaddle: For beginners of PaddlePaddle, the following documentation will tutor you about installing PaddlePaddle:
...@@ -669,11 +672,15 @@ If you have been armed with certain level of deep learning knowledge, and it hap ...@@ -669,11 +672,15 @@ If you have been armed with certain level of deep learning knowledge, and it hap
For more information about paddlepadde, Please refer to [PaddlePaddle Github](https://github.com/PaddlePaddle/Paddle) or [Official Website](https://www.paddlepaddle.org.cn/) for details. For more information about paddlepadde, Please refer to [PaddlePaddle Github](https://github.com/PaddlePaddle/Paddle) or [Official Website](https://www.paddlepaddle.org.cn/) for details.
Other dependency of ERNIE is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```
## Pre-trained Models & Datasets ### Pre-trained Models & Datasets
### Models #### Models
| Model | Description | | Model | Description |
| :------------------------------------------------- | :----------------------------------------------------------- | | :------------------------------------------------- | :----------------------------------------------------------- |
...@@ -683,36 +690,36 @@ For more information about paddlepadde, Please refer to [PaddlePaddle Github](ht ...@@ -683,36 +690,36 @@ For more information about paddlepadde, Please refer to [PaddlePaddle Github](ht
| [ERNIE 2.0 Base for English](https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz) | with params, config and vocabs | | [ERNIE 2.0 Base for English](https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz) | with params, config and vocabs |
| [ERNIE 2.0 Large for English](https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz) | with params, config and vocabs | | [ERNIE 2.0 Large for English](https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz) | with params, config and vocabs |
### Datasets #### Datasets
#### English Datasets ##### English Datasets
Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `${TASK_DATA_PATH}` Download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `${TASK_DATA_PATH}`
After the dataset is downloaded, you should run `sh ./script/en_glue/preprocess/cvt.sh $TASK_DATA_PATH` to convert the data format for training. If everything goes well, there will be a folder named `glue_data_processed` created with all the converted datas in it. After the dataset is downloaded, you should run `sh ./script/en_glue/preprocess/cvt.sh $TASK_DATA_PATH` to convert the data format for training. If everything goes well, there will be a folder named `glue_data_processed` created with all the converted datas in it.
#### Chinese Datasets ##### Chinese Datasets
You can download Chinese Datasets from [here](https://ernie.bj.bcebos.com/task_data_zh.tgz) You can download Chinese Datasets from [here](https://ernie.bj.bcebos.com/task_data_zh.tgz)
## Fine-tuning #### Fine-tuning
### Batchsize and GPU Settings ##### Batchsize and GPU Settings
In our experiments, we found that the batch size is important for different tasks. For users can more easily reproducing results, we list the batch size and gpu cards here: In our experiments, we found that the batch size is important for different tasks. For users can more easily reproducing results, we list the batch size and gpu cards here:
| Dataset | Batch Size | GPU | | Dataset | Batch Size | GPU |
| ------------ | --------------- | ------------------- | | ------------ | --------------- | ------------------- |
| CoLA | 32 / 64(base) | 1 | | CoLA | 32 / 64 (base) | 1 |
| SST-2 | 64 / 256(base) | 8 | | SST-2 | 64 / 256 (base) | 8 |
| STS-B | 128 | 8 | | STS-B | 128 | 8 |
| QQP | 256 | 8 | | QQP | 256 | 8 |
| MNLI | 256 / 512(base) | 8 | | MNLI | 256 / 512 (base) | 8 |
| QNLI | 256 | 8 | | QNLI | 256 | 8 |
| RTE | 16 / 4(base) | 1 | | RTE | 16 / 4 (base) | 1 |
| MRPC | 16 / 32(base) | 2 | | MRPC | 16 / 32 (base) | 2 |
| WNLI | 8 | 1 | | WNLI | 8 | 1 |
| XNLI | 65536 (tokens) | 8 | | XNLI | 65536 (tokens) | 8 |
| CMRC2018 | 64 | 8 (large) / 4(base) | | CMRC2018 | 64 | 8 (large) / 4(base) |
...@@ -725,9 +732,20 @@ In our experiments, we found that the batch size is important for different task ...@@ -725,9 +732,20 @@ In our experiments, we found that the batch size is important for different task
\* *For MNLI, QNLI,we used 32GB V100, for other tasks we used 22GB P40* \* *For MNLI, QNLI,we used 32GB V100, for other tasks we used 22GB P40*
### Classification
#### Single Sentence Classification Tasks #### Multiprocessing and fp16 auto mix-precision finetune
multiprocessing finetuning can be simply enabled with `finetune_launch.py` in your finetune script.
with multiprocessing finetune paddle can fully utilize your CPU/GPU capacity to accelerate finetuning.
`finetune_launch.py` should place in front of your finetune command. make sure to provide number of process and device id per node by specifiying `--nproc_per_node` and `--selected_gpus`. Number of device ids should match `nproc_per_node` and `CUDA_VISIBLE_DEVICES`, and the indexing should start from 0.
fp16 finetuning can be simply enable by specifing `--use_fp16 true` in your training script (make sure you use have a Tensor Core device). ERNIE will cast computation op to fp16 precision, while maintain storage in fp32 precision. approximately 60% speedup is seen on XNLI finetuning.
dynamic loss scale is used to avoid gradient vanish.
#### Classification
##### Single Sentence Classification Tasks
The code used to perform classification/regression finetuning is in `run_classifier.py`, we also provide the shell scripts for each task including best hyperpameters. The code used to perform classification/regression finetuning is in `run_classifier.py`, we also provide the shell scripts for each task including best hyperpameters.
...@@ -735,7 +753,7 @@ Take an English task `SST-2` and a Chinese task `ChnSentCorp` for example, ...@@ -735,7 +753,7 @@ Take an English task `SST-2` and a Chinese task `ChnSentCorp` for example,
Step1: Download and unarchive the model in path `${MODEL_PATH}`, if everything goes well, there should be a folder named `params` in `$MODEL_PATH`; Step1: Download and unarchive the model in path `${MODEL_PATH}`, if everything goes well, there should be a folder named `params` in `$MODEL_PATH`;
Step2: Download and unarchive the data set in `${TASK_DATA_PATH}`, for English tasks, there should be 9 folders named `CoLA` , `MNLI`, `MRPC`, `QNLI` , `QQP`, `RTE` , `SST-2`, `STS-B` , `WNLI`; for Chinese tasks, there should be 5 folders named `lcqmc`, `xnli`, `msra-ner`, `chnsentcorp`, `nlpcc-dbqa` in `${TASK_DATA_PATH}`; Step2: Download and unarchive the data set in `${TASK_DATA_PATH}`, for English tasks, there should be 9 folders named `CoLA` , `MNLI`, `MRPC`, `QNLI` , `QQP`, `RTE` , `SST-2`, `STS-B` , `WNLI`; for Chinese tasks, there should be 6 folders named `cmrc2018` `drc`, `xnli`, `msra-ner`, `chnsentcorp`, `nlpcc-dbqa` in `${TASK_DATA_PATH}`;
Step3: Follow the instructions below based on your own task type for starting your programs. Step3: Follow the instructions below based on your own task type for starting your programs.
...@@ -785,7 +803,7 @@ Similarly, for the Chinese task `ChnSentCorp`, after setting the environment var ...@@ -785,7 +803,7 @@ Similarly, for the Chinese task `ChnSentCorp`, after setting the environment var
#### Sentence Pair Classification Tasks ##### Sentence Pair Classification Tasks
Take `RTE` as an example, the data should have 3 fields `text_a text_b label` with tsv format. Here is some example datas: Take `RTE` as an example, the data should have 3 fields `text_a text_b label` with tsv format. Here is some example datas:
``` ```
...@@ -821,9 +839,9 @@ testing ./data/test.tsv, save to output/test_out.5.2019-07-23-15-25-06.tsv.4.781 ...@@ -821,9 +839,9 @@ testing ./data/test.tsv, save to output/test_out.5.2019-07-23-15-25-06.tsv.4.781
### Sequence Labeling #### Sequence Labeling
#### Named Entity Recognition ##### Named Entity Recognition
Take `MSRA-NER(SIGHAN2006)` as an example, the data should have 2 fields, `text_a label`, with tsv format. Here is some example datas : Take `MSRA-NER(SIGHAN2006)` as an example, the data should have 2 fields, `text_a label`, with tsv format. Here is some example datas :
``` ```
...@@ -840,7 +858,7 @@ Also, remember to set environmental variables like above, and run `sh script/zh_ ...@@ -840,7 +858,7 @@ Also, remember to set environmental variables like above, and run `sh script/zh_
[test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s [test evaluation] f1: 0.937390, precision: 0.925988, recall: 0.949077, elapsed time: 36.565929 s
``` ```
### Machine Reading Comprehension #### Machine Reading Comprehension
Take `DRCD` as an example, convert the data into SQUAD format firstly: Take `DRCD` as an example, convert the data into SQUAD format firstly:
...@@ -883,9 +901,9 @@ Also, remember to set environmental variables like above, and run `sh script/zh_ ...@@ -883,9 +901,9 @@ Also, remember to set environmental variables like above, and run `sh script/zh_
``` ```
## Pre-training with ERNIE 1.0 ### Pre-training with ERNIE 1.0
### Data Preprocessing #### Data Preprocessing
We construct the training dataset based on [Baidu Baike](https://en.wikipedia.org/wiki/Baidu_Baike), [Baidu Knows(Baidu Zhidao)](https://en.wikipedia.org/wiki/Baidu_Knows), [Baidu Tieba](https://en.wikipedia.org/wiki/Baidu_Tieba) for Chinese version ERNIE, and [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download), [Reddit](https://en.wikipedia.org/wiki/Reddit), [BookCorpus](https://github.com/soskek/bookcorpus) for English version ERNIE. We construct the training dataset based on [Baidu Baike](https://en.wikipedia.org/wiki/Baidu_Baike), [Baidu Knows(Baidu Zhidao)](https://en.wikipedia.org/wiki/Baidu_Knows), [Baidu Tieba](https://en.wikipedia.org/wiki/Baidu_Tieba) for Chinese version ERNIE, and [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download), [Reddit](https://en.wikipedia.org/wiki/Reddit), [BookCorpus](https://github.com/soskek/bookcorpus) for English version ERNIE.
...@@ -899,7 +917,7 @@ Here are some train instances after processing (which can be found in [`data/dem ...@@ -899,7 +917,7 @@ Here are some train instances after processing (which can be found in [`data/dem
Each instance is composed of 5 fields, which are joined by `;`in one line, represented `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label` respectively. Especially, in the field`seg_labels`, 0 means the begin of one word, 1 means non-begin of one word, -1 means placeholder, the other number means `CLS` or `SEP`. Each instance is composed of 5 fields, which are joined by `;`in one line, represented `token_ids; sentence_type_ids; position_ids; seg_labels; next_sentence_label` respectively. Especially, in the field`seg_labels`, 0 means the begin of one word, 1 means non-begin of one word, -1 means placeholder, the other number means `CLS` or `SEP`.
### PreTrain ERNIE 1.0 #### Pretrain ERNIE 1.0
The start entry for pretrain is [`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh). Before we run the train program, remember to set CUDA、cuDNN、NCCL2 etc. in the environment variable LD_LIBRARY_PATH. The start entry for pretrain is [`script/zh_task/pretrain.sh`](./script/zh_task/pretrain.sh). Before we run the train program, remember to set CUDA、cuDNN、NCCL2 etc. in the environment variable LD_LIBRARY_PATH.
...@@ -919,10 +937,15 @@ epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent ...@@ -919,10 +937,15 @@ epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent
``` ```
### Distillation
ERNIE provide a toolkit for data distillation to further accelerate your ineference, see <a href="./distill/README.md">here</a> for detail
## FAQ
### FAQ1: How to get sentence/tokens embedding of ERNIE? ### FAQ
#### FAQ1: How to get sentence/tokens embedding of ERNIE?
Run ```ernie_encoder.py ``` we can get the both sentence embedding and tokens embeddings. The input data format should be same as that mentioned in chapter [Fine-tuning](#fine-tuning). Run ```ernie_encoder.py ``` we can get the both sentence embedding and tokens embeddings. The input data format should be same as that mentioned in chapter [Fine-tuning](#fine-tuning).
...@@ -932,7 +955,7 @@ Here is an example to get sentence embedding and token embedding for LCQMC dev d ...@@ -932,7 +955,7 @@ Here is an example to get sentence embedding and token embedding for LCQMC dev d
export FLAGS_sync_nccl_allreduce=1 export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=0 export CUDA_VISIBLE_DEVICES=0
python -u ernir_encoder.py \ python -u ernie_encoder.py \
--use_cuda true \ --use_cuda true \
--batch_size 32 \ --batch_size 32 \
--output_dir "./test" \ --output_dir "./test" \
...@@ -947,22 +970,18 @@ when finished running this script, `cls_emb.npy` and `top_layer_emb.npy `will b ...@@ -947,22 +970,18 @@ when finished running this script, `cls_emb.npy` and `top_layer_emb.npy `will b
### FAQ2: How to predict on new data with Fine-tuning model? #### FAQ2: How to predict on new data with Fine-tuning model?
Take classification tasks for example, here is the script for batch prediction: Take classification tasks for example, here is the script for batch prediction:
``` ```
python -u predict_classifier.py \ python -u infer_classifyer.py \
--use_cuda true \
--batch_size 32 \
--vocab_path ${MODEL_PATH}/vocab.txt \
--init_checkpoint "./checkpoints/step_100" \
--do_lower_case true \
--max_seq_len 128 \
--ernie_config_path ${MODEL_PATH}/ernie_config.json \ --ernie_config_path ${MODEL_PATH}/ernie_config.json \
--do_predict true \ --init_checkpoint "./checkpoints/step_100" \
--predict_set ${TASK_DATA_PATH}/lcqmc/test.tsv \ --save_inference_model_path ./saved_model \
--num_labels 2 --predict_set ${TASK_DATA_PATH}/xnli/test.tsv \
--vocab_path ${MODEL_PATH}/vocab.txt \
--num_labels 3
``` ```
Argument `init_checkpoint` is the path of the model, `predict_set` is the path of test file, `num_labels` is the number of target labels. Argument `init_checkpoint` is the path of the model, `predict_set` is the path of test file, `num_labels` is the number of target labels.
...@@ -971,18 +990,28 @@ Argument `init_checkpoint` is the path of the model, `predict_set` is the path ...@@ -971,18 +990,28 @@ Argument `init_checkpoint` is the path of the model, `predict_set` is the path
### FAQ3: Is the argument batch_size for one GPU card or for all GPU cards? #### FAQ3: Is the argument batch_size for one GPU card or for all GPU cards?
For one GPU card. For one GPU card.
### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH. #### FAQ4: Can not find library: libcudnn.so. Please try to add the lib path to LD_LIBRARY_PATH.
Export the path of cuda to LD_LIBRARY_PATH, e.g.: `export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v[your cudnn version]/cuda/lib64` Export the path of cuda to LD_LIBRARY_PATH, e.g.: `export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v[your cudnn version]/cuda/lib64`
### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH. #### FAQ5: Can not find library: libnccl.so. Please try to add the lib path to LD_LIBRARY_PATH.
Download [NCCL2](https://developer.nvidia.com/nccl/nccl-download), and export the library path to LD_LIBRARY_PATH, e.g.:`export LD_LIBRARY_PATH=/home/work/nccl/lib` Download [NCCL2](https://developer.nvidia.com/nccl/nccl-download), and export the library path to LD_LIBRARY_PATH, e.g.:`export LD_LIBRARY_PATH=/home/work/nccl/lib`
### FAQ6: Runtime error: `ModuleNotFoundError No module named propeller`<a name="faq6"></a>
you can import propeller to your PYTHONPATH by `export PYTHONPATH:./:$PYTHONPATH`
#### FAQ7: Cannot malloc XXX MB GPU memory.
Try to reduce the batch_size, reduce the max_seq_len and set FLAGS_eager_delete_tensor_gb=0.0
此差异已折叠。
* [ERNIE Slim 数据蒸馏](#ernie-slim-数据蒸馏)
* [ERNIE数据蒸馏三步](#ernie数据蒸馏三步)
* [数据增强](#数据增强)
* [使用教程](#使用教程)
* [离线蒸馏](#离线蒸馏)
* [在线蒸馏](#在线蒸馏)
* [效果验证](#效果验证)
* [Case#1 用户提供“无标注数据”](#case1)
* [Case#2 用户未提供“无标注数据”](#case2)
* [FAQ](#faq)
# ERNIE Slim 数据蒸馏
在ERNIE强大的语义理解能力背后,是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高,若不能有效压缩则无法实际应用。
![ernie_distill](../.metas/ernie_distill.png)
因此,如上图所示,我们基于[数据蒸馏技术](https://arxiv.org/pdf/1712.04440.pdf)构建了**ERNIE Slim数据蒸馏系统**。它的原理是通过数据作为桥梁,将ERNIE模型的知识迁移至小模型,以达到损失很小的效果却能达到上千倍的预测速度提升的效果。
### ERNIE数据蒸馏三步
- **Step 1**. 使用ERNIE模型对输入标注数据对进行fine-tune,得到Teacher Model
- **Step 2**. 使用ERNIE Service对以下无监督数据进行预测:
1. 用户提供的大规模无标注数据,需与标注数据同源
2. 对标注数据进行数据增强,具体增强策略见下节
3. 对无标注数据和数据增强数据进行一定比例混合
- **Step 3.** 使用步骤2的数据训练出Student Model
### 数据增强
目前采用三种[数据增强策略](https://arxiv.org/pdf/1903.12136.pdf)策略,对于不用的任务可以特定的比例混合。三种数据增强策略包括:
1. 添加噪声:对原始样本中的词,以一定的概率(如0.1)替换为”UNK”标签
2. 同词性词替换:对原始样本中的所有词,以一定的概率(如0.1)替换为本数据集钟随机一个同词性的词
3. N-sampling:从原始样本中,随机选取位置截取长度为m的片段作为新的样本,其中片段的长度m为0到原始样本长度之间的随机值
# 使用教程
我们采用上述3种增强策略制作了chnsenticorp的增强数据:增强后的数据为原训练数据的10倍(96000行),可以从[这里](https://ernie.bj.bcebos.com/distill_data.tar.gz)下载。将下载的 `distill` 文件夹放入 `${TASK_DATA_PATH}` 后即可执行下面的脚本开始蒸馏。
### 离线蒸馏
离线蒸馏指的是先通过训练好的ERNIE模型预测出无监督数据的label,然后student模型去学习这些label。只需执行
```script
sh ./distill/script/distill_chnsenticorp.sh
```
即可开始离线蒸馏。
该脚本会进行前述的三步:1. 在任务数据上Fine-tune。 2. 加载Fine-tune好的模型对增强数据进行打分。 3.使用Student模型进行训练。脚本采用hard-label蒸馏,在第二步中将会直接预测出ERNIE标注的label。
该脚本涉及两个python文件:`./example/finetune_classifier.py` 负责finetune以及预测teacher模型, `distill/distill_chnsentocorp.py` 负责student模型的训练。事先构造好的增强数据放在`${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug`
在脚本的第二步中,使用 `--do_predict` 参数进入预测模式:
```script
cat ${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug/part.0 |python3 -u ./example/finetune_classifier.py \
--do_predict \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/teacher \
--warm_start_from ${MODEL_PATH}/params \
--vocab_file ${MODEL_PATH}/vocab.txt \
...
```
脚本从标准输入获取明文输入,并将打分输出到标准输出。用这种方式对数据增强后的无监督训练预料进行标注。最终的标注结果放在 `prediction_output/part.0` 文件中。标注结果包含两列, 第一列为明文,第二列为标注label。
在第三步开始student模型的训练:
```script
python3 ./distill/distill_chnsentocorp.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/student \
--vocab_file ${TASK_DATA_PATH}/distill/chnsenticorp/student/vocab.txt \
--unsupervise_data_dir ./prediction_output/ \
--max_seqlen 128 \
...
```
训练流程与第一步相同,`--data_dir` 指定的监督数据,`--unsupervise_data_dir` 指定ERNIE标注数据。Student模型是一个简单的BOW模型,其定义位于`distill/distill_chnsentocorp.py`。用户只需改写其中的model部分即可实现定制蒸馏模型。
如果用户已经拥有了无监督数据,则可以将无监督数据放入 `${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug` 即可。
### 在线蒸馏
考虑到在某些场景下,无监督数据过大导致预测过程十分耗时,或者ERNIE预测出的分布过大而无法预先存放在磁盘中。针对这种场景我们提出一种 **在线蒸馏** 方案。采用`propeller` 进行fine-tune并使用 `BestInferenceModelExporter` 后,`propeller` 会自动将指标最好的模型保存为paddle inference model格式,随后启动一个预测服务。Student模型在训练的同时,实时地访问这个服务来获得ERNIE的预测打分。只需执行
```
sh ./distill/script/distill_chnsenticorp_with_propeller_server.sh
```
即可完成上述流程。
流程包含3步:1. finetune ERNIE模型。2. 取指标最好的ERNIE模型启动`propeller`服务。 3.在student模型的训练过程中访问服务获取teacher模型的标注。
此流程涉及两个python文件: `example/finetune_classifier.py``distill/distill_chnsentocorp_with_propeller_server.py` 。其中第一步与离线蒸馏中的用法完全一样。
第二步中使用
```script
python3 -m propeller.tools.start_server -p 8113 -m ${teacher_dir}/best/inference/ &
```
启动一个ernie预测服务
第三步开始student模型的同步训练:
```script
python3 ./distill/distill_chnsentocorp_with_propeller_server.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/student \
--vocab_file ${TASK_DATA_PATH}/distill/chnsenticorp/student/vocab.txt \
--teacher_vocab_file ${MODEL_PATH}/vocab.txt \
--max_seqlen 128 \
--teacher_max_seqlen 128 \
--server_batch_size 64 \
--teacher_host tcp://localhost:8113 \
--num_coroutine 10
```
该脚本将`${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug` 目录下的增强数据进行切字并请求`propeller` 服务。`--num_coroutine` 指定了请求的并发数,`--teacher_host` 指定了服务的端口和IP,`--server_batch_size` 指定了请求的batch_size,在实际的请求中每个batch的数据会拆分成若干个 `--server_batch_size` 大小的数据去请求服务。
# 效果验证
我们将实际应用场景分类为两种:
### Case#1 用户提供“无标注数据”<a name="case1"></a>
|模型 | 评论低质识别【分类 \| ACC】 | 中文情感【分类 \| ACC】 |问题识别【分类 \| ACC】|搜索问答匹配【匹配 \| 正逆序】|
|---|---|---|---|---|
|ERNIE-Finetune | 90.6% | 96.2% | 97.5% | 4.25 |
|非ERNIE基线(BOW)| 80.8% | 94.7% | 93.0% | 1.83 |
|**+ 数据蒸馏** | 87.2% | 95.8% | 96.3% | 3.30 |
### Case#2 用户未提供“无标注数据”(通过数据增强生成数据)<a name="case2"></a>
|模型 |ChnSentiCorp |
|---|---|
|ERNIE-Finetune |95.4% |
|非ERNIE基线(BOW)|90.1%|
|**+ 数据蒸馏** |91.4%|
|非ERNIE基线(LSTM)|91.2%|
|**+ 数据蒸馏**|93.9%|
# FAQ
### FQA1: 预测同时蒸馏报错:`Client call failed`
终端打印的错误是client的日志,server端的日志在前面。一般来说可能是server显存超限导致。这种时候需要在student模型finetune的脚本中使用`--server_batch_size ` 显示控制请求服务的batch大小。
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
from random import random
from functools import reduce, partial
import logging
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
from propeller.paddle.data import Dataset
from optimization import optimization
import utils.data
log.setLevel(logging.DEBUG)
class ClassificationBowModel(propeller.train.Model):
"""propeller Model wraper for paddle-ERNIE """
def __init__(self, config, mode, run_config):
self.config = config
self.mode = mode
self.run_config = run_config
self._param_initializer = F.initializer.TruncatedNormal(
scale=config.initializer_range)
self._emb_dtype = "float32"
self._word_emb_name = "word_embedding"
def forward(self, features):
text_ids_a, = features
def bow(ids):
embed = L.embedding(
input=ids,
size=[self.config.vocab_size, self.config.emb_size],
dtype=self._emb_dtype,
param_attr=F.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
zero = L.fill_constant(shape=[1], dtype='int64', value=0)
pad = L.cast(L.logical_not(L.equal(ids, zero)), 'float32')
sumed = L.reduce_sum(embed * pad, dim=1)
sumed = L.softsign(sumed)
return sumed
sumed = bow(text_ids_a)
fced = L.fc(
input=sumed,
size=self.config.emb_size,
act='tanh',
param_attr=F.ParamAttr(
name="middle_fc.w_0", initializer=self._param_initializer),
bias_attr="middle_fc.b_0")
logits = L.fc(
input=fced,
size=self.config.num_label,
act=None,
param_attr=F.ParamAttr(
name="pooler_fc.w_0", initializer=self._param_initializer),
bias_attr="pooler_fc.b_0")
if self.mode is propeller.RunMode.PREDICT:
probs = L.softmax(logits)
return probs
else:
return logits
def loss(self, predictions, labels):
labels = L.softmax(labels)
loss = L.softmax_with_cross_entropy(predictions, labels, soft_label=True)
loss = L.mean(loss)
return loss
def backward(self, loss):
scheduled_lr, _ = optimization(
loss=loss,
warmup_steps=int(self.run_config.max_steps * self.config.warmup_proportion),
num_train_steps=self.run_config.max_steps,
learning_rate=self.config.learning_rate,
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
weight_decay=self.config.weight_decay,
scheduler="linear_warmup_decay",)
propeller.summary.scalar('lr', scheduled_lr)
def metrics(self, predictions, labels):
predictions = L.argmax(predictions, axis=1)
labels = L.argmax(labels, axis=1)
#predictions = L.unsqueeze(predictions, axes=[1])
acc = propeller.metrics.Acc(labels, predictions)
#auc = propeller.metrics.Auc(labels, predictions)
return {'acc': acc}
if __name__ == '__main__':
parser = propeller.ArgumentParser('Distill model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--vocab_file', type=str, required=True)
parser.add_argument('--unsupervise_data_dir', type=str, required=True)
parser.add_argument('--data_dir', type=str)
args = parser.parse_args()
run_config = propeller.parse_runconfig(args)
hparams = propeller.parse_hparam(args)
vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(args.vocab_file, 'rb'))}
unk_id = vocab['[UNK]']
char_tokenizer = utils.data.CharTokenizer(vocab.keys())
space_tokenizer = utils.data.SpaceTokenizer(vocab.keys())
supervise_feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('text_a', unk_id=unk_id, vocab_dict=vocab, tokenizer=space_tokenizer),
propeller.data.LabelColumn('label'),
])
def before(text_a, label):
sentence_a = text_a[: args.max_seqlen]
return sentence_a, label
def after(sentence_a, label):
batch_size = sentence_a.shape[0]
onehot_label = np.zeros([batch_size, hparams.num_label], dtype=np.float32)
onehot_label[np.arange(batch_size), label] = 9999.
sentence_a, = utils.data.expand_dims(sentence_a)
return sentence_a, onehot_label
train_ds = supervise_feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(before) \
.padded_batch(hparams.batch_size, (0, 0)) \
.map(after) \
unsup_train_ds = supervise_feature_column.build_dataset('unsup_train', data_dir=args.unsupervise_data_dir, shuffle=True, repeat=True, use_gz=False) \
.map(before) \
.padded_batch(hparams.batch_size, (0, 0)) \
.map(after)
dev_ds = supervise_feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(hparams.batch_size, (0, 0)) \
.map(after)
train_ds = utils.data.interleave(train_ds, unsup_train_ds)
shapes = ([-1, args.max_seqlen, 1], [-1, hparams.num_label])
types = ('int64', 'float32')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
'''
from tqdm import tqdm
for slots in tqdm(train_ds):
pass
'''
best_exporter = propeller.train.exporter.BestExporter(os.path.join(run_config.model_dir, 'best'), cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=ClassificationBowModel,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds},
exporters=[best_exporter])
print('dev_acc3\t%.5f' % (best_exporter._best['dev']['acc']))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
from random import random
from functools import reduce, partial
import logging
import numpy as np
import multiprocessing
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
from propeller.paddle.data import Dataset
from propeller.service.client import InferenceClient
from optimization import optimization
import utils.data
log.setLevel(logging.DEBUG)
class ClassificationBowModel(propeller.train.Model):
"""propeller Model wraper for paddle-ERNIE """
def __init__(self, config, mode, run_config):
self.config = config
self.mode = mode
self.run_config = run_config
self._param_initializer = F.initializer.TruncatedNormal(
scale=config.initializer_range)
self._emb_dtype = "float32"
self._word_emb_name = "word_embedding"
def forward(self, features):
text_ids_a, = features
def bow(ids):
embed = L.embedding(
input=ids,
size=[self.config.vocab_size, self.config.emb_size],
dtype=self._emb_dtype,
param_attr=F.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
zero = L.fill_constant(shape=[1], dtype='int64', value=0)
pad = L.cast(L.logical_not(L.equal(ids, zero)), 'float32')
sumed = L.reduce_sum(embed * pad, dim=1)
sumed = L.softsign(sumed)
return sumed
sumed = bow(text_ids_a)
fced = L.fc(
input=sumed,
size=self.config.emb_size,
act='tanh',
param_attr=F.ParamAttr(
name="middle_fc.w_0", initializer=self._param_initializer),
bias_attr="middle_fc.b_0")
logits = L.fc(
input=fced,
size=self.config.num_label,
act=None,
param_attr=F.ParamAttr(
name="pooler_fc.w_0", initializer=self._param_initializer),
bias_attr="pooler_fc.b_0")
if self.mode is propeller.RunMode.PREDICT:
probs = L.softmax(logits)
return probs
else:
return logits
def loss(self, predictions, labels):
labels = L.softmax(labels)
loss = L.softmax_with_cross_entropy(predictions, labels, soft_label=True)
loss = L.mean(loss)
return loss
def backward(self, loss):
scheduled_lr, _ = optimization(
loss=loss,
warmup_steps=int(self.run_config.max_steps * self.config.warmup_proportion),
num_train_steps=self.run_config.max_steps,
learning_rate=self.config.learning_rate,
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
weight_decay=self.config.weight_decay,
scheduler="linear_warmup_decay",)
propeller.summary.scalar('lr', scheduled_lr)
def metrics(self, predictions, labels):
predictions = L.argmax(predictions, axis=1)
labels = L.argmax(labels, axis=1)
#predictions = L.unsqueeze(predictions, axes=[1])
acc = propeller.metrics.Acc(labels, predictions)
#auc = propeller.metrics.Auc(labels, predictions)
return {'acc': acc}
if __name__ == '__main__':
parser = propeller.ArgumentParser('distill model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--vocab_file', type=str, required=True)
parser.add_argument('--teacher_vocab_file', type=str, required=True)
parser.add_argument('--teacher_max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str)
parser.add_argument('--server_batch_size', type=int, default=64)
parser.add_argument('--num_coroutine', type=int, default=1)
parser.add_argument('--teacher_host', type=str, required=True)
args = parser.parse_args()
run_config = propeller.parse_runconfig(args)
hparams = propeller.parse_hparam(args)
teacher_vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(args.teacher_vocab_file, 'rb'))}
vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(args.vocab_file, 'rb'))}
teacher_sep_id = teacher_vocab['[SEP]']
teacher_cls_id = teacher_vocab['[CLS]']
teacher_unk_id = teacher_vocab['[UNK]']
unk_id = vocab['[UNK]']
char_tokenizer = utils.data.CharTokenizer(vocab.keys())
space_tokenizer = utils.data.SpaceTokenizer(vocab.keys())
supervise_feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('text_a', unk_id=unk_id, vocab_dict=vocab, tokenizer=space_tokenizer),
propeller.data.LabelColumn('label'),
])
unsupervise_feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('text_a', unk_id=unk_id, vocab_dict=vocab, tokenizer=space_tokenizer),
propeller.data.TextColumn('teacher_text_a', unk_id=teacher_unk_id, vocab_dict=teacher_vocab, tokenizer=char_tokenizer),
])
def before(text_a, label):
sentence_a = text_a[: args.max_seqlen]
return sentence_a, label
def after(sentence_a, label):
batch_size = sentence_a.shape[0]
onehot_label = np.zeros([batch_size, hparams.num_label], dtype=np.float32)
onehot_label[np.arange(batch_size), label] = 9999.
sentence_a, = utils.data.expand_dims(sentence_a)
return sentence_a, onehot_label
train_ds = supervise_feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(before) \
.padded_batch(hparams.batch_size, (0, 0)) \
.map(after) \
dev_ds = supervise_feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(hparams.batch_size, (0, 0)) \
.map(after)
def unsuperve_before(text_a, teacher_text_a):
teacher_sentence, teacher_segments = utils.data.build_1_pair(teacher_text_a, max_seqlen=args.teacher_max_seqlen, cls_id=teacher_cls_id, sep_id=teacher_sep_id)
sentence_a = text_a[: args.max_seqlen]
return sentence_a, teacher_sentence, teacher_segments
client = InferenceClient(args.teacher_host, batch_size=args.server_batch_size, num_coroutine=args.num_coroutine)
log.info('teacher host %s' % args.teacher_host)
def ask_teacher_for_label(sentence_a, teacher_sentence, teacher_segments):
sentence_a, teacher_sentence, teacher_segments = utils.data.expand_dims(sentence_a, teacher_sentence, teacher_segments)
teacher_label, = client(teacher_sentence, teacher_segments)
teacher_label = teacher_label[:, :]
return sentence_a, teacher_label
unsup_train_ds = unsupervise_feature_column.build_dataset('unsup_train', data_dir=os.path.join(args.data_dir, 'unsup_train_aug'), shuffle=True, repeat=True, use_gz=False) \
.buffered(100) \
.map(unsuperve_before) \
.padded_batch(hparams.batch_size, (0, 0, 0)) \
.map(ask_teacher_for_label)
train_ds = utils.data.interleave(train_ds, unsup_train_ds)
shapes = ([-1, args.max_seqlen, 1], [-1, hparams.num_label])
types = ('int64', 'float32')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
'''
from tqdm import tqdm
for slots in tqdm(train_ds):
pass
'''
best_exporter = propeller.train.exporter.BestExporter(os.path.join(run_config.model_dir, 'best'), cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=ClassificationBowModel,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds},
exporters=[best_exporter])
print('dev_acc3\t%.5f' % (best_exporter._best['dev']['acc']))
set -x
export PYTHONPATH=.:./ernie/:${PYTHONPATH:-}
output_dir=./output/distill
teacher_dir=${output_dir}/teacher
student_dir=${output_dir}/student
# 1. finetune teacher
CUDA_VISIBLE_DEVICES=0 \
python3 -u ./example/finetune_classifier.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/teacher \
--warm_start_from ${MODEL_PATH}/params \
--vocab_file ${MODEL_PATH}/vocab.txt \
--max_seqlen 128 \
--run_config '{
"model_dir": "'${teacher_dir}'",
"max_steps": '$((10 * 9600 / 32))',
"save_steps": 100,
"log_steps": 10,
"max_ckpt": 1,
"skip_steps": 0,
"eval_steps": 100
}' \
--hparam ${MODEL_PATH}/ernie_config.json \
--hparam '{ # model definition
"sent_type_vocab_size": None, # default term in official config
"use_task_id": False,
"task_id": 0,
}' \
--hparam '{ # learn
"warmup_proportion": 0.1,
"weight_decay": 0.01,
"use_fp16": 0,
"learning_rate": 0.00005,
"num_label": 2,
"batch_size": 32
}'
(($?!=0)) && echo "Something goes wrong at Step 1, please check" && exit -1
# 2. start a prediction server
export CUDA_VISIBLE_DEVICES=0
cat ${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug/part.0 |awk -F"\t" '{print $2}' |python3 -u ./example/finetune_classifier.py \
--do_predict \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/teacher \
--warm_start_from ${MODEL_PATH}/params \
--vocab_file ${MODEL_PATH}/vocab.txt \
--max_seqlen 128 \
--run_config '{
"model_dir": "'${teacher_dir}'",
"log_steps": 10,
}' \
--hparam ${MODEL_PATH}/ernie_config.json \
--hparam '{ # model definition
"sent_type_vocab_size": None, # default term in official config
"use_task_id": False,
"task_id": 0,
}' \
--hparam '{ # learn
"warmup_proportion": 0.1,
"weight_decay": 0.01,
"use_fp16": 0,
"learning_rate": 0.00005,
"num_label": 2,
"batch_size": 100
}' > prediction_label
(($?!=0)) && echo "Something goes wrong at Step 2, please check" && exit -1
mkdir prediction_output
paste ${TASK_DATA_PATH}/distill/chnsenticorp/student/unsup_train_aug/part.0 prediction_label |awk -F"\t" '{print $2"\t"$3}' > prediction_output/part.0
#. 3. learn from teacher
export CUDA_VISIBLE_DEVICES=0
python3 ./distill/distill_chnsentocorp.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/student \
--vocab_file ${TASK_DATA_PATH}/distill/chnsenticorp/student/vocab.txt \
--unsupervise_data_dir ./prediction_output/ \
--max_seqlen 128 \
--run_config '{
"model_dir": "'${student_dir}'",
"max_steps": '$((100 * 9600 / 100))',
"save_steps": 1000,
"log_steps": 10,
"max_ckpt": 1,
"skip_steps": 0,
"eval_steps": 100
}' \
--hparam '{
"num_label": 2,
"vocab_size": 35000,
"emb_size": 128,
"initializer_range": 0.02,
}' \
--hparam '{ # lr shit
"warmup_proportion": 0.1,
"weight_decay": 0.00,
"learning_rate": 1e-4,
"batch_size": 100
}'
(($?!=0)) && echo "Something goes wrong at Step 3, please check" && exit -1
set -x
export PYTHONPATH=.:./ernie/:${PYTHONPATH:-}
output_dir=./output/distill
teacher_dir=${output_dir}/teacher
student_dir=${output_dir}/student
# 1. finetune teacher
CUDA_VISIBLE_DEVICES=0 \
python3 -u ./example/finetune_classifier.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/teacher \
--warm_start_from ${MODEL_PATH}/params \
--vocab_file ${MODEL_PATH}/vocab.txt \
--max_seqlen 128 \
--run_config '{
"model_dir": "'${teacher_dir}'",
"max_steps": '$((10 * 9600 / 32))',
"save_steps": 100,
"log_steps": 10,
"max_ckpt": 1,
"skip_steps": 0,
"eval_steps": 100
}' \
--hparam ${MODEL_PATH}/ernie_config.json \
--hparam '{ # model definition
"sent_type_vocab_size": None, # default term in official config
"use_task_id": False,
"task_id": 0,
}' \
--hparam '{ # learn
"warmup_proportion": 0.1,
"weight_decay": 0.01,
"use_fp16": 0,
"learning_rate": 0.00005,
"num_label": 2,
"batch_size": 32
}'
(($?!=0)) && echo "Something goes wrong at Step 1, please check" && exit -1
# 2. start a prediction server
CUDA_VISIBLE_DEVICES=1 \
python3 -m propeller.tools.start_server -p 8113 -m ${teacher_dir}/best/inference/ &
echo $! > pid.server
sleep 10
#. 3. learn from teacher
export CUDA_VISIBLE_DEVICES=0
python3 ./distill/distill_chnsentocorp_with_propeller_server.py \
--data_dir ${TASK_DATA_PATH}/distill/chnsenticorp/student \
--vocab_file ${TASK_DATA_PATH}/distill/chnsenticorp/student/vocab.txt \
--teacher_vocab_file ${MODEL_PATH}/vocab.txt \
--max_seqlen 128 \
--teacher_max_seqlen 128 \
--server_batch_size 64 \
--teacher_host tcp://localhost:8113 \
--num_coroutine 10 \
--run_config '{
"model_dir": "'${student_dir}'",
"max_steps": '$((100 * 9600 / 100))',
"save_steps": 1000,
"log_steps": 10,
"max_ckpt": 1,
"skip_steps": 0,
"eval_steps": 100
}' \
--hparam '{ # model definition
"num_label": 2,
"vocab_size": 35000,
"emb_size": 128,
"initializer_range": 0.02,
}' \
--hparam '{ # learn
"warmup_proportion": 0.1,
"weight_decay": 0.00,
"learning_rate": 1e-4,
"batch_size": 100
}'
(($?!=0)) && echo "Something goes wrong at Step 2, please check" && exit -1
ps -ef|grep 'propeller.tools.start_server' |awk '{print $2}'|xargs kill -9
...@@ -208,7 +208,7 @@ def pad_batch_data(insts, ...@@ -208,7 +208,7 @@ def pad_batch_data(insts,
if return_seq_lens: if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts]) seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1, 1])] return_list += [seq_lens.astype("int64").reshape([-1])]
return return_list if len(return_list) > 1 else return_list[0] return return_list if len(return_list) > 1 else return_list[0]
......
...@@ -22,13 +22,16 @@ import argparse ...@@ -22,13 +22,16 @@ import argparse
import numpy as np import numpy as np
import multiprocessing import multiprocessing
import logging
import paddle.fluid as fluid import paddle.fluid as fluid
import reader.task_reader as task_reader import reader.task_reader as task_reader
from model.ernie import ErnieConfig, ErnieModel from model.ernie_v1 import ErnieConfig, ErnieModel
from utils.args import ArgumentGroup, print_arguments from utils.args import ArgumentGroup, print_arguments, prepare_logger
from utils.init import init_pretraining_params from utils.init import init_pretraining_params
log = logging.getLogger()
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.") model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
...@@ -52,24 +55,21 @@ run_type_g.add_arg("use_cuda", bool, True, "If set, use G ...@@ -52,24 +55,21 @@ run_type_g.add_arg("use_cuda", bool, True, "If set, use G
def create_model(args, pyreader_name, ernie_config): def create_model(args, pyreader_name, ernie_config):
pyreader = fluid.layers.py_reader( src_ids = fluid.layers.data(name='1', shape=[-1, args.max_seq_len, 1], dtype='int64')
capacity=50, sent_ids = fluid.layers.data(name='2', shape=[-1, args.max_seq_len, 1], dtype='int64')
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], pos_ids = fluid.layers.data(name='3', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], task_ids = fluid.layers.data(name='4', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, 1]], input_mask = fluid.layers.data(name='5', shape=[-1, args.max_seq_len, 1], dtype='float32')
dtypes=['int64', 'int64', 'int64', 'int64', 'float', 'int64'], seq_lens = fluid.layers.data(name='8', shape=[-1], dtype='int64')
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name, pyreader = fluid.io.DataLoader.from_generator(feed_list=[src_ids, sent_ids, pos_ids, task_ids, input_mask, seq_lens],
use_double_buffer=True) capacity=70,
iterable=False)
(src_ids, sent_ids, pos_ids, task_ids, input_mask,
seq_lens) = fluid.layers.read_file(pyreader)
ernie = ErnieModel( ernie = ErnieModel(
src_ids=src_ids, src_ids=src_ids,
position_ids=pos_ids, position_ids=pos_ids,
sentence_ids=sent_ids, sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask, input_mask=input_mask,
config=ernie_config) config=ernie_config)
...@@ -129,8 +129,6 @@ def main(args): ...@@ -129,8 +129,6 @@ def main(args):
pyreader, graph_vars = create_model( pyreader, graph_vars = create_model(
args, pyreader_name='reader', ernie_config=ernie_config) args, pyreader_name='reader', ernie_config=ernie_config)
fluid.memory_optimize(input_program=infer_program)
infer_program = infer_program.clone(for_test=True) infer_program = infer_program.clone(for_test=True)
exe.run(startup_prog) exe.run(startup_prog)
...@@ -145,7 +143,7 @@ def main(args): ...@@ -145,7 +143,7 @@ def main(args):
exec_strategy = fluid.ExecutionStrategy() exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = dev_count exec_strategy.num_threads = dev_count
pyreader.decorate_tensor_provider(data_generator) pyreader.set_batch_generator(data_generator)
pyreader.start() pyreader.start()
total_cls_emb = [] total_cls_emb = []
...@@ -169,15 +167,21 @@ def main(args): ...@@ -169,15 +167,21 @@ def main(args):
total_cls_emb = np.concatenate(total_cls_emb) total_cls_emb = np.concatenate(total_cls_emb)
total_top_layer_emb = np.concatenate(total_top_layer_emb) total_top_layer_emb = np.concatenate(total_top_layer_emb)
if not os.path.exists(args.output_dir):
os.mkdir(args.output_dir)
else:
raise RuntimeError('output dir exists: %s' % args.output_dir)
with open(os.path.join(args.output_dir, "cls_emb.npy"), with open(os.path.join(args.output_dir, "cls_emb.npy"),
"w") as cls_emb_file: "wb") as cls_emb_file:
np.save(cls_emb_file, total_cls_emb) np.save(cls_emb_file, total_cls_emb)
with open(os.path.join(args.output_dir, "top_layer_emb.npy"), with open(os.path.join(args.output_dir, "top_layer_emb.npy"),
"w") as top_layer_emb_file: "wb") as top_layer_emb_file:
np.save(top_layer_emb_file, total_top_layer_emb) np.save(top_layer_emb_file, total_top_layer_emb)
if __name__ == '__main__': if __name__ == '__main__':
prepare_logger(log)
args = parser.parse_args() args = parser.parse_args()
print_arguments(args) print_arguments(args)
......
...@@ -16,8 +16,11 @@ ...@@ -16,8 +16,11 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import time import time
import logging
import numpy as np import numpy as np
from scipy.stats import pearsonr, spearmanr from scipy.stats import pearsonr, spearmanr
...@@ -26,6 +29,7 @@ import paddle.fluid as fluid ...@@ -26,6 +29,7 @@ import paddle.fluid as fluid
from model.ernie import ErnieModel from model.ernie import ErnieModel
log = logging.getLogger(__name__)
def create_model(args, def create_model(args,
pyreader_name, pyreader_name,
...@@ -35,34 +39,22 @@ def create_model(args, ...@@ -35,34 +39,22 @@ def create_model(args,
is_classify=False, is_classify=False,
is_regression=False, is_regression=False,
ernie_version="1.0"): ernie_version="1.0"):
src_ids = fluid.layers.data(name='eval_placeholder_0', shape=[-1, args.max_seq_len, 1], dtype='int64')
sent_ids = fluid.layers.data(name='eval_placeholder_1', shape=[-1, args.max_seq_len, 1], dtype='int64')
pos_ids = fluid.layers.data(name='eval_placeholder_2', shape=[-1, args.max_seq_len, 1], dtype='int64')
input_mask = fluid.layers.data(name='eval_placeholder_3', shape=[-1, args.max_seq_len, 1], dtype='float32')
task_ids = fluid.layers.data(name='eval_placeholder_4', shape=[-1, args.max_seq_len, 1], dtype='int64')
qids = fluid.layers.data(name='eval_placeholder_5', shape=[-1, 1], dtype='int64')
if is_classify: if is_classify:
pyreader = fluid.layers.py_reader( labels = fluid.layers.data(name='6', shape=[-1, 1], dtype='int64')
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, 1], [-1, 1]],
dtypes=[
'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'
],
lod_levels=[0, 0, 0, 0, 0, 0, 0],
name=task_name + "_" + pyreader_name,
use_double_buffer=True)
elif is_regression: elif is_regression:
pyreader = fluid.layers.py_reader( labels = fluid.layers.data(name='6', shape=[-1, 1], dtype='float32')
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], pyreader = fluid.io.DataLoader.from_generator(feed_list=[src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, qids],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], capacity=70,
[-1, args.max_seq_len, 1], [-1, 1], [-1, 1]], iterable=False)
dtypes=[
'int64', 'int64', 'int64', 'int64', 'float32', 'float32',
'int64'
],
lod_levels=[0, 0, 0, 0, 0, 0, 0],
name=task_name + "_" + pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, task_ids, input_mask, labels,
qids) = fluid.layers.read_file(pyreader)
ernie = ErnieModel( ernie = ErnieModel(
src_ids=src_ids, src_ids=src_ids,
...@@ -88,8 +80,12 @@ def create_model(args, ...@@ -88,8 +80,12 @@ def create_model(args,
name=task_name + "_cls_out_b", name=task_name + "_cls_out_b",
initializer=fluid.initializer.Constant(0.))) initializer=fluid.initializer.Constant(0.)))
assert is_classify != is_regression, 'is_classify or is_regression must be true and only one of them can be true'
if is_prediction: if is_prediction:
if is_classify:
probs = fluid.layers.softmax(logits) probs = fluid.layers.softmax(logits)
else:
probs = logits
feed_targets_name = [ feed_targets_name = [
src_ids.name, sent_ids.name, pos_ids.name, input_mask.name src_ids.name, sent_ids.name, pos_ids.name, input_mask.name
] ]
...@@ -97,7 +93,6 @@ def create_model(args, ...@@ -97,7 +93,6 @@ def create_model(args,
feed_targets_name += [task_ids.name] feed_targets_name += [task_ids.name]
return pyreader, probs, feed_targets_name return pyreader, probs, feed_targets_name
assert is_classify != is_regression, 'is_classify or is_regression must be true and only one of them can be true'
num_seqs = fluid.layers.create_tensor(dtype='int64') num_seqs = fluid.layers.create_tensor(dtype='int64')
if is_classify: if is_classify:
ce_loss, probs = fluid.layers.softmax_with_cross_entropy( ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
......
...@@ -16,12 +16,15 @@ ...@@ -16,12 +16,15 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import time import time
import numpy as np import numpy as np
import os import os
import math import math
import json import json
import logging
import collections import collections
import six import six
...@@ -34,21 +37,21 @@ from model.ernie import ErnieModel ...@@ -34,21 +37,21 @@ from model.ernie import ErnieModel
import tokenization import tokenization
log = logging.getLogger(__name__)
def create_model(args, pyreader_name, ernie_config, is_training): def create_model(args, pyreader_name, ernie_config, is_training):
pyreader = fluid.layers.py_reader( src_ids = fluid.layers.data(name='1', shape=[-1, args.max_seq_len, 1], dtype='int64')
capacity=50, pos_ids = fluid.layers.data(name='2', shape=[-1, args.max_seq_len, 1], dtype='int64')
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], sent_ids= fluid.layers.data(name='3', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], task_ids= fluid.layers.data(name='4', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, 1], [-1, 1], [-1, 1]], input_mask = fluid.layers.data(name='5', shape=[-1, args.max_seq_len, 1], dtype='float32')
dtypes=[ start_positions = fluid.layers.data(name='6', shape=[-1, 1], dtype='int64')
'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64', end_positions = fluid.layers.data(name='7', shape=[-1, 1], dtype='int64')
'int64' unique_id = fluid.layers.data(name='8', shape=[-1, 1], dtype='int64')
],
lod_levels=[0, 0, 0, 0, 0, 0, 0, 0], pyreader = fluid.io.DataLoader.from_generator(feed_list=[
name=pyreader_name, src_ids, sent_ids, pos_ids, task_ids, input_mask, start_positions,
use_double_buffer=True) end_positions, unique_id], capacity=50, iterable=False)
(src_ids, sent_ids, pos_ids, task_ids, input_mask, start_positions,
end_positions, unique_id) = fluid.layers.read_file(pyreader)
ernie = ErnieModel( ernie = ErnieModel(
src_ids=src_ids, src_ids=src_ids,
...@@ -151,7 +154,7 @@ def evaluate(exe, ...@@ -151,7 +154,7 @@ def evaluate(exe,
program=test_program, fetch_list=fetch_list) program=test_program, fetch_list=fetch_list)
for idx in range(np_unique_ids.shape[0]): for idx in range(np_unique_ids.shape[0]):
if len(all_results) % 1000 == 0: if len(all_results) % 1000 == 0:
print("Processing example: %d" % len(all_results)) log.info("Processing example: %d" % len(all_results))
unique_id = int(np_unique_ids[idx]) unique_id = int(np_unique_ids[idx])
start_logits = [float(x) for x in np_start_logits[idx].flat] start_logits = [float(x) for x in np_start_logits[idx].flat]
end_logits = [float(x) for x in np_end_logits[idx].flat] end_logits = [float(x) for x in np_end_logits[idx].flat]
...@@ -179,7 +182,7 @@ def evaluate(exe, ...@@ -179,7 +182,7 @@ def evaluate(exe,
time_end = time.time() time_end = time.time()
elapsed_time = time_end - time_begin elapsed_time = time_end - time_begin
print( log.info(
"[%s evaluation] em: %f, f1: %f, avg: %f, questions: %d, elapsed time: %f" "[%s evaluation] em: %f, f1: %f, avg: %f, questions: %d, elapsed time: %f"
% (eval_phase, em, f1, avg, total, elapsed_time)) % (eval_phase, em, f1, avg, total, elapsed_time))
...@@ -188,8 +191,8 @@ def write_predictions(all_examples, all_features, all_results, n_best_size, ...@@ -188,8 +191,8 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
max_answer_length, do_lower_case, output_prediction_file, max_answer_length, do_lower_case, output_prediction_file,
output_nbest_file): output_nbest_file):
"""Write final predictions to the json file and log-odds of null if needed.""" """Write final predictions to the json file and log-odds of null if needed."""
print("Writing predictions to: %s" % (output_prediction_file)) log.info("Writing predictions to: %s" % (output_prediction_file))
print("Writing nbest to: %s" % (output_nbest_file)) log.info("Writing nbest to: %s" % (output_nbest_file))
example_index_to_features = collections.defaultdict(list) example_index_to_features = collections.defaultdict(list)
for feature in all_features: for feature in all_features:
......
...@@ -15,6 +15,9 @@ ...@@ -15,6 +15,9 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
...@@ -23,28 +26,27 @@ import numpy as np ...@@ -23,28 +26,27 @@ import numpy as np
import multiprocessing import multiprocessing
import paddle import paddle
import logging
import paddle.fluid as fluid import paddle.fluid as fluid
from six.moves import xrange from six.moves import xrange
from model.ernie import ErnieModel from model.ernie import ErnieModel
log = logging.getLogger(__name__)
def create_model(args, pyreader_name, ernie_config, is_prediction=False): def create_model(args, pyreader_name, ernie_config, is_prediction=False):
pyreader = fluid.layers.py_reader( src_ids = fluid.layers.data(name='1', shape=[-1, args.max_seq_len, 1], dtype='int64')
capacity=50, sent_ids = fluid.layers.data(name='2', shape=[-1, args.max_seq_len, 1], dtype='int64')
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], pos_ids = fluid.layers.data(name='3', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], task_ids = fluid.layers.data(name='4', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1]], input_mask = fluid.layers.data(name='5', shape=[-1, args.max_seq_len, 1], dtype='float32')
dtypes=[ labels = fluid.layers.data(name='7', shape=[-1, args.max_seq_len, 1], dtype='int64')
'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64' seq_lens = fluid.layers.data(name='8', shape=[-1], dtype='int64')
],
lod_levels=[0, 0, 0, 0, 0, 0, 0], pyreader = fluid.io.DataLoader.from_generator(feed_list=[src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, seq_lens],
name=pyreader_name, capacity=70,
use_double_buffer=True) iterable=False)
(src_ids, sent_ids, pos_ids, task_ids, input_mask, labels,
seq_lens) = fluid.layers.read_file(pyreader)
ernie = ErnieModel( ernie = ErnieModel(
src_ids=src_ids, src_ids=src_ids,
...@@ -70,9 +72,7 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False): ...@@ -70,9 +72,7 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
initializer=fluid.initializer.Constant(0.))) initializer=fluid.initializer.Constant(0.)))
infers = fluid.layers.argmax(logits, axis=2) infers = fluid.layers.argmax(logits, axis=2)
ret_labels = fluid.layers.reshape(x=labels, shape=[-1, 1])
ret_infers = fluid.layers.reshape(x=infers, shape=[-1, 1]) ret_infers = fluid.layers.reshape(x=infers, shape=[-1, 1])
lod_labels = fluid.layers.sequence_unpad(labels, seq_lens) lod_labels = fluid.layers.sequence_unpad(labels, seq_lens)
lod_infers = fluid.layers.sequence_unpad(infers, seq_lens) lod_infers = fluid.layers.sequence_unpad(infers, seq_lens)
...@@ -92,18 +92,14 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False): ...@@ -92,18 +92,14 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
ce_loss = ce_loss * input_mask ce_loss = ce_loss * input_mask
loss = fluid.layers.mean(x=ce_loss) loss = fluid.layers.mean(x=ce_loss)
if args.use_fp16 and args.loss_scaling > 1.0:
loss *= args.loss_scaling
graph_vars = { graph_vars = {
"inputs": src_ids,
"loss": loss, "loss": loss,
"probs": probs, "probs": probs,
"labels": ret_labels, "seqlen": seq_lens,
"infers": ret_infers,
"num_infer": num_infer, "num_infer": num_infer,
"num_label": num_label, "num_label": num_label,
"num_correct": num_correct, "num_correct": num_correct,
"seq_lens": seq_lens
} }
for k, v in graph_vars.items(): for k, v in graph_vars.items():
...@@ -112,92 +108,12 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False): ...@@ -112,92 +108,12 @@ def create_model(args, pyreader_name, ernie_config, is_prediction=False):
return pyreader, graph_vars return pyreader, graph_vars
def chunk_eval(np_labels, np_infers, np_lens, tag_num, dev_count=1): def calculate_f1(num_label, num_infer, num_correct):
def extract_bio_chunk(seq):
chunks = []
cur_chunk = None
null_index = tag_num - 1
for index in xrange(len(seq)):
tag = seq[index]
tag_type = tag // 2
tag_pos = tag % 2
if tag == null_index:
if cur_chunk is not None:
chunks.append(cur_chunk)
cur_chunk = None
continue
if tag_pos == 0:
if cur_chunk is not None:
chunks.append(cur_chunk)
cur_chunk = {}
cur_chunk = {"st": index, "en": index + 1, "type": tag_type}
else:
if cur_chunk is None:
cur_chunk = {"st": index, "en": index + 1, "type": tag_type}
continue
if cur_chunk["type"] == tag_type:
cur_chunk["en"] = index + 1
else:
chunks.append(cur_chunk)
cur_chunk = {"st": index, "en": index + 1, "type": tag_type}
if cur_chunk is not None:
chunks.append(cur_chunk)
return chunks
null_index = tag_num - 1
num_label = 0
num_infer = 0
num_correct = 0
labels = np_labels.reshape([-1]).astype(np.int32).tolist()
infers = np_infers.reshape([-1]).astype(np.int32).tolist()
all_lens = np_lens.reshape([dev_count, -1]).astype(np.int32).tolist()
base_index = 0
for dev_index in xrange(dev_count):
lens = all_lens[dev_index]
max_len = 0
for l in lens:
max_len = max(max_len, l)
for i in xrange(len(lens)):
seq_st = base_index + i * max_len + 1
seq_en = seq_st + (lens[i] - 2)
infer_chunks = extract_bio_chunk(infers[seq_st:seq_en])
label_chunks = extract_bio_chunk(labels[seq_st:seq_en])
num_infer += len(infer_chunks)
num_label += len(label_chunks)
infer_index = 0
label_index = 0
while label_index < len(label_chunks) \
and infer_index < len(infer_chunks):
if infer_chunks[infer_index]["st"] \
< label_chunks[label_index]["st"]:
infer_index += 1
elif infer_chunks[infer_index]["st"] \
> label_chunks[label_index]["st"]:
label_index += 1
else:
if infer_chunks[infer_index]["en"] \
== label_chunks[label_index]["en"] \
and infer_chunks[infer_index]["type"] \
== label_chunks[label_index]["type"]:
num_correct += 1
infer_index += 1
label_index += 1
base_index += max_len * len(lens)
return num_label, num_infer, num_correct
num_infer = np.sum(num_infer)
num_label = np.sum(num_label)
num_correct = np.sum(num_correct)
def calculate_f1(num_label, num_infer, num_correct):
if num_infer == 0: if num_infer == 0:
precision = 0.0 precision = 0.0
else: else:
...@@ -220,34 +136,12 @@ def evaluate(exe, ...@@ -220,34 +136,12 @@ def evaluate(exe,
pyreader, pyreader,
graph_vars, graph_vars,
tag_num, tag_num,
eval_phase,
dev_count=1): dev_count=1):
fetch_list = [ fetch_list = [
graph_vars["num_infer"].name, graph_vars["num_label"].name, graph_vars["num_infer"].name, graph_vars["num_label"].name,
graph_vars["num_correct"].name graph_vars["num_correct"].name
] ]
if eval_phase == "train":
fetch_list.append(graph_vars["loss"].name)
if "learning_rate" in graph_vars:
fetch_list.append(graph_vars["learning_rate"].name)
outputs = exe.run(fetch_list=fetch_list)
np_num_infer, np_num_label, np_num_correct, np_loss = outputs[:4]
num_label = np.sum(np_num_label)
num_infer = np.sum(np_num_infer)
num_correct = np.sum(np_num_correct)
precision, recall, f1 = calculate_f1(num_label, num_infer, num_correct)
rets = {
"precision": precision,
"recall": recall,
"f1": f1,
"loss": np.mean(np_loss)
}
if "learning_rate" in graph_vars:
rets["lr"] = float(outputs[4][0])
return rets
else:
total_label, total_infer, total_correct = 0.0, 0.0, 0.0 total_label, total_infer, total_correct = 0.0, 0.0, 0.0
time_begin = time.time() time_begin = time.time()
pyreader.start() pyreader.start()
...@@ -266,7 +160,59 @@ def evaluate(exe, ...@@ -266,7 +160,59 @@ def evaluate(exe,
precision, recall, f1 = calculate_f1(total_label, total_infer, precision, recall, f1 = calculate_f1(total_label, total_infer,
total_correct) total_correct)
time_end = time.time() time_end = time.time()
return \
"[evaluation] f1: %f, precision: %f, recall: %f, elapsed time: %f s" \
% (f1, precision, recall, time_end - time_begin)
def chunk_predict(np_inputs, np_probs, np_lens, dev_count=1):
inputs = np_inputs.reshape([-1]).astype(np.int32)
probs = np_probs.reshape([-1, np_probs.shape[-1]])
all_lens = np_lens.reshape([dev_count, -1]).astype(np.int32).tolist()
base_index = 0
out = []
for dev_index in xrange(dev_count):
lens = all_lens[dev_index]
max_len = 0
for l in lens:
max_len = max(max_len, l)
for i in xrange(len(lens)):
seq_st = base_index + i * max_len + 1
seq_en = seq_st + (lens[i] - 2)
prob = probs[seq_st:seq_en, :]
infers = np.argmax(prob, -1)
out.append((
inputs[seq_st:seq_en].tolist(),
infers.tolist(),
prob.tolist()))
base_index += max_len * len(lens)
return out
def predict(exe,
test_program,
test_pyreader,
graph_vars,
dev_count=1):
fetch_list = [
graph_vars["inputs"].name,
graph_vars["probs"].name,
graph_vars["seqlen"].name,
]
test_pyreader.start()
res = []
while True:
try:
inputs, probs, np_lens = exe.run(program=test_program,
fetch_list=fetch_list)
r = chunk_predict(inputs, probs, np_lens, dev_count)
res += r
except fluid.core.EOFException:
test_pyreader.reset()
break
return res
print(
"[%s evaluation] f1: %f, precision: %f, recall: %f, elapsed time: %f s"
% (eval_phase, f1, precision, recall, time_end - time_begin))
...@@ -11,10 +11,12 @@ ...@@ -11,10 +11,12 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
...@@ -47,10 +49,21 @@ train_g.add_arg("warmup_proportion", float, 0.1, ...@@ -47,10 +49,21 @@ train_g.add_arg("warmup_proportion", float, 0.1,
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.") train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.") train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.") train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("loss_scaling", float, 1.0, train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 102400,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.") "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("test_save", str, "test_result", "test_save")
train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save")
train_g.add_arg("metric", str, "simple_accuracy", "metric") train_g.add_arg("metric", str, "simple_accuracy", "metric")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
log_g = ArgumentGroup(parser, "logging", "logging related.") log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
...@@ -86,6 +99,7 @@ data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", " ...@@ -86,6 +99,7 @@ data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "
run_type_g = ArgumentGroup(parser, "run_type", "running type options.") run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.") run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import sys
import subprocess
import os
import six
import copy
import argparse
import time
import logging
from utils.args import ArgumentGroup, print_arguments, prepare_logger
from finetune_args import parser as worker_parser
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
multip_g = ArgumentGroup(parser, "multiprocessing",
"start paddle training using multi-processing mode.")
multip_g.add_arg("node_ips", str, None,
"paddle trainer ips")
multip_g.add_arg("node_id", int, 0,
"the trainer id of the node for multi-node distributed training.")
multip_g.add_arg("print_config", bool, True,
"print the config of multi-processing mode.")
multip_g.add_arg("current_node_ip", str, None,
"the ip of current node.")
multip_g.add_arg("split_log_path", str, "log",
"log path for each trainer.")
multip_g.add_arg("log_prefix", str, "",
"the prefix name of job log.")
multip_g.add_arg("nproc_per_node", int, 8,
"the number of process to use on each node.")
multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7",
"the gpus selected to use.")
multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
"in parallel followed by all the arguments", positional_arg=True)
multip_g.add_arg("training_script_args", str, None,
"training script args", positional_arg=True, nargs=argparse.REMAINDER)
# yapf: enable
log = logging.getLogger()
def start_procs(args):
procs = []
log_fns = []
default_env = os.environ.copy()
node_id = args.node_id
node_ips = [x.strip() for x in args.node_ips.split(',')]
current_ip = args.current_node_ip
if args.current_node_ip is None:
assert len(node_ips) == 1
current_ip = node_ips[0]
log.info(current_ip)
num_nodes = len(node_ips)
selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
selected_gpu_num = len(selected_gpus)
all_trainer_endpoints = ""
for ip in node_ips:
for i in range(args.nproc_per_node):
if all_trainer_endpoints != "":
all_trainer_endpoints += ","
all_trainer_endpoints += "%s:617%d" % (ip, i)
nranks = num_nodes * args.nproc_per_node
gpus_per_proc = args.nproc_per_node % selected_gpu_num
if gpus_per_proc == 0:
gpus_per_proc = selected_gpu_num // args.nproc_per_node
else:
gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1
selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] for i in range(0, len(selected_gpus), gpus_per_proc)]
if args.print_config:
log.info("all_trainer_endpoints: %s"
", node_id: %s"
", current_ip: %s"
", num_nodes: %s"
", node_ips: %s"
", gpus_per_proc: %s"
", selected_gpus_per_proc: %s"
", nranks: %s" % (
all_trainer_endpoints,
node_id,
current_ip,
num_nodes,
node_ips,
gpus_per_proc,
selected_gpus_per_proc,
nranks))
current_env = copy.copy(default_env)
procs = []
cmds = []
log_fns = []
for i in range(0, args.nproc_per_node):
trainer_id = node_id * args.nproc_per_node + i
assert current_ip is not None
current_env.update({
"FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
"PADDLE_TRAINER_ID" : "%d" % trainer_id,
"PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
"PADDLE_TRAINERS_NUM": "%d" % nranks,
"PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
"PADDLE_NODES_NUM": "%d" % num_nodes
})
try:
idx = args.training_script_args.index('--is_distributed')
args.training_script_args[idx + 1] = 'true'
except ValueError:
args.training_script_args += ['--is_distributed', 'true']
cmd = [sys.executable, "-u",
args.training_script] + args.training_script_args
cmds.append(cmd)
if args.split_log_path:
logdir = "%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, trainer_id)
try:
os.mkdir(os.path.dirname(logdir))
except OSError:
pass
fn = open(logdir, "a")
log_fns.append(fn)
process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
log.info('subprocess launched, check log at %s' % logdir)
else:
process = subprocess.Popen(cmd, env=current_env)
log.info('subprocess launched')
procs.append(process)
try:
for i in range(len(procs)):
proc = procs[i]
proc.wait()
if len(log_fns) > 0:
log_fns[i].close()
if proc.returncode != 0:
raise subprocess.CalledProcessError(returncode=procs[i].returncode,
cmd=cmds[i])
else:
log.info("proc %d finsh" % i)
except KeyboardInterrupt as e:
for p in procs:
log.info('killing %s' % p)
p.terminate()
def main(args):
if args.print_config:
print_arguments(args)
start_procs(args)
if __name__ == "__main__":
prepare_logger(log)
lanch_args = parser.parse_args()
finetuning_args = worker_parser.parse_args(
lanch_args.training_script_args)
init_path = finetuning_args.init_pretraining_params
log.info("init model: %s" % init_path)
if not finetuning_args.use_fp16:
os.system('rename .master "" ' + init_path + '/*.master')
main(lanch_args)
...@@ -11,16 +11,18 @@ ...@@ -11,16 +11,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Inference by """
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
import argparse import argparse
import numpy as np import numpy as np
import logging
import multiprocessing import multiprocessing
# NOTE(paddle-dev): All of these flags should be # NOTE(paddle-dev): All of these flags should be
...@@ -39,14 +41,14 @@ from reader.task_reader import ClassifyReader ...@@ -39,14 +41,14 @@ from reader.task_reader import ClassifyReader
from model.ernie import ErnieConfig from model.ernie import ErnieConfig
from finetune.classifier import create_model from finetune.classifier import create_model
from utils.args import ArgumentGroup, print_arguments from utils.args import print_arguments, check_cuda, prepare_logger, ArgumentGroup
from utils.init import init_pretraining_params from utils.init import init_pretraining_params
from finetune_args import parser from finetune_args import parser
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "options to init, resume and save model.") model_g = ArgumentGroup(parser, "model", "options to init, resume and save model.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for bert model config.") model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("save_inference_model_path", str, "inference_model", "If set, save the inference model to this path.") model_g.add_arg("save_inference_model_path", str, "inference_model", "If set, save the inference model to this path.")
model_g.add_arg("use_fp16", bool, False, "Whether to resume parameters from fp16 checkpoint.") model_g.add_arg("use_fp16", bool, False, "Whether to resume parameters from fp16 checkpoint.")
...@@ -66,6 +68,7 @@ run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for trai ...@@ -66,6 +68,7 @@ run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for trai
run_type_g.add_arg("do_prediction", bool, True, "Whether to do prediction on test set.") run_type_g.add_arg("do_prediction", bool, True, "Whether to do prediction on test set.")
args = parser.parse_args() args = parser.parse_args()
log = logging.getLogger()
# yapf: enable. # yapf: enable.
def main(args): def main(args):
...@@ -113,7 +116,7 @@ def main(args): ...@@ -113,7 +116,7 @@ def main(args):
_, ckpt_dir = os.path.split(args.init_checkpoint.rstrip('/')) _, ckpt_dir = os.path.split(args.init_checkpoint.rstrip('/'))
dir_name = ckpt_dir + '_inference_model' dir_name = ckpt_dir + '_inference_model'
model_path = os.path.join(args.save_inference_model_path, dir_name) model_path = os.path.join(args.save_inference_model_path, dir_name)
print("save inference model to %s" % model_path) log.info("save inference model to %s" % model_path)
fluid.io.save_inference_model( fluid.io.save_inference_model(
model_path, model_path,
feed_target_names, [probs], feed_target_names, [probs],
...@@ -125,8 +128,12 @@ def main(args): ...@@ -125,8 +128,12 @@ def main(args):
#config = AnalysisConfig(os.path.join(model_path, "__model__"), os.path.join(model_path, "")) #config = AnalysisConfig(os.path.join(model_path, "__model__"), os.path.join(model_path, ""))
config = AnalysisConfig(model_path) config = AnalysisConfig(model_path)
if not args.use_cuda: if not args.use_cuda:
print("disable gpu") log.info("disable gpu")
config.disable_gpu() config.disable_gpu()
config.switch_ir_optim(True)
else:
log.info("using gpu")
config.enable_use_gpu(1024)
# Create PaddlePredictor # Create PaddlePredictor
predictor = create_paddle_predictor(config) predictor = create_paddle_predictor(config)
...@@ -137,7 +144,7 @@ def main(args): ...@@ -137,7 +144,7 @@ def main(args):
epoch=1, epoch=1,
shuffle=False) shuffle=False)
print("-------------- prediction results --------------") log.info("-------------- prediction results --------------")
np.set_printoptions(precision=4, suppress=True) np.set_printoptions(precision=4, suppress=True)
index = 0 index = 0
total_time = 0 total_time = 0
...@@ -156,32 +163,20 @@ def main(args): ...@@ -156,32 +163,20 @@ def main(args):
# parse outputs # parse outputs
output = outputs[0] output = outputs[0]
print(output.name) batch_result = output.as_ndarray()
output_data = output.data.float_data()
#assert len(output_data) == args.num_labels * args.batch_size
batch_result = np.array(output_data).reshape((-1, args.num_labels))
for single_example_probs in batch_result: for single_example_probs in batch_result:
print("{} example\t{}".format(index, single_example_probs)) print('\t'.join(map(str, single_example_probs.tolist())))
index += 1 index += 1
print("qps:{}\ttotal_time:{}\ttotal_example:{}\tbatch_size:{}".format(index/total_time, total_time, index, args.batch_size)) log.info("qps:{}\ttotal_time:{}\ttotal_example:{}\tbatch_size:{}".format(index/total_time, total_time, index, args.batch_size))
def array2tensor(ndarray): def array2tensor(ndarray):
""" convert numpy array to PaddleTensor""" """ convert numpy array to PaddleTensor"""
assert isinstance(ndarray, np.ndarray), "input type must be np.ndarray" assert isinstance(ndarray, np.ndarray), "input type must be np.ndarray"
tensor = PaddleTensor() tensor = PaddleTensor(data=ndarray)
tensor.name = "data"
tensor.shape = ndarray.shape
if "float" in str(ndarray.dtype):
tensor.dtype = PaddleDType.FLOAT32
elif "int" in str(ndarray.dtype):
tensor.dtype = PaddleDType.INT64
else:
raise ValueError("{} type ndarray is unsupported".format(tensor.dtype))
tensor.data = PaddleBuf(ndarray.flatten().tolist())
return tensor return tensor
if __name__ == '__main__': if __name__ == '__main__':
prepare_logger(log)
print_arguments(args) print_arguments(args)
main(args) main(args)
...@@ -16,14 +16,19 @@ ...@@ -16,14 +16,19 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import json import json
import six import six
import logging
import paddle.fluid as fluid import paddle.fluid as fluid
from io import open
from paddle.fluid.layers import core
from model.transformer_encoder import encoder, pre_process_layer from model.transformer_encoder import encoder, pre_process_layer
log = logging.getLogger(__name__)
class ErnieConfig(object): class ErnieConfig(object):
def __init__(self, config_path): def __init__(self, config_path):
...@@ -31,7 +36,7 @@ class ErnieConfig(object): ...@@ -31,7 +36,7 @@ class ErnieConfig(object):
def _parse(self, config_path): def _parse(self, config_path):
try: try:
with open(config_path) as json_file: with open(config_path, 'r', encoding='utf8') as json_file:
config_dict = json.load(json_file) config_dict = json.load(json_file)
except Exception: except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" % raise IOError("Error in parsing Ernie model config file '%s'" %
...@@ -44,8 +49,8 @@ class ErnieConfig(object): ...@@ -44,8 +49,8 @@ class ErnieConfig(object):
def print_config(self): def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)): for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value)) log.info('%s: %s' % (arg, value))
print('------------------------------------------------') log.info('------------------------------------------------')
class ErnieModel(object): class ErnieModel(object):
...@@ -81,8 +86,8 @@ class ErnieModel(object): ...@@ -81,8 +86,8 @@ class ErnieModel(object):
self._pos_emb_name = "pos_embedding" self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding" self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding" self._task_emb_name = "task_embedding"
self._dtype = "float16" if use_fp16 else "float32" self._dtype = core.VarDesc.VarType.FP16 if use_fp16 else core.VarDesc.VarType.FP32
self._emb_dtype = "float32" self._emb_dtype = core.VarDesc.VarType.FP32
# Initialize all weigths by truncated normal initializer, and all biases # Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default. # will be initialized by constant zero by default.
...@@ -134,7 +139,7 @@ class ErnieModel(object): ...@@ -134,7 +139,7 @@ class ErnieModel(object):
emb_out = pre_process_layer( emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder') emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if self._dtype == "float16": if self._dtype == core.VarDesc.VarType.FP16:
emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype) emb_out = fluid.layers.cast(x=emb_out, dtype=self._dtype)
input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype) input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
self_attn_mask = fluid.layers.matmul( self_attn_mask = fluid.layers.matmul(
...@@ -163,6 +168,10 @@ class ErnieModel(object): ...@@ -163,6 +168,10 @@ class ErnieModel(object):
postprocess_cmd="dan", postprocess_cmd="dan",
param_initializer=self._param_initializer, param_initializer=self._param_initializer,
name='encoder') name='encoder')
if self._dtype == core.VarDesc.VarType.FP16:
self._enc_out = fluid.layers.cast(
x=self._enc_out, dtype=self._emb_dtype)
def get_sequence_output(self): def get_sequence_output(self):
return self._enc_out return self._enc_out
...@@ -171,9 +180,6 @@ class ErnieModel(object): ...@@ -171,9 +180,6 @@ class ErnieModel(object):
"""Get the first feature of each sequence for classification""" """Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice( next_sent_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1]) input=self._enc_out, axes=[1], starts=[0], ends=[1])
if self._dtype == "float16":
next_sent_feat = fluid.layers.cast(
x=next_sent_feat, dtype=self._emb_dtype)
next_sent_feat = fluid.layers.fc( next_sent_feat = fluid.layers.fc(
input=next_sent_feat, input=next_sent_feat,
size=self._emb_size, size=self._emb_size,
...@@ -194,8 +200,6 @@ class ErnieModel(object): ...@@ -194,8 +200,6 @@ class ErnieModel(object):
x=self._enc_out, shape=[-1, self._emb_size]) x=self._enc_out, shape=[-1, self._emb_size])
# extract masked tokens' feature # extract masked tokens' feature
mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos) mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
if self._dtype == "float16":
mask_feat = fluid.layers.cast(x=mask_feat, dtype=self._emb_dtype)
# transform: fc # transform: fc
mask_trans_feat = fluid.layers.fc( mask_trans_feat = fluid.layers.fc(
......
...@@ -16,14 +16,19 @@ ...@@ -16,14 +16,19 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import json import json
import logging
import six import six
import paddle.fluid as fluid import paddle.fluid as fluid
from io import open
from paddle.fluid.layers import core
from model.transformer_encoder import encoder, pre_process_layer from model.transformer_encoder import encoder, pre_process_layer
log = logging.getLogger(__name__)
class ErnieConfig(object): class ErnieConfig(object):
def __init__(self, config_path): def __init__(self, config_path):
...@@ -31,7 +36,7 @@ class ErnieConfig(object): ...@@ -31,7 +36,7 @@ class ErnieConfig(object):
def _parse(self, config_path): def _parse(self, config_path):
try: try:
with open(config_path) as json_file: with open(config_path, 'r', encoding='utf8') as json_file:
config_dict = json.load(json_file) config_dict = json.load(json_file)
except Exception: except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" % raise IOError("Error in parsing Ernie model config file '%s'" %
...@@ -44,8 +49,8 @@ class ErnieConfig(object): ...@@ -44,8 +49,8 @@ class ErnieConfig(object):
def print_config(self): def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)): for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value)) log.info('%s: %s' % (arg, value))
print('------------------------------------------------') log.info('------------------------------------------------')
class ErnieModel(object): class ErnieModel(object):
...@@ -72,7 +77,7 @@ class ErnieModel(object): ...@@ -72,7 +77,7 @@ class ErnieModel(object):
self._word_emb_name = "word_embedding" self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding" self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding" self._sent_emb_name = "sent_embedding"
self._dtype = "float16" if use_fp16 else "float32" self._dtype = core.VarDesc.VarType.FP16 if use_fp16 else core.VarDesc.VarType.FP32
# Initialize all weigths by truncated normal initializer, and all biases # Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default. # will be initialized by constant zero by default.
...@@ -110,7 +115,7 @@ class ErnieModel(object): ...@@ -110,7 +115,7 @@ class ErnieModel(object):
emb_out = pre_process_layer( emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder') emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if self._dtype == "float16": if self._dtype == core.VarDesc.VarType.FP16:
input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype) input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
self_attn_mask = fluid.layers.matmul( self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True) x=input_mask, y=input_mask, transpose_y=True)
......
...@@ -16,10 +16,13 @@ ...@@ -16,10 +16,13 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import numpy as np import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param from utils.fp16 import create_master_params_grads, master_param_to_train_param, apply_dynamic_loss_scaling
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
......
...@@ -42,8 +42,18 @@ train_g.add_arg("warmup_steps", int, 5000, "Total steps to perform wa ...@@ -42,8 +42,18 @@ train_g.add_arg("warmup_steps", int, 5000, "Total steps to perform wa
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.") train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.") train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.") train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("loss_scaling", float, 1.0, train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 102400,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.") "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
log_g = ArgumentGroup(parser, "logging", "logging related.") log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from __future__ import division
import sys
import subprocess
import os
import six
import copy
import argparse
import time
import logging
from utils.args import ArgumentGroup, print_arguments, prepare_logger
from pretrain_args import parser as worker_parser
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
multip_g = ArgumentGroup(parser, "multiprocessing",
"start paddle training using multi-processing mode.")
multip_g.add_arg("node_ips", str, None,
"paddle trainer ips")
multip_g.add_arg("node_id", int, 0,
"the trainer id of the node for multi-node distributed training.")
multip_g.add_arg("print_config", bool, True,
"print the config of multi-processing mode.")
multip_g.add_arg("current_node_ip", str, None,
"the ip of current node.")
multip_g.add_arg("split_log_path", str, "./log",
"log path for each trainer.")
multip_g.add_arg("log_prefix", str, "",
"the prefix name of job log.")
multip_g.add_arg("nproc_per_node", int, 8,
"the number of process to use on each node.")
multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7",
"the gpus selected to use.")
multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
"in parallel followed by all the arguments", positional_arg=True)
multip_g.add_arg("training_script_args", str, None,
"training script args", positional_arg=True, nargs=argparse.REMAINDER)
# yapf: enable
log = logging.getLogger()
def start_procs(args):
procs = []
log_fns = []
default_env = os.environ.copy()
node_id = args.node_id
node_ips = [x.strip() for x in args.node_ips.split(',')]
current_ip = args.current_node_ip
if args.current_node_ip is None:
assert len(node_ips) == 1
current_ip = node_ips[0]
log.info(current_ip)
num_nodes = len(node_ips)
selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
selected_gpu_num = len(selected_gpus)
all_trainer_endpoints = ""
for ip in node_ips:
for i in range(args.nproc_per_node):
if all_trainer_endpoints != "":
all_trainer_endpoints += ","
all_trainer_endpoints += "%s:617%d" % (ip, i)
nranks = num_nodes * args.nproc_per_node
gpus_per_proc = args.nproc_per_node % selected_gpu_num
if gpus_per_proc == 0:
gpus_per_proc = selected_gpu_num // args.nproc_per_node
else:
gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1
log.info(gpus_per_proc)
selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] for i in range(0, len(selected_gpus), gpus_per_proc)]
if args.print_config:
log.info("all_trainer_endpoints: %s"
", node_id: %s"
", current_ip: %s"
", num_nodes: %s"
", node_ips: %s"
", gpus_per_proc: %s"
", selected_gpus_per_proc: %s"
", nranks: %s" % (
all_trainer_endpoints,
node_id,
current_ip,
num_nodes,
node_ips,
gpus_per_proc,
selected_gpus_per_proc,
nranks))
current_env = copy.copy(default_env)
procs = []
cmds = []
log_fns = []
for i in range(0, args.nproc_per_node):
trainer_id = node_id * args.nproc_per_node + i
current_env.update({
"FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
"PADDLE_TRAINER_ID" : "%d" % trainer_id,
"PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
"PADDLE_TRAINERS_NUM": "%d" % nranks,
"PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
"PADDLE_NODES_NUM": "%d" % num_nodes
})
try:
idx = args.training_script_args.index('--is_distributed')
args.training_script_args[idx + 1] = 'true'
except ValueError:
args.training_script_args += ['--is_distributed', 'true']
cmd = [sys.executable, "-u",
args.training_script] + args.training_script_args
cmds.append(cmd)
if args.split_log_path:
logdir = "%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, trainer_id)
try:
os.mkdir(os.path.dirname(logdir))
except OSError:
pass
fn = open(logdir, "a")
log_fns.append(fn)
process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
log.info('subprocess launched, check log at %s' % logdir)
else:
process = subprocess.Popen(cmd, env=current_env)
log.info('subprocess launched')
procs.append(process)
try:
for i in range(len(procs)):
proc = procs[i]
proc.wait()
if len(log_fns) > 0:
log_fns[i].close()
if proc.returncode != 0:
raise subprocess.CalledProcessError(returncode=procs[i].returncode,
cmd=cmds[i])
else:
log.info("proc %d finsh" % i)
except KeyboardInterrupt as e:
for p in procs:
log.info('killing %s' % p)
p.terminate()
def main(args):
if args.print_config:
print_arguments(args)
start_procs(args)
if __name__ == "__main__":
prepare_logger(log)
lanch_args = parser.parse_args()
pretraining_args = worker_parser.parse_args(
lanch_args.training_script_args)
init_path = pretraining_args.init_checkpoint
if init_path and not pretraining_args.use_fp16:
os.system('rename .master "" ' + init_path + '/*.master')
main(lanch_args)
...@@ -11,9 +11,11 @@ ...@@ -11,9 +11,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import numpy as np import numpy as np
...@@ -36,8 +38,10 @@ class ErnieDataReader(object): ...@@ -36,8 +38,10 @@ class ErnieDataReader(object):
filelist, filelist,
vocab_path, vocab_path,
batch_size=4096, batch_size=4096,
in_tokens=True,
max_seq_len=512, max_seq_len=512,
shuffle_files=True, shuffle_files=True,
random_seed=1,
epoch=100, epoch=100,
voc_size=0, voc_size=0,
is_test=False, is_test=False,
...@@ -46,6 +50,8 @@ class ErnieDataReader(object): ...@@ -46,6 +50,8 @@ class ErnieDataReader(object):
self.vocab = self.load_vocab(vocab_path) self.vocab = self.load_vocab(vocab_path)
self.filelist = filelist self.filelist = filelist
self.batch_size = batch_size self.batch_size = batch_size
self.in_tokens = in_tokens
self.random_seed = random_seed
self.shuffle_files = shuffle_files self.shuffle_files = shuffle_files
self.epoch = epoch self.epoch = epoch
self.current_epoch = 0 self.current_epoch = 0
...@@ -60,13 +66,43 @@ class ErnieDataReader(object): ...@@ -60,13 +66,43 @@ class ErnieDataReader(object):
self.mask_id = self.vocab["[MASK]"] self.mask_id = self.vocab["[MASK]"]
self.is_test = is_test self.is_test = is_test
self.generate_neg_sample = generate_neg_sample self.generate_neg_sample = generate_neg_sample
assert self.batch_size > 100, "Current batch size means total token's number, \
it should not be set to too small number." self.trainer_id = 0
self.trainer_nums = 1
self.files = open(filelist).readlines()
self.total_file = len(self.files)
if self.is_test: if self.is_test:
self.epoch = 1 self.epoch = 1
self.shuffle_files = False self.shuffle_files = False
self.global_rng = np.random.RandomState(random_seed)
if self.shuffle_files:
if os.getenv("PADDLE_TRAINER_ID"):
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
if os.getenv("PADDLE_NODES_NUM"):
self.trainer_nums = int(os.getenv("PADDLE_TRAINERS_NUM"))
#renew total_file
self.total_file = len(self.files) // self.trainer_nums * self.trainer_nums
if len(self.files) < self.trainer_nums:
raise RuntimeError('not enouph train file to shard, file:%d num_trainer:%d' % (len(self.files), self.trainer_nums))
tmp_files = []
for each in range(epoch):
each_files = self.files
self.global_rng.shuffle(each_files)
tmp_files += each_files
self.files = tmp_files
#renew epochs
self.epoch = len(self.files) // self.total_file * self.total_file
assert self.total_file > 0, \
"[Error] data_dir is empty or less than %d" % self.trainer_nums
if self.in_tokens:
assert self.batch_size > 100, "Current batch size means total token's number, \
it should not be set to too small number."
def get_progress(self): def get_progress(self):
"""return current progress of traning data """return current progress of traning data
""" """
...@@ -75,13 +111,16 @@ class ErnieDataReader(object): ...@@ -75,13 +111,16 @@ class ErnieDataReader(object):
def parse_line(self, line, max_seq_len=512): def parse_line(self, line, max_seq_len=512):
""" parse one line to token_ids, sentence_ids, pos_ids, label """ parse one line to token_ids, sentence_ids, pos_ids, label
""" """
line = line.strip().decode().split(";") line = line.strip().split(";")
assert len(line) == 5, "One sample must have 5 fields!" assert len(line) == 5, \
"One sample must have %d fields!" % 5
(token_ids, sent_ids, pos_ids, seg_labels, label) = line (token_ids, sent_ids, pos_ids, seg_labels, label) = line
token_ids = [int(token) for token in token_ids.split(" ")] token_ids = [int(token) for token in token_ids.split(" ")]
sent_ids = [int(token) for token in sent_ids.split(" ")] sent_ids = [int(token) for token in sent_ids.split(" ")]
pos_ids = [int(token) for token in pos_ids.split(" ")] pos_ids = [int(token) for token in pos_ids.split(" ")]
seg_labels = [int(seg_label) for seg_label in seg_labels.split(" ")] seg_labels = [int(seg_label) for seg_label in seg_labels.split(" ")]
assert len(token_ids) == len(sent_ids) == len(pos_ids) == len( assert len(token_ids) == len(sent_ids) == len(pos_ids) == len(
seg_labels seg_labels
), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids) == len(seg_labels)" ), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids) == len(seg_labels)"
...@@ -94,6 +133,7 @@ class ErnieDataReader(object): ...@@ -94,6 +133,7 @@ class ErnieDataReader(object):
assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
with gzip.open(file, "rb") as f: with gzip.open(file, "rb") as f:
for line in f: for line in f:
line = line.decode('utf8')
parsed_line = self.parse_line( parsed_line = self.parse_line(
line, max_seq_len=self.max_seq_len) line, max_seq_len=self.max_seq_len)
if parsed_line is None: if parsed_line is None:
...@@ -233,34 +273,62 @@ class ErnieDataReader(object): ...@@ -233,34 +273,62 @@ class ErnieDataReader(object):
(num_total_miss, pos_sample_num * 2, (num_total_miss, pos_sample_num * 2,
num_total_miss / (pos_sample_num * 2))) num_total_miss / (pos_sample_num * 2)))
def shuffle_samples(self, sample_generator, buffer=1000):
samples = []
try:
while True:
while len(samples) < buffer:
sample = next(sample_generator)
samples.append(sample)
np.random.shuffle(samples)
for sample in samples:
yield sample
samples = []
except StopIteration:
print("stopiteration: reach end of file")
if len(samples) == 0:
yield None
else:
np.random.shuffle(samples)
for sample in samples:
yield sample
def data_generator(self): def data_generator(self):
""" """
data_generator data_generator
""" """
files = open(self.filelist).readlines()
self.total_file = len(files)
assert self.total_file > 0, "[Error] data_dir is empty"
def wrapper(): def wrapper():
def reader(): def reader():
for epoch in range(self.epoch): for epoch in range(self.epoch):
self.current_epoch = epoch + 1 self.current_epoch = epoch + 1
files = self.files
#during training, data are sliced by trainers
if self.shuffle_files: if self.shuffle_files:
np.random.shuffle(files) start = epoch * self.total_file
for index, file in enumerate(files): end = start + self.total_file
file, mask_word_prob = file.strip().split("\t") files = [file_ for index, file_ in enumerate(self.files[start:end]) \
if index % self.trainer_nums == self.trainer_id]
for index, file_ in enumerate(files):
file_, mask_word_prob = file_.strip().split("\t")
mask_word = (np.random.random() < float(mask_word_prob)) mask_word = (np.random.random() < float(mask_word_prob))
self.current_file_index = index + 1 self.current_file_index = (index + 1) * self.trainer_nums
self.current_file = file self.current_file = file_
if mask_word: if mask_word:
self.mask_type = "mask_word" self.mask_type = "mask_word"
else: else:
self.mask_type = "mask_char" self.mask_type = "mask_char"
sample_generator = self.read_file(file) sample_generator = self.read_file(file_)
if not self.is_test and self.generate_neg_sample: if not self.is_test:
if self.generate_neg_sample:
sample_generator = self.mixin_negtive_samples( sample_generator = self.mixin_negtive_samples(
sample_generator) sample_generator)
else:
#shuffle buffered sample
sample_generator = self.shuffle_samples(
sample_generator)
for sample in sample_generator: for sample in sample_generator:
if sample is None: if sample is None:
continue continue
...@@ -272,7 +340,11 @@ class ErnieDataReader(object): ...@@ -272,7 +340,11 @@ class ErnieDataReader(object):
for parsed_line in reader(): for parsed_line in reader():
token_ids, sent_ids, pos_ids, label, seg_labels, mask_word = parsed_line token_ids, sent_ids, pos_ids, label, seg_labels, mask_word = parsed_line
max_len = max(max_len, len(token_ids)) max_len = max(max_len, len(token_ids))
if (len(batch) + 1) * max_len <= batch_size: if self.in_tokens:
to_append = (len(batch) + 1) * max_len <= batch_size
else:
to_append = len(batch) < batch_size
if to_append:
batch.append(parsed_line) batch.append(parsed_line)
total_token_num += len(token_ids) total_token_num += len(token_ids)
else: else:
......
...@@ -11,18 +11,41 @@ ...@@ -11,18 +11,41 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import sys
import os import os
import csv
import json import json
import random import random
import logging
import numpy as np import numpy as np
import six
from io import open
from collections import namedtuple from collections import namedtuple
import tokenization import tokenization
from batching import pad_batch_data from batching import pad_batch_data
log = logging.getLogger(__name__)
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
def csv_reader(fd, delimiter='\t'):
def gen():
for i in fd:
yield i.rstrip('\n').split(delimiter)
return gen()
class BaseReader(object): class BaseReader(object):
def __init__(self, def __init__(self,
vocab_path, vocab_path,
...@@ -58,7 +81,7 @@ class BaseReader(object): ...@@ -58,7 +81,7 @@ class BaseReader(object):
self.num_examples = 0 self.num_examples = 0
if label_map_config: if label_map_config:
with open(label_map_config) as f: with open(label_map_config, encoding='utf8') as f:
self.label_map = json.load(f) self.label_map = json.load(f)
else: else:
self.label_map = None self.label_map = None
...@@ -69,8 +92,8 @@ class BaseReader(object): ...@@ -69,8 +92,8 @@ class BaseReader(object):
def _read_tsv(self, input_file, quotechar=None): def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file.""" """Reads a tab separated value file."""
with open(input_file, "r") as f: with open(input_file, 'r', encoding='utf8') as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar) reader = csv_reader(f)
headers = next(reader) headers = next(reader)
Example = namedtuple('Example', headers) Example = namedtuple('Example', headers)
...@@ -242,15 +265,21 @@ class BaseReader(object): ...@@ -242,15 +265,21 @@ class BaseReader(object):
for batch in all_dev_batches: for batch in all_dev_batches:
yield batch yield batch
all_dev_batches = [] all_dev_batches = []
def f():
return wrapper try:
for i in wrapper():
yield i
except Exception as e:
import traceback
traceback.print_exc()
return f
class ClassifyReader(BaseReader): class ClassifyReader(BaseReader):
def _read_tsv(self, input_file, quotechar=None): def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file.""" """Reads a tab separated value file."""
with open(input_file, "r") as f: with open(input_file, 'r', encoding='utf8') as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar) reader = csv_reader(f)
headers = next(reader) headers = next(reader)
text_indices = [ text_indices = [
index for index, h in enumerate(headers) if h != "label" index for index, h in enumerate(headers) if h != "label"
...@@ -472,7 +501,7 @@ class MRCReader(BaseReader): ...@@ -472,7 +501,7 @@ class MRCReader(BaseReader):
def _read_json(self, input_file, is_training): def _read_json(self, input_file, is_training):
examples = [] examples = []
with open(input_file, "r") as f: with open(input_file, "r", encoding='utf8') as f:
input_data = json.load(f)["data"] input_data = json.load(f)["data"]
for entry in input_data: for entry in input_data:
for paragraph in entry["paragraphs"]: for paragraph in entry["paragraphs"]:
...@@ -507,7 +536,7 @@ class MRCReader(BaseReader): ...@@ -507,7 +536,7 @@ class MRCReader(BaseReader):
actual_text = " ".join(doc_tokens[start_pos:(end_pos actual_text = " ".join(doc_tokens[start_pos:(end_pos
+ 1)]) + 1)])
if actual_text.find(orig_answer_text) == -1: if actual_text.find(orig_answer_text) == -1:
print("Could not find answer: '%s' vs. '%s'", log.info("Could not find answer: '%s' vs. '%s'",
actual_text, orig_answer_text) actual_text, orig_answer_text)
continue continue
else: else:
......
...@@ -16,9 +16,12 @@ ...@@ -16,9 +16,12 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
import logging
import multiprocessing import multiprocessing
# NOTE(paddle-dev): All of these flags should be # NOTE(paddle-dev): All of these flags should be
...@@ -32,12 +35,13 @@ import reader.task_reader as task_reader ...@@ -32,12 +35,13 @@ import reader.task_reader as task_reader
from model.ernie import ErnieConfig from model.ernie import ErnieConfig
from finetune.classifier import create_model, evaluate, predict from finetune.classifier import create_model, evaluate, predict
from optimization import optimization from optimization import optimization
from utils.args import print_arguments, check_cuda from utils.args import print_arguments, check_cuda, prepare_logger
from utils.init import init_pretraining_params, init_checkpoint from utils.init import init_pretraining_params, init_checkpoint
from utils.cards import get_cards from utils.cards import get_cards
from finetune_args import parser from finetune_args import parser
args = parser.parse_args() args = parser.parse_args()
log = logging.getLogger()
def main(args): def main(args):
...@@ -45,8 +49,9 @@ def main(args): ...@@ -45,8 +49,9 @@ def main(args):
ernie_config.print_config() ernie_config.print_config()
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) dev_list = fluid.cuda_places()
dev_count = fluid.core.get_cuda_device_count() place = dev_list[0]
dev_count = len(dev_list)
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
...@@ -75,8 +80,6 @@ def main(args): ...@@ -75,8 +80,6 @@ def main(args):
if args.random_seed is not None: if args.random_seed is not None:
startup_prog.random_seed = args.random_seed startup_prog.random_seed = args.random_seed
if args.predict_batch_size == None:
args.predict_batch_size = args.batch_size
if args.do_train: if args.do_train:
train_data_generator = reader.data_generator( train_data_generator = reader.data_generator(
input_file=args.train_set, input_file=args.train_set,
...@@ -89,16 +92,18 @@ def main(args): ...@@ -89,16 +92,18 @@ def main(args):
num_train_examples = reader.get_num_examples(args.train_set) num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens: if args.in_tokens:
if args.batch_size < args.max_seq_len:
raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
max_train_steps = args.epoch * num_train_examples // ( max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count args.batch_size // args.max_seq_len) // dev_count
else: else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion) warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count) log.info("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples) log.info("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps) log.info("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps) log.info("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program() train_program = fluid.Program()
if args.random_seed is not None and args.enable_ce: if args.random_seed is not None and args.enable_ce:
...@@ -121,7 +126,13 @@ def main(args): ...@@ -121,7 +126,13 @@ def main(args):
startup_prog=startup_prog, startup_prog=startup_prog,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
scheduler=args.lr_scheduler, scheduler=args.lr_scheduler,
use_fp16=args.use_fp16) use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio)
if args.verbose: if args.verbose:
if args.in_tokens: if args.in_tokens:
...@@ -131,7 +142,7 @@ def main(args): ...@@ -131,7 +142,7 @@ def main(args):
else: else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage( lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size) program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" % log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit)) (lower_mem, upper_mem, unit))
if args.do_val or args.do_test: if args.do_val or args.do_test:
...@@ -148,11 +159,36 @@ def main(args): ...@@ -148,11 +159,36 @@ def main(args):
test_prog = test_prog.clone(for_test=True) test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1 nccl2_num_trainers = 1
nccl2_trainer_id = 0 nccl2_trainer_id = 0
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog) exe.run(startup_prog)
if args.do_train: if args.do_train:
if args.init_checkpoint and args.init_pretraining_params: if args.init_checkpoint and args.init_pretraining_params:
print( log.warning(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' " "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.") "both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint: if args.init_checkpoint:
...@@ -192,7 +228,7 @@ def main(args): ...@@ -192,7 +228,7 @@ def main(args):
num_trainers=nccl2_num_trainers, num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id) trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator) train_pyreader.set_batch_generator(train_data_generator)
else: else:
train_exe = None train_exe = None
...@@ -236,14 +272,14 @@ def main(args): ...@@ -236,14 +272,14 @@ def main(args):
verbose += "learning rate: %f" % ( verbose += "learning rate: %f" % (
outputs["learning_rate"] outputs["learning_rate"]
if warmup_steps > 0 else args.learning_rate) if warmup_steps > 0 else args.learning_rate)
print(verbose) log.info(verbose)
current_example, current_epoch = reader.get_train_progress() current_example, current_epoch = reader.get_train_progress()
time_end = time.time() time_end = time.time()
used_time = time_end - time_begin used_time = time_end - time_begin
if args.is_classify: if args.is_classify:
print( log.info(
"epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s" % "ave acc: %f, speed: %f steps/s" %
(current_epoch, current_example, num_train_examples, (current_epoch, current_example, num_train_examples,
...@@ -252,7 +288,7 @@ def main(args): ...@@ -252,7 +288,7 @@ def main(args):
ce_info.append( ce_info.append(
[outputs["loss"], outputs["accuracy"], used_time]) [outputs["loss"], outputs["accuracy"], used_time])
if args.is_regression: if args.is_regression:
print( log.info(
"epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
" speed: %f steps/s" % " speed: %f steps/s" %
(current_epoch, current_example, num_train_examples, (current_epoch, current_example, num_train_examples,
...@@ -260,6 +296,7 @@ def main(args): ...@@ -260,6 +296,7 @@ def main(args):
args.skip_steps / used_time)) args.skip_steps / used_time))
time_begin = time.time() time_begin = time.time()
if nccl2_trainer_id == 0:
if steps % args.save_steps == 0: if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints, save_path = os.path.join(args.checkpoints,
"step_" + str(steps)) "step_" + str(steps))
...@@ -295,10 +332,10 @@ def main(args): ...@@ -295,10 +332,10 @@ def main(args):
ce_acc = ce_info[-2][1] ce_acc = ce_info[-2][1]
ce_time = ce_info[-2][2] ce_time = ce_info[-2][2]
except: except:
print("ce info error") log.info("ce info error")
print("kpis\ttrain_duration_card%s\t%s" % (card_num, ce_time)) log.info("kpis\ttrain_duration_card%s\t%s" % (card_num, ce_time))
print("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss)) log.info("kpis\ttrain_loss_card%s\t%f" % (card_num, ce_loss))
print("kpis\ttrain_acc_card%s\t%f" % (card_num, ce_acc)) log.info("kpis\ttrain_acc_card%s\t%f" % (card_num, ce_acc))
# final eval on dev set # final eval on dev set
if args.do_val: if args.do_val:
...@@ -312,7 +349,7 @@ def main(args): ...@@ -312,7 +349,7 @@ def main(args):
# final eval on dianostic, hack for glue-ax # final eval on dianostic, hack for glue-ax
if args.diagnostic: if args.diagnostic:
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.diagnostic, args.diagnostic,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -320,7 +357,7 @@ def main(args): ...@@ -320,7 +357,7 @@ def main(args):
dev_count=1, dev_count=1,
shuffle=False)) shuffle=False))
print("Final diagnostic") log.info("Final diagnostic")
qids, preds, probs = predict( qids, preds, probs = predict(
test_exe, test_exe,
test_prog, test_prog,
...@@ -334,22 +371,23 @@ def main(args): ...@@ -334,22 +371,23 @@ def main(args):
for id, s, p in zip(qids, preds, probs): for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p)) f.write('{}\t{}\t{}\n'.format(id, s, p))
print("Done final diagnostic, saving to {}".format( log.info("Done final diagnostic, saving to {}".format(
args.diagnostic_save)) args.diagnostic_save))
def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
epoch, steps): epoch, steps):
# evaluate dev set # evaluate dev set
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for ds in args.dev_set.split(','): for ds in args.dev_set.split(','):
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
ds, ds,
batch_size=args.predict_batch_size, batch_size=batch_size,
epoch=1, epoch=1,
dev_count=1, dev_count=1,
shuffle=False)) shuffle=False))
print("validation result of dataset {}:".format(ds)) log.info("validation result of dataset {}:".format(ds))
evaluate_info = evaluate( evaluate_info = evaluate(
exe, exe,
test_prog, test_prog,
...@@ -359,7 +397,7 @@ def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, ...@@ -359,7 +397,7 @@ def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
metric=args.metric, metric=args.metric,
is_classify=args.is_classify, is_classify=args.is_classify,
is_regression=args.is_regression) is_regression=args.is_regression)
print(evaluate_info + ', file: {}, epoch: {}, steps: {}'.format( log.info(evaluate_info + ', file: {}, epoch: {}, steps: {}'.format(
ds, epoch, steps)) ds, epoch, steps))
...@@ -368,18 +406,19 @@ def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, ...@@ -368,18 +406,19 @@ def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
test_sets = args.test_set.split(',') test_sets = args.test_set.split(',')
save_dirs = args.test_save.split(',') save_dirs = args.test_save.split(',')
assert len(test_sets) == len(save_dirs) assert len(test_sets) == len(save_dirs)
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for test_f, save_f in zip(test_sets, save_dirs): for test_f, save_f in zip(test_sets, save_dirs):
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
test_f, test_f,
batch_size=args.predict_batch_size, batch_size=batch_size,
epoch=1, epoch=1,
dev_count=1, dev_count=1,
shuffle=False)) shuffle=False))
save_path = save_f + '.' + str(epoch) + '.' + str(steps) save_path = save_f + '.' + str(epoch) + '.' + str(steps)
print("testing {}, save to {}".format(test_f, save_path)) log.info("testing {}, save to {}".format(test_f, save_path))
qids, preds, probs = predict( qids, preds, probs = predict(
exe, exe,
test_prog, test_prog,
...@@ -393,11 +432,16 @@ def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, ...@@ -393,11 +432,16 @@ def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars,
os.makedirs(save_dir) os.makedirs(save_dir)
with open(save_path, 'w') as f: with open(save_path, 'w') as f:
if len(qids) == 0:
for s, p in zip(preds, probs):
f.write('{}\t{}\n'.format(s, p))
else:
for id, s, p in zip(qids, preds, probs): for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p)) f.write('{}\t{}\t{}\n'.format(id, s, p))
if __name__ == '__main__': if __name__ == '__main__':
prepare_logger(log)
print_arguments(args) print_arguments(args)
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
main(args) main(args)
...@@ -16,9 +16,11 @@ ...@@ -16,9 +16,11 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
import os import os
import time import time
import logging
import multiprocessing import multiprocessing
# NOTE(paddle-dev): All of these flags should be # NOTE(paddle-dev): All of these flags should be
...@@ -32,11 +34,12 @@ import reader.task_reader as task_reader ...@@ -32,11 +34,12 @@ import reader.task_reader as task_reader
from model.ernie import ErnieConfig from model.ernie import ErnieConfig
from finetune.mrc import create_model, evaluate from finetune.mrc import create_model, evaluate
from optimization import optimization from optimization import optimization
from utils.args import print_arguments from utils.args import print_arguments, prepare_logger
from utils.init import init_pretraining_params, init_checkpoint from utils.init import init_pretraining_params, init_checkpoint
from finetune_args import parser from finetune_args import parser
args = parser.parse_args() args = parser.parse_args()
log = logging.getLogger()
def main(args): def main(args):
...@@ -44,8 +47,9 @@ def main(args): ...@@ -44,8 +47,9 @@ def main(args):
ernie_config.print_config() ernie_config.print_config()
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) dev_list = fluid.cuda_places()
dev_count = fluid.core.get_cuda_device_count() place = dev_list[0]
dev_count = len(dev_list)
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
...@@ -70,6 +74,8 @@ def main(args): ...@@ -70,6 +74,8 @@ def main(args):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at " raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.") "least one of them must be True.")
if args.do_test:
assert args.test_save is not None
startup_prog = fluid.Program() startup_prog = fluid.Program()
if args.random_seed is not None: if args.random_seed is not None:
startup_prog.random_seed = args.random_seed startup_prog.random_seed = args.random_seed
...@@ -77,27 +83,30 @@ def main(args): ...@@ -77,27 +83,30 @@ def main(args):
if args.predict_batch_size == None: if args.predict_batch_size == None:
args.predict_batch_size = args.batch_size args.predict_batch_size = args.batch_size
if args.do_train: if args.do_train:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
train_data_generator = reader.data_generator( train_data_generator = reader.data_generator(
input_file=args.train_set, input_file=args.train_set,
batch_size=args.batch_size, batch_size=args.batch_size,
epoch=args.epoch, epoch=args.epoch,
dev_count=dev_count, dev_count=trainers_num,
shuffle=True, shuffle=True,
phase="train") phase="train")
num_train_examples = reader.get_num_examples("train") num_train_examples = reader.get_num_examples("train")
if args.in_tokens: if args.in_tokens:
if args.batch_size < args.max_seq_len:
raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
max_train_steps = args.epoch * num_train_examples // ( max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count args.batch_size // args.max_seq_len) // dev_count
else: else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion) warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count) log.info("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples) log.info("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps) log.info("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps) log.info("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program() train_program = fluid.Program()
...@@ -108,7 +117,7 @@ def main(args): ...@@ -108,7 +117,7 @@ def main(args):
pyreader_name='train_reader', pyreader_name='train_reader',
ernie_config=ernie_config, ernie_config=ernie_config,
is_training=True) is_training=True)
scheduled_lr, loss_scaling = optimization( scheduled_lr, _ = optimization(
loss=graph_vars["loss"], loss=graph_vars["loss"],
warmup_steps=warmup_steps, warmup_steps=warmup_steps,
num_train_steps=max_train_steps, num_train_steps=max_train_steps,
...@@ -117,7 +126,13 @@ def main(args): ...@@ -117,7 +126,13 @@ def main(args):
startup_prog=startup_prog, startup_prog=startup_prog,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
scheduler=args.lr_scheduler, scheduler=args.lr_scheduler,
use_fp16=args.use_fp16) use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio)
if args.verbose: if args.verbose:
if args.in_tokens: if args.in_tokens:
...@@ -127,7 +142,7 @@ def main(args): ...@@ -127,7 +142,7 @@ def main(args):
else: else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage( lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size) program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" % log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit)) (lower_mem, upper_mem, unit))
if args.do_val or args.do_test: if args.do_val or args.do_test:
...@@ -144,11 +159,36 @@ def main(args): ...@@ -144,11 +159,36 @@ def main(args):
nccl2_num_trainers = 1 nccl2_num_trainers = 1
nccl2_trainer_id = 0 nccl2_trainer_id = 0
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog) exe.run(startup_prog)
if args.do_train: if args.do_train:
if args.init_checkpoint and args.init_pretraining_params: if args.init_checkpoint and args.init_pretraining_params:
print( log.warning(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' " "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.") "both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint: if args.init_checkpoint:
...@@ -188,7 +228,7 @@ def main(args): ...@@ -188,7 +228,7 @@ def main(args):
num_trainers=nccl2_num_trainers, num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id) trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator) train_pyreader.set_batch_generator(train_data_generator)
else: else:
train_exe = None train_exe = None
...@@ -214,12 +254,12 @@ def main(args): ...@@ -214,12 +254,12 @@ def main(args):
verbose += "learning rate: %f" % ( verbose += "learning rate: %f" % (
outputs["learning_rate"] outputs["learning_rate"]
if warmup_steps > 0 else args.learning_rate) if warmup_steps > 0 else args.learning_rate)
print(verbose) log.info(verbose)
current_example, current_epoch = reader.get_train_progress() current_example, current_epoch = reader.get_train_progress()
time_end = time.time() time_end = time.time()
used_time = time_end - time_begin used_time = time_end - time_begin
print("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " log.info("epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"speed: %f steps/s" % "speed: %f steps/s" %
(current_epoch, current_example, num_train_examples, (current_epoch, current_example, num_train_examples,
steps, outputs["loss"], args.skip_steps / used_time)) steps, outputs["loss"], args.skip_steps / used_time))
...@@ -232,7 +272,7 @@ def main(args): ...@@ -232,7 +272,7 @@ def main(args):
if steps % args.validation_steps == 0: if steps % args.validation_steps == 0:
if args.do_val: if args.do_val:
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.dev_set, args.dev_set,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -251,7 +291,7 @@ def main(args): ...@@ -251,7 +291,7 @@ def main(args):
args=args) args=args)
if args.do_test: if args.do_test:
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.test_set, args.test_set,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -277,8 +317,8 @@ def main(args): ...@@ -277,8 +317,8 @@ def main(args):
# final eval on dev set # final eval on dev set
if args.do_val: if args.do_val:
print("Final validation result:") log.info("Final validation result:")
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.dev_set, args.dev_set,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -298,8 +338,8 @@ def main(args): ...@@ -298,8 +338,8 @@ def main(args):
# final eval on test set # final eval on test set
if args.do_test: if args.do_test:
print("Final test result:") log.info("Final test result:")
test_pyreader.decorate_tensor_provider( test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.test_set, args.test_set,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -319,7 +359,8 @@ def main(args): ...@@ -319,7 +359,8 @@ def main(args):
if __name__ == '__main__': if __name__ == '__main__':
while True: prepare_logger(log)
print_arguments(args)
scope = fluid.core.Scope() scope = fluid.core.Scope()
with fluid.scope_guard(scope): with fluid.scope_guard(scope):
main(args) main(args)
...@@ -16,10 +16,15 @@ ...@@ -16,10 +16,15 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
import six
import logging
import multiprocessing import multiprocessing
from io import open
# NOTE(paddle-dev): All of these flags should be # NOTE(paddle-dev): All of these flags should be
# set before `import paddle`. Otherwise, it would # set before `import paddle`. Otherwise, it would
...@@ -32,11 +37,12 @@ import reader.task_reader as task_reader ...@@ -32,11 +37,12 @@ import reader.task_reader as task_reader
from model.ernie import ErnieConfig from model.ernie import ErnieConfig
from optimization import optimization from optimization import optimization
from utils.init import init_pretraining_params, init_checkpoint from utils.init import init_pretraining_params, init_checkpoint
from utils.args import print_arguments, check_cuda from utils.args import print_arguments, check_cuda, prepare_logger
from finetune.sequence_label import create_model, evaluate from finetune.sequence_label import create_model, evaluate, predict, calculate_f1
from finetune_args import parser from finetune_args import parser
args = parser.parse_args() args = parser.parse_args()
log = logging.getLogger()
def main(args): def main(args):
...@@ -44,12 +50,12 @@ def main(args): ...@@ -44,12 +50,12 @@ def main(args):
ernie_config.print_config() ernie_config.print_config()
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0'))) dev_list = fluid.cuda_places()
dev_count = fluid.core.get_cuda_device_count() place = dev_list[0]
dev_count = len(dev_list)
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
reader = task_reader.SequenceLabelReader( reader = task_reader.SequenceLabelReader(
vocab_path=args.vocab_path, vocab_path=args.vocab_path,
...@@ -79,16 +85,19 @@ def main(args): ...@@ -79,16 +85,19 @@ def main(args):
num_train_examples = reader.get_num_examples(args.train_set) num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens: if args.in_tokens:
if args.batch_size < args.max_seq_len:
raise ValueError('if in_tokens=True, batch_size should greater than max_sqelen, got batch_size:%d seqlen:%d' % (args.batch_size, args.max_seq_len))
max_train_steps = args.epoch * num_train_examples // ( max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // dev_count args.batch_size // args.max_seq_len) // dev_count
else: else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion) warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d" % dev_count) log.info("Device count: %d" % dev_count)
print("Num train examples: %d" % num_train_examples) log.info("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps) log.info("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps) log.info("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program() train_program = fluid.Program()
...@@ -107,7 +116,13 @@ def main(args): ...@@ -107,7 +116,13 @@ def main(args):
startup_prog=startup_prog, startup_prog=startup_prog,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
scheduler=args.lr_scheduler, scheduler=args.lr_scheduler,
use_fp16=args.use_fp16) use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio)
if args.verbose: if args.verbose:
if args.in_tokens: if args.in_tokens:
...@@ -117,7 +132,7 @@ def main(args): ...@@ -117,7 +132,7 @@ def main(args):
else: else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage( lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size) program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" % log.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit)) (lower_mem, upper_mem, unit))
if args.do_val or args.do_test: if args.do_val or args.do_test:
...@@ -131,11 +146,38 @@ def main(args): ...@@ -131,11 +146,38 @@ def main(args):
test_prog = test_prog.clone(for_test=True) test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog) exe.run(startup_prog)
if args.do_train: if args.do_train:
if args.init_checkpoint and args.init_pretraining_params: if args.init_checkpoint and args.init_pretraining_params:
print( log.info(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' " "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.") "both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint: if args.init_checkpoint:
...@@ -171,9 +213,11 @@ def main(args): ...@@ -171,9 +213,11 @@ def main(args):
use_cuda=args.use_cuda, use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name, loss_name=graph_vars["loss"].name,
exec_strategy=exec_strategy, exec_strategy=exec_strategy,
main_program=train_program) main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator) train_pyreader.set_batch_generator(train_data_generator)
else: else:
train_exe = None train_exe = None
...@@ -186,7 +230,6 @@ def main(args): ...@@ -186,7 +230,6 @@ def main(args):
if args.do_train: if args.do_train:
train_pyreader.start() train_pyreader.start()
steps = 0 steps = 0
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time() time_begin = time.time()
...@@ -196,54 +239,47 @@ def main(args): ...@@ -196,54 +239,47 @@ def main(args):
if steps % args.skip_steps != 0: if steps % args.skip_steps != 0:
train_exe.run(fetch_list=[]) train_exe.run(fetch_list=[])
else: else:
outputs = evaluate(train_exe, train_program, train_pyreader, fetch_list = [
graph_vars, args.num_labels, "train", graph_vars["num_infer"].name, graph_vars["num_label"].name,
dev_count) graph_vars["num_correct"].name,
graph_vars["loss"].name,
graph_vars['learning_rate'].name,
]
out = train_exe.run(fetch_list=fetch_list)
num_infer, num_label, num_correct, np_loss, np_lr = out
lr = float(np_lr[0])
loss = np_loss.mean()
precision, recall, f1 = calculate_f1(num_label, num_infer, num_correct)
if args.verbose: if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size( log.info("train pyreader queue size: %d, learning rate: %f" % (train_pyreader.queue.size(),
) lr if warmup_steps > 0 else args.learning_rate))
verbose += "learning rate: %f" % (
outputs["lr"]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = reader.get_train_progress() current_example, current_epoch = reader.get_train_progress()
time_end = time.time() time_end = time.time()
used_time = time_end - time_begin used_time = time_end - time_begin
print("epoch: %d, progress: %d/%d, step: %d, loss: %f, " log.info("epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"f1: %f, precision: %f, recall: %f, speed: %f steps/s" "f1: %f, precision: %f, recall: %f, speed: %f steps/s"
% (current_epoch, current_example, num_train_examples, % (current_epoch, current_example, num_train_examples,
steps, outputs["loss"], outputs["f1"], steps, loss, f1, precision, recall,
outputs["precision"], outputs["recall"],
args.skip_steps / used_time)) args.skip_steps / used_time))
time_begin = time.time() time_begin = time.time()
if steps % args.save_steps == 0: if nccl2_trainer_id == 0 and steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints, save_path = os.path.join(args.checkpoints,
"step_" + str(steps)) "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program) fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0: if nccl2_trainer_id == 0 and steps % args.validation_steps == 0:
# evaluate dev set # evaluate dev set
if args.do_val: if args.do_val:
test_pyreader.decorate_tensor_provider( evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
reader.data_generator( current_epoch, steps)
args.dev_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader, graph_vars,
args.num_labels, "dev")
# evaluate test set # evaluate test set
if args.do_test: if args.do_test:
test_pyreader.decorate_tensor_provider( predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
reader.data_generator( current_epoch, steps)
args.test_set,
batch_size=args.batch_size,
epoch=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader, graph_vars,
args.num_labels, "test")
except fluid.core.EOFException: except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps)) save_path = os.path.join(args.checkpoints, "step_" + str(steps))
...@@ -252,31 +288,71 @@ def main(args): ...@@ -252,31 +288,71 @@ def main(args):
break break
# final eval on dev set # final eval on dev set
if args.do_val: if nccl2_trainer_id ==0 and args.do_val:
test_pyreader.decorate_tensor_provider( if not args.do_train:
current_example, current_epoch = reader.get_train_progress()
evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
current_epoch, 'final')
if nccl2_trainer_id == 0 and args.do_test:
if not args.do_train:
current_example, current_epoch = reader.get_train_progress()
predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
current_epoch, 'final')
def evaluate_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
epoch, steps):
# evaluate dev set
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for ds in args.dev_set.split(','): #single card eval
test_pyreader.set_batch_generator(
reader.data_generator( reader.data_generator(
args.dev_set, ds,
batch_size=args.batch_size, batch_size=batch_size,
epoch=1, epoch=1,
dev_count=1,
shuffle=False)) shuffle=False))
print("Final validation result:") log.info("validation result of dataset {}:".format(ds))
evaluate(exe, test_prog, test_pyreader, graph_vars, args.num_labels, info = evaluate(exe, test_prog, test_pyreader, graph_vars,
"dev") args.num_labels)
log.info(info + ', file: {}, epoch: {}, steps: {}'.format(
# final eval on test set ds, epoch, steps))
if args.do_test:
test_pyreader.decorate_tensor_provider(
reader.data_generator( def predict_wrapper(reader, exe, test_prog, test_pyreader, graph_vars,
args.test_set, epoch, steps):
batch_size=args.batch_size, test_sets = args.test_set.split(',')
save_dirs = args.test_save.split(',')
assert len(test_sets) == len(save_dirs), 'number of test_sets & test_save not match, got %d vs %d' % (len(test_sets), len(save_dirs))
batch_size = args.batch_size if args.predict_batch_size is None else args.predict_batch_size
for test_f, save_f in zip(test_sets, save_dirs):
test_pyreader.set_batch_generator(reader.data_generator(
test_f,
batch_size=batch_size,
epoch=1, epoch=1,
dev_count=1,
shuffle=False)) shuffle=False))
print("Final test result:")
evaluate(exe, test_prog, test_pyreader, graph_vars, args.num_labels,
"test")
save_path = save_f + '.' + str(epoch) + '.' + str(steps)
log.info("testing {}, save to {}".format(test_f, save_path))
res = predict(exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_dir = os.path.dirname(save_path)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
tokenizer = reader.tokenizer
rev_label_map = {v: k for k, v in six.iteritems(reader.label_map)}
with open(save_path, 'w', encoding='utf8') as f:
for id, s, p in res:
id = ' '.join(tokenizer.convert_ids_to_tokens(id))
p = ' '.join(['%.5f' % pp[ss] for ss, pp in zip(s, p)])
s = ' '.join([rev_label_map[ss]for ss in s])
f.write('{}\t{}\t{}\n'.format(id, s, p))
if __name__ == '__main__': if __name__ == '__main__':
prepare_logger(log)
print_arguments(args) print_arguments(args)
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
main(args) main(args)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import os
import argparse
from propeller.service.client import InferenceClient
from propeller import log
import six
import utils.data
from time import time
import numpy as np
class ErnieClient(InferenceClient):
def __init__(self,
vocab_file,
host='localhost',
port=8888,
batch_size=32,
num_coroutine=1,
timeout=10.,
max_seqlen=128):
host_port = 'tcp://%s:%d' % (host, port)
client = super(ErnieClient, self).__init__(host_port, batch_size=batch_size, num_coroutine=num_coroutine, timeout=timeout)
self.vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(vocab_file, 'rb'))}
self.tokenizer = utils.data.CharTokenizer(self.vocab.keys())
self.max_seqlen = max_seqlen
self.cls_id = self.vocab['[CLS]']
self.sep_id = self.vocab['[SEP]']
def txt_2_id(self, text):
ids = np.array([self.vocab[i] for i in self.tokenizer(text)])
return ids
def pad_and_batch(self, ids):
max_len = max(map(len, ids))
padded = np.stack([np.pad(i, [[0, max_len - len(i)]], mode='constant')for i in ids])
padded = np.expand_dims(padded, axis=-1)
return padded
def __call__(self, text_a, text_b=None):
if text_b is not None and len(text_a) != len(text_b):
raise ValueError('text_b %d has different size than text_a %d' % (text_b, text_a))
text_a = [i.encode('utf8') if isinstance(i, six.string_types) else i for i in text_a]
if text_b is not None:
text_b = [i.encode('utf8') if isinstance(i, six.string_types) else i for i in text_b]
ids_a = map(self.txt_2_id, text_a)
if text_b is not None:
ids_b = map(self.txt_2_id, text_b)
ret = [utils.data.build_2_pair(a, b, self.max_seqlen, self.cls_id, self.sep_id) for a, b in zip(ids_a, ids_b)]
else:
ret = [utils.data.build_1_pair(a, self.max_seqlen, self.cls_id, self.sep_id) for a in ids_a]
sen_ids, token_type_ids = zip(*ret)
sen_ids = self.pad_and_batch(sen_ids)
token_type_ids = self.pad_and_batch(token_type_ids)
ret, = super(ErnieClient, self).__call__(sen_ids, token_type_ids)
return ret
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ernie_encoder_client')
parser.add_argument('--host', type=str, default='localhost')
parser.add_argument('-i', '--input', type=str, required=True)
parser.add_argument('-o', '--output', type=str, required=True)
parser.add_argument('-p', '--port', type=int, default=8888)
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--num_coroutine', type=int, default=1)
parser.add_argument('--vocab', type=str, required=True)
args = parser.parse_args()
client = ErnieClient(args.vocab, args.host, args.port, batch_size=args.batch_size, num_coroutine=args.num_coroutine)
inputs = [i.strip().split(b'\t') for i in open(args.input, 'rb').readlines()]
if len(inputs) == 0:
raise ValueError('empty input')
send_batch = args.num_coroutine * args.batch_size
send_num = len(inputs) // send_batch + 1
rets = []
start = time()
for i in range(send_num):
slice = inputs[i * send_batch: (i + 1) * send_batch]
if len(slice) == 0:
continue
columns = list(zip(*slice))
if len(columns) > 2:
raise ValueError('inputs file has more than 2 columns')
ret = client(*columns)
if len(ret.shape) == 3:
ret = ret[:, 0, :] # take cls
rets.append(ret)
end = time()
with open(args.output, 'wb') as outf:
arr = np.concatenate(rets, 0)
np.save(outf, arr)
log.info('query num: %d average latency %.5f' % (len(inputs), (end - start)/len(inputs)))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import os
import argparse
import logging
import logging.handlers
import re
from propeller.service.server import InferenceServer
from propeller import log
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-m', '--model_dir', type=str, required=True)
parser.add_argument('-p', '--port', type=int, default=8888)
parser.add_argument('-v', '--verbose', action='store_true')
parser.add_argument('--encode_layer', type=str, choices=[
'pooler',
'layer12',
'layer11',
'layer10',
'layer9',
'layer8',
'layer7',
'layer6',
'layer5',
'layer4',
'layer3',
'layer2',
'layer1',
], default='pooler')
args = parser.parse_args()
if args.verbose:
log.setLevel(logging.DEBUG)
cuda_env = os.getenv("CUDA_VISIBLE_DEVICES")
if cuda_env is None:
raise RuntimeError('CUDA_VISIBLE_DEVICES not set')
if not os.path.exists(args.model_dir):
raise ValueError('model_dir not found: %s' % args.model_dir)
if not os.path.exists(args.model_dir):
raise ValueError('model_dir not found: %s' % args.model_dir)
n_devices = len(cuda_env.split(","))
if args.encode_layer.lower() == 'pooler':
model_dir = os.path.join(args.model_dir, 'pooler')
else:
pat = re.compile(r'layer(\d+)')
match = pat.match(args.encode_layer.lower())
layer = int(match.group(1))
model_dir = os.path.join(args.model_dir, 'enc%d' % layer)
server = InferenceServer(model_dir, n_devices)
log.info('propeller server listent on port %d' % args.port)
server.listen(args.port)
...@@ -17,6 +17,10 @@ ...@@ -17,6 +17,10 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from io import open
import collections import collections
import unicodedata import unicodedata
...@@ -69,7 +73,7 @@ def printable_text(text): ...@@ -69,7 +73,7 @@ def printable_text(text):
def load_vocab(vocab_file): def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary.""" """Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict() vocab = collections.OrderedDict()
fin = open(vocab_file) with open(vocab_file, encoding='utf8') as fin:
for num, line in enumerate(fin): for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t") items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2: if len(items) > 2:
......
...@@ -12,14 +12,16 @@ ...@@ -12,14 +12,16 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""ERNIE pretraining.""" """ERNIE pretraining."""
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import time import time
import multiprocessing import multiprocessing
import logging
import numpy as np import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
...@@ -27,31 +29,29 @@ import paddle.fluid as fluid ...@@ -27,31 +29,29 @@ import paddle.fluid as fluid
from reader.pretraining import ErnieDataReader from reader.pretraining import ErnieDataReader
from model.ernie_v1 import ErnieModel, ErnieConfig from model.ernie_v1 import ErnieModel, ErnieConfig
from optimization import optimization from optimization import optimization
from utils.args import print_arguments, check_cuda from utils.args import print_arguments, check_cuda, prepare_logger
from utils.init import init_checkpoint, init_pretraining_params from utils.init import init_checkpoint, init_pretraining_params
from pretrain_args import parser from pretrain_args import parser
log = logging.getLogger()
args = parser.parse_args() args = parser.parse_args()
# yapf: enable. # yapf: enable.
def create_model(pyreader_name, ernie_config): def create_model(pyreader_name, ernie_config):
pyreader = fluid.layers.py_reader( src_ids = fluid.layers.data(name='1', shape=[-1, args.max_seq_len, 1], dtype='int64')
capacity=70, pos_ids = fluid.layers.data(name='2', shape=[-1, args.max_seq_len, 1], dtype='int64')
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], sent_ids= fluid.layers.data(name='3', shape=[-1, args.max_seq_len, 1], dtype='int64')
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1], input_mask = fluid.layers.data(name='4', shape=[-1, args.max_seq_len, 1], dtype='float32')
[-1, 1], [-1, 1]], mask_label = fluid.layers.data(name='5', shape=[-1, 1], dtype='int64')
dtypes=[ mask_pos = fluid.layers.data(name='6', shape=[-1, 1], dtype='int64')
'int64', 'int64', 'int64', 'float32', 'int64', 'int64', 'int64' labels = fluid.layers.data(name='r', shape=[-1, 1], dtype='int64')
],
lod_levels=[0, 0, 0, 0, 0, 0, 0], pyreader = fluid.io.DataLoader.from_generator(feed_list=[
name=pyreader_name, src_ids, pos_ids, sent_ids, input_mask, mask_label, mask_pos, labels
use_double_buffer=True) ], capacity=70, iterable=False)
(src_ids, pos_ids, sent_ids, input_mask, mask_label, mask_pos,
labels) = fluid.layers.read_file(pyreader)
ernie = ErnieModel( ernie = ErnieModel(
src_ids=src_ids, src_ids=src_ids,
...@@ -65,9 +65,6 @@ def create_model(pyreader_name, ernie_config): ...@@ -65,9 +65,6 @@ def create_model(pyreader_name, ernie_config):
next_sent_acc, mask_lm_loss, total_loss = ernie.get_pretraining_output( next_sent_acc, mask_lm_loss, total_loss = ernie.get_pretraining_output(
mask_label, mask_pos, labels) mask_label, mask_pos, labels)
if args.use_fp16 and args.loss_scaling > 1.0:
total_loss *= args.loss_scaling
return pyreader, next_sent_acc, mask_lm_loss, total_loss return pyreader, next_sent_acc, mask_lm_loss, total_loss
...@@ -97,7 +94,7 @@ def predict_wrapper(args, ...@@ -97,7 +94,7 @@ def predict_wrapper(args,
def predict(exe=exe, pyreader=pyreader): def predict(exe=exe, pyreader=pyreader):
pyreader.decorate_tensor_provider(data_reader.data_generator()) pyreader.set_batch_generator(data_reader.data_generator())
pyreader.start() pyreader.start()
cost = 0 cost = 0
...@@ -114,7 +111,7 @@ def predict_wrapper(args, ...@@ -114,7 +111,7 @@ def predict_wrapper(args,
cost += each_total_cost cost += each_total_cost
steps += 1 steps += 1
if args.do_test and steps % args.skip_steps == 0: if args.do_test and steps % args.skip_steps == 0:
print("[test_set] steps: %d" % steps) log.info("[test_set] steps: %d" % steps)
except fluid.core.EOFException: except fluid.core.EOFException:
pyreader.reset() pyreader.reset()
...@@ -151,9 +148,9 @@ def test(args): ...@@ -151,9 +148,9 @@ def test(args):
pyreader=test_pyreader, pyreader=test_pyreader,
fetch_list=[next_sent_acc.name, mask_lm_loss.name, total_loss.name]) fetch_list=[next_sent_acc.name, mask_lm_loss.name, total_loss.name])
print("test begin") log.info("test begin")
loss, lm_loss, acc, steps, speed = predict() loss, lm_loss, acc, steps, speed = predict()
print( log.info(
"[test_set] loss: %f, global ppl: %f, next_sent_acc: %f, speed: %f steps/s" "[test_set] loss: %f, global ppl: %f, next_sent_acc: %f, speed: %f steps/s"
% (np.mean(np.array(loss) / steps), % (np.mean(np.array(loss) / steps),
np.exp(np.mean(np.array(lm_loss) / steps)), np.exp(np.mean(np.array(lm_loss) / steps)),
...@@ -161,7 +158,7 @@ def test(args): ...@@ -161,7 +158,7 @@ def test(args):
def train(args): def train(args):
print("pretraining start") log.info("pretraining start")
ernie_config = ErnieConfig(args.ernie_config_path) ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config() ernie_config.print_config()
...@@ -171,7 +168,7 @@ def train(args): ...@@ -171,7 +168,7 @@ def train(args):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
train_pyreader, next_sent_acc, mask_lm_loss, total_loss = create_model( train_pyreader, next_sent_acc, mask_lm_loss, total_loss = create_model(
pyreader_name='train_reader', ernie_config=ernie_config) pyreader_name='train_reader', ernie_config=ernie_config)
scheduled_lr, loss_scaling = optimization( scheduled_lr, _ = optimization(
loss=total_loss, loss=total_loss,
warmup_steps=args.warmup_steps, warmup_steps=args.warmup_steps,
num_train_steps=args.num_train_steps, num_train_steps=args.num_train_steps,
...@@ -180,13 +177,14 @@ def train(args): ...@@ -180,13 +177,14 @@ def train(args):
startup_prog=startup_prog, startup_prog=startup_prog,
weight_decay=args.weight_decay, weight_decay=args.weight_decay,
scheduler=args.lr_scheduler, scheduler=args.lr_scheduler,
use_fp16=args.use_fp16) use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio)
fluid.memory_optimize(
input_program=train_program,
skip_opt_set=[
next_sent_acc.name, mask_lm_loss.name, total_loss.name
])
test_prog = fluid.Program() test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog): with fluid.program_guard(test_prog, startup_prog):
...@@ -196,31 +194,34 @@ def train(args): ...@@ -196,31 +194,34 @@ def train(args):
test_prog = test_prog.clone(for_test=True) test_prog = test_prog.clone(for_test=True)
if len(fluid.cuda_places()) == 0:
raise RuntimeError('not cuda device cound, check ur env setting')
if args.use_cuda: if args.use_cuda:
place = fluid.CUDAPlace(0) place = fluid.cuda_places()[0]
dev_count = fluid.core.get_cuda_device_count() dev_count = fluid.core.get_cuda_device_count()
else: else:
place = fluid.CPUPlace() place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
print("Device count %d" % dev_count) log.info("Device count %d" % dev_count)
print("theoretical memory usage: ") log.info("theoretical memory usage: ")
print(fluid.contrib.memory_usage( log.info(fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size // args.max_seq_len)) program=train_program, batch_size=args.batch_size // args.max_seq_len))
nccl2_num_trainers = 1 nccl2_num_trainers = 1
nccl2_trainer_id = 0 nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed) log.info("args.is_distributed: %s" % args.is_distributed)
if args.is_distributed: if args.is_distributed:
worker_endpoints_env = os.getenv("worker_endpoints") worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
worker_endpoints = worker_endpoints_env.split(",") worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints) trainers_num = len(worker_endpoints)
current_endpoint = os.getenv("current_endpoint") current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
trainer_id = worker_endpoints.index(current_endpoint) trainer_id = worker_endpoints.index(current_endpoint)
if trainer_id == 0: if trainer_id == 0:
print("train_id == 0, sleep 60s") log.info("train_id == 0, sleep 60s")
time.sleep(60) time.sleep(60)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \ log.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num, trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id)) current_endpoint, trainer_id))
...@@ -281,7 +282,7 @@ def train(args): ...@@ -281,7 +282,7 @@ def train(args):
next_sent_acc.name, mask_lm_loss.name, total_loss.name next_sent_acc.name, mask_lm_loss.name, total_loss.name
]) ])
train_pyreader.decorate_tensor_provider(data_reader.data_generator()) train_pyreader.set_batch_generator(data_reader.data_generator())
train_pyreader.start() train_pyreader.start()
steps = 0 steps = 0
cost = [] cost = []
...@@ -309,13 +310,13 @@ def train(args): ...@@ -309,13 +310,13 @@ def train(args):
lm_cost.extend(each_mask_lm_cost) lm_cost.extend(each_mask_lm_cost)
cost.extend(each_total_cost) cost.extend(each_total_cost)
print("feed_queue size", train_pyreader.queue.size()) log.info("feed_queue size %d" % train_pyreader.queue.size())
time_end = time.time() time_end = time.time()
used_time = time_end - time_begin used_time = time_end - time_begin
epoch, current_file_index, total_file, current_file, mask_type = data_reader.get_progress( epoch, current_file_index, total_file, current_file, mask_type = data_reader.get_progress(
) )
print("current learning_rate:%f" % np_lr[0]) log.info("current learning_rate:%f" % np_lr[0])
print( log.info(
"epoch: %d, progress: %d/%d, step: %d, loss: %f, " "epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"ppl: %f, next_sent_acc: %f, speed: %f steps/s, file: %s, mask_type: %s" "ppl: %f, next_sent_acc: %f, speed: %f steps/s, file: %s, mask_type: %s"
% (epoch, current_file_index, total_file, steps, % (epoch, current_file_index, total_file, steps,
...@@ -335,7 +336,7 @@ def train(args): ...@@ -335,7 +336,7 @@ def train(args):
if args.valid_filelist and steps % args.validation_steps == 0: if args.valid_filelist and steps % args.validation_steps == 0:
vali_cost, vali_lm_cost, vali_acc, vali_steps, vali_speed = predict( vali_cost, vali_lm_cost, vali_acc, vali_steps, vali_speed = predict(
) )
print("[validation_set] epoch: %d, step: %d, " log.info("[validation_set] epoch: %d, step: %d, "
"loss: %f, global ppl: %f, batch-averged ppl: %f, " "loss: %f, global ppl: %f, batch-averged ppl: %f, "
"next_sent_acc: %f, speed: %f steps/s" % "next_sent_acc: %f, speed: %f steps/s" %
(epoch, steps, np.mean(np.array(vali_cost) / vali_steps), (epoch, steps, np.mean(np.array(vali_cost) / vali_steps),
...@@ -349,6 +350,7 @@ def train(args): ...@@ -349,6 +350,7 @@ def train(args):
if __name__ == '__main__': if __name__ == '__main__':
prepare_logger(log)
print_arguments(args) print_arguments(args)
check_cuda(args.use_cuda) check_cuda(args.use_cuda)
if args.do_test: if args.do_test:
......
...@@ -12,17 +12,37 @@ ...@@ -12,17 +12,37 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Arguments for configuration.""" """Arguments for configuration."""
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import six import six
import os
import sys
import argparse import argparse
import logging
import paddle.fluid as fluid import paddle.fluid as fluid
log = logging.getLogger(__name__)
def prepare_logger(logger, debug=False, save_to_file=None):
formatter = logging.Formatter(fmt='[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s')
console_hdl = logging.StreamHandler()
console_hdl.setFormatter(formatter)
logger.addHandler(console_hdl)
if save_to_file is not None and not os.path.exists(save_to_file):
file_hdl = logging.FileHandler(save_to_file)
file_hdl.setFormatter(formatter)
logger.addHandler(file_hdl)
logger.setLevel(logging.DEBUG)
logger.propagate = False
def str2bool(v): def str2bool(v):
# because argparse does not support to parse "true, False" as python # because argparse does not support to parse "true, False" as python
# boolean directly # boolean directly
...@@ -33,10 +53,11 @@ class ArgumentGroup(object): ...@@ -33,10 +53,11 @@ class ArgumentGroup(object):
def __init__(self, parser, title, des): def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des) self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs): def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type type = str2bool if type == bool else type
self._group.add_argument( self._group.add_argument(
"--" + name, prefix + name,
default=default, default=default,
type=type, type=type,
help=help + ' Default: %(default)s.', help=help + ' Default: %(default)s.',
...@@ -44,10 +65,10 @@ class ArgumentGroup(object): ...@@ -44,10 +65,10 @@ class ArgumentGroup(object):
def print_arguments(args): def print_arguments(args):
print('----------- Configuration Arguments -----------') log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))): for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value)) log.info('%s: %s' % (arg, value))
print('------------------------------------------------') log.info('------------------------------------------------')
def check_cuda(use_cuda, err = \ def check_cuda(use_cuda, err = \
...@@ -56,7 +77,7 @@ def check_cuda(use_cuda, err = \ ...@@ -56,7 +77,7 @@ def check_cuda(use_cuda, err = \
): ):
try: try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False: if use_cuda == True and fluid.is_compiled_with_cuda() == False:
print(err) log.error(err)
sys.exit(1) sys.exit(1)
except Exception as e: except Exception as e:
pass pass
...@@ -11,7 +11,11 @@ ...@@ -11,7 +11,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
......
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
''' '''
Evaluation script for CMRC 2018 Evaluation script for CMRC 2018
version: v5 version: v5
...@@ -6,22 +19,25 @@ Note: ...@@ -6,22 +19,25 @@ Note:
v5 formatted output, add usage description v5 formatted output, add usage description
v4 fixed segmentation issues v4 fixed segmentation issues
''' '''
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from collections import Counter, OrderedDict from collections import Counter, OrderedDict
import string import string
import re import re
import argparse import argparse
import json import json
import sys import sys
reload(sys)
sys.setdefaultencoding('utf8')
import nltk import nltk
import pdb import pdb
# split Chinese with English # split Chinese with English
def mixed_segmentation(in_str, rm_punc=False): def mixed_segmentation(in_str, rm_punc=False):
in_str = str(in_str).decode('utf-8').lower().strip() in_str = in_str.lower().strip()
segs_out = [] segs_out = []
temp_str = "" temp_str = ""
sp_char = [ sp_char = [
...@@ -32,7 +48,7 @@ def mixed_segmentation(in_str, rm_punc=False): ...@@ -32,7 +48,7 @@ def mixed_segmentation(in_str, rm_punc=False):
for char in in_str: for char in in_str:
if rm_punc and char in sp_char: if rm_punc and char in sp_char:
continue continue
if re.search(ur'[\u4e00-\u9fa5]', char) or char in sp_char: if re.search(r'[\u4e00-\u9fa5]', char) or char in sp_char:
if temp_str != "": if temp_str != "":
ss = nltk.word_tokenize(temp_str) ss = nltk.word_tokenize(temp_str)
segs_out.extend(ss) segs_out.extend(ss)
...@@ -51,7 +67,7 @@ def mixed_segmentation(in_str, rm_punc=False): ...@@ -51,7 +67,7 @@ def mixed_segmentation(in_str, rm_punc=False):
# remove punctuation # remove punctuation
def remove_punctuation(in_str): def remove_punctuation(in_str):
in_str = str(in_str).decode('utf-8').lower().strip() in_str = in_str.lower().strip()
sp_char = [ sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':', '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
...@@ -102,7 +118,7 @@ def evaluate(ground_truth_file, prediction_file): ...@@ -102,7 +118,7 @@ def evaluate(ground_truth_file, prediction_file):
skip_count += 1 skip_count += 1
continue continue
prediction = str(prediction_file[query_id]) prediction = prediction_file[query_id]
f1 += calc_f1_score(answers, prediction) f1 += calc_f1_score(answers, prediction)
em += calc_em_score(answers, prediction) em += calc_em_score(answers, prediction)
...@@ -139,8 +155,8 @@ def calc_em_score(answers, prediction): ...@@ -139,8 +155,8 @@ def calc_em_score(answers, prediction):
def eval_file(dataset_file, prediction_file): def eval_file(dataset_file, prediction_file):
ground_truth_file = json.load(open(dataset_file, 'rb')) ground_truth_file = json.load(open(dataset_file, 'r'))
prediction_file = json.load(open(prediction_file, 'rb')) prediction_file = json.load(open(prediction_file, 'r'))
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file) F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
AVG = (EM + F1) * 0.5 AVG = (EM + F1) * 0.5
return EM, F1, AVG, TOTAL return EM, F1, AVG, TOTAL
......
此差异已折叠。
此差异已折叠。
...@@ -12,25 +12,35 @@ ...@@ -12,25 +12,35 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os import os
import six import six
import ast import ast
import copy import copy
import logging
import numpy as np import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
log = logging.getLogger(__name__)
def cast_fp32_to_fp16(exe, main_program): def cast_fp32_to_fp16(exe, main_program):
print("Cast parameters to float16 data format.") log.info("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters(): for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"): if not param.name.endswith(".master"):
param_t = fluid.global_scope().find_var(param.name).get_tensor() param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t) data = np.array(param_t)
if param.name.find("layer_norm") == -1: if param.name.startswith("encoder_layer") \
and "layer_norm" not in param.name:
param_t.set(np.float16(data).view(np.uint16), exe.place) param_t.set(np.float16(data).view(np.uint16), exe.place)
#load fp32
master_param_var = fluid.global_scope().find_var(param.name + master_param_var = fluid.global_scope().find_var(param.name +
".master") ".master")
if master_param_var is not None: if master_param_var is not None:
...@@ -51,7 +61,7 @@ def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False): ...@@ -51,7 +61,7 @@ def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
init_checkpoint_path, init_checkpoint_path,
main_program=main_program, main_program=main_program,
predicate=existed_persitables) predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path)) log.info("Load model from {}".format(init_checkpoint_path))
if use_fp16: if use_fp16:
cast_fp32_to_fp16(exe, main_program) cast_fp32_to_fp16(exe, main_program)
...@@ -74,7 +84,7 @@ def init_pretraining_params(exe, ...@@ -74,7 +84,7 @@ def init_pretraining_params(exe,
pretraining_params_path, pretraining_params_path,
main_program=main_program, main_program=main_program,
predicate=existed_params) predicate=existed_params)
print("Load pretraining parameters from {}.".format( log.info("Load pretraining parameters from {}.".format(
pretraining_params_path)) pretraining_params_path))
if use_fp16: if use_fp16:
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
# ERNIE fast inference (C++)
ERNIE C++ fast inference API提供了一种更为高效的在线预测方案,可以直接联编译至生产环境以获取更好的性能。
其实现基于[fluid inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/native_infer.html).
**请确保您的 fluid inference 版本高于 1.6.3 以获得正确的预测结果。**
本页面提供了一个ERNIE C++ fast inference 的 demo benchmark.
## 准备工作
demo 数据取自XNLI数据集test集合,位于./data 中。采用明文id格式,一行代表一个 batch, 包含四个字段:
```text
src_ids, pos_ids, sent_ids, self_attn_mask
```
字段之间按照分号(;)分隔;各字段内部包含 `shape``data` 两部分,按照冒号(:)分隔; `shape``data` 内部按空格分隔;`self_attn_mask` 为 FLOAT32 类型,其余字段为 INT64 类型。
ERNIE fast inference 需要输入 inference\_model 格式的模型,可以参考[这里](../README.zh.md#生成inference_model)生成 inference\_model .
**使用propeller产出的 inference\_model 只需要`src_ids`,`sent_ids` 两个字段,因此需要适当修改数据文件**
## 编译和运行
为了编译本 demo,c++ 编译器需要支持 C++11 标准。
下载对应的 [fluid_inference库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_usage/deploy/inference/build_and_install_lib_cn.html) , 根据使用的 paddle 的版本和配置状况 (是否使用 avx, mkl, 以及 cuda, cudnn 版本) 选择下载对应的版本并解压,会得到 `fluid_inference` 文件夹,将其放在与`inference.cc`同一级目录。
用以下命令编译:
``` bash
cd ./gpu # cd ./cpu
mkdir build
cd build
cmake ..
make
```
用以下命令运行:
```
./run.sh ../data/sample /path/to/inference_mode_dir
```
## 性能测试
测试样本:XNLI test集合,输入BatchSize=1, SequenceLength=128.
重复5遍取平均值。
| 测试环境 | 延迟(ms) |
| ----- | ----- |
| CPU(Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz (20 线程)) | 29.8818|
| GPU (P4) | 8.5 |
CMAKE_MINIMUM_REQUIRED(VERSION 3.2)
PROJECT(inference_demo)
SET(CMAKE_C_COMPILER gcc)
SET(CMAKE_CXX_COMPILER g++)
ADD_COMPILE_OPTIONS(-std=c++11 -g)
SET(FLUID_INFER_LIB fluid_inference)
SET(FLUID_INC_PATH ${FLUID_INFER_LIB}/paddle/include)
SET(FLUID_LIB_PATH ${FLUID_INFER_LIB}/paddle/lib)
SET(GLOG_INC_PATH ${FLUID_INFER_LIB}/third_party/install/glog/include)
SET(GLOG_LIB_PATH ${FLUID_INFER_LIB}/third_party/install/glog/lib)
SET(GFLAGS_INC_PATH ${FLUID_INFER_LIB}/third_party/install/gflags/include)
SET(GFLAGS_LIB_PATH ${FLUID_INFER_LIB}/third_party/install/gflags/lib)
SET(MKLDNN_LIB_PATH ${FLUID_INFER_LIB}/third_party/install/mkldnn/lib)
SET(MKLML_LIB_PATH ${FLUID_INFER_LIB}/third_party/install/mklml/lib)
INCLUDE_DIRECTORIES(${FLUID_INC_PATH})
INCLUDE_DIRECTORIES(${GLOG_INC_PATH})
INCLUDE_DIRECTORIES(${GFLAGS_INC_PATH})
LINK_DIRECTORIES(${FLUID_LIB_PATH})
LINK_DIRECTORIES(${GLOG_LIB_PATH})
LINK_DIRECTORIES(${GFLAGS_LIB_PATH})
LINK_DIRECTORIES(${MKLML_LIB_PATH})
LINK_DIRECTORIES(${MKLDNN_LIB_PATH})
ADD_EXECUTABLE(inference inference.cc)
TARGET_LINK_LIBRARIES(inference dl paddle_fluid glog gflags pthread)
此差异已折叠。
set -x
(($# != 2)) && echo "${0} data model" && exit -1
export LD_LIBRARY_PATH=fluid_inference/third_party/install/mkldnn/lib:fluid_inference/third_party/install/mklml/lib:fluid_inference/paddle/lib/:/home/work/cuda-9.0/lib64/:/home/work/cudnn/cudnn_v7_3_1_cuda9.0/lib64/:$LD_LIBRARY_PATH \
./build/inference --logtostderr \
--model_dir $2 \
--data $1 \
--repeat 5 \
--output_prediction true \
--use_gpu true \
--device 0 \
此差异已折叠。
此差异已折叠。
此差异已折叠。
set -x
(($# != 2)) && echo "${0} data model" && exit -1
export LD_LIBRARY_PATH=fluid_inference/third_party/install/mkldnn/lib:fluid_inference/third_party/install/mklml/lib:fluid_inference/paddle/lib/:/home/work/cuda-9.0/lib64/:/home/work/cudnn/cudnn_v7_3_1_cuda9.0/lib64/:$LD_LIBRARY_PATH
./build/inference --logtostderr \
--model_dir $2 \
--data $1 \
--repeat 5 \
--output_prediction true \
--use_gpu true \
--device 0 \
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册