未验证 提交 de4063b5 编写于 作者: M Meiyim 提交者: GitHub

Paddle 2.0 (#604)

* update to paddle 2.

* update readme

* upgrade multi card fintune example

* use paddle.AdamW, use grad acc

* bump propeller

* remove grad acc

* fix ner

* update propeller & distribued sample

* wip

* +seq2seq

* format

* fix erneigen

* fix pretrain

* fix static

* update propeller  for py37 compat

* fix pretrain static

* up readme

* update readme

* static pretrain

* remove optimization out of core libray

* ner use `cross_entropy`, use `ignore index`

* fix dygraph pretrain: add stop criteria

* bugfix, LN wrong initialize

* add grad acc for classifiction task

* seq2seq use fp32 when decoding

* use `paddle.io.DataLoader`

* + distill

* update readme

* update distill fig link

* propeller use vdl

* do not use pure fp16 for static graph
Co-authored-by: Nchenxuyi <work@yq01-qianmo-com-255-129-11.yq01.baidu.com>
上级 881bd978
......@@ -15,4 +15,3 @@ markComment: >
Thank you for your contributions.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: false
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/PaddlePaddle/mirrors-yapf.git
rev: 0d79c0c469bab64f7229c9aca2b1186ef47f0e37
hooks:
- id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
hooks:
- id: check-added-large-files
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
files: (?!.*third_party)^.*$ | (?!.*book)^.*$
- id: end-of-file-fixer
......@@ -11,7 +11,13 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
[\[more information\]](https://wenxin.baidu.com/)
# News
- Sept.24.2020:
- Dec.29.2020:
- Pretrain and finetune ERNIE with [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc).
- New AMP(auto mixed precision) feature for every demo in this repo.
- Introducing `Gradient accumulation`, run `ERNIE-large` with only 8G memory.
- Sept.24.2020:
- [`ERNIE-ViL`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil) is **avaliable** now!
- A **knowledge-enhanced** joint representations for vision-language tasks.
- Constructing three **Scene Graph Prediction** tasks utilizing structured knowledge.
......@@ -20,20 +26,19 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
- May.20.2020:
- Try ERNIE in "`dygraph`", with:
- Pretrain and finetune ERNIE with [PaddlePaddle v1.8](https://github.com/PaddlePaddle/Paddle/tree/release/1.8).
- Eager execution with `paddle.fluid.dygraph`.
- Distributed training.
- Easy deployment.
- Learn NLP in Aistudio tutorials.
- Backward compatibility for old-styled checkpoint
- [`ERNIE-GEN`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen) is **avaliable** now! ([link here](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen))
- the **state-of-the-art** pre-trained model for generation tasks, accepted by `IJCAI-2020`.
- A novel **span-by-span generation pre-training task**.
- An **infilling generation** echanism and a **noise-aware generation** method.
- Implemented by a carefully designed **Multi-Flow Attention** architecture.
- You are able to `download` all models including `base/large/large-430G`.
- Apr.30.2020: Release [ERNIESage](https://github.com/PaddlePaddle/PGL/tree/master/examples/erniesage), a novel Graph Neural Network Model using ERNIE as its aggregator. It is implemented through [PGL](https://github.com/PaddlePaddle/PGL)
- Mar.27.2020: [Champion on 5 SemEval2020 sub tasks](https://www.jiqizhixin.com/articles/2020-03-27-8)
- Dec.26.2019: [1st place on GLUE leaderboard](https://www.technologyreview.com/2019/12/26/131372/ai-baidu-ernie-google-bert-natural-language-glue/)
......@@ -41,7 +46,7 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
- Jul.7.2019: [Introducing ERNIE2.0](https://www.jiqizhixin.com/articles/2019-07-31-10)
- Mar.16.2019: [Introducing ERNIE1.0](https://www.jiqizhixin.com/articles/2019-03-16-3)
# Table of contents
* [Tutorials](#tutorials)
* [Setup](#setup)
......@@ -54,18 +59,16 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
```python
import numpy as np
import paddle.fluid.dygraph as D
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
D.guard().__enter__() # activate paddle `dygrpah` mode
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0)) # insert extra `batch` dimension
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
......@@ -95,7 +98,7 @@ This repo requires PaddlePaddle 1.7.0+, please see [here](https://www.paddlepadd
pip install paddle-ernie
```
or
or
```shell
git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
......@@ -117,10 +120,10 @@ pip install -e .
| [ERNIE Gen Large 430G for English](https://ernie-github.cdn.bcebos.com/model-ernie-gen-large-430g-en.1.tar.gz)| Layer:24, Hidden:1024, Heads:16 + 430G pretrain corpus | ernie-gen-large-430g-en |
##### 4. download datasets
**English Datasets**
Download the [GLUE datasets](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
Download the [GLUE datasets](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
the `--data_dir` option in the following section assumes a directory tree like this:
......@@ -152,11 +155,16 @@ see [demo](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz) data for MNLI
- try eager execution with `dygraph model` :
```script
python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
python3 ./demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
--data_dir ./data/xnli
```
- specify `--use_amp` to activate AMP training.
- `--bsz` denotes global batch size for one optimization step, `--micro_bsz` denotes maximum batch size for each GPU device.
if `--micro_bsz < --bsz`, gradient accumulation will be actiavted.
- Distributed finetune
`paddle.distributed.launch` is a process manager, we use it to launch python processes on each avalible GPU devices:
......@@ -165,15 +173,15 @@ When in distributed training, `max_steps` is used as stopping criteria rather th
You could calculate `max_steps` with `EPOCH * NUM_TRAIN_EXAMPLES / TOTAL_BATCH`.
Also notice than we shard the train data according to device id to prevent over fitting.
demo:
(make sure you have more than 2 GPUs,
online model download can not work in `paddle.distributed.launch`,
you need to run single card finetuning first to get pretrained model, or donwload and extract one manualy from [here](#section-pretrained-models)):
demo:
(make sure you have more than 2 GPUs,
online model download can not work in `paddle.distributed.launch`,
you need to run single card finetuning first to get pretrained model, or donwload and extract one manualy from [here](#section-pretrained-models)):
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_dygraph_distributed.py \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie-2.0-en
......@@ -182,11 +190,12 @@ python3 -m paddle.distributed.launch \
many other demo python scripts:
1. [Sentiment Analysis](./demo/finetune_sentiment_analysis_dygraph.py)
1. [Semantic Similarity](./demo/finetune_classifier_dygraph.py)
1. [Name Entity Recognition(NER)](./demo/finetune_ner_dygraph.py)
1. [Machine Reading Comprehension](./demo/finetune_mrc_dygraph.py)
1. [Sentiment Analysis](./demo/finetune_sentiment_analysis.py)
1. [Semantic Similarity](./demo/finetune_classifier.py)
1. [Name Entity Recognition(NER)](./demo/finetune_ner.py)
1. [Machine Reading Comprehension](./demo/finetune_mrc.py)
1. [Text generation](./demo/seq2seq/README.md)
1. [Text classification with `paddle.static` API](./demo/finetune_classifier_static.py)
......@@ -220,7 +229,7 @@ see [here](./demo/pretrain/README.md)
# Online inference
If `--inference_model_dir` is passed to `finetune_classifier_dygraph.py`,
If `--inference_model_dir` is passed to `finetune_classifier_dygraph.py`,
a deployable model will be generated at the end of finetuning and your model is ready to serve.
For details about online inferece, see [C++ inference API](./inference/README.md),
......@@ -244,14 +253,14 @@ sids = np.expand_dims(sids, 0)
result = client(ids, sids)
```
A pre-made `inference model` for ernie-1.0 can be downloaded at [here](https://ernie.bj.bcebos.com/ernie1.0_zh_inference_model.tar.gz).
A pre-made `inference model` for ernie-1.0 can be downloaded at [here](https://ernie.bj.bcebos.com/ernie1.0_zh_inference_model.tar.gz).
It can be used for feature-based finetuning or feature extraction.
# Distillation
Knowledge distillation is good way to compress and accelerate ERNIE.
Knowledge distillation is good way to compress and accelerate ERNIE.
For details about distillation, see [here](./distill/README.md)
For details about distillation, see [here](./demo/distill/README.md)
# Citation
......@@ -271,7 +280,7 @@ For details about distillation, see [here](./distill/README.md)
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
year={2019}
}
```
......@@ -306,4 +315,3 @@ For full reproduction of paper results, please checkout to `repro` branch of thi
- QQ discussion group: 760439550 (ERNIE discussion group).
- QQ discussion group: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
......@@ -10,16 +10,20 @@ ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框
# 新闻
- 2020.9.24:
- 2020.12.29:
- `ERNIE`开源工具套件全面升级 [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc)
- 所有demo教程均引入AMP(混合精度训练), 平均提速达2.3倍。
- 引入`Gradient accumulation`, 8G显存也可运行`ERNIE-large`模型。
- 2020.9.24:
- `ERNIE-ViL` 模型正式开源! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil))
- 面向视觉-语言知识增强的预训练框架,首次在视觉-语言预训练引入结构化的知识。
- 利用场景图中的知识,构建了物体、属性和关系预测任务,精细刻画模态间细粒度语义对齐。
- 五项视觉-语言下游任务取得最好效果,[视觉常识推理榜单](https://visualcommonsense.com/)取得第一。
- 2020.5.20:
- 2020.5.20:
- 欢迎试用`动态图`实现的 ERNIE:
- 基于[PaddlePaddle v1.8](https://github.com/PaddlePaddle/Paddle/tree/release/1.8)使用 ERNIE 进行 Pretrain 和 Finetune.
- 动态执行, 所见即所得。
- 大规模分布式训练。
- 易于部署。
......@@ -52,18 +56,16 @@ ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框
# 快速上手
```python
import numpy as np
import paddle.fluid.dygraph as D
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
D.guard().__enter__() # activate paddle `dygrpah` mode
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0)) # insert extra `batch` dimension
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
......@@ -71,7 +73,7 @@ print(pooled.numpy()) # convert results to numpy
# 教程
手边没有GPU?欢迎在[AIStudio](https://aistudio.baidu.com/aistudio/index)中直接试用 ERNIE.
手边没有GPU?欢迎在[AIStudio](https://aistudio.baidu.com/aistudio/index)中直接试用 ERNIE.
(请选择最新版本的教程并申请GPU运行环境)
1. [从0开始学ERNIE](https://aistudio.baidu.com/studio/edu/group/quick/join/314947)
......@@ -159,11 +161,16 @@ data/xnli
- 使用 `动态图` 模型进行finetune:
```script
python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
python3 ./ernie_d/demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
--data_dir ./data/xnli
```
- 加入`--use_amp`以启用AMP功能(请在支持`TensorCore`设备上启用AMP)
- 通过`--bsz`指定全局batch\_size(一步优化中模型所能见到的样本数), 通过`--micro_bsz` 指定输入给每一张GPU卡的样本数
`--bsz > --micro_bsz` 脚本会自动开启梯度累计功能.
- 分布式 finetune
`paddle.distributed.launch` 是一个进程管理器,我们采用它在每一张GPU上启动一个python进程,并配置相应的环境变量以进行分布式训练:
......@@ -177,7 +184,7 @@ python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_dygraph_distributed.py \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie2.0-en
......@@ -186,11 +193,12 @@ python3 -m paddle.distributed.launch \
更多示例脚本:
1. [情感分析](./demo/finetune_sentiment_analysis_dygraph.py)
1. [语义匹配](./demo/finetune_classifier_dygraph.py)
1. [命名实体识别(NER)](./demo/finetune_ner_dygraph.py)
1. [机器阅读理解](./demo/finetune_mrc_dygraph.py) (需要多卡环境运行;参见上面"分布式 finetune"一节)
1. [情感分析](./demo/finetune_sentiment_analysis.py)
1. [语义匹配](./demo/finetune_classifier.py)
1. [命名实体识别(NER)](./demo/finetune_ner.py)
1. [机器阅读理解](./demo/finetune_mrc.py) (需要多卡环境运行;参见上面"分布式 finetune"一节)
1. [文本摘要生成](./demo/seq2seq/README.md)
1. [使用静态图完成文本分类](./demo/finetune_classifier_static.py)
**推荐超参数设置:**
......@@ -221,7 +229,7 @@ python3 -m paddle.distributed.launch \
# 在线预测
如果`finetune_classifier_dygraph.py`中指定了`--inference_model_dir`参数,funetune脚本会将你的模型序列化并产出可以直接部署线上预测的`inference_model`.
如果`finetune_classifier.py`中指定了`--inference_model_dir`参数,funetune脚本会将你的模型序列化并产出可以直接部署线上预测的`inference_model`.
关于生产环境中使用线上预测代码的实现细节,请见[C++ inference API](./inference/README.md).
或者你可以使用`propeller`启动一个多GPU预测服务(需要GPU环境),只需执行:
......@@ -254,7 +262,7 @@ ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
# 蒸馏
知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见[这里](./distill/README.md)
知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见[这里](./demo/distill/README.md)
# 文献引用
......@@ -274,7 +282,7 @@ ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:1907.12412},
year={2019}
year={2019}
}
```
......@@ -309,4 +317,3 @@ ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
- QQ 群: 760439550 (ERNIE discussion group).
- QQ 2群: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
......@@ -9,7 +9,7 @@
# ERNIE Slim 数据蒸馏
在ERNIE强大的语义理解能力背后,是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高,若不能有效压缩则无法实际应用。
![ernie_distill](../.metas/ernie_distill.png)
![ernie_distill](../../.metas/ernie_distill.png)
因此,如上图所示,我们基于[数据蒸馏技术](https://arxiv.org/pdf/1712.04440.pdf)构建了**ERNIE Slim数据蒸馏系统**。它的原理是通过数据作为桥梁,将ERNIE模型的知识迁移至小模型,以达到损失很小的效果却能达到上千倍的预测速度提升的效果。
......@@ -18,11 +18,11 @@
- **Step 1**. 使用ERNIE模型对输入标注数据对进行fine-tune,得到Teacher Model
- **Step 2**. 使用ERNIE Service对以下无监督数据进行预测:
1. 用户提供的大规模无标注数据,需与标注数据同源
2. 对标注数据进行数据增强,具体增强策略见下节
3. 对无标注数据和数据增强数据进行一定比例混合
3. 对无标注数据和数据增强数据进行一定比例混合
- **Step 3.** 使用步骤2的数据训练出Student Model
......@@ -59,7 +59,6 @@ python ./distill/distill.py
|---|---|
|ERNIE-Finetune |95.4% |
|非ERNIE基线(BOW)|90.1%|
|**+ 数据蒸馏** |91.4%|
|**+ 数据蒸馏** |91.4%|
|非ERNIE基线(LSTM)|91.2%|
|**+ 数据蒸馏**|93.9%|
......@@ -13,51 +13,66 @@
# limitations under the License.
import sys
import os
import os
import numpy as np
from sklearn.metrics import f1_score
import paddle as P
import paddle.fluid as F
import paddle.fluid.layers as L
import paddle.fluid.dygraph as D
import propeller.paddle as propeller
from paddle.nn import functional as F
import propeller.paddle as propeller
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification
from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
# 本例子采用chnsenticorp中文情感识别任务作为示范;并且事先通过数据增强扩充了蒸馏所需的无监督数据
#
#
# 下载数据;并存放在 ./chnsenticorp-data/
# 数据分为3列:原文;空格切词;情感标签
# 其中第一列为ERNIE的输入;第二列为BoW词袋模型的输入
# 事先统计好的BoW 词典在 ./chnsenticorp-data/vocab.bow.txt
# 定义finetune teacher模型所需要的超参数
DATA_DIR='./chnsenticorp-data/'
SEQLEN=256
BATCH=32
EPOCH=10
LR=5e-5
DATA_DIR = './chnsenticorp-data/'
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 5e-5
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
student_vocab = {i.strip(): l for l, i in enumerate(open(os.path.join(DATA_DIR, 'vocab.bow.txt')).readlines())}
student_vocab = {
i.strip(): l
for l, i in enumerate(
open(
os.path.join(DATA_DIR, 'vocab.bow.txt'), encoding='utf8')
.readlines())
}
def space_tokenizer(i):
return i.decode('utf8').split()
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_a_student', unk_id=student_vocab['[UNK]'], vocab_dict=student_vocab, tokenizer=space_tokenizer),
propeller.data.LabelColumn('label', vocab_dict={
b"0": 0,
b"1": 1,
}),
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_a_student',
unk_id=student_vocab['[UNK]'],
vocab_dict=student_vocab,
tokenizer=space_tokenizer),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
}),
])
def map_fn(seg_a, seg_a_student, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=SEQLEN)
sentence, segments = tokenizer.build_for_ernie(seg_a)
......@@ -74,9 +89,9 @@ train_ds_unlabel = feature_column.build_dataset('train-da', data_dir=os.path.joi
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(DATA_DIR, 'dev/'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(BATCH,)
.padded_batch(BATCH,)
shapes = ([-1,SEQLEN],[-1,SEQLEN], [-1, SEQLEN], [-1])
shapes = ([-1, SEQLEN], [-1, SEQLEN], [-1, SEQLEN], [-1])
types = ('int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
......@@ -86,154 +101,198 @@ train_ds_unlabel.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(0)
D.guard(place).__enter__()
place = P.CUDAPlace(0)
def evaluate_teacher(model, dataset):
all_pred, all_label = [], []
with D.base._switch_tracer_mode_guard_(is_train=False):
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(dataset.start()):
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids)
pred = L.argmax(logits, -1)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
model.train()
return f1
teacher_model = ErnieModelForSequenceClassification.from_pretrained('ernie-1.0', num_labels=2)
teacher_model = ErnieModelForSequenceClassification.from_pretrained(
'ernie-1.0', num_labels=2)
teacher_model.train()
if not os.path.exists('./teacher_model.pdparams'):
g_clip = F.clip.GradientClipByGlobalNorm(1.0)
opt = AdamW(learning_rate=LinearDecay(LR, 9600*EPOCH*0.1/BATCH, 9600*EPOCH/BATCH), parameter_list=teacher_model.parameters(), weight_decay=0.01, grad_clip=g_clip)
if not os.path.exists('./teacher_model.bin'):
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=teacher_model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
for epoch in range(EPOCH):
for step, (ids_student, ids, sids, labels) in enumerate(train_ds.start(place)):
for step, (ids_student, ids, sids, labels) in enumerate(
P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, logits = teacher_model(ids, labels=labels)
loss.backward()
if step % 10 == 0:
print('[step %03d] teacher train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
lr_scheduler.step()
teacher_model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
if step % 100 == 0:
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' %f1)
D.save_dygraph(teacher_model.state_dict(), './teacher_model')
print('teacher f1: %.5f' % f1)
P.save(teacher_model.state_dict(), './teacher_model.bin')
else:
state_dict, _ = D.load_dygraph('./teacher_model')
teacher_model.set_dict(state_dict)
state_dict = P.load('./teacher_model.bin')
teacher_model.set_state_dict(state_dict)
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' %f1)
print('teacher f1: %.5f' % f1)
# 定义finetune student 模型所需要的超参数
SEQLEN=256
BATCH=100
EPOCH=10
LR=1e-4
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 1e-4
def evaluate_student(model, dataset):
all_pred, all_label = [], []
with D.base._switch_tracer_mode_guard_(is_train=False):
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(dataset.start()):
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids_student)
pred = L.argmax(logits, -1)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
model.train()
return f1
return f1
class BOW(D.Layer):
class BOW(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = D.Embedding([len(student_vocab), 128], padding_idx=0)
self.fc = D.Linear(128, 2)
self.emb = P.nn.Embedding(len(student_vocab), 128, padding_idx=0)
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
pad_mask = L.unsqueeze(L.cast(ids!=0, 'float32'), [-1])
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
embbed = L.reduce_sum(embbed * pad_mask, 1)
embbed = L.softsign(embbed)
embbed = (embbed * pad_mask).sum(1)
embbed = F.softsign(embbed)
logits = self.fc(embbed)
if labels is not None:
if len(labels.shape)==1:
labels = L.reshape(labels, [-1, 1])
loss = L.softmax_with_cross_entropy(logits, labels)
loss = L.reduce_mean(loss)
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
class CNN(D.Layer):
class CNN(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = D.Embedding([30002, 128], padding_idx=0)
self.cnn = D.Conv2D(128, 128, (1, 3), padding=(0, 1), act='relu')
self.pool = D.Pool2D((1, 3), pool_padding=(0, 1))
self.fc = D.Linear(128, 2)
self.emb = P.nn.Embedding(30002, 128, padding_idx=0)
self.cnn = P.nn.Conv2D(128, 128, (1, 3), padding=(0, 1), act='relu')
self.pool = P.nn.Pool2D((1, 3), pool_padding=(0, 1))
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
#d_batch, d_seqlen = ids.shape
hidden = embbed
hidden = L.transpose(hidden, [0, 2, 1]) #change to NCWH
hidden = L.unsqueeze(hidden, [2])
hidden = hidden.transpose([0, 2, 1]).unsqueeze(2) #change to NCWH
hidden = self.cnn(hidden)
hidden = self.pool(hidden)
hidden = L.squeeze(hidden, [2])
hidden = L.transpose(hidden, [0, 2, 1])
pad_mask = L.unsqueeze(L.cast(ids!=0, 'float32'), [-1])
hidden = L.softsign(L.reduce_sum(hidden * pad_mask, 1))
hidden = self.pool(hidden).squeeze(2).transpose([0, 2, 1])
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
hidden = P.nn.funcional.softsign(L(hidden * pad_mask).sum(1))
logits = self.fc(hidden)
if labels is not None:
if len(labels.shape)==1:
labels = L.reshape(labels, [-1, 1])
loss = L.softmax_with_cross_entropy(logits, labels)
loss = L.reduce_mean(loss)
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
def KL(pred, target):
pred = L.log(L.softmax(pred))
target = L.softmax(target)
loss = L.kldiv_loss(pred, target)
pred = F.log_softmax(pred)
target = F.softmax(target)
loss = F.kl_div(pred, target)
return loss
teacher_model.eval()
model = BOW()
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LR, parameter_list=model.parameters(), weight_decay=0.01, grad_clip=g_clip)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
model.train()
for epoch in range(EPOCH):
for step, (ids_student, ids, sids, label) in enumerate(train_ds.start(place)):
_, logits_t = teacher_model(ids, sids) # teacher 模型输出logits
logits_t.stop_gradient=True
_, logits_s = model(ids_student) # student 模型输出logits
for epoch in range(EPOCH - 1):
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
with P.no_grad():
_, logits_t = teacher_model(ids, sids) # teacher 模型输出logits
_, logits_s = model(ids_student) # student 模型输出logits
loss_ce, _ = model(ids_student, labels=label)
loss_kd = KL(logits_s, logits_t) # 由KL divergence度量两个分布的距离
loss_kd = KL(logits_s, logits_t.detach()) # 由KL divergence度量两个分布的距离
loss = loss_ce + loss_kd
loss.backward()
if step % 10 == 0:
print('[step %03d] distill train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
lr_scheduler.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('student f1 %.5f' % f1)
# 最后再加一轮hard label训练巩固结果
for step, (ids_student, ids, sids, label) in enumerate(train_ds.start(place)):
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, _ = model(ids_student, labels=label)
loss.backward()
if step % 10 == 0:
print('[step %03d] train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('final f1 %.5f' % f1)
......@@ -11,204 +11,257 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
import os
import re
import time
import logging
from random import random
import json
from random import random
from functools import reduce, partial
from visualdl import LogWriter
import numpy as np
import multiprocessing
import tempfile
import re
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import optimization
#import utils.data
import logging
import argparse
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def model_fn(features, mode, params, run_config):
ernie = ErnieModelForSequenceClassification(params, name='')
if not params is propeller.RunMode.TRAIN:
ernie.eval()
metrics, loss = None, None
if mode is propeller.RunMode.PREDICT:
src_ids, sent_ids = features
_, logits = ernie(src_ids, sent_ids)
predictions = [logits,]
else:
src_ids, sent_ids, labels = features
if mode is propeller.RunMode.EVAL:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
pred = L.argmax(logits, axis=1)
acc = propeller.metrics.Acc(labels, pred)
metrics = {'acc': acc}
predictions = [pred]
else:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
scheduled_lr, _ = optimization(
loss=loss,
warmup_steps=int(run_config.max_steps * params['warmup_proportion']),
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
use_fp16=params.use_fp16,
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
)
propeller.summary.scalar('lr', scheduled_lr)
predictions = [logits,]
return propeller.ModelSpec(loss=loss, mode=mode, metrics=metrics, predictions=predictions)
if __name__ == '__main__':
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--do_predict', action='store_true')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--warm_start_from', type=str)
parser.add_argument('--epoch', type=int, default=3)
parser.add_argument('--use_fp16', action='store_true')
args = parser.parse_args()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' % args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=32,
num_labels=3,
warmup_proportion=0.1,
learning_rate=5e-5,
weight_decay=0.01,
use_task_id=False,
use_fp16=args.use_fp16,
)
hparams = default_hparams.join(propeller.HParams(**hparams_config_file)).join(hparams_cli)
default_run_config=dict(
max_steps=args.epoch * 390000 / hparams.batch_size,
save_steps=1000,
log_steps=10,
max_ckpt=1,
skip_steps=0,
model_dir=tempfile.mkdtemp(),
eval_steps=100)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
unk_id = tokenizer.vocab['[UNK]']
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
if not args.do_predict:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('title', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('comment', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
#label = np.expand_dims(label, -1) #
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(param_path, v.name)),
from_dir=param_path,
)
best_exporter = propeller.train.exporter.BestExporter(os.path.join(run_config.model_dir, 'best'), cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=model_fn,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds, 'test': test_ds},
warm_start_setting=ws,
exporters=[best_exporter])
print('dev_acc3\t%.5f\ntest_acc3\t%.5f' % (best_exporter._best['dev']['acc'], best_exporter._best['test']['acc']))
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('title', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('comment', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
])
def map_fn(seg_a, seg_b):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument(
'--bsz',
type=int,
default=128,
help='global batch size for each optimizer step')
parser.add_argument(
'--micro_bsz',
type=int,
default=32,
help='batch size for each device. if `--bsz` > `--micro_bsz` * num_device, will do grad accumulate'
)
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--use_lr_decay',
action='store_true',
help='if set, learning rate will decay to zero at `max_steps`')
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--inference_model_dir',
type=Path,
default=None,
help='inference model output directory')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--max_steps',
type=int,
default=None,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
if args.bsz > args.micro_bsz:
assert args.bsz % args.micro_bsz == 0, 'cannot perform gradient accumulate with bsz:%d micro_bsz:%d' % (
args.bsz, args.micro_bsz)
acc_step = args.bsz // args.micro_bsz
log.info(
'performing gradient accumulate: global_bsz:%d, micro_bsz:%d, accumulate_steps:%d'
% (args.bsz, args.micro_bsz, acc_step))
args.bsz = args.micro_bsz
else:
acc_step = 1
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
predict_ds.data_shapes = shapes[: -1]
predict_ds.data_types = types[: -1]
est = propeller.Learner(model_fn, run_config, hparams)
for res, in est.predict(predict_ds, ckpt=-1):
print('%d\t%.5f\t%.5f\t%.5f' % (np.argmax(res), res[0], res[1], res[2]))
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
place = P.CUDAPlace(0)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd = P.load(args.init_checkpoint)
model.set_state_dict(sd)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
if args.use_lr_decay:
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
else:
lr_scheduler = None
opt = P.optimizer.Adam(
args.lr,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step, inter_step = 0, 0
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None):
inter_step += 1
loss, _ = model(ids, sids, labels=label)
loss /= acc_step
loss = scaler.scale(loss)
loss.backward()
if inter_step % acc_step != 0:
continue
step += 1
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler and lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for ids, sids, label in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(0),
batch_size=None):
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if args.inference_model_dir is not None:
class InferenceModel(ErnieModelForSequenceClassification):
def forward(self, ids, sids):
_, logits = super(InferenceModel, self).forward(ids, sids)
return logits
model.__class__ = InferenceModel
log.debug('saving inference model')
src_placeholder = P.zeros([2, 2], dtype='int64')
sent_placehodler = P.zeros([2, 2], dtype='int64')
_, static = P.jit.TracedLayer.trace(
model, inputs=[src_placeholder, sent_placehodler])
static.save_inference_model(str(args.inference_model_dir))
#class InferenceModel(ErnieModelForSequenceClassification):
# @P.jit.to_static
# def forward(self, ids, sids):
# _, logits = super(InferenceModel, self).forward(ids, sids, labels=None)
# return logits
#model.__class__ = InferenceModel
#src_placeholder = P.zeros([2, 2], dtype='int64')
#sent_placehodler = P.zeros([2, 2], dtype='int64')
#P.jit.save(model, args.inference_model_dir, input_var=[src_placeholder, sent_placehodler])
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import time
import logging
import json
import re
from random import random
from functools import reduce, partial
import numpy as np
import logging
#from visualdl import LogWriter
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
parser = propeller.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
b"2": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'),
shuffle=True, repeat=True, use_gz=False, shard=True) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'),
shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
P.distributed.init_parallel_env()
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd, _ = P.load(args.init_checkpoint)
model.set_state_dict(sd)
model = P.DataParallel(model)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step = 0
create_if_not_exists(args.save_dir)
#with LogWriter(logdir=str(create_if_not_exists(args.save_dir / 'vdl-%d' % env.dev_id))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=None):
step += 1
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
# do logging
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
#log_writer.add_scalar('loss', _l, step=step)
#log_writer.add_scalar('lr', _lr, step=step)
# do saving
if step % 100 == 0 and env.dev_id == 0:
acc = []
with P.no_grad():
model.eval()
for d in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(env.dev_id),
batch_size=None):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
#log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# exit
if step > args.max_steps:
break
if args.save_dir is not None and env.dev_id == 0:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import argparse
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--use_lr_decay', action='store_true', help='if set, learning rate will decay to zero at `max_steps`')
parser.add_argument('--warmup_proportion', type=float, default=0.1, help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`')
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--inference_model_dir', type=str, default=None, help='inference model output directory')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--max_steps', type=int, default=None, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument('--init_checkpoint', type=str, default=None, help='checkpoint to warm start from')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_b', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(0)
with FD.guard(place):
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd, _ = FD.load_dygraph(args.init_checkpoint)
model.set_dict(sd)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
if args.use_lr_decay:
opt = AdamW(learning_rate=LinearDecay(args.lr, int(args.warmup_proportion * args.max_steps), args.max_steps), parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
else:
opt = AdamW(args.lr, parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
for epoch in range(args.epoch):
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss.backward()
if step % 10 == 0:
log.debug('train loss %.5f lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating %d' % epoch)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
if args.inference_model_dir is not None:
log.debug('saving inference model')
class InferemceModel(ErnieModelForSequenceClassification):
def forward(self, *args, **kwargs):
_, logits = super(InferemceModel, self).forward(*args, **kwargs)
return logits
model.__class__ = InferemceModel #dynamic change model type, to make sure forward output doesn't contain `None`
src_placeholder = FD.to_variable(np.ones([1, 1], dtype=np.int64))
sent_placehodler = FD.to_variable(np.zeros([1, 1], dtype=np.int64))
model(src_placeholder, sent_placehodler)
_, static_model = FD.TracedLayer.trace(model, inputs=[src_placeholder, sent_placehodler])
static_model.save_inference_model(args.inference_model_dir)
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = propeller.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--max_steps', type=int, required=True, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=int, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument('--init_checkpoint', type=str, default=None, help='checkpoint to warm start from')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_b', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
b"0": 0,
b"1": 1,
b"2": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=False, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
train_ds = train_ds.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
log.debug('shard %d/%d'%(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id))
train_ds = train_ds.shuffle(10000)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(FD.parallel.Env().dev_id)
with FD.guard(place):
ctx = FD.parallel.prepare_context()
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd, _ = FD.load_dygraph(args.init_checkpoint)
model.set_dict(sd)
model = FD.parallel.DataParallel(model, ctx)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LinearDecay(
args.lr,
int(args.warmup_proportion * args.max_steps),
args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
model.clear_gradients()
if step % 10 == 0:
log.debug('train loss %.5f, lr %.e3' % (loss.numpy(), opt.current_step_lr()))
if step % 100 == 0 and FD.parallel.Env().dev_id == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating')):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if step > args.max_steps:
break
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
import os
import re
import time
import logging
from random import random
import json
from functools import reduce, partial
import numpy as np
import multiprocessing
import tempfile
import re
import paddle as P
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from demo.optimization import optimization
#import utils.data
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def model_fn(features, mode, params, run_config):
ernie = ErnieModelForSequenceClassification(params, name='')
if mode is not propeller.RunMode.TRAIN:
ernie.eval()
else:
ernie.train()
metrics, loss = None, None
if mode is propeller.RunMode.PREDICT:
src_ids, sent_ids = features
_, logits = ernie(src_ids, sent_ids)
predictions = [logits, ]
else:
src_ids, sent_ids, labels = features
if mode is propeller.RunMode.EVAL:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
pred = logits.argmax(axis=1)
acc = propeller.metrics.Acc(labels, pred)
metrics = {'acc': acc}
predictions = [pred]
train_hooks = None
else:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
lr_step_hook, loss_scale_coef = optimization(
loss=loss,
warmup_steps=int(run_config.max_steps *
params['warmup_proportion']),
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
use_fp16=args.use_amp,
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay", )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
predictions = [logits, ]
train_hooks = [lr_step_hook]
return propeller.ModelSpec(
loss=loss,
mode=mode,
metrics=metrics,
predictions=predictions,
train_hooks=train_hooks)
if __name__ == '__main__':
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--do_predict', action='store_true')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--warm_start_from', type=str)
parser.add_argument('--epoch', type=int, default=3)
parser.add_argument('--use_amp', action='store_true')
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=32,
num_labels=3,
warmup_proportion=0.1,
learning_rate=5e-5,
weight_decay=0.01,
use_task_id=False,
use_fp16=args.use_amp)
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=args.epoch * 390000 / hparams.batch_size,
save_steps=1000,
log_steps=10,
max_ckpt=1,
skip_steps=0,
model_dir=tempfile.mkdtemp(),
eval_steps=100)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
unk_id = tokenizer.vocab['[UNK]']
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
if not args.do_predict:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
#label = np.expand_dims(label, -1) #
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
varname_to_warmstart = re.compile(
r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$'
)
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(param_path, v.name)),
from_dir=param_path,
)
best_exporter = propeller.train.exporter.BestExporter(
os.path.join(run_config.model_dir, 'best'),
cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=model_fn,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds,
'test': test_ds},
warm_start_setting=ws,
exporters=[best_exporter])
print('dev_acc3\t%.5f\ntest_acc3\t%.5f' %
(best_exporter._best['dev']['acc'],
best_exporter._best['test']['acc']))
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
def map_fn(seg_a, seg_b):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
predict_ds.data_shapes = shapes[:-1]
predict_ds.data_types = types[:-1]
est = propeller.Learner(model_fn, run_config, hparams)
for res, in est.predict(predict_ds, ckpt=-1):
print('%d\t%.5f\t%.5f\t%.5f' %
(np.argmax(res), res[0], res[1], res[2]))
......@@ -17,114 +17,187 @@ from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import re
import time
import logging
import json
from pathlib import Path
from random import random
from tqdm import tqdm
from functools import reduce, partial
import pickle
import argparse
from functools import partial
from io import open
import numpy as np
import logging
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
import paddle as P
from propeller import log
import propeller.paddle as propeller
from ernie.modeling_ernie import ErnieModel, ErnieModelForQuestionAnswering
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
#from ernie.optimization import AdamW, LinearDecay
from demo.mrc import mrc_reader
from demo.mrc import mrc_metrics
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def evaluate(model, ds, all_examples, all_features, tokenizer, args):
dev_file = json.loads(open(args.dev_file).read())
with D.base._switch_tracer_mode_guard_(is_train=False):
dev_file = json.loads(open(args.dev_file, encoding='utf8').read())
with P.no_grad():
log.debug('start eval')
model.eval()
all_res = []
for step, (uids, token_ids, token_type_ids, _, __) in enumerate(ds.start(place)):
_ , start_logits, end_logits = model(token_ids, token_type_ids)
res = [mrc_metrics.RawResult(unique_id=u, start_logits=s, end_logits=e)
for u, s, e in zip(uids.numpy(), start_logits.numpy(), end_logits.numpy())]
all_res = []
for step, (uids, token_ids, token_type_ids, _, __) in enumerate(
P.io.DataLoader(
ds, places=P.CUDAPlace(env.dev_id), batch_size=None)):
_, start_logits, end_logits = model(token_ids, token_type_ids)
res = [
mrc_metrics.RawResult(
unique_id=u, start_logits=s, end_logits=e)
for u, s, e in zip(uids.numpy(),
start_logits.numpy(), end_logits.numpy())
]
all_res += res
open('all_res', 'wb').write(pickle.dumps(all_res))
all_pred, all_nbests = mrc_metrics.make_results(
tokenizer,
all_examples,
all_features,
all_res,
n_best_size=args.n_best_size,
max_answer_length=args.max_answer_length,
do_lower_case=tokenizer.lower)
tokenizer,
all_examples,
all_features,
all_res,
n_best_size=args.n_best_size,
max_answer_length=args.max_answer_length,
do_lower_case=tokenizer.lower)
f1, em, _, __ = mrc_metrics.evaluate(dev_file, all_pred)
model.train()
log.debug('done eval')
return f1, em
def train(model, train_dataset, dev_dataset, dev_examples, dev_features, tokenizer, args):
ctx = D.parallel.prepare_context()
model = D.parallel.DataParallel(model, ctx)
def train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args):
model = P.DataParallel(model)
max_steps = len(train_features) * args.epoch // args.bsz
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=args.lr, parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(max_steps,
int(args.warmup_proportion * max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
train_dataset = train_dataset \
.repeat() \
.shard(D.parallel.Env().nranks, D.parallel.Env().dev_id) \
.shuffle(1000) \
.padded_batch(args.bsz)
.cache_shuffle_shard(env.nranks, env.dev_id, drop_last=True) \
.padded_batch(args.bsz)
log.debug('init training with args: %s' % repr(args))
for step, (_, token_ids, token_type_ids, start_pos, end_pos) in enumerate(train_dataset.start(place)):
loss, _, __ = model(token_ids, token_type_ids, start_pos=start_pos, end_pos=end_pos)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
model.clear_gradients()
if D.parallel.Env().dev_id == 0 and step % 10 == 0:
log.debug('[step %d] train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
if D.parallel.Env().dev_id == 0 and step % 100 == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features, tokenizer, args)
log.debug('[step %d] eval result: f1 %.5f em %.5f' % (step, f1, em))
if step > max_steps:
break
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(enable=args.use_amp):
for step, (_, token_ids, token_type_ids, start_pos,
end_pos) in enumerate(
P.io.DataLoader(
train_dataset,
places=P.CUDAPlace(env.dev_id),
batch_size=None)):
loss, _, __ = model(
token_ids,
token_type_ids,
start_pos=start_pos,
end_pos=end_pos)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if env.dev_id == 0 and step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if env.dev_id == 0 and step % 100 == 0:
f1, em = evaluate(model, dev_dataset, dev_examples,
dev_features, tokenizer, args)
log.debug('[step %d] eval result: f1 %.5f em %.5f' %
(step, f1, em))
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if step > max_steps:
break
if __name__ == "__main__":
parser = argparse.ArgumentParser('MRC model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=512, help='max sentence length, should not greater than 512')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=512,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=8, help='batchsize')
parser.add_argument('--epoch', type=int, default=2, help='epoch')
parser.add_argument('--train_file', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--dev_file', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument(
'--train_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--dev_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=3e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--n_best_size', type=int, default=20, help='nbest prediction to keep')
parser.add_argument('--max_answer_length', type=int, default=100, help='max answer span')
parser.add_argument('--wd', type=float, default=0.00, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--n_best_size', type=int, default=20, help='nbest prediction to keep')
parser.add_argument(
'--max_answer_length', type=int, default=100, help='max answer span')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
P.distributed.init_parallel_env()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
if not os.path.exists(args.train_file):
......@@ -134,43 +207,41 @@ if __name__ == "__main__":
log.info('making train/dev data...')
train_examples = mrc_reader.read_files(args.train_file, is_training=True)
train_features = mrc_reader.convert_example_to_features(train_examples, args.max_seqlen, tokenizer, is_training=True)
train_features = mrc_reader.convert_example_to_features(
train_examples, args.max_seqlen, tokenizer, is_training=True)
dev_examples = mrc_reader.read_files(args.dev_file, is_training=False)
dev_features = mrc_reader.convert_example_to_features(dev_examples, args.max_seqlen, tokenizer, is_training=False)
dev_features = mrc_reader.convert_example_to_features(
dev_examples, args.max_seqlen, tokenizer, is_training=False)
log.info('train examples: %d, features: %d' % (len(train_examples), len(train_features)))
log.info('train examples: %d, features: %d' %
(len(train_examples), len(train_features)))
def map_fn(unique_id, example_index, doc_span_index, tokens, token_to_orig_map, token_is_max_context, token_ids, position_ids, text_type_ids, start_position, end_position):
def map_fn(unique_id, example_index, doc_span_index, tokens,
token_to_orig_map, token_is_max_context, token_ids,
position_ids, text_type_ids, start_position, end_position):
if start_position is None:
start_position = 0
if end_position is None:
end_position = 0
return np.array(unique_id), np.array(token_ids), np.array(text_type_ids), np.array(start_position), np.array(end_position)
return np.array(unique_id), np.array(token_ids), np.array(
text_type_ids), np.array(start_position), np.array(end_position)
train_dataset = propeller.data.Dataset.from_list(train_features).map(map_fn)
train_dataset = propeller.data.Dataset.from_list(train_features).map(
map_fn)
dev_dataset = propeller.data.Dataset.from_list(dev_features).map(map_fn).padded_batch(args.bsz)
shapes = ([-1], [-1, args.max_seqlen], [-1, args.max_seqlen], [-1], [-1])
types = ('int64', 'int64', 'int64', 'int64', 'int64')
dev_dataset = propeller.data.Dataset.from_list(dev_features).map(
map_fn).padded_batch(args.bsz)
train_dataset.name = 'train'
dev_dataset.name = 'dev'
model = ErnieModelForQuestionAnswering.from_pretrained(
args.from_pretrained, name='')
train_dataset.data_shapes = shapes
train_dataset.data_types = types
dev_dataset.data_shapes = shapes
dev_dataset.data_types = types
train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args)
place = F.CUDAPlace(D.parallel.Env().dev_id)
D.guard(place).__enter__()
model = ErnieModelForQuestionAnswering.from_pretrained(args.from_pretrained, name='')
train(model, train_dataset, dev_dataset, dev_examples, dev_features, tokenizer, args)
if D.parallel.Env().dev_id == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features, tokenizer, args)
if env.dev_id == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features,
tokenizer, args)
log.debug('final eval result: f1 %.5f em %.5f' % (f1, em))
if D.parallel.Env().dev_id == 0 and args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import six
import json
from random import random
from tqdm import tqdm
from collections import OrderedDict
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import multiprocessing
import pickle
import logging
from sklearn.metrics import f1_score
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification, ErnieModelForTokenClassification
from ernie.tokenizing_ernie import ErnieTokenizer
#from ernie.optimization import AdamW, LinearDecay
parser = propeller.ArgumentParser('NER model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--bsz', type=int, default=32)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--epoch', type=int, default=6)
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE, used in learning rate scheduler'
)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
parser.add_argument('--from_pretrained', type=Path, required=True)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
def tokenizer_func(inputs):
ret = inputs.split(b'\2')
tokens, orig_pos = [], []
for i, r in enumerate(ret):
t = tokenizer.tokenize(r)
for tt in t:
tokens.append(tt)
orig_pos.append(i)
assert len(tokens) == len(orig_pos)
return tokens + orig_pos
def tokenizer_func_for_label(inputs):
return inputs.split(b'\2')
feature_map = {
b"B-PER": 0,
b"I-PER": 1,
b"B-ORG": 2,
b"I-ORG": 3,
b"B-LOC": 4,
b"I-LOC": 5,
b"O": 6,
}
other_tag_id = feature_map[b'O']
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'text_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer_func), propeller.data.TextColumn(
'label',
unk_id=other_tag_id,
vocab_dict=feature_map,
tokenizer=tokenizer_func_for_label, )
])
def before(seg, label):
seg, orig_pos = np.split(seg, 2)
aligned_label = label[orig_pos]
seg, _ = tokenizer.truncate(seg, [], args.max_seqlen)
aligned_label, _ = tokenizer.truncate(aligned_label, [], args.max_seqlen)
orig_pos, _ = tokenizer.truncate(orig_pos, [], args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(
seg
) #utils.data.build_1_pair(seg, max_seqlen=args.max_seqlen, cls_id=cls_id, sep_id=sep_id)
aligned_label = np.concatenate([[0], aligned_label, [0]], 0)
orig_pos = np.concatenate([[0], orig_pos, [0]])
assert len(aligned_label) == len(sentence) == len(orig_pos), (
len(aligned_label), len(sentence), len(orig_pos)) # alinged
return sentence, segments, aligned_label, label, orig_pos
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1, 0)) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
def evaluate(model, dataset):
model.eval()
with P.no_grad():
chunkf1 = propeller.metrics.ChunkF1(None, None, None, len(feature_map))
for step, (ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
dataset, batch_size=None)):
loss, logits = model(ids, sids)
#print('\n'.join(map(str, logits.numpy().tolist())))
assert orig_pos.shape[0] == logits.shape[0] == ids.shape[
0] == label.shape[0]
for pos, lo, la, id in zip(orig_pos.numpy(),
logits.numpy(),
label.numpy(), ids.numpy()):
_dic = OrderedDict()
assert len(pos) == len(lo) == len(id)
for _pos, _lo, _id in zip(pos, lo, id):
if _id > tokenizer.mask_id: # [MASK] is the largest special token
_dic.setdefault(_pos, []).append(_lo)
merged_lo = np.array(
[np.array(l).mean(0) for _, l in six.iteritems(_dic)])
merged_preds = np.argmax(merged_lo, -1)
la = la[np.where(la != (other_tag_id + 1))] #remove pad
if len(la) > len(merged_preds):
log.warn(
'accuracy loss due to truncation: label len:%d, truncate to %d'
% (len(la), len(merged_preds)))
merged_preds = np.pad(merged_preds,
[0, len(la) - len(merged_preds)],
mode='constant',
constant_values=7)
else:
assert len(la) == len(
merged_preds
), 'expect label == prediction, got %d vs %d' % (
la.shape, merged_preds.shape)
chunkf1.update((merged_preds, la, np.array(len(la))))
#f1 = f1_score(np.concatenate(all_label), np.concatenate(all_pred), average='macro')
f1 = chunkf1.eval()
model.train()
return f1
model = ErnieModelForTokenClassification.from_pretrained(
args.from_pretrained,
num_labels=len(feature_map),
name='',
has_pooler=False)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, (
ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
train_ds, batch_size=None)):
loss, logits = model(ids, sids, labels=aligned_label)
#loss, logits = model(ids, sids, labels=aligned_label, loss_weights=P.cast(ids != 0, 'float32'))
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
f1 = evaluate(model, dev_ds)
log.debug('eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
f1 = evaluate(model, dev_ds)
log.debug('final eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import six
import json
from random import random
from tqdm import tqdm
from collections import OrderedDict
from functools import reduce, partial
import numpy as np
import multiprocessing
import pickle
import logging
from sklearn.metrics import f1_score
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification, ErnieModelForTokenClassification
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = propeller.ArgumentParser('NER model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--bsz', type=int, default=32)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--epoch', type=int, default=6)
parser.add_argument('--warmup_proportion', type=float, default=0.1, help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`')
parser.add_argument('--max_steps', type=int, required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE, used in learning rate scheduler')
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
def tokenizer_func(inputs):
ret = inputs.split(b'\2')
tokens, orig_pos = [], []
for i, r in enumerate(ret):
t = tokenizer.tokenize(r)
for tt in t:
tokens.append(tt)
orig_pos.append(i)
assert len(tokens) == len(orig_pos)
return tokens + orig_pos
def tokenizer_func_for_label(inputs):
return inputs.split(b'\2')
feature_map = {
b"B-PER": 0,
b"I-PER": 1,
b"B-ORG": 2,
b"I-ORG": 3,
b"B-LOC": 4,
b"I-LOC": 5,
b"O": 6,
}
other_tag_id = feature_map[b'O']
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('text_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer_func),
propeller.data.TextColumn('label', unk_id=other_tag_id, vocab_dict=feature_map,
tokenizer=tokenizer_func_for_label,)
])
def before(seg, label):
seg, orig_pos = np.split(seg, 2)
aligned_label = label[orig_pos]
seg, _ = tokenizer.truncate(seg, [], args.max_seqlen)
aligned_label, _ = tokenizer.truncate(aligned_label, [], args.max_seqlen)
orig_pos, _ = tokenizer.truncate(orig_pos, [], args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg) #utils.data.build_1_pair(seg, max_seqlen=args.max_seqlen, cls_id=cls_id, sep_id=sep_id)
aligned_label = np.concatenate([[0], aligned_label, [0]], 0)
orig_pos = np.concatenate([[0], orig_pos, [0]])
assert len(aligned_label) == len(sentence) == len(orig_pos), (len(aligned_label), len(sentence), len(orig_pos)) # alinged
return sentence, segments, aligned_label, label, orig_pos
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1, 0)) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1,0)) \
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1,0)) \
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1, args.max_seqlen])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
place = F.CUDAPlace(0)
@FD.no_grad
def evaluate(model, dataset):
model.eval()
chunkf1 = propeller.metrics.ChunkF1(None, None, None, len(feature_map))
for step, (ids, sids, aligned_label, label, orig_pos) in enumerate(tqdm(dataset.start(place))):
loss, logits = model(ids, sids)
#print('\n'.join(map(str, logits.numpy().tolist())))
assert orig_pos.shape[0] == logits.shape[0] == ids.shape[0] == label.shape[0]
for pos, lo, la, id in zip(orig_pos.numpy(), logits.numpy(), label.numpy(), ids.numpy()):
_dic = OrderedDict()
assert len(pos) ==len(lo) == len(id)
for _pos, _lo, _id in zip(pos, lo, id):
if _id > tokenizer.mask_id: # [MASK] is the largest special token
_dic.setdefault(_pos, []).append(_lo)
merged_lo = np.array([np.array(l).mean(0) for _, l in six.iteritems(_dic)])
merged_preds = np.argmax(merged_lo, -1)
la = la[np.where(la != (other_tag_id + 1))] #remove pad
if len(la) > len(merged_preds):
log.warn('accuracy loss due to truncation: label len:%d, truncate to %d' % (len(la), len(merged_preds)))
merged_preds = np.pad(merged_preds, [0, len(la) - len(merged_preds)], mode='constant', constant_values=7)
else:
assert len(la) == len(merged_preds), 'expect label == prediction, got %d vs %d' % (la.shape, merged_preds.shape)
chunkf1.update((merged_preds, la, np.array(len(la))))
#f1 = f1_score(np.concatenate(all_label), np.concatenate(all_pred), average='macro')
f1 = chunkf1.eval()
model.train()
return f1
with FD.guard(place):
model = ErnieModelForTokenClassification.from_pretrained(args.from_pretrained, num_labels=len(feature_map), name='', has_pooler=False)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(
learning_rate=LinearDecay(args.lr, int(args.warmup_proportion * args.max_steps), args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd, grad_clip=g_clip)
#opt = F.optimizer.AdamOptimizer(learning_rate=LinearDecay(args.lr, args.warmup_steps, args.max_steps), parameter_list=model.parameters())
for epoch in range(args.epoch):
for step, (ids, sids, aligned_label, label, orig_pos) in enumerate(tqdm(train_ds.start(place))):
loss, logits = model(ids, sids, labels=aligned_label, loss_weights=L.cast(ids > tokenizer.mask_id, 'float32')) # [MASK] is the largest special token
loss.backward()
if step % 10 == 0 :
log.debug('train loss %.5f, lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0 :
f1 = evaluate(model, dev_ds)
log.debug('eval f1: %.5f' % f1)
f1 = evaluate(model, dev_ds)
log.debug('final eval f1: %.5f' % f1)
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import logging
import argparse
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
log = logging.getLogger()
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--eval', action='store_true')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if not args.eval:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label'),
])
def map_fn(seg_a, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(logdir=str(create_if_not_exists(args.save_dir /
'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, d in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None)):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (
step, _l, _lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for step, d in enumerate(
P.io.DataLoader(
dev_ds,
places=P.CUDAPlace(0),
batch_size=None)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(),
args.save_dir / 'ckpt.bin')
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
sd, _ = P.load(args.save_dir / 'ckpt.bin')
model.set_dict(sd)
model.eval()
def map_fn(seg_a):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(args.bsz)
for step, (ids, sids) in enumerate(
P.io.DataLoader(
predict_ds, places=P.CUDAPlace(0), batch_size=None)):
_, logits = model(ids, sids)
pred = logits.numpy().argmax(-1)
print('\n'.join(map(str, pred.tolist())))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import argparse
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
log = logging.getLogger()
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--max_steps', type=int, required=True, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--eval', action='store_true')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
place = F.CUDAPlace(0)
with FD.guard(place):
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if not args.eval:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label'),
])
def map_fn(seg_a, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LinearDecay(
args.lr,
int(args.warmup_proportion * args.max_steps), args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
for epoch in range(args.epoch):
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss.backward()
if step % 10 == 0:
log.debug('train loss %.5f lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating %d' % epoch)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
])
assert args.save_dir is not None
sd, _ = FD.load_dygraph(args.save_dir)
model.set_dict(sd)
model.eval()
def map_fn(seg_a):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(args.bsz)
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen])
types = ('int64', 'int64')
predict_ds.data_shapes = shapes
predict_ds.data_types = types
for step, (ids, sids) in enumerate(predict_ds.start(place)):
_, logits = model(ids, sids)
pred = logits.numpy().argmax(-1)
print('\n'.join(map(str, pred.tolist())))
......@@ -17,7 +17,6 @@ from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import re
import six
......@@ -29,7 +28,8 @@ import nltk
import unicodedata
from collections import namedtuple
RawResult = namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
RawResult = namedtuple("RawResult",
["unique_id", "start_logits", "end_logits"])
log = logging.getLogger(__name__)
......@@ -340,7 +340,7 @@ def _get_final_text(pred_text, orig_text, tokenizer):
def make_results(vocab, all_examples, all_features, all_results, n_best_size,
max_answer_length, do_lower_case):
max_answer_length, do_lower_case):
"""Write final predictions to the json file and log-odds of null if needed."""
tokenizer = _BasicTokenizer(do_lower_case)
example_index_to_features = collections.defaultdict(list)
......@@ -384,7 +384,8 @@ def make_results(vocab, all_examples, all_features, all_results, n_best_size,
continue
if end_index not in feature.token_to_orig_map:
continue
if not feature.token_is_max_context.get(start_index, False):
if not feature.token_is_max_context.get(start_index,
False):
continue
if end_index < start_index:
continue
......@@ -414,8 +415,8 @@ def make_results(vocab, all_examples, all_features, all_results, n_best_size,
break
feature = features[pred.feature_index]
if pred.start_index > 0: # this is a non-null prediction
tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
)]
tok_tokens = feature.tokens[pred.start_index:(pred.end_index +
1)]
orig_doc_start = feature.token_to_orig_map[pred.start_index]
orig_doc_end = feature.token_to_orig_map[pred.end_index]
orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
......@@ -483,9 +484,11 @@ def mixed_segmentation(in_str, rm_punc=False):
in_str = in_str.lower().strip()
segs_out = []
temp_str = ""
sp_char = ['-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=',
',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、',
'「', '」', '(', ')', '-', '~', '『', '』']
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』'
]
for char in in_str:
if rm_punc and char in sp_char:
continue
......@@ -510,9 +513,11 @@ def mixed_segmentation(in_str, rm_punc=False):
def remove_punctuation(in_str):
"""remove punctuation"""
in_str = in_str.lower().strip()
sp_char = ['-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=',
',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、',
'「', '」', '(', ')', '-', '~', '『', '』']
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』'
]
out_segs = []
for char in in_str:
if char in sp_char:
......@@ -525,7 +530,7 @@ def remove_punctuation(in_str):
# find longest common string
def find_lcs(s1, s2):
"""find_lcs"""
m = [[0 for i in range(len(s2)+1)] for j in range(len(s1)+1)]
m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
mmax = 0
p = 0
for i in range(len(s1)):
......@@ -535,7 +540,7 @@ def find_lcs(s1, s2):
if m[i + 1][j + 1] > mmax:
mmax = m[i + 1][j + 1]
p = i + 1
return s1[p - mmax: p], mmax
return s1[p - mmax:p], mmax
def calc_f1_score(answers, prediction):
......@@ -548,9 +553,9 @@ def calc_f1_score(answers, prediction):
if lcs_len == 0:
f1_scores.append(0)
continue
precision = 1.0 * lcs_len / len(prediction_segs)
recall = 1.0 * lcs_len / len(ans_segs)
f1 = (2 * precision * recall) / (precision + recall)
precision = 1.0 * lcs_len / len(prediction_segs)
recall = 1.0 * lcs_len / len(ans_segs)
f1 = (2 * precision * recall) / (precision + recall)
f1_scores.append(f1)
return max(f1_scores)
......@@ -578,20 +583,20 @@ def evaluate(ground_truth_file, prediction_file):
context_text = instance['context'].strip()
for qas in instance['qas']:
total_count += 1
query_id = qas['id'].strip()
query_text = qas['question'].strip()
answers = [ans["text"] for ans in qas["answers"]]
query_id = qas['id'].strip()
query_text = qas['question'].strip()
answers = [ans["text"] for ans in qas["answers"]]
if query_id not in prediction_file:
sys.stderr.write('Unanswered question: {}\n'.format(query_id))
sys.stderr.write('Unanswered question: {}\n'.format(
query_id))
skip_count += 1
continue
prediction = prediction_file[query_id]
prediction = prediction_file[query_id]
f1 += calc_f1_score(answers, prediction)
em += calc_em_score(answers, prediction)
f1_score = f1 / total_count
em_score = em / total_count
return [f1_score, em_score, total_count, skip_count]
......@@ -20,33 +20,26 @@ from __future__ import unicode_literals
import sys
import argparse
import logging
from functools import partial
from io import open
open = partial(open, encoding='utf-8')
import json
from collections import namedtuple
log = logging.getLogger(__name__)
Example = namedtuple('Example', [
'qas_id', 'question_text', 'doc_tokens', 'orig_answer_text',
'start_position', 'end_position'
])
Example = namedtuple('Example',
['qas_id',
'question_text',
'doc_tokens',
'orig_answer_text',
'start_position',
'end_position'])
Feature = namedtuple("Feature",
["unique_id",
"example_index",
"doc_span_index",
"tokens",
"token_to_orig_map",
"token_is_max_context",
"token_ids",
"position_ids",
"text_type_ids",
"start_position",
"end_position"])
Feature = namedtuple("Feature", [
"unique_id", "example_index", "doc_span_index", "tokens",
"token_to_orig_map", "token_is_max_context", "token_ids", "position_ids",
"text_type_ids", "start_position", "end_position"
])
def _tokenize_chinese_chars(text):
......@@ -113,7 +106,8 @@ def _check_is_max_context(doc_spans, cur_span_index, position):
return cur_span_index == best_span_index
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
orig_answer_text):
"""improve answer span"""
tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
......@@ -140,7 +134,7 @@ def read_files(input_file, is_training):
start_pos = None
end_pos = None
orig_answer_text = None
if is_training:
if len(qa["answers"]) != 1:
raise ValueError(
......@@ -151,17 +145,20 @@ def read_files(input_file, is_training):
orig_answer_text = answer["text"]
answer_offset = answer["answer_start"]
answer_length = len(orig_answer_text)
doc_tokens = [paragraph_text[:answer_offset],
paragraph_text[answer_offset: answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]]
doc_tokens = [
paragraph_text[:answer_offset], paragraph_text[
answer_offset:answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]
]
start_pos = 1
end_pos = 1
actual_text = " ".join(doc_tokens[start_pos:(end_pos + 1)])
actual_text = " ".join(doc_tokens[start_pos:(end_pos +
1)])
if actual_text.find(orig_answer_text) == -1:
log.info("Could not find answer: '%s' vs. '%s'",
actual_text, orig_answer_text)
actual_text, orig_answer_text)
continue
else:
doc_tokens = _tokenize_chinese_chars(paragraph_text)
......@@ -177,7 +174,13 @@ def read_files(input_file, is_training):
return examples
def convert_example_to_features(examples, max_seq_length, tokenizer, is_training, doc_stride=128, max_query_length=64):
def convert_example_to_features(examples,
max_seq_length,
tokenizer,
is_training,
doc_stride=128,
max_query_length=64):
"""convert example to feature"""
features = []
unique_id = 1000000000
......@@ -185,7 +188,7 @@ def convert_example_to_features(examples, max_seq_length, tokenizer, is_training
for (example_index, example) in enumerate(examples):
query_tokens = tokenizer.tokenize(example.question_text)
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0: max_query_length]
query_tokens = query_tokens[0:max_query_length]
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
......@@ -202,7 +205,8 @@ def convert_example_to_features(examples, max_seq_length, tokenizer, is_training
if is_training:
tok_start_position = orig_to_tok_index[example.start_position]
if example.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
tok_end_position = orig_to_tok_index[example.end_position +
1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
(tok_start_position, tok_end_position) = _improve_answer_span(
......@@ -297,4 +301,3 @@ if __name__ == "__main__":
features = convert_example_to_features(examples, 512, tokenizer, True)
log.debug(len(examples))
log.debug(len(features))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import logging
import re
import numpy as np
import paddle as P
import paddle.distributed.fleet as fleet
from propeller.paddle.train.hooks import RunHook
log = logging.getLogger(__name__)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
def optimization(
loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False, ):
"""do backword for static"""
def exclude_from_weight_decay(param):
name = param.rstrip('.master')
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
g_clip = P.nn.ClipGradByGlobalNorm(1.0)
lr_scheduler = P.optimizer.lr.LambdaDecay(
learning_rate,
get_warmup_and_linear_decay(num_train_steps, warmup_steps))
optimizer = P.optimizer.AdamW(
learning_rate=lr_scheduler,
weight_decay=weight_decay,
grad_clip=g_clip,
apply_decay_param_fun=exclude_from_weight_decay)
if use_fp16:
log.info('AMP activated')
if weight_decay > 0.:
raise ValueError(
'paddle amp will ignore `weight_decay`, see https://github.com/PaddlePaddle/Paddle/issues/29794'
)
#amp_list = P.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
# custom_white_list=['softmax', 'layer_norm', 'gelu'])
optimizer = P.fluid.contrib.mixed_precision.decorate(
optimizer, init_loss_scaling=3**15, use_dynamic_loss_scaling=True)
_, param_grads = optimizer.minimize(loss)
loss_scaling = P.static.default_main_program().global_block().var(
'loss_scaling_0')
else:
_, param_grads = optimizer.minimize(loss)
loss_scaling = None
class LRStepHook(RunHook):
def after_run(self, _, __):
lr_scheduler.step()
log.debug('lr step: %.5f' % lr_scheduler.get_lr())
return LRStepHook(), loss_scaling
......@@ -4,7 +4,7 @@ only **mask word** strategy from [Ernie1.0](https://arxiv.org/pdf/1904.09223.pdf
1. make pretrain data
we use documents from multiple datasource (e.g. Wikipedia) to pretrain.
we use documents from multiple datasource (e.g. Wikipedia) to pretrain.
input text should be segmented with space (even in chinese, this segmentation is used for *mask word*).
each line corresonds to a *sentence*.
empty line indicates end of document.
......@@ -41,4 +41,3 @@ python3 -m paddle.distributed.launch \
--from_pretrained /path/to/ernie1.0_pretrain_dir/
```
......@@ -15,16 +15,21 @@ import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
def gen_segs(segment_piece):
if len(segment_piece) == 0:
return []
else:
return [min(segment_piece)] * len(segment_piece)
whit_space_pat = re.compile(r'\S+')
def segment(inputs, inputs_segment):
ret = [r.span() for r in whit_space_pat.finditer(inputs)]
ret = [(inputs[s: e], gen_segs(inputs_segment[s: e])) for i, (s, e) in enumerate(ret)]
ret = [(inputs[s:e], gen_segs(inputs_segment[s:e]))
for i, (s, e) in enumerate(ret)]
return ret
......@@ -36,11 +41,13 @@ def tokenize(sen, seg_info):
sen = sen.lower()
res_word, res_segments = [], []
for match in pat.finditer(sen):
words, pos = _wordpiece(match.group(0), vocab=vocab_set, unk_token='[UNK]')
words, pos = _wordpiece(
match.group(0), vocab=vocab_set, unk_token='[UNK]')
start_of_word = match.span()[0]
for w, p in zip(words, pos):
res_word.append(w)
res_segments.append(gen_segs(seg_info[p[0] + start_of_word: p[1] + start_of_word]))
res_segments.append(
gen_segs(seg_info[p[0] + start_of_word:p[1] + start_of_word]))
return res_word, res_segments
......@@ -63,22 +70,32 @@ def parse_txt(line):
print('****', file=sys.stderr)
ret_line = [vocab.get(r, vocab['[UNK]']) for r in ret_line]
ret_seginfo = [[-1] if i == [] else i for i in ret_seginfo] #for sentence piece only
ret_seginfo = [[-1] if i == [] else i
for i in ret_seginfo] #for sentence piece only
ret_seginfo = [min(i) for i in ret_seginfo]
return ret_line, ret_seginfo
def build_example(slots):
txt, seginfo = slots
txt_fe_list = feature_pb2.FeatureList(feature=[feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=t)) for t in txt])
segsinfo_fe_list = feature_pb2.FeatureList(feature=[feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=s)) for s in seginfo])
assert len(txt_fe_list.feature) == len(segsinfo_fe_list.feature), 'txt[%d] and seginfo[%d] size not match' % (len(txt_fe_list.feature), len(segsinfo_fe_list.feature))
txt_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=t))
for t in txt
])
segsinfo_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=s))
for s in seginfo
])
assert len(txt_fe_list.feature) == len(
segsinfo_fe_list.feature), 'txt[%d] and seginfo[%d] size not match' % (
len(txt_fe_list.feature), len(segsinfo_fe_list.feature))
features = {
'txt': txt_fe_list,
'txt': txt_fe_list,
'segs': segsinfo_fe_list,
}
ex = example_pb2.SequenceExample(feature_lists=feature_pb2.FeatureLists(feature_list=features))
ex = example_pb2.SequenceExample(feature_lists=feature_pb2.FeatureLists(
feature_list=features))
return ex
......@@ -122,15 +139,17 @@ if __name__ == '__main__':
args = parser.parse_args()
log.setLevel(logging.DEBUG)
from ernie.tokenizing_ernie import _wordpiece
pat = re.compile(r'([a-zA-Z0-9]+|\S)')
vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(args.vocab, 'rb'))}
vocab = {
j.strip().split(b'\t')[0].decode('utf8'): i
for i, j in enumerate(open(args.vocab, 'rb'))
}
vocab_set = set(vocab.keys())
with open(args.src, 'rb') as from_file, gzip.open(args.tgt, 'wb') as to_file:
with open(args.src, 'rb') as from_file, gzip.open(args.tgt,
'wb') as to_file:
log.info('making gz from bb %s ==> %s' % (from_file, to_file))
build_bb(from_file, to_file)
log.info('done: %s' % to_file)
......@@ -24,12 +24,11 @@ import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
import paddle as P
import sentencepiece as spm
import json
......@@ -39,12 +38,13 @@ import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import optimization
#from ernie.optimization import AdamW, LinearDecay
import propeller as propeller_base
import propeller.paddle as propeller
from propeller.paddle.data import Dataset
from propeller import log
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
......@@ -53,6 +53,7 @@ if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
......@@ -71,43 +72,10 @@ else:
yield total
def ernie_pretrain_model_fn(features, mode, params, run_config):
"""propeller Model wraper for paddle-ERNIE """
src_ids, sent_ids, mlm_label, mask_pos, nsp_label = features
ernie = ErnieModelForPretraining(params, name='')
total_loss, mlm_loss, nsp_loss = ernie(src_ids, sent_ids, labels=mlm_label, mlm_pos=mask_pos, nsp_labels=nsp_label)
metrics = None
inf_spec = None
propeller.summary.scalar('loss', total_loss)
propeller.summary.scalar('nsp-loss', nsp_loss)
propeller.summary.scalar('mlm-loss', mlm_loss)
scheduled_lr, loss_scale_coef = optimization(
loss=total_loss,
warmup_steps=params['warmup_steps'],
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
use_fp16=params['use_fp16'],
)
propeller.summary.scalar('lr', scheduled_lr)
if params['use_fp16']:
propeller.summary.scalar('loss_scale', loss_scale_coef)
pred = [total_loss]
return propeller.ModelSpec(loss=total_loss, mode=mode, metrics=metrics, predictions=pred)
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin: random_begin + to_length]
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
......@@ -119,9 +87,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(half_ml, ml - b_len), np.minimum(half_ml, b_len)
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(half_ml, a_len), np.maximum(half_ml, ml - a_len)
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
......@@ -131,9 +101,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate([[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate([[0], token_type_a, [0], token_type_b, [1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
......@@ -145,24 +117,25 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
batch_size, seqlen = shape
invalid_pos = np.where(seg_info == -1)
seg_info += 1 #no more =1
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[: sample_num]
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
mask[:, 0] = False # ignore CLS head
rand = np.random.rand(*shape)
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
random_id = np.random.randint(1, vocab_size, size=shape)
replace_id = mask_id * choose_mask_id + \
......@@ -172,30 +145,39 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
mask_pos = np.where(mask)
#mask_pos_flatten = list(map(lambda idx: idx[0] * seqlen + idx[1], zip(*mask_pos))) #transpose
mask_label = sentence[mask_pos]
sentence[mask_pos] = replace_id[mask_pos] #overwrite
sentence[mask_pos] = replace_id[mask_pos] #overwrite
#log.debug(mask_pos_flatten)
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, hparams, args):
def make_pretrain_dataset(name, dir, vocab, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % dir)
raise ValueError('train data not found in %s' % gz_files)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller.data.example_pb2.SequenceExample()
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['txt'].feature]
doc_seg = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['segs'].feature]
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
......@@ -205,7 +187,9 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(line) # 0.1 means large variance on sentence piece result
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
......@@ -213,8 +197,9 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
yield buf,
buf, size = [], 0
if len(buf) != 0:
yield buf,
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
......@@ -228,10 +213,13 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a) if l < seqlen_a] #always take the first one
buf_b = [c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen]
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
if r.random() < 0.5: #pos or neg
label = np.int64(1)
else:
label = np.int64(0)
......@@ -243,7 +231,9 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(a, b, args.max_seqlen, vocab) #negative sample might exceed max seqlen
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
......@@ -251,14 +241,20 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info, args.mask_rate, hparams.vocab_size, vocab)
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info,
args.mask_rate,
len(vocab), vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([str(j) + '\t' + '|'.join(map(str, i)) for i, j in zip(sentence.tolist(), label)]))
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
......@@ -269,13 +265,21 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
dataset = dataset.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
if len(gz_files) < propeller.train.distribution.status.num_replica:
raise ValueError(
'not enough train file to shard: # of train files: %d, # of workers %d'
% (len(gz_files),
propeller.train.distribution.status.num_replica))
dataset = dataset.shard(env.nranks, env.dev_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(hparams.batch_size, (0, 0, 0, 0)).map(after)
dataset = dataset.padded_batch(args.bsz, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
......@@ -287,68 +291,110 @@ if __name__ == '__main__':
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, default=None)
parser.add_argument('--mask_rate', type=float, default=0.15)
parser.add_argument('--check', type=float, default=0.)
args = parser.parse_args()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' % args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=50,
warmup_steps=10000,
learning_rate=1e-4,
weight_decay=0.01,
use_fp16=False,
parser.add_argument(
'--max_seqlen',
type=int,
default=256,
help='max sequence length, documents from pretrain data will expand to this length'
)
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='protobuf pretrain data directory')
parser.add_argument(
'--mask_rate',
type=float,
default=0.15,
help='probability of input token tobe masked')
parser.add_argument(
'--check', type=float, default=0., help='probability of debug info')
parser.add_argument(
'--warmup_steps', type=int, default=10000, help='warmups steps')
parser.add_argument(
'--max_steps', type=int, default=1000000, help='max pretrian steps')
parser.add_argument('--lr', type=float, default=1e-4, help='learning_rate')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretraind model dir')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output_dir')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument('--bsz', type=int, default=50)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
hparams = default_hparams.join(propeller.HParams(**hparams_config_file)).join(hparams_cli)
default_run_config=dict(
max_steps=1000000,
save_steps=10000,
log_steps=10,
max_ckpt=3,
skip_steps=0,
eval_steps=-1)
args = parser.parse_args()
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
P.distributed.init_parallel_env()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset('train', args.data_dir,
vocab=tokenizer.vocab, hparams=hparams, args=args)
seq_shape = [-1, args.max_seqlen]
ints_shape = [-1,]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
types = ('int64', 'int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
ws = None
#varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
varname_to_warmstart = re.compile(r'.*')
if args.from_pretrained is not None:
warm_start_dir = os.path.join(args.from_pretrained, 'params')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(warm_start_dir, v.name)),
from_dir=warm_start_dir
)
ernie_learner = propeller.Learner(ernie_pretrain_model_fn, run_config, params=hparams, warm_start_setting=ws)
ernie_learner.train(train_ds)
train_ds = make_pretrain_dataset(
'train', args.data_dir, vocab=tokenizer.vocab, args=args)
model = ErnieModelForPretraining.from_pretrained(args.from_pretrained)
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps, args.warmup_steps))
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
model = P.DataParallel(model)
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(args.use_amp):
for step, samples in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=0)):
(src_ids, sent_ids, mlm_label, mask_pos, nsp_label) = samples
loss, mlmloss, nsploss = model(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if step % 1000 == 0 and env.dev_id == 0:
log.debug('saveing...')
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if step > args.max_steps:
break
log.info('done')
......@@ -24,14 +24,11 @@ import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
import sentencepiece as spm
import paddle as P
import json
from tqdm import tqdm
......@@ -40,9 +37,10 @@ import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import AdamW, LinearDecay
from demo.optimization import optimization
import propeller.paddle as propeller
import propeller as propeller_base
from propeller.paddle.data import Dataset
from propeller import log
......@@ -54,6 +52,7 @@ if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
......@@ -72,9 +71,54 @@ else:
yield total
def ernie_pretrain_model_fn(features, mode, params, run_config):
"""propeller Model wraper for paddle-ERNIE """
src_ids, sent_ids, mlm_label, mask_pos, nsp_label = features
ernie = ErnieModelForPretraining(params, name='')
total_loss, mlm_loss, nsp_loss = ernie(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
metrics = None
inf_spec = None
propeller.summary.scalar('loss', total_loss)
propeller.summary.scalar('nsp-loss', nsp_loss)
propeller.summary.scalar('mlm-loss', mlm_loss)
lr_step_hook, loss_scale_coef = optimization(
loss=total_loss,
warmup_steps=params['warmup_steps'],
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
use_fp16=args.use_amp, )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
if args.use_amp:
propeller.summary.scalar('loss_scaling', loss_scale_coef)
pred = [total_loss]
return propeller.ModelSpec(
loss=total_loss,
mode=mode,
metrics=metrics,
predictions=pred,
train_hooks=[lr_step_hook])
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin: random_begin + to_length]
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
......@@ -86,9 +130,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(half_ml, ml - b_len), np.minimum(half_ml, b_len)
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(half_ml, a_len), np.maximum(half_ml, ml - a_len)
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
......@@ -98,9 +144,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate([[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate([[0], token_type_a, [0], token_type_b, [1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
......@@ -112,24 +160,25 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
batch_size, seqlen = shape
invalid_pos = np.where(seg_info == -1)
seg_info += 1 #no more =1
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[: sample_num]
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
mask[:, 0] = False # ignore CLS head
rand = np.random.rand(*shape)
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
choose_original = rand < 0.1 #
choose_random_id = (0.1 < rand) & (rand < 0.2) #
choose_mask_id = 0.2 < rand #
random_id = np.random.randint(1, vocab_size, size=shape)
replace_id = mask_id * choose_mask_id + \
......@@ -139,30 +188,39 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
mask_pos = np.where(mask)
#mask_pos_flatten = list(map(lambda idx: idx[0] * seqlen + idx[1], zip(*mask_pos))) #transpose
mask_label = sentence[mask_pos]
sentence[mask_pos] = replace_id[mask_pos] #overwrite
sentence[mask_pos] = replace_id[mask_pos] #overwrite
#log.debug(mask_pos_flatten)
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, args):
def make_pretrain_dataset(name, dir, vocab, hparams, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % gz_files)
raise ValueError('train data not found in %s' % dir)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller.data.example_pb2.SequenceExample()
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['txt'].feature]
doc_seg = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['segs'].feature]
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
......@@ -172,7 +230,9 @@ def make_pretrain_dataset(name, dir, vocab, args):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(line) # 0.1 means large variance on sentence piece result
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
......@@ -180,8 +240,9 @@ def make_pretrain_dataset(name, dir, vocab, args):
yield buf,
buf, size = [], 0
if len(buf) != 0:
yield buf,
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
......@@ -195,10 +256,13 @@ def make_pretrain_dataset(name, dir, vocab, args):
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a) if l < seqlen_a] #always take the first one
buf_b = [c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen]
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
if r.random() < 0.5: #pos or neg
label = np.int64(1)
else:
label = np.int64(0)
......@@ -210,7 +274,9 @@ def make_pretrain_dataset(name, dir, vocab, args):
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(a, b, args.max_seqlen, vocab) #negative sample might exceed max seqlen
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
......@@ -218,14 +284,19 @@ def make_pretrain_dataset(name, dir, vocab, args):
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info, args.mask_rate, len(vocab), vocab)
sentence, mask_pos, mlm_label = apply_mask(
sentence, seg_info, args.mask_rate, hparams.vocab_size, vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([str(j) + '\t' + '|'.join(map(str, i)) for i, j in zip(sentence.tolist(), label)]))
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
......@@ -236,15 +307,17 @@ def make_pretrain_dataset(name, dir, vocab, args):
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
if len(gz_files) < propeller.train.distribution.status.num_replica:
raise ValueError('not enough train file to shard: # of train files: %d, # of workers %d' % (len(gz_files), propeller.train.distribution.status.num_replica))
dataset = dataset.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
dataset = dataset.shard(
propeller.train.distribution.status.num_replica,
propeller.train.distribution.status.replica_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(args.bsz, (0, 0, 0, 0)).map(after)
dataset = dataset.padded_batch(hparams.batch_size, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
......@@ -256,53 +329,77 @@ if __name__ == '__main__':
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=256, help='max sequence length, documents from pretrain data will expand to this length')
parser.add_argument('--data_dir', type=str, required=True, help='protobuf pretrain data directory')
parser.add_argument('--mask_rate', type=float, default=0.15, help='probability of input token tobe masked')
parser.add_argument('--check', type=float, default=0., help='probability of debug info')
parser.add_argument('--warmup_steps', type=int, default=10000, help='warmups steps')
parser.add_argument('--max_steps', type=int, default=1000000, help='max pretrian steps')
parser.add_argument('--lr', type=float, default=1e-4, help='learning_rate')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretraind model dir')
parser.add_argument('--save_dir', type=str, default=None, help='model output_dir')
parser.add_argument('--bsz', type=int, default=50)
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=Path, default=None)
parser.add_argument('--use_amp', action='store_true')
parser.add_argument('--mask_rate', type=float, default=0.15)
parser.add_argument('--check', type=float, default=0.)
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=50,
warmup_steps=10000,
learning_rate=1e-4,
weight_decay=0.01, )
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=1000000,
save_steps=10000,
log_steps=10,
max_ckpt=3,
skip_steps=0,
eval_steps=-1)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset('train', args.data_dir,
vocab=tokenizer.vocab, args=args)
train_ds = make_pretrain_dataset(
'train',
args.data_dir,
vocab=tokenizer.vocab,
hparams=hparams,
args=args)
seq_shape = [-1, args.max_seqlen]
ints_shape = [-1,]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
ints_shape = [-1, ]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
types = ('int64', 'int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
place = F.CUDAPlace(D.parallel.Env().dev_id)
with D.guard(place):
model = ErnieModelForPretraining.from_pretrained(args.from_pretrained)
opt = AdamW(learning_rate=LinearDecay(args.lr, args.warmup_steps, args.max_steps), parameter_list=model.parameters(), weight_decay=0.01)
ctx = D.parallel.prepare_context()
model = D.parallel.DataParallel(model, ctx)
for step, samples in enumerate(tqdm(train_ds.start(place))):
(src_ids, sent_ids, mlm_label, mask_pos, nsp_label) = samples
loss, mlmloss, nsploss = model(src_ids, sent_ids, labels=mlm_label, mlm_pos=mask_pos, nsp_labels=nsp_label)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
model.clear_gradients()
if step % 10 == 0:
log.debug('train loss %.5f scaled loss %.5f' % (loss.numpy(), scaled_loss.numpy()))
if step % 10000 == 0 and D.parallel.Env().dev_id == 0 and args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
ws = None
#varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
varname_to_warmstart = re.compile(r'.*')
if args.from_pretrained is not None:
warm_start_dir = os.path.join(args.from_pretrained, 'params')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(warm_start_dir, v.name)),
from_dir=warm_start_dir
)
ernie_learner = propeller.Learner(
ernie_pretrain_model_fn,
run_config,
params=hparams,
warm_start_setting=ws)
ernie_learner.train(train_ds)
......@@ -23,15 +23,15 @@ python3 -m paddle.distributed.launch \
--max_steps $((287113*30/64))
```
Note that you need more than 2 GPUs to run the finetuning.
During multi-gpu finetuning, `max_steps` is used as stop criteria rather than `epoch` to prevent dead block.
Note that you need more than 2 GPUs to run the finetuning.
During multi-gpu finetuning, `max_steps` is used as stop criteria rather than `epoch` to prevent dead block.
We simply canculate `max_steps` with: `EPOCH * NUM_TRIAN_EXAMPLE / TOTAL_BATCH`.
This demo script will save a finetuned model at `--save_dir`, and do muti-gpu prediction every `--eval_steps` and save prediction results at `--predict_output_dir`.
### Evalution
While finetuning, a serials of prediction files is generated.
While finetuning, a serials of prediction files is generated.
First you need to sort and join all files with:
```shell
......@@ -40,13 +40,13 @@ sort -t$'\t' -k1n ./pred/pred.step60000.* |awk -F"\t" '{print $2}'> final_predic
then use `./eval_cnndm/cnndm_eval.sh` to calcuate all metrics
(`pyrouge` is required to evalute CNN/Daily Mail.)
```shell
sh cnndm_eval.sh final_prediction ./data/cnndm/dev.summary
```
### Inference
### Inference
To run beam serach decode after you got a finetuned model. try:
......@@ -57,5 +57,3 @@ cat one_column_source_text| python3 demo/seq2seq/decode.py \
--save_dir ./model_cnndm \
--bsz 8
```
此差异已折叠。
此差异已折叠。
```bash
_____ ____ _ _ ___ _____ ____ _____ _ _
_____ ____ _ _ ___ _____ ____ _____ _ _
| ____| _ \| \ | |_ _| ____| / ___| ____| \ | |
| _| | |_) | \| || || _| _____| | _| _| | \| |
| |___| _ <| |\ || || |__|_____| |_| | |___| |\ |
......
![ernie_vil](.meta/ernie-vil.png)
![ernie_vil](.meta/ernie-vil.png)
The `ERNIE-ViL` (including our pre-trained models and VCR task-pretrained models) has been released at [here](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil).
......@@ -18,16 +18,13 @@ from __future__ import print_function
from __future__ import unicode_literals
import paddle
paddle_version = [int(i) for i in paddle.__version__.split('.')]
if paddle_version[1] < 7:
raise RuntimeError('paddle-ernie requires paddle 1.7+, got %s' %
if paddle.__version__ != '0.0.0' and paddle.__version__ < '2.0.0':
raise RuntimeError('propeller 0.2 requires paddle 2.0+, got %s' %
paddle.__version__)
from ernie.modeling_ernie import ErnieModel
from ernie.modeling_ernie import (ErnieModelForSequenceClassification,
ErnieModelForTokenClassification,
ErnieModelForQuestionAnswering,
ErnieModelForPretraining)
from ernie.modeling_ernie import (
ErnieModelForSequenceClassification, ErnieModelForTokenClassification,
ErnieModelForQuestionAnswering, ErnieModelForPretraining)
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册