提交 3b9d92fc 编写于 作者: C chenxuyi

update to paddle 2.0

上级 881bd978
......@@ -15,4 +15,3 @@ markComment: >
Thank you for your contributions.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: false
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/PaddlePaddle/mirrors-yapf.git
rev: 0d79c0c469bab64f7229c9aca2b1186ef47f0e37
hooks:
- id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
hooks:
- id: check-added-large-files
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
files: (?!.*third_party)^.*$ | (?!.*book)^.*$
- id: end-of-file-fixer
......@@ -11,6 +11,12 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
[\[more information\]](https://wenxin.baidu.com/)
# News
- Dec.29.2020:
- Pretrain and finetune ERNIE with [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc).
- New AMP(auto mixed precision) feature for every demo in this repo.
- Introducing `Gradient accumulation`, run `ERNIE-large` with only 8G memory.
- Sept.24.2020:
- [`ERNIE-ViL`](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil) is **avaliable** now!
- A **knowledge-enhanced** joint representations for vision-language tasks.
......@@ -20,7 +26,6 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
- May.20.2020:
- Try ERNIE in "`dygraph`", with:
- Pretrain and finetune ERNIE with [PaddlePaddle v1.8](https://github.com/PaddlePaddle/Paddle/tree/release/1.8).
- Eager execution with `paddle.fluid.dygraph`.
- Distributed training.
- Easy deployment.
......@@ -54,18 +59,16 @@ ERNIE 2.0 builds a strong basic for nearly every NLP tasks: Text Classification,
```python
import numpy as np
import paddle.fluid.dygraph as D
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
D.guard().__enter__() # activate paddle `dygrpah` mode
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0)) # insert extra `batch` dimension
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
......@@ -152,11 +155,16 @@ see [demo](https://ernie-github.cdn.bcebos.com/data-mnli-m.tar.gz) data for MNLI
- try eager execution with `dygraph model` :
```script
python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
python3 ./demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
```
- specify `--use_amp` to activate AMP training.
- `--bsz` denotes global batch size for one optimization step, `--micro_bsz` denotes maximum batch size for each GPU device.
if `--micro_bsz < --bsz`, gradient accumulation will be actiavted.
- Distributed finetune
`paddle.distributed.launch` is a process manager, we use it to launch python processes on each avalible GPU devices:
......@@ -173,7 +181,7 @@ you need to run single card finetuning first to get pretrained model, or donwloa
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_dygraph_distributed.py \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie-2.0-en
......@@ -182,11 +190,12 @@ python3 -m paddle.distributed.launch \
many other demo python scripts:
1. [Sentiment Analysis](./demo/finetune_sentiment_analysis_dygraph.py)
1. [Semantic Similarity](./demo/finetune_classifier_dygraph.py)
1. [Name Entity Recognition(NER)](./demo/finetune_ner_dygraph.py)
1. [Machine Reading Comprehension](./demo/finetune_mrc_dygraph.py)
1. [Sentiment Analysis](./demo/finetune_sentiment_analysis.py)
1. [Semantic Similarity](./demo/finetune_classifier.py)
1. [Name Entity Recognition(NER)](./demo/finetune_ner.py)
1. [Machine Reading Comprehension](./demo/finetune_mrc.py)
1. [Text generation](./demo/seq2seq/README.md)
1. [Text classification with `paddle.static` API](./demo/finetune_classifier_static.py)
......@@ -251,7 +260,7 @@ It can be used for feature-based finetuning or feature extraction.
Knowledge distillation is good way to compress and accelerate ERNIE.
For details about distillation, see [here](./distill/README.md)
For details about distillation, see [here](./demo/distill/README.md)
# Citation
......@@ -306,4 +315,3 @@ For full reproduction of paper results, please checkout to `repro` branch of thi
- QQ discussion group: 760439550 (ERNIE discussion group).
- QQ discussion group: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
......@@ -10,6 +10,11 @@ ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框
# 新闻
- 2020.12.29:
- `ERNIE`开源工具套件全面升级 [PaddlePaddle v2.0](https://github.com/PaddlePaddle/Paddle/tree/release/2.0-rc)
- 所有demo教程均引入AMP(混合精度训练), 平均提速达2.3倍。
- 引入`Gradient accumulation`, 8G显存也可运行`ERNIE-large`模型。
- 2020.9.24:
- `ERNIE-ViL` 模型正式开源! ([点击进入](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-vil))
- 面向视觉-语言知识增强的预训练框架,首次在视觉-语言预训练引入结构化的知识。
......@@ -19,7 +24,6 @@ ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框
- 2020.5.20:
- 欢迎试用`动态图`实现的 ERNIE:
- 基于[PaddlePaddle v1.8](https://github.com/PaddlePaddle/Paddle/tree/release/1.8)使用 ERNIE 进行 Pretrain 和 Finetune.
- 动态执行, 所见即所得。
- 大规模分布式训练。
- 易于部署。
......@@ -52,18 +56,16 @@ ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框
# 快速上手
```python
import numpy as np
import paddle.fluid.dygraph as D
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
D.guard().__enter__() # activate paddle `dygrpah` mode
model = ErnieModel.from_pretrained('ernie-1.0') # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = D.to_variable(np.expand_dims(ids, 0)) # insert extra `batch` dimension
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimension
pooled, encoded = model(ids) # eager execution
print(pooled.numpy()) # convert results to numpy
......@@ -159,11 +161,16 @@ data/xnli
- 使用 `动态图` 模型进行finetune:
```script
python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
python3 ./ernie_d/demo/finetune_classifier.py \
--from_pretrained ernie-1.0 \
--data_dir ./data/xnli
```
- 加入`--use_amp`以启用AMP功能(请在支持`TensorCore`设备上启用AMP)
- 通过`--bsz`指定全局batch\_size(一步优化中模型所能见到的样本数), 通过`--micro_bsz` 指定输入给每一张GPU卡的样本数
`--bsz > --micro_bsz` 脚本会自动开启梯度累计功能.
- 分布式 finetune
`paddle.distributed.launch` 是一个进程管理器,我们采用它在每一张GPU上启动一个python进程,并配置相应的环境变量以进行分布式训练:
......@@ -177,7 +184,7 @@ python3 ./ernie_d/demo/finetune_classifier_dygraph.py \
```script
python3 -m paddle.distributed.launch \
./demo/finetune_classifier_dygraph_distributed.py \
./demo/finetune_classifier_distributed.py \
--data_dir data/mnli \
--max_steps 10000 \
--from_pretrained ernie2.0-en
......@@ -186,11 +193,12 @@ python3 -m paddle.distributed.launch \
更多示例脚本:
1. [情感分析](./demo/finetune_sentiment_analysis_dygraph.py)
1. [语义匹配](./demo/finetune_classifier_dygraph.py)
1. [命名实体识别(NER)](./demo/finetune_ner_dygraph.py)
1. [机器阅读理解](./demo/finetune_mrc_dygraph.py) (需要多卡环境运行;参见上面"分布式 finetune"一节)
1. [情感分析](./demo/finetune_sentiment_analysis.py)
1. [语义匹配](./demo/finetune_classifier.py)
1. [命名实体识别(NER)](./demo/finetune_ner.py)
1. [机器阅读理解](./demo/finetune_mrc.py) (需要多卡环境运行;参见上面"分布式 finetune"一节)
1. [文本摘要生成](./demo/seq2seq/README.md)
1. [使用静态图完成文本分类](./demo/finetune_classifier_static.py)
**推荐超参数设置:**
......@@ -221,7 +229,7 @@ python3 -m paddle.distributed.launch \
# 在线预测
如果`finetune_classifier_dygraph.py`中指定了`--inference_model_dir`参数,funetune脚本会将你的模型序列化并产出可以直接部署线上预测的`inference_model`.
如果`finetune_classifier.py`中指定了`--inference_model_dir`参数,funetune脚本会将你的模型序列化并产出可以直接部署线上预测的`inference_model`.
关于生产环境中使用线上预测代码的实现细节,请见[C++ inference API](./inference/README.md).
或者你可以使用`propeller`启动一个多GPU预测服务(需要GPU环境),只需执行:
......@@ -254,7 +262,7 @@ ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
# 蒸馏
知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见[这里](./distill/README.md)
知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见[这里](./demo/distill/README.md)
# 文献引用
......@@ -309,4 +317,3 @@ ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]
- QQ 群: 760439550 (ERNIE discussion group).
- QQ 2群: 958422639 (ERNIE discussion group-v2).
- [Forums](http://ai.baidu.com/forum/topic/list/168?pageNo=1): discuss implementations, research, etc.
......@@ -9,7 +9,7 @@
# ERNIE Slim 数据蒸馏
在ERNIE强大的语义理解能力背后,是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高,若不能有效压缩则无法实际应用。
![ernie_distill](../.metas/ernie_distill.png)
![ernie_distill](../../.metas/ernie_distill.png)
因此,如上图所示,我们基于[数据蒸馏技术](https://arxiv.org/pdf/1712.04440.pdf)构建了**ERNIE Slim数据蒸馏系统**。它的原理是通过数据作为桥梁,将ERNIE模型的知识迁移至小模型,以达到损失很小的效果却能达到上千倍的预测速度提升的效果。
......@@ -62,4 +62,3 @@ python ./distill/distill.py
|**+ 数据蒸馏** |91.4%|
|非ERNIE基线(LSTM)|91.2%|
|**+ 数据蒸馏**|93.9%|
......@@ -18,15 +18,12 @@ import os
import numpy as np
from sklearn.metrics import f1_score
import paddle as P
import paddle.fluid as F
import paddle.fluid.layers as L
import paddle.fluid.dygraph as D
from paddle.nn import functional as F
import propeller.paddle as propeller
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification
from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
# 本例子采用chnsenticorp中文情感识别任务作为示范;并且事先通过数据增强扩充了蒸馏所需的无监督数据
#
......@@ -36,28 +33,46 @@ from ernie.optimization import AdamW, LinearDecay
# 事先统计好的BoW 词典在 ./chnsenticorp-data/vocab.bow.txt
# 定义finetune teacher模型所需要的超参数
DATA_DIR='./chnsenticorp-data/'
SEQLEN=256
BATCH=32
EPOCH=10
LR=5e-5
DATA_DIR = './chnsenticorp-data/'
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 5e-5
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
student_vocab = {i.strip(): l for l, i in enumerate(open(os.path.join(DATA_DIR, 'vocab.bow.txt')).readlines())}
student_vocab = {
i.strip(): l
for l, i in enumerate(
open(
os.path.join(DATA_DIR, 'vocab.bow.txt'), encoding='utf8')
.readlines())
}
def space_tokenizer(i):
return i.decode('utf8').split()
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_a_student', unk_id=student_vocab['[UNK]'], vocab_dict=student_vocab, tokenizer=space_tokenizer),
propeller.data.LabelColumn('label', vocab_dict={
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_a_student',
unk_id=student_vocab['[UNK]'],
vocab_dict=student_vocab,
tokenizer=space_tokenizer),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
}),
])
def map_fn(seg_a, seg_a_student, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=SEQLEN)
sentence, segments = tokenizer.build_for_ernie(seg_a)
......@@ -76,7 +91,7 @@ dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(DATA_DIR, 'de
.map(map_fn) \
.padded_batch(BATCH,)
shapes = ([-1,SEQLEN],[-1,SEQLEN], [-1, SEQLEN], [-1])
shapes = ([-1, SEQLEN], [-1, SEQLEN], [-1, SEQLEN], [-1])
types = ('int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
......@@ -86,16 +101,18 @@ train_ds_unlabel.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(0)
D.guard(place).__enter__()
place = P.CUDAPlace(0)
def evaluate_teacher(model, dataset):
all_pred, all_label = [], []
with D.base._switch_tracer_mode_guard_(is_train=False):
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(dataset.start()):
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids)
pred = L.argmax(logits, -1)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
......@@ -103,44 +120,62 @@ def evaluate_teacher(model, dataset):
return f1
teacher_model = ErnieModelForSequenceClassification.from_pretrained('ernie-1.0', num_labels=2)
teacher_model = ErnieModelForSequenceClassification.from_pretrained(
'ernie-1.0', num_labels=2)
teacher_model.train()
if not os.path.exists('./teacher_model.pdparams'):
g_clip = F.clip.GradientClipByGlobalNorm(1.0)
opt = AdamW(learning_rate=LinearDecay(LR, 9600*EPOCH*0.1/BATCH, 9600*EPOCH/BATCH), parameter_list=teacher_model.parameters(), weight_decay=0.01, grad_clip=g_clip)
if not os.path.exists('./teacher_model.bin'):
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=teacher_model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
for epoch in range(EPOCH):
for step, (ids_student, ids, sids, labels) in enumerate(train_ds.start(place)):
for step, (ids_student, ids, sids, labels) in enumerate(
P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, logits = teacher_model(ids, labels=labels)
loss.backward()
if step % 10 == 0:
print('[step %03d] teacher train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
lr_scheduler.step()
teacher_model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
if step % 100 == 0:
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' %f1)
D.save_dygraph(teacher_model.state_dict(), './teacher_model')
print('teacher f1: %.5f' % f1)
P.save(teacher_model.state_dict(), './teacher_model.bin')
else:
state_dict, _ = D.load_dygraph('./teacher_model')
teacher_model.set_dict(state_dict)
state_dict = P.load('./teacher_model.bin')
teacher_model.set_state_dict(state_dict)
f1 = evaluate_teacher(teacher_model, dev_ds)
print('teacher f1: %.5f' %f1)
print('teacher f1: %.5f' % f1)
# 定义finetune student 模型所需要的超参数
SEQLEN=256
BATCH=100
EPOCH=10
LR=1e-4
SEQLEN = 256
BATCH = 32
EPOCH = 10
LR = 1e-4
def evaluate_student(model, dataset):
all_pred, all_label = [], []
with D.base._switch_tracer_mode_guard_(is_train=False):
with P.no_grad():
model.eval()
for step, (ids_student, ids, _, labels) in enumerate(dataset.start()):
for step, (ids_student, ids, _, labels) in enumerate(
P.io.DataLoader(
dataset, places=place, batch_size=None)):
_, logits = model(ids_student)
pred = L.argmax(logits, -1)
pred = logits.argmax(-1)
all_pred.extend(pred.numpy())
all_label.extend(labels.numpy())
f1 = f1_score(all_label, all_pred, average='macro')
......@@ -148,92 +183,116 @@ def evaluate_student(model, dataset):
return f1
class BOW(D.Layer):
class BOW(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = D.Embedding([len(student_vocab), 128], padding_idx=0)
self.fc = D.Linear(128, 2)
self.emb = P.nn.Embedding(len(student_vocab), 128, padding_idx=0)
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
pad_mask = L.unsqueeze(L.cast(ids!=0, 'float32'), [-1])
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
embbed = L.reduce_sum(embbed * pad_mask, 1)
embbed = L.softsign(embbed)
embbed = (embbed * pad_mask).sum(1)
embbed = F.softsign(embbed)
logits = self.fc(embbed)
if labels is not None:
if len(labels.shape)==1:
labels = L.reshape(labels, [-1, 1])
loss = L.softmax_with_cross_entropy(logits, labels)
loss = L.reduce_mean(loss)
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
class CNN(D.Layer):
class CNN(P.nn.Layer):
def __init__(self):
super().__init__()
self.emb = D.Embedding([30002, 128], padding_idx=0)
self.cnn = D.Conv2D(128, 128, (1, 3), padding=(0, 1), act='relu')
self.pool = D.Pool2D((1, 3), pool_padding=(0, 1))
self.fc = D.Linear(128, 2)
self.emb = P.nn.Embedding(30002, 128, padding_idx=0)
self.cnn = P.nn.Conv2D(128, 128, (1, 3), padding=(0, 1), act='relu')
self.pool = P.nn.Pool2D((1, 3), pool_padding=(0, 1))
self.fc = P.nn.Linear(128, 2)
def forward(self, ids, labels=None):
embbed = self.emb(ids)
#d_batch, d_seqlen = ids.shape
hidden = embbed
hidden = L.transpose(hidden, [0, 2, 1]) #change to NCWH
hidden = L.unsqueeze(hidden, [2])
hidden = hidden.transpose([0, 2, 1]).unsqueeze(2) #change to NCWH
hidden = self.cnn(hidden)
hidden = self.pool(hidden)
hidden = L.squeeze(hidden, [2])
hidden = L.transpose(hidden, [0, 2, 1])
pad_mask = L.unsqueeze(L.cast(ids!=0, 'float32'), [-1])
hidden = L.softsign(L.reduce_sum(hidden * pad_mask, 1))
hidden = self.pool(hidden).squeeze(2).transpose([0, 2, 1])
pad_mask = (ids != 0).cast('float32').unsqueeze(-1)
hidden = P.nn.funcional.softsign(L(hidden * pad_mask).sum(1))
logits = self.fc(hidden)
if labels is not None:
if len(labels.shape)==1:
labels = L.reshape(labels, [-1, 1])
loss = L.softmax_with_cross_entropy(logits, labels)
loss = L.reduce_mean(loss)
if len(labels.shape) == 1:
labels = labels.reshape([-1, 1])
loss = F.cross_entropy(logits, labels).mean()
else:
loss = None
return loss, logits
def KL(pred, target):
pred = L.log(L.softmax(pred))
target = L.softmax(target)
loss = L.kldiv_loss(pred, target)
pred = F.log_softmax(pred)
target = F.softmax(target)
loss = F.kl_div(pred, target)
return loss
teacher_model.eval()
model = BOW()
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LR, parameter_list=model.parameters(), weight_decay=0.01, grad_clip=g_clip)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
LR,
get_warmup_and_linear_decay(9600 * EPOCH / BATCH,
9600 * EPOCH * 0.1 / BATCH))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=0.01,
grad_clip=g_clip)
model.train()
for epoch in range(EPOCH):
for step, (ids_student, ids, sids, label) in enumerate(train_ds.start(place)):
for epoch in range(EPOCH - 1):
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
with P.no_grad():
_, logits_t = teacher_model(ids, sids) # teacher 模型输出logits
logits_t.stop_gradient=True
_, logits_s = model(ids_student) # student 模型输出logits
loss_ce, _ = model(ids_student, labels=label)
loss_kd = KL(logits_s, logits_t) # 由KL divergence度量两个分布的距离
loss_kd = KL(logits_s, logits_t.detach()) # 由KL divergence度量两个分布的距离
loss = loss_ce + loss_kd
loss.backward()
if step % 10 == 0:
print('[step %03d] distill train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
lr_scheduler.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('student f1 %.5f' % f1)
# 最后再加一轮hard label训练巩固结果
for step, (ids_student, ids, sids, label) in enumerate(train_ds.start(place)):
for step, (
ids_student, ids, sids, label
) in enumerate(P.io.DataLoader(
train_ds, places=place, batch_size=None)):
loss, _ = model(ids_student, labels=label)
loss.backward()
if step % 10 == 0:
print('[step %03d] train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
opt.step()
model.clear_gradients()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l, _lr)
print(msg)
f1 = evaluate_student(model, dev_ds)
print('final f1 %.5f' % f1)
......@@ -11,204 +11,258 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
import os
import re
import time
import logging
from random import random
import json
from random import random
from functools import reduce, partial
from visualdl import LogWriter
import numpy as np
import multiprocessing
import tempfile
import re
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import optimization
#import utils.data
import logging
import argparse
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def model_fn(features, mode, params, run_config):
ernie = ErnieModelForSequenceClassification(params, name='')
if not params is propeller.RunMode.TRAIN:
ernie.eval()
metrics, loss = None, None
if mode is propeller.RunMode.PREDICT:
src_ids, sent_ids = features
_, logits = ernie(src_ids, sent_ids)
predictions = [logits,]
else:
src_ids, sent_ids, labels = features
if mode is propeller.RunMode.EVAL:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
pred = L.argmax(logits, axis=1)
acc = propeller.metrics.Acc(labels, pred)
metrics = {'acc': acc}
predictions = [pred]
else:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
scheduled_lr, _ = optimization(
loss=loss,
warmup_steps=int(run_config.max_steps * params['warmup_proportion']),
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
use_fp16=params.use_fp16,
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
)
propeller.summary.scalar('lr', scheduled_lr)
predictions = [logits,]
return propeller.ModelSpec(loss=loss, mode=mode, metrics=metrics, predictions=predictions)
if __name__ == '__main__':
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--do_predict', action='store_true')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--warm_start_from', type=str)
parser.add_argument('--epoch', type=int, default=3)
parser.add_argument('--use_fp16', action='store_true')
args = parser.parse_args()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' % args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=32,
num_labels=3,
warmup_proportion=0.1,
learning_rate=5e-5,
weight_decay=0.01,
use_task_id=False,
use_fp16=args.use_fp16,
)
hparams = default_hparams.join(propeller.HParams(**hparams_config_file)).join(hparams_cli)
default_run_config=dict(
max_steps=args.epoch * 390000 / hparams.batch_size,
save_steps=1000,
log_steps=10,
max_ckpt=1,
skip_steps=0,
model_dir=tempfile.mkdtemp(),
eval_steps=100)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
unk_id = tokenizer.vocab['[UNK]']
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
if not args.do_predict:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('title', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('comment', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument(
'--bsz',
type=int,
default=128,
help='global batch size for each optimizer step')
parser.add_argument(
'--micro_bsz',
type=int,
default=32,
help='batch size for each device. if `--bsz` > `--micro_bsz` * num_device, will do grad accumulate'
)
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--use_lr_decay',
action='store_true',
help='if set, learning rate will decay to zero at `max_steps`')
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--inference_model_dir',
type=Path,
default=None,
help='inference model output directory')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--max_steps',
type=int,
default=None,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
if args.bsz > args.micro_bsz:
assert args.bsz % args.micro_bsz == 0, 'cannot perform gradient accumulate with bsz:%d micro_bsz:%d' % (
args.bsz, args.micro_bsz)
acc_step = args.bsz // args.micro_bsz
log.info(
'performing gradient accumulate: global_bsz:%d, micro_bsz:%d, accumulate_steps:%d'
% (args.bsz, args.micro_bsz, acc_step))
args.bsz = args.micro_bsz
else:
acc_step = 1
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
])
def map_fn(seg_a, seg_b, label):
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
#label = np.expand_dims(label, -1) #
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
.padded_batch(args.bsz, (0, 0, 0))
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(param_path, v.name)),
from_dir=param_path,
)
best_exporter = propeller.train.exporter.BestExporter(os.path.join(run_config.model_dir, 'best'), cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=model_fn,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds, 'test': test_ds},
warm_start_setting=ws,
exporters=[best_exporter])
print('dev_acc3\t%.5f\ntest_acc3\t%.5f' % (best_exporter._best['dev']['acc'], best_exporter._best['test']['acc']))
.padded_batch(args.bsz, (0, 0, 0))
place = P.CUDAPlace(0)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd = P.load(args.init_checkpoint)
model.set_state_dict(sd)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
if args.use_lr_decay:
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
else:
lr_scheduler = None
opt = P.optimizer.AdamW(
args.lr,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step, inter_step = 0, 0
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None):
inter_step += 1
loss, _ = model(ids, sids, labels=label)
loss /= acc_step
loss = scaler.scale(loss)
loss.backward()
if inter_step % acc_step != 0:
continue
step += 1
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler and lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr(
) if args.use_lr_decay else args.lr
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('title', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('comment', unk_id=unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
])
def map_fn(seg_a, seg_b):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
predict_ds.data_shapes = shapes[: -1]
predict_ds.data_types = types[: -1]
est = propeller.Learner(model_fn, run_config, hparams)
for res, in est.predict(predict_ds, ckpt=-1):
print('%d\t%.5f\t%.5f\t%.5f' % (np.argmax(res), res[0], res[1], res[2]))
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for ids, sids, label in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(0),
batch_size=None):
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if args.inference_model_dir is not None:
class InferenceModel(ErnieModelForSequenceClassification):
def forward(self, ids, sids):
_, logits = super(InferenceModel, self).forward(ids, sids)
return logits
model.__class__ = InferenceModel
log.debug('saving inference model')
src_placeholder = P.zeros([2, 2], dtype='int64')
sent_placehodler = P.zeros([2, 2], dtype='int64')
_, static = P.jit.TracedLayer.trace(
model, inputs=[src_placeholder, sent_placehodler])
static.save_inference_model(str(args.inference_model_dir))
#class InferenceModel(ErnieModelForSequenceClassification):
# @P.jit.to_static
# def forward(self, ids, sids):
# _, logits = super(InferenceModel, self).forward(ids, sids, labels=None)
# return logits
#model.__class__ = InferenceModel
#src_placeholder = P.zeros([2, 2], dtype='int64')
#sent_placehodler = P.zeros([2, 2], dtype='int64')
#P.jit.save(model, args.inference_model_dir, input_var=[src_placeholder, sent_placehodler])
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import time
import logging
import json
import re
from random import random
from functools import reduce, partial
import numpy as np
import logging
#from visualdl import LogWriter
from pathlib import Path
import paddle as P
from propeller import log
import propeller.paddle as propeller
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
parser = propeller.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'seg_b',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label', vocab_dict={
b"0": 0,
b"1": 1,
b"2": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'),
shuffle=True, repeat=True, use_gz=False, shard=True) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'),
shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
P.distributed.init_parallel_env()
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd = P.load(args.init_checkpoint)
model.set_state_dict(sd)
model = P.DataParallel(model)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
step = 0
create_if_not_exists(args.save_dir)
#with LogWriter(logdir=str(create_if_not_exists(args.save_dir / 'vdl-%d' % env.dev_id))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for ids, sids, label in P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=None):
step += 1
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
# do logging
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
#log_writer.add_scalar('loss', _l, step=step)
#log_writer.add_scalar('lr', _lr, step=step)
# do saving
if step % 100 == 0 and env.dev_id == 0:
acc = []
with P.no_grad():
model.eval()
for d in P.io.DataLoader(
dev_ds, places=P.CUDAPlace(env.dev_id),
batch_size=None):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
#log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# exit
if step > args.max_steps:
break
if args.save_dir is not None and env.dev_id == 0:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import argparse
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--use_lr_decay', action='store_true', help='if set, learning rate will decay to zero at `max_steps`')
parser.add_argument('--warmup_proportion', type=float, default=0.1, help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`')
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--inference_model_dir', type=str, default=None, help='inference model output directory')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--max_steps', type=int, default=None, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument('--init_checkpoint', type=str, default=None, help='checkpoint to warm start from')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_b', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(0)
with FD.guard(place):
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd, _ = FD.load_dygraph(args.init_checkpoint)
model.set_dict(sd)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
if args.use_lr_decay:
opt = AdamW(learning_rate=LinearDecay(args.lr, int(args.warmup_proportion * args.max_steps), args.max_steps), parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
else:
opt = AdamW(args.lr, parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
for epoch in range(args.epoch):
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss.backward()
if step % 10 == 0:
log.debug('train loss %.5f lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating %d' % epoch)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
if args.inference_model_dir is not None:
log.debug('saving inference model')
class InferemceModel(ErnieModelForSequenceClassification):
def forward(self, *args, **kwargs):
_, logits = super(InferemceModel, self).forward(*args, **kwargs)
return logits
model.__class__ = InferemceModel #dynamic change model type, to make sure forward output doesn't contain `None`
src_placeholder = FD.to_variable(np.ones([1, 1], dtype=np.int64))
sent_placehodler = FD.to_variable(np.zeros([1, 1], dtype=np.int64))
model(src_placeholder, sent_placehodler)
_, static_model = FD.TracedLayer.trace(model, inputs=[src_placeholder, sent_placehodler])
static_model.save_inference_model(args.inference_model_dir)
log.debug('done')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = propeller.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--max_steps', type=int, required=True, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=int, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument('--init_checkpoint', type=str, default=None, help='checkpoint to warm start from')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.TextColumn('seg_b', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label', vocab_dict={
b"0": 0,
b"1": 1,
b"2": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=False, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
train_ds = train_ds.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
log.debug('shard %d/%d'%(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id))
train_ds = train_ds.shuffle(10000)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz, (0, 0, 0))
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
place = F.CUDAPlace(FD.parallel.Env().dev_id)
with FD.guard(place):
ctx = FD.parallel.prepare_context()
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if args.init_checkpoint is not None:
log.info('loading checkpoint from %s' % args.init_checkpoint)
sd, _ = FD.load_dygraph(args.init_checkpoint)
model.set_dict(sd)
model = FD.parallel.DataParallel(model, ctx)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LinearDecay(
args.lr,
int(args.warmup_proportion * args.max_steps),
args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
model.clear_gradients()
if step % 10 == 0:
log.debug('train loss %.5f, lr %.e3' % (loss.numpy(), opt.current_step_lr()))
if step % 100 == 0 and FD.parallel.Env().dev_id == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating')):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if step > args.max_steps:
break
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
import os
import re
import time
import logging
from random import random
import json
from functools import reduce, partial
import numpy as np
import multiprocessing
import tempfile
import re
import paddle as P
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from demo.optimization import optimization
#import utils.data
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def model_fn(features, mode, params, run_config):
ernie = ErnieModelForSequenceClassification(params, name='')
if mode is not propeller.RunMode.TRAIN:
ernie.eval()
else:
ernie.train()
metrics, loss = None, None
if mode is propeller.RunMode.PREDICT:
src_ids, sent_ids = features
_, logits = ernie(src_ids, sent_ids)
predictions = [logits, ]
else:
src_ids, sent_ids, labels = features
if mode is propeller.RunMode.EVAL:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
pred = logits.argmax(axis=1)
acc = propeller.metrics.Acc(labels, pred)
metrics = {'acc': acc}
predictions = [pred]
train_hooks = None
else:
loss, logits = ernie(src_ids, sent_ids, labels=labels)
lr_step_hook, loss_scale_coef = optimization(
loss=loss,
warmup_steps=int(run_config.max_steps *
params['warmup_proportion']),
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
use_fp16=args.use_amp,
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay", )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
predictions = [logits, ]
train_hooks = [lr_step_hook]
return propeller.ModelSpec(
loss=loss,
mode=mode,
metrics=metrics,
predictions=predictions,
train_hooks=train_hooks)
if __name__ == '__main__':
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--do_predict', action='store_true')
parser.add_argument('--max_seqlen', type=int, default=128)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--warm_start_from', type=str)
parser.add_argument('--epoch', type=int, default=3)
parser.add_argument('--use_amp', action='store_true')
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=32,
num_labels=3,
warmup_proportion=0.1,
learning_rate=5e-5,
weight_decay=0.01,
use_task_id=False,
use_fp16=args.use_amp)
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=args.epoch * 390000 / hparams.batch_size,
save_steps=1000,
log_steps=10,
max_ckpt=1,
skip_steps=0,
model_dir=tempfile.mkdtemp(),
eval_steps=100)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
unk_id = tokenizer.vocab['[UNK]']
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
if not args.do_predict:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn(
'label',
vocab_dict={
b"contradictory": 0,
b"contradiction": 0,
b"entailment": 1,
b"neutral": 2,
}),
])
def map_fn(seg_a, seg_b, label):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
#label = np.expand_dims(label, -1) #
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=True, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size)
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
varname_to_warmstart = re.compile(
r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$'
)
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(param_path, v.name)),
from_dir=param_path,
)
best_exporter = propeller.train.exporter.BestExporter(
os.path.join(run_config.model_dir, 'best'),
cmp_fn=lambda old, new: new['dev']['acc'] > old['dev']['acc'])
propeller.train.train_and_eval(
model_class_or_model_fn=model_fn,
params=hparams,
run_config=run_config,
train_dataset=train_ds,
eval_dataset={'dev': dev_ds,
'test': test_ds},
warm_start_setting=ws,
exporters=[best_exporter])
print('dev_acc3\t%.5f\ntest_acc3\t%.5f' %
(best_exporter._best['dev']['acc'],
best_exporter._best['test']['acc']))
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'title',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.TextColumn(
'comment',
unk_id=unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
def map_fn(seg_a, seg_b):
seg_a, seg_b = tokenizer.truncate(
seg_a, seg_b, seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, seg_b)
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(hparams.batch_size) \
predict_ds.data_shapes = shapes[:-1]
predict_ds.data_types = types[:-1]
est = propeller.Learner(model_fn, run_config, hparams)
for res, in est.predict(predict_ds, ckpt=-1):
print('%d\t%.5f\t%.5f\t%.5f' %
(np.argmax(res), res[0], res[1], res[2]))
......@@ -17,50 +17,56 @@ from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import re
import time
import logging
import json
from pathlib import Path
from random import random
from tqdm import tqdm
from functools import reduce, partial
import pickle
import argparse
from functools import partial
from io import open
import numpy as np
import logging
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
import paddle as P
from propeller import log
import propeller.paddle as propeller
from ernie.modeling_ernie import ErnieModel, ErnieModelForQuestionAnswering
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
#from ernie.optimization import AdamW, LinearDecay
from demo.mrc import mrc_reader
from demo.mrc import mrc_metrics
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
def evaluate(model, ds, all_examples, all_features, tokenizer, args):
dev_file = json.loads(open(args.dev_file).read())
with D.base._switch_tracer_mode_guard_(is_train=False):
dev_file = json.loads(open(args.dev_file, encoding='utf8').read())
with P.no_grad():
log.debug('start eval')
model.eval()
all_res = []
for step, (uids, token_ids, token_type_ids, _, __) in enumerate(ds.start(place)):
_ , start_logits, end_logits = model(token_ids, token_type_ids)
res = [mrc_metrics.RawResult(unique_id=u, start_logits=s, end_logits=e)
for u, s, e in zip(uids.numpy(), start_logits.numpy(), end_logits.numpy())]
for step, (uids, token_ids, token_type_ids, _, __) in enumerate(
P.io.DataLoader(
ds, places=P.CUDAPlace(env.dev_id), batch_size=None)):
_, start_logits, end_logits = model(token_ids, token_type_ids)
res = [
mrc_metrics.RawResult(
unique_id=u, start_logits=s, end_logits=e)
for u, s, e in zip(uids.numpy(),
start_logits.numpy(), end_logits.numpy())
]
all_res += res
open('all_res', 'wb').write(pickle.dumps(all_res))
all_pred, all_nbests = mrc_metrics.make_results(
......@@ -77,54 +83,121 @@ def evaluate(model, ds, all_examples, all_features, tokenizer, args):
return f1, em
def train(model, train_dataset, dev_dataset, dev_examples, dev_features, tokenizer, args):
ctx = D.parallel.prepare_context()
model = D.parallel.DataParallel(model, ctx)
def train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args):
model = P.DataParallel(model)
max_steps = len(train_features) * args.epoch // args.bsz
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=args.lr, parameter_list=model.parameters(), weight_decay=args.wd, grad_clip=g_clip)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(max_steps,
int(args.warmup_proportion * max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
train_dataset = train_dataset \
.repeat() \
.shard(D.parallel.Env().nranks, D.parallel.Env().dev_id) \
.shuffle(1000) \
.cache_shuffle_shard(env.nranks, env.dev_id, drop_last=True) \
.padded_batch(args.bsz)
log.debug('init training with args: %s' % repr(args))
for step, (_, token_ids, token_type_ids, start_pos, end_pos) in enumerate(train_dataset.start(place)):
loss, _, __ = model(token_ids, token_type_ids, start_pos=start_pos, end_pos=end_pos)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(enable=args.use_amp):
for step, (_, token_ids, token_type_ids, start_pos,
end_pos) in enumerate(
P.io.DataLoader(
train_dataset,
places=P.CUDAPlace(env.dev_id),
batch_size=None)):
loss, _, __ = model(
token_ids,
token_type_ids,
start_pos=start_pos,
end_pos=end_pos)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
if D.parallel.Env().dev_id == 0 and step % 10 == 0:
log.debug('[step %d] train loss %.5f lr %.3e' % (step, loss.numpy(), opt.current_step_lr()))
if D.parallel.Env().dev_id == 0 and step % 100 == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features, tokenizer, args)
log.debug('[step %d] eval result: f1 %.5f em %.5f' % (step, f1, em))
lr_scheduler.step()
if env.dev_id == 0 and step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if env.dev_id == 0 and step % 100 == 0:
f1, em = evaluate(model, dev_dataset, dev_examples,
dev_features, tokenizer, args)
log.debug('[step %d] eval result: f1 %.5f em %.5f' %
(step, f1, em))
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if step > max_steps:
break
if __name__ == "__main__":
parser = argparse.ArgumentParser('MRC model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=512, help='max sentence length, should not greater than 512')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=512,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=8, help='batchsize')
parser.add_argument('--epoch', type=int, default=2, help='epoch')
parser.add_argument('--train_file', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--dev_file', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument(
'--train_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--dev_file',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=3e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--n_best_size', type=int, default=20, help='nbest prediction to keep')
parser.add_argument('--max_answer_length', type=int, default=100, help='max answer span')
parser.add_argument('--wd', type=float, default=0.00, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--n_best_size', type=int, default=20, help='nbest prediction to keep')
parser.add_argument(
'--max_answer_length', type=int, default=100, help='max answer span')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
env = P.distributed.ParallelEnv()
P.distributed.init_parallel_env()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
if not os.path.exists(args.train_file):
......@@ -134,43 +207,41 @@ if __name__ == "__main__":
log.info('making train/dev data...')
train_examples = mrc_reader.read_files(args.train_file, is_training=True)
train_features = mrc_reader.convert_example_to_features(train_examples, args.max_seqlen, tokenizer, is_training=True)
train_features = mrc_reader.convert_example_to_features(
train_examples, args.max_seqlen, tokenizer, is_training=True)
dev_examples = mrc_reader.read_files(args.dev_file, is_training=False)
dev_features = mrc_reader.convert_example_to_features(dev_examples, args.max_seqlen, tokenizer, is_training=False)
dev_features = mrc_reader.convert_example_to_features(
dev_examples, args.max_seqlen, tokenizer, is_training=False)
log.info('train examples: %d, features: %d' % (len(train_examples), len(train_features)))
log.info('train examples: %d, features: %d' %
(len(train_examples), len(train_features)))
def map_fn(unique_id, example_index, doc_span_index, tokens, token_to_orig_map, token_is_max_context, token_ids, position_ids, text_type_ids, start_position, end_position):
def map_fn(unique_id, example_index, doc_span_index, tokens,
token_to_orig_map, token_is_max_context, token_ids,
position_ids, text_type_ids, start_position, end_position):
if start_position is None:
start_position = 0
if end_position is None:
end_position = 0
return np.array(unique_id), np.array(token_ids), np.array(text_type_ids), np.array(start_position), np.array(end_position)
return np.array(unique_id), np.array(token_ids), np.array(
text_type_ids), np.array(start_position), np.array(end_position)
train_dataset = propeller.data.Dataset.from_list(train_features).map(map_fn)
train_dataset = propeller.data.Dataset.from_list(train_features).map(
map_fn)
dev_dataset = propeller.data.Dataset.from_list(dev_features).map(map_fn).padded_batch(args.bsz)
shapes = ([-1], [-1, args.max_seqlen], [-1, args.max_seqlen], [-1], [-1])
types = ('int64', 'int64', 'int64', 'int64', 'int64')
dev_dataset = propeller.data.Dataset.from_list(dev_features).map(
map_fn).padded_batch(args.bsz)
train_dataset.name = 'train'
dev_dataset.name = 'dev'
model = ErnieModelForQuestionAnswering.from_pretrained(
args.from_pretrained, name='')
train_dataset.data_shapes = shapes
train_dataset.data_types = types
dev_dataset.data_shapes = shapes
dev_dataset.data_types = types
train(model, train_dataset, dev_dataset, dev_examples, dev_features,
tokenizer, args)
place = F.CUDAPlace(D.parallel.Env().dev_id)
D.guard(place).__enter__()
model = ErnieModelForQuestionAnswering.from_pretrained(args.from_pretrained, name='')
train(model, train_dataset, dev_dataset, dev_examples, dev_features, tokenizer, args)
if D.parallel.Env().dev_id == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features, tokenizer, args)
if env.dev_id == 0:
f1, em = evaluate(model, dev_dataset, dev_examples, dev_features,
tokenizer, args)
log.debug('final eval result: f1 %.5f em %.5f' % (f1, em))
if D.parallel.Env().dev_id == 0 and args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
if env.dev_id == 0 and args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import six
import json
from random import random
from tqdm import tqdm
from collections import OrderedDict
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import multiprocessing
import pickle
import logging
from sklearn.metrics import f1_score
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification, ErnieModelForTokenClassification
from ernie.tokenizing_ernie import ErnieTokenizer
#from ernie.optimization import AdamW, LinearDecay
parser = propeller.ArgumentParser('NER model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--bsz', type=int, default=32)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--epoch', type=int, default=6)
parser.add_argument(
'--warmup_proportion',
type=float,
default=0.1,
help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`'
)
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE, used in learning rate scheduler'
)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
parser.add_argument('--from_pretrained', type=Path, required=True)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
def tokenizer_func(inputs):
ret = inputs.split(b'\2')
tokens, orig_pos = [], []
for i, r in enumerate(ret):
t = tokenizer.tokenize(r)
for tt in t:
tokens.append(tt)
orig_pos.append(i)
assert len(tokens) == len(orig_pos)
return tokens + orig_pos
def tokenizer_func_for_label(inputs):
return inputs.split(b'\2')
feature_map = {
b"B-PER": 0,
b"I-PER": 1,
b"B-ORG": 2,
b"I-ORG": 3,
b"B-LOC": 4,
b"I-LOC": 5,
b"O": 6,
}
other_tag_id = feature_map[b'O']
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'text_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer_func), propeller.data.TextColumn(
'label',
unk_id=other_tag_id,
vocab_dict=feature_map,
tokenizer=tokenizer_func_for_label, )
])
def before(seg, label):
seg, orig_pos = np.split(seg, 2)
aligned_label = label[orig_pos]
seg, _ = tokenizer.truncate(seg, [], args.max_seqlen)
aligned_label, _ = tokenizer.truncate(aligned_label, [], args.max_seqlen)
orig_pos, _ = tokenizer.truncate(orig_pos, [], args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(
seg
) #utils.data.build_1_pair(seg, max_seqlen=args.max_seqlen, cls_id=cls_id, sep_id=sep_id)
aligned_label = np.concatenate([[0], aligned_label, [0]], 0)
orig_pos = np.concatenate([[0], orig_pos, [0]])
assert len(aligned_label) == len(sentence) == len(orig_pos), (
len(aligned_label), len(sentence), len(orig_pos)) # alinged
return sentence, segments, aligned_label, label, orig_pos
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1, 0)) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,-100, other_tag_id + 1,0)) \
def evaluate(model, dataset):
model.eval()
with P.no_grad():
chunkf1 = propeller.metrics.ChunkF1(None, None, None, len(feature_map))
for step, (ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
dataset, batch_size=None)):
loss, logits = model(ids, sids)
#print('\n'.join(map(str, logits.numpy().tolist())))
assert orig_pos.shape[0] == logits.shape[0] == ids.shape[
0] == label.shape[0]
for pos, lo, la, id in zip(orig_pos.numpy(),
logits.numpy(),
label.numpy(), ids.numpy()):
_dic = OrderedDict()
assert len(pos) == len(lo) == len(id)
for _pos, _lo, _id in zip(pos, lo, id):
if _id > tokenizer.mask_id: # [MASK] is the largest special token
_dic.setdefault(_pos, []).append(_lo)
merged_lo = np.array(
[np.array(l).mean(0) for _, l in six.iteritems(_dic)])
merged_preds = np.argmax(merged_lo, -1)
la = la[np.where(la != (other_tag_id + 1))] #remove pad
if len(la) > len(merged_preds):
log.warn(
'accuracy loss due to truncation: label len:%d, truncate to %d'
% (len(la), len(merged_preds)))
merged_preds = np.pad(merged_preds,
[0, len(la) - len(merged_preds)],
mode='constant',
constant_values=7)
else:
assert len(la) == len(
merged_preds
), 'expect label == prediction, got %d vs %d' % (
la.shape, merged_preds.shape)
chunkf1.update((merged_preds, la, np.array(len(la))))
#f1 = f1_score(np.concatenate(all_label), np.concatenate(all_pred), average='macro')
f1 = chunkf1.eval()
model.train()
return f1
model = ErnieModelForTokenClassification.from_pretrained(
args.from_pretrained,
num_labels=len(feature_map),
name='',
has_pooler=False)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps,
int(args.warmup_proportion * args.max_steps)))
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(
logdir=str(create_if_not_exists(args.save_dir / 'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, (
ids, sids, aligned_label, label, orig_pos
) in enumerate(P.io.DataLoader(
train_ds, batch_size=None)):
loss, logits = model(ids, sids, labels=aligned_label)
#loss, logits = model(ids, sids, labels=aligned_label, loss_weights=P.cast(ids != 0, 'float32'))
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (step, _l,
_lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
f1 = evaluate(model, dev_ds)
log.debug('eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
f1 = evaluate(model, dev_ds)
log.debug('final eval f1: %.5f' % f1)
log_writer.add_scalar('eval/f1', f1, step=step)
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import six
import json
from random import random
from tqdm import tqdm
from collections import OrderedDict
from functools import reduce, partial
import numpy as np
import multiprocessing
import pickle
import logging
from sklearn.metrics import f1_score
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification, ErnieModelForTokenClassification
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = propeller.ArgumentParser('NER model with ERNIE')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--bsz', type=int, default=32)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--epoch', type=int, default=6)
parser.add_argument('--warmup_proportion', type=float, default=0.1, help='if use_lr_decay is set, '
'learning rate will raise to `lr` at `warmup_proportion` * `max_steps` and decay to 0. at `max_steps`')
parser.add_argument('--max_steps', type=int, required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE, used in learning rate scheduler')
parser.add_argument('--from_pretrained', type=str, required=True)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
def tokenizer_func(inputs):
ret = inputs.split(b'\2')
tokens, orig_pos = [], []
for i, r in enumerate(ret):
t = tokenizer.tokenize(r)
for tt in t:
tokens.append(tt)
orig_pos.append(i)
assert len(tokens) == len(orig_pos)
return tokens + orig_pos
def tokenizer_func_for_label(inputs):
return inputs.split(b'\2')
feature_map = {
b"B-PER": 0,
b"I-PER": 1,
b"B-ORG": 2,
b"I-ORG": 3,
b"B-LOC": 4,
b"I-LOC": 5,
b"O": 6,
}
other_tag_id = feature_map[b'O']
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('text_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer_func),
propeller.data.TextColumn('label', unk_id=other_tag_id, vocab_dict=feature_map,
tokenizer=tokenizer_func_for_label,)
])
def before(seg, label):
seg, orig_pos = np.split(seg, 2)
aligned_label = label[orig_pos]
seg, _ = tokenizer.truncate(seg, [], args.max_seqlen)
aligned_label, _ = tokenizer.truncate(aligned_label, [], args.max_seqlen)
orig_pos, _ = tokenizer.truncate(orig_pos, [], args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg) #utils.data.build_1_pair(seg, max_seqlen=args.max_seqlen, cls_id=cls_id, sep_id=sep_id)
aligned_label = np.concatenate([[0], aligned_label, [0]], 0)
orig_pos = np.concatenate([[0], orig_pos, [0]])
assert len(aligned_label) == len(sentence) == len(orig_pos), (len(aligned_label), len(sentence), len(orig_pos)) # alinged
return sentence, segments, aligned_label, label, orig_pos
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1, 0)) \
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1,0)) \
test_ds = feature_column.build_dataset('test', data_dir=os.path.join(args.data_dir, 'test'), shuffle=False, repeat=False, use_gz=False) \
.map(before) \
.padded_batch(args.bsz, (0,0,0, other_tag_id + 1,0)) \
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1, args.max_seqlen])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
test_ds.data_shapes = shapes
test_ds.data_types = types
place = F.CUDAPlace(0)
@FD.no_grad
def evaluate(model, dataset):
model.eval()
chunkf1 = propeller.metrics.ChunkF1(None, None, None, len(feature_map))
for step, (ids, sids, aligned_label, label, orig_pos) in enumerate(tqdm(dataset.start(place))):
loss, logits = model(ids, sids)
#print('\n'.join(map(str, logits.numpy().tolist())))
assert orig_pos.shape[0] == logits.shape[0] == ids.shape[0] == label.shape[0]
for pos, lo, la, id in zip(orig_pos.numpy(), logits.numpy(), label.numpy(), ids.numpy()):
_dic = OrderedDict()
assert len(pos) ==len(lo) == len(id)
for _pos, _lo, _id in zip(pos, lo, id):
if _id > tokenizer.mask_id: # [MASK] is the largest special token
_dic.setdefault(_pos, []).append(_lo)
merged_lo = np.array([np.array(l).mean(0) for _, l in six.iteritems(_dic)])
merged_preds = np.argmax(merged_lo, -1)
la = la[np.where(la != (other_tag_id + 1))] #remove pad
if len(la) > len(merged_preds):
log.warn('accuracy loss due to truncation: label len:%d, truncate to %d' % (len(la), len(merged_preds)))
merged_preds = np.pad(merged_preds, [0, len(la) - len(merged_preds)], mode='constant', constant_values=7)
else:
assert len(la) == len(merged_preds), 'expect label == prediction, got %d vs %d' % (la.shape, merged_preds.shape)
chunkf1.update((merged_preds, la, np.array(len(la))))
#f1 = f1_score(np.concatenate(all_label), np.concatenate(all_pred), average='macro')
f1 = chunkf1.eval()
model.train()
return f1
with FD.guard(place):
model = ErnieModelForTokenClassification.from_pretrained(args.from_pretrained, num_labels=len(feature_map), name='', has_pooler=False)
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(
learning_rate=LinearDecay(args.lr, int(args.warmup_proportion * args.max_steps), args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd, grad_clip=g_clip)
#opt = F.optimizer.AdamOptimizer(learning_rate=LinearDecay(args.lr, args.warmup_steps, args.max_steps), parameter_list=model.parameters())
for epoch in range(args.epoch):
for step, (ids, sids, aligned_label, label, orig_pos) in enumerate(tqdm(train_ds.start(place))):
loss, logits = model(ids, sids, labels=aligned_label, loss_weights=L.cast(ids > tokenizer.mask_id, 'float32')) # [MASK] is the largest special token
loss.backward()
if step % 10 == 0 :
log.debug('train loss %.5f, lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0 :
f1 = evaluate(model, dev_ds)
log.debug('eval f1: %.5f' % f1)
f1 = evaluate(model, dev_ds)
log.debug('final eval f1: %.5f' % f1)
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
from pathlib import Path
from visualdl import LogWriter
import numpy as np
import logging
import argparse
import paddle as P
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
#from ernie.optimization import AdamW, LinearDecay
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretrained model directory or tag')
parser.add_argument(
'--max_seqlen',
type=int,
default=128,
help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='data directory includes train / develop data')
parser.add_argument(
'--max_steps',
type=int,
required=True,
help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--eval', action='store_true')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output directory')
parser.add_argument(
'--init_checkpoint',
type=str,
default=None,
help='checkpoint to warm start from')
parser.add_argument(
'--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
model = ErnieModelForSequenceClassification.from_pretrained(
args.from_pretrained, num_labels=3, name='')
if not args.eval:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label'),
])
def map_fn(seg_a, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(
args.max_steps, int(args.warmup_proportion * args.max_steps)))
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
opt = P.optimizer.AdamW(
lr_scheduler,
parameters=model.parameters(),
weight_decay=args.wd,
apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
grad_clip=g_clip)
scaler = P.amp.GradScaler(enable=args.use_amp)
with LogWriter(logdir=str(create_if_not_exists(args.save_dir /
'vdl'))) as log_writer:
with P.amp.auto_cast(enable=args.use_amp):
for epoch in range(args.epoch):
for step, d in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(0), batch_size=None)):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[step-%d] train loss %.5f lr %.3e scaling %.3e' % (
step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[step-%d] train loss %.5f lr %.3e' % (
step, _l, _lr)
log.debug(msg)
log_writer.add_scalar('loss', _l, step=step)
log_writer.add_scalar('lr', _lr, step=step)
if step % 100 == 0:
acc = []
with P.no_grad():
model.eval()
for step, d in enumerate(
P.io.DataLoader(
dev_ds,
places=P.CUDAPlace(0),
batch_size=None)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
a = (logits.argmax(-1) == label)
acc.append(a.numpy())
model.train()
acc = np.concatenate(acc).mean()
log_writer.add_scalar('eval/acc', acc, step=step)
log.debug('acc %.5f' % acc)
if args.save_dir is not None:
P.save(model.state_dict(),
args.save_dir / 'ckpt.bin')
if args.save_dir is not None:
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn(
'seg_a',
unk_id=tokenizer.unk_id,
vocab_dict=tokenizer.vocab,
tokenizer=tokenizer.tokenize),
])
sd = P.load(args.init_checkpoint)
model.set_dict(sd)
model.eval()
def map_fn(seg_a):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(args.bsz)
for step, (ids, sids) in enumerate(
P.io.DataLoader(
predict_ds, places=P.CUDAPlace(0), batch_size=None)):
_, logits = model(ids, sids)
pred = logits.numpy().argmax(-1)
print('\n'.join(map(str, pred.tolist())))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import time
import logging
import json
from random import random
from tqdm import tqdm
from functools import reduce, partial
import numpy as np
import logging
import argparse
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as FD
import paddle.fluid.layers as L
from propeller import log
import propeller.paddle as propeller
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
log = logging.getLogger()
#from model.bert import BertConfig, BertModelLayer
from ernie.modeling_ernie import ErnieModel, ErnieModelForSequenceClassification
from ernie.tokenizing_ernie import ErnieTokenizer, ErnieTinyTokenizer
from ernie.optimization import AdamW, LinearDecay
if __name__ == '__main__':
parser = argparse.ArgumentParser('classify model with ERNIE')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretrained model directory or tag')
parser.add_argument('--max_seqlen', type=int, default=128, help='max sentence length, should not greater than 512')
parser.add_argument('--bsz', type=int, default=32, help='batchsize')
parser.add_argument('--epoch', type=int, default=3, help='epoch')
parser.add_argument('--data_dir', type=str, required=True, help='data directory includes train / develop data')
parser.add_argument('--max_steps', type=int, required=True, help='max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE')
parser.add_argument('--warmup_proportion', type=float, default=0.1)
parser.add_argument('--lr', type=float, default=5e-5, help='learning rate')
parser.add_argument('--eval', action='store_true')
parser.add_argument('--save_dir', type=str, default=None, help='model output directory')
parser.add_argument('--wd', type=float, default=0.01, help='weight decay, aka L2 regularizer')
args = parser.parse_args()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
#tokenizer = ErnieTinyTokenizer.from_pretrained(args.from_pretrained)
place = F.CUDAPlace(0)
with FD.guard(place):
model = ErnieModelForSequenceClassification.from_pretrained(args.from_pretrained, num_labels=3, name='')
if not args.eval:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
propeller.data.LabelColumn('label'),
])
def map_fn(seg_a, label):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments, label
train_ds = feature_column.build_dataset('train', data_dir=os.path.join(args.data_dir, 'train'), shuffle=True, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
dev_ds = feature_column.build_dataset('dev', data_dir=os.path.join(args.data_dir, 'dev'), shuffle=False, repeat=False, use_gz=False) \
.map(map_fn) \
.padded_batch(args.bsz)
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen], [-1])
types = ('int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
dev_ds.data_shapes = shapes
dev_ds.data_types = types
g_clip = F.clip.GradientClipByGlobalNorm(1.0) #experimental
opt = AdamW(learning_rate=LinearDecay(
args.lr,
int(args.warmup_proportion * args.max_steps), args.max_steps),
parameter_list=model.parameters(),
weight_decay=args.wd,
grad_clip=g_clip)
for epoch in range(args.epoch):
for step, d in enumerate(tqdm(train_ds.start(place), desc='training')):
ids, sids, label = d
loss, _ = model(ids, sids, labels=label)
loss.backward()
if step % 10 == 0:
log.debug('train loss %.5f lr %.3e' % (loss.numpy(), opt.current_step_lr()))
opt.minimize(loss)
model.clear_gradients()
if step % 100 == 0:
acc = []
with FD.base._switch_tracer_mode_guard_(is_train=False):
model.eval()
for step, d in enumerate(tqdm(dev_ds.start(place), desc='evaluating %d' % epoch)):
ids, sids, label = d
loss, logits = model(ids, sids, labels=label)
#print('\n'.join(map(str, logits.numpy().tolist())))
a = L.argmax(logits, -1) == label
acc.append(a.numpy())
model.train()
log.debug('acc %.5f' % np.concatenate(acc).mean())
if args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
else:
feature_column = propeller.data.FeatureColumns([
propeller.data.TextColumn('seg_a', unk_id=tokenizer.unk_id, vocab_dict=tokenizer.vocab, tokenizer=tokenizer.tokenize),
])
assert args.save_dir is not None
sd, _ = FD.load_dygraph(args.save_dir)
model.set_dict(sd)
model.eval()
def map_fn(seg_a):
seg_a, _ = tokenizer.truncate(seg_a, [], seqlen=args.max_seqlen)
sentence, segments = tokenizer.build_for_ernie(seg_a, [])
return sentence, segments
predict_ds = feature_column.build_dataset_from_stdin('predict') \
.map(map_fn) \
.padded_batch(args.bsz)
shapes = ([-1, args.max_seqlen], [-1, args.max_seqlen])
types = ('int64', 'int64')
predict_ds.data_shapes = shapes
predict_ds.data_types = types
for step, (ids, sids) in enumerate(predict_ds.start(place)):
_, logits = model(ids, sids)
pred = logits.numpy().argmax(-1)
print('\n'.join(map(str, pred.tolist())))
......@@ -17,7 +17,6 @@ from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import re
import six
......@@ -29,7 +28,8 @@ import nltk
import unicodedata
from collections import namedtuple
RawResult = namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
RawResult = namedtuple("RawResult",
["unique_id", "start_logits", "end_logits"])
log = logging.getLogger(__name__)
......@@ -384,7 +384,8 @@ def make_results(vocab, all_examples, all_features, all_results, n_best_size,
continue
if end_index not in feature.token_to_orig_map:
continue
if not feature.token_is_max_context.get(start_index, False):
if not feature.token_is_max_context.get(start_index,
False):
continue
if end_index < start_index:
continue
......@@ -414,8 +415,8 @@ def make_results(vocab, all_examples, all_features, all_results, n_best_size,
break
feature = features[pred.feature_index]
if pred.start_index > 0: # this is a non-null prediction
tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
)]
tok_tokens = feature.tokens[pred.start_index:(pred.end_index +
1)]
orig_doc_start = feature.token_to_orig_map[pred.start_index]
orig_doc_end = feature.token_to_orig_map[pred.end_index]
orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
......@@ -483,9 +484,11 @@ def mixed_segmentation(in_str, rm_punc=False):
in_str = in_str.lower().strip()
segs_out = []
temp_str = ""
sp_char = ['-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=',
',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、',
'「', '」', '(', ')', '-', '~', '『', '』']
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』'
]
for char in in_str:
if rm_punc and char in sp_char:
continue
......@@ -510,9 +513,11 @@ def mixed_segmentation(in_str, rm_punc=False):
def remove_punctuation(in_str):
"""remove punctuation"""
in_str = in_str.lower().strip()
sp_char = ['-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=',
',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、',
'「', '」', '(', ')', '-', '~', '『', '』']
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』'
]
out_segs = []
for char in in_str:
if char in sp_char:
......@@ -525,7 +530,7 @@ def remove_punctuation(in_str):
# find longest common string
def find_lcs(s1, s2):
"""find_lcs"""
m = [[0 for i in range(len(s2)+1)] for j in range(len(s1)+1)]
m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
mmax = 0
p = 0
for i in range(len(s1)):
......@@ -535,7 +540,7 @@ def find_lcs(s1, s2):
if m[i + 1][j + 1] > mmax:
mmax = m[i + 1][j + 1]
p = i + 1
return s1[p - mmax: p], mmax
return s1[p - mmax:p], mmax
def calc_f1_score(answers, prediction):
......@@ -583,7 +588,8 @@ def evaluate(ground_truth_file, prediction_file):
answers = [ans["text"] for ans in qas["answers"]]
if query_id not in prediction_file:
sys.stderr.write('Unanswered question: {}\n'.format(query_id))
sys.stderr.write('Unanswered question: {}\n'.format(
query_id))
skip_count += 1
continue
......@@ -594,4 +600,3 @@ def evaluate(ground_truth_file, prediction_file):
f1_score = f1 / total_count
em_score = em / total_count
return [f1_score, em_score, total_count, skip_count]
......@@ -20,33 +20,26 @@ from __future__ import unicode_literals
import sys
import argparse
import logging
from functools import partial
from io import open
open = partial(open, encoding='utf-8')
import json
from collections import namedtuple
log = logging.getLogger(__name__)
Example = namedtuple('Example', [
'qas_id', 'question_text', 'doc_tokens', 'orig_answer_text',
'start_position', 'end_position'
])
Example = namedtuple('Example',
['qas_id',
'question_text',
'doc_tokens',
'orig_answer_text',
'start_position',
'end_position'])
Feature = namedtuple("Feature",
["unique_id",
"example_index",
"doc_span_index",
"tokens",
"token_to_orig_map",
"token_is_max_context",
"token_ids",
"position_ids",
"text_type_ids",
"start_position",
"end_position"])
Feature = namedtuple("Feature", [
"unique_id", "example_index", "doc_span_index", "tokens",
"token_to_orig_map", "token_is_max_context", "token_ids", "position_ids",
"text_type_ids", "start_position", "end_position"
])
def _tokenize_chinese_chars(text):
......@@ -113,7 +106,8 @@ def _check_is_max_context(doc_spans, cur_span_index, position):
return cur_span_index == best_span_index
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
orig_answer_text):
"""improve answer span"""
tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
......@@ -151,14 +145,17 @@ def read_files(input_file, is_training):
orig_answer_text = answer["text"]
answer_offset = answer["answer_start"]
answer_length = len(orig_answer_text)
doc_tokens = [paragraph_text[:answer_offset],
paragraph_text[answer_offset: answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]]
doc_tokens = [
paragraph_text[:answer_offset], paragraph_text[
answer_offset:answer_offset + answer_length],
paragraph_text[answer_offset + answer_length:]
]
start_pos = 1
end_pos = 1
actual_text = " ".join(doc_tokens[start_pos:(end_pos + 1)])
actual_text = " ".join(doc_tokens[start_pos:(end_pos +
1)])
if actual_text.find(orig_answer_text) == -1:
log.info("Could not find answer: '%s' vs. '%s'",
actual_text, orig_answer_text)
......@@ -177,7 +174,13 @@ def read_files(input_file, is_training):
return examples
def convert_example_to_features(examples, max_seq_length, tokenizer, is_training, doc_stride=128, max_query_length=64):
def convert_example_to_features(examples,
max_seq_length,
tokenizer,
is_training,
doc_stride=128,
max_query_length=64):
"""convert example to feature"""
features = []
unique_id = 1000000000
......@@ -185,7 +188,7 @@ def convert_example_to_features(examples, max_seq_length, tokenizer, is_training
for (example_index, example) in enumerate(examples):
query_tokens = tokenizer.tokenize(example.question_text)
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0: max_query_length]
query_tokens = query_tokens[0:max_query_length]
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
......@@ -202,7 +205,8 @@ def convert_example_to_features(examples, max_seq_length, tokenizer, is_training
if is_training:
tok_start_position = orig_to_tok_index[example.start_position]
if example.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
tok_end_position = orig_to_tok_index[example.end_position +
1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
(tok_start_position, tok_end_position) = _improve_answer_span(
......@@ -297,4 +301,3 @@ if __name__ == "__main__":
features = convert_example_to_features(examples, 512, tokenizer, True)
log.debug(len(examples))
log.debug(len(features))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import logging
import re
import numpy as np
import paddle as P
import paddle.distributed.fleet as fleet
from propeller.paddle.train.hooks import RunHook
log = logging.getLogger(__name__)
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
def optimization(
loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False, ):
"""do backword for static"""
def exclude_from_weight_decay(param):
name = param.rstrip('.master')
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
g_clip = P.nn.ClipGradByGlobalNorm(1.0)
lr_scheduler = P.optimizer.lr.LambdaDecay(
learning_rate,
get_warmup_and_linear_decay(num_train_steps, warmup_steps))
optimizer = P.optimizer.AdamW(
learning_rate=lr_scheduler,
weight_decay=weight_decay,
grad_clip=g_clip,
apply_decay_param_fun=exclude_from_weight_decay)
if use_fp16:
log.info('AMP activated')
if weight_decay > 0.:
raise ValueError(
'paddle amp will ignore `weight_decay`, see https://github.com/PaddlePaddle/Paddle/issues/29794'
)
#amp_list = P.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
# custom_white_list=['softmax', 'layer_norm', 'gelu'])
optimizer = P.fluid.contrib.mixed_precision.decorate(
optimizer, init_loss_scaling=3**15, use_dynamic_loss_scaling=True)
_, param_grads = optimizer.minimize(loss)
loss_scaling = P.static.default_main_program().global_block().var(
'loss_scaling_0')
else:
_, param_grads = optimizer.minimize(loss)
loss_scaling = None
class LRStepHook(RunHook):
def after_run(self, _, __):
lr_scheduler.step()
log.debug('lr step: %.5f' % lr_scheduler.get_lr())
return LRStepHook(), loss_scaling
......@@ -41,4 +41,3 @@ python3 -m paddle.distributed.launch \
--from_pretrained /path/to/ernie1.0_pretrain_dir/
```
......@@ -15,16 +15,21 @@ import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
def gen_segs(segment_piece):
if len(segment_piece) == 0:
return []
else:
return [min(segment_piece)] * len(segment_piece)
whit_space_pat = re.compile(r'\S+')
def segment(inputs, inputs_segment):
ret = [r.span() for r in whit_space_pat.finditer(inputs)]
ret = [(inputs[s: e], gen_segs(inputs_segment[s: e])) for i, (s, e) in enumerate(ret)]
ret = [(inputs[s:e], gen_segs(inputs_segment[s:e]))
for i, (s, e) in enumerate(ret)]
return ret
......@@ -36,11 +41,13 @@ def tokenize(sen, seg_info):
sen = sen.lower()
res_word, res_segments = [], []
for match in pat.finditer(sen):
words, pos = _wordpiece(match.group(0), vocab=vocab_set, unk_token='[UNK]')
words, pos = _wordpiece(
match.group(0), vocab=vocab_set, unk_token='[UNK]')
start_of_word = match.span()[0]
for w, p in zip(words, pos):
res_word.append(w)
res_segments.append(gen_segs(seg_info[p[0] + start_of_word: p[1] + start_of_word]))
res_segments.append(
gen_segs(seg_info[p[0] + start_of_word:p[1] + start_of_word]))
return res_word, res_segments
......@@ -63,22 +70,32 @@ def parse_txt(line):
print('****', file=sys.stderr)
ret_line = [vocab.get(r, vocab['[UNK]']) for r in ret_line]
ret_seginfo = [[-1] if i == [] else i for i in ret_seginfo] #for sentence piece only
ret_seginfo = [[-1] if i == [] else i
for i in ret_seginfo] #for sentence piece only
ret_seginfo = [min(i) for i in ret_seginfo]
return ret_line, ret_seginfo
def build_example(slots):
txt, seginfo = slots
txt_fe_list = feature_pb2.FeatureList(feature=[feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=t)) for t in txt])
segsinfo_fe_list = feature_pb2.FeatureList(feature=[feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=s)) for s in seginfo])
assert len(txt_fe_list.feature) == len(segsinfo_fe_list.feature), 'txt[%d] and seginfo[%d] size not match' % (len(txt_fe_list.feature), len(segsinfo_fe_list.feature))
txt_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=t))
for t in txt
])
segsinfo_fe_list = feature_pb2.FeatureList(feature=[
feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=s))
for s in seginfo
])
assert len(txt_fe_list.feature) == len(
segsinfo_fe_list.feature), 'txt[%d] and seginfo[%d] size not match' % (
len(txt_fe_list.feature), len(segsinfo_fe_list.feature))
features = {
'txt': txt_fe_list,
'segs': segsinfo_fe_list,
}
ex = example_pb2.SequenceExample(feature_lists=feature_pb2.FeatureLists(feature_list=features))
ex = example_pb2.SequenceExample(feature_lists=feature_pb2.FeatureLists(
feature_list=features))
return ex
......@@ -122,15 +139,17 @@ if __name__ == '__main__':
args = parser.parse_args()
log.setLevel(logging.DEBUG)
from ernie.tokenizing_ernie import _wordpiece
pat = re.compile(r'([a-zA-Z0-9]+|\S)')
vocab = {j.strip().split(b'\t')[0].decode('utf8'): i for i, j in enumerate(open(args.vocab, 'rb'))}
vocab = {
j.strip().split(b'\t')[0].decode('utf8'): i
for i, j in enumerate(open(args.vocab, 'rb'))
}
vocab_set = set(vocab.keys())
with open(args.src, 'rb') as from_file, gzip.open(args.tgt, 'wb') as to_file:
with open(args.src, 'rb') as from_file, gzip.open(args.tgt,
'wb') as to_file:
log.info('making gz from bb %s ==> %s' % (from_file, to_file))
build_bb(from_file, to_file)
log.info('done: %s' % to_file)
......@@ -24,12 +24,11 @@ import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle
import paddle.fluid as F
import paddle.fluid.layers as L
import paddle as P
import sentencepiece as spm
import json
......@@ -39,12 +38,13 @@ import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import optimization
#from ernie.optimization import AdamW, LinearDecay
import propeller as propeller_base
import propeller.paddle as propeller
from propeller.paddle.data import Dataset
from propeller import log
from demo.utils import create_if_not_exists, get_warmup_and_linear_decay
log.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
......@@ -53,6 +53,7 @@ if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
......@@ -71,43 +72,10 @@ else:
yield total
def ernie_pretrain_model_fn(features, mode, params, run_config):
"""propeller Model wraper for paddle-ERNIE """
src_ids, sent_ids, mlm_label, mask_pos, nsp_label = features
ernie = ErnieModelForPretraining(params, name='')
total_loss, mlm_loss, nsp_loss = ernie(src_ids, sent_ids, labels=mlm_label, mlm_pos=mask_pos, nsp_labels=nsp_label)
metrics = None
inf_spec = None
propeller.summary.scalar('loss', total_loss)
propeller.summary.scalar('nsp-loss', nsp_loss)
propeller.summary.scalar('mlm-loss', mlm_loss)
scheduled_lr, loss_scale_coef = optimization(
loss=total_loss,
warmup_steps=params['warmup_steps'],
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=F.default_main_program(),
startup_prog=F.default_startup_program(),
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
use_fp16=params['use_fp16'],
)
propeller.summary.scalar('lr', scheduled_lr)
if params['use_fp16']:
propeller.summary.scalar('loss_scale', loss_scale_coef)
pred = [total_loss]
return propeller.ModelSpec(loss=total_loss, mode=mode, metrics=metrics, predictions=pred)
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin: random_begin + to_length]
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
......@@ -119,9 +87,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(half_ml, ml - b_len), np.minimum(half_ml, b_len)
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(half_ml, a_len), np.maximum(half_ml, ml - a_len)
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
......@@ -131,9 +101,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate([[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate([[0], token_type_a, [0], token_type_b, [1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
......@@ -148,13 +120,14 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[: sample_num]
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
......@@ -177,25 +150,34 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, hparams, args):
def make_pretrain_dataset(name, dir, vocab, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % dir)
raise ValueError('train data not found in %s' % gz_files)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller.data.example_pb2.SequenceExample()
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['txt'].feature]
doc_seg = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['segs'].feature]
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
......@@ -205,7 +187,9 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(line) # 0.1 means large variance on sentence piece result
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
......@@ -215,6 +199,7 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
if len(buf) != 0:
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
......@@ -228,8 +213,11 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a) if l < seqlen_a] #always take the first one
buf_b = [c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen]
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
label = np.int64(1)
......@@ -243,7 +231,9 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(a, b, args.max_seqlen, vocab) #negative sample might exceed max seqlen
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
......@@ -251,14 +241,20 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info, args.mask_rate, hparams.vocab_size, vocab)
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info,
args.mask_rate,
len(vocab), vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([str(j) + '\t' + '|'.join(map(str, i)) for i, j in zip(sentence.tolist(), label)]))
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
......@@ -269,13 +265,21 @@ def make_pretrain_dataset(name, dir, vocab, hparams, args):
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
dataset = dataset.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
if len(gz_files) < propeller.train.distribution.status.num_replica:
raise ValueError(
'not enough train file to shard: # of train files: %d, # of workers %d'
% (len(gz_files),
propeller.train.distribution.status.num_replica))
dataset = dataset.shard(env.nranks, env.dev_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(hparams.batch_size, (0, 0, 0, 0)).map(after)
dataset = dataset.padded_batch(args.bsz, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
......@@ -287,68 +291,110 @@ if __name__ == '__main__':
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=str, default=None)
parser.add_argument('--mask_rate', type=float, default=0.15)
parser.add_argument('--check', type=float, default=0.)
args = parser.parse_args()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' % args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=50,
warmup_steps=10000,
learning_rate=1e-4,
weight_decay=0.01,
use_fp16=False,
parser.add_argument(
'--max_seqlen',
type=int,
default=256,
help='max sequence length, documents from pretrain data will expand to this length'
)
parser.add_argument(
'--data_dir',
type=str,
required=True,
help='protobuf pretrain data directory')
parser.add_argument(
'--mask_rate',
type=float,
default=0.15,
help='probability of input token tobe masked')
parser.add_argument(
'--check', type=float, default=0., help='probability of debug info')
parser.add_argument(
'--warmup_steps', type=int, default=10000, help='warmups steps')
parser.add_argument(
'--max_steps', type=int, default=1000000, help='max pretrian steps')
parser.add_argument('--lr', type=float, default=1e-4, help='learning_rate')
parser.add_argument(
'--from_pretrained',
type=Path,
required=True,
help='pretraind model dir')
parser.add_argument(
'--save_dir', type=Path, required=True, help='model output_dir')
parser.add_argument(
'--wd',
type=float,
default=0.01,
help='weight decay, aka L2 regularizer')
parser.add_argument('--bsz', type=int, default=50)
parser.add_argument(
'--use_amp',
action='store_true',
help='only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices'
)
hparams = default_hparams.join(propeller.HParams(**hparams_config_file)).join(hparams_cli)
default_run_config=dict(
max_steps=1000000,
save_steps=10000,
log_steps=10,
max_ckpt=3,
skip_steps=0,
eval_steps=-1)
args = parser.parse_args()
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
P.distributed.init_parallel_env()
env = P.distributed.ParallelEnv()
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset('train', args.data_dir,
vocab=tokenizer.vocab, hparams=hparams, args=args)
seq_shape = [-1, args.max_seqlen]
ints_shape = [-1,]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
types = ('int64', 'int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
ws = None
#varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
varname_to_warmstart = re.compile(r'.*')
if args.from_pretrained is not None:
warm_start_dir = os.path.join(args.from_pretrained, 'params')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(warm_start_dir, v.name)),
from_dir=warm_start_dir
)
ernie_learner = propeller.Learner(ernie_pretrain_model_fn, run_config, params=hparams, warm_start_setting=ws)
ernie_learner.train(train_ds)
train_ds = make_pretrain_dataset(
'train', args.data_dir, vocab=tokenizer.vocab, args=args)
model = ErnieModelForPretraining.from_pretrained(args.from_pretrained)
param_name_to_exclue_from_weight_decay = re.compile(
r'.*layer_norm_scale|.*layer_norm_bias|.*b_0')
lr_scheduler = P.optimizer.lr.LambdaDecay(
args.lr,
get_warmup_and_linear_decay(args.max_steps, args.warmup_steps))
g_clip = P.nn.ClipGradByGlobalNorm(1.0) #experimental
opt = P.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
apply_decay_param_fun=lambda n: param_name_to_exclue_from_weight_decay.match(n),
weight_decay=args.wd,
grad_clip=g_clip)
model = P.DataParallel(model)
scaler = P.amp.GradScaler(enable=args.use_amp)
create_if_not_exists(args.save_dir)
with P.amp.auto_cast(args.use_amp):
for step, samples in enumerate(
P.io.DataLoader(
train_ds, places=P.CUDAPlace(env.dev_id), batch_size=0)):
(src_ids, sent_ids, mlm_label, mask_pos, nsp_label) = samples
loss, mlmloss, nsploss = model(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
lr_scheduler.step()
if step % 10 == 0:
_lr = lr_scheduler.get_lr()
if args.use_amp:
_l = (loss / scaler._scale).numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e scaling %.3e' % (
env.dev_id, step, _l, _lr, scaler._scale.numpy())
else:
_l = loss.numpy()
msg = '[rank-%d][step-%d] train loss %.5f lr %.3e' % (
env.dev_id, step, _l, _lr)
log.debug(msg)
if step % 1000 == 0 and env.dev_id == 0:
log.debug('saveing...')
P.save(model.state_dict(), args.save_dir / 'ckpt.bin')
if step > args.max_steps:
break
log.info('done')
......@@ -24,14 +24,11 @@ import re
import logging
import six
from glob import glob
from pathlib import Path
from functools import reduce, partial
import itertools
import paddle
import paddle.fluid as F
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
import sentencepiece as spm
import paddle as P
import json
from tqdm import tqdm
......@@ -40,9 +37,10 @@ import random as r
from ernie.modeling_ernie import ErnieModelForPretraining
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.optimization import AdamW, LinearDecay
from demo.optimization import optimization
import propeller.paddle as propeller
import propeller as propeller_base
from propeller.paddle.data import Dataset
from propeller import log
......@@ -54,6 +52,7 @@ if six.PY3:
from itertools import accumulate
else:
import operator
def accumulate(iterable, func=operator.add, initial=None):
'Return running totals'
# accumulate([1,2,3,4,5]) --> 1 3 6 10 15
......@@ -72,9 +71,54 @@ else:
yield total
def ernie_pretrain_model_fn(features, mode, params, run_config):
"""propeller Model wraper for paddle-ERNIE """
src_ids, sent_ids, mlm_label, mask_pos, nsp_label = features
ernie = ErnieModelForPretraining(params, name='')
total_loss, mlm_loss, nsp_loss = ernie(
src_ids,
sent_ids,
labels=mlm_label,
mlm_pos=mask_pos,
nsp_labels=nsp_label)
metrics = None
inf_spec = None
propeller.summary.scalar('loss', total_loss)
propeller.summary.scalar('nsp-loss', nsp_loss)
propeller.summary.scalar('mlm-loss', mlm_loss)
lr_step_hook, loss_scale_coef = optimization(
loss=total_loss,
warmup_steps=params['warmup_steps'],
num_train_steps=run_config.max_steps,
learning_rate=params['learning_rate'],
train_program=P.static.default_main_program(),
startup_prog=P.static.default_startup_program(),
weight_decay=params['weight_decay'],
scheduler="linear_warmup_decay",
use_fp16=args.use_amp, )
scheduled_lr = P.static.default_main_program().global_block().var(
'learning_rate_0')
propeller.summary.scalar('lr', scheduled_lr)
if args.use_amp:
propeller.summary.scalar('loss_scaling', loss_scale_coef)
pred = [total_loss]
return propeller.ModelSpec(
loss=total_loss,
mode=mode,
metrics=metrics,
predictions=pred,
train_hooks=[lr_step_hook])
def truncate_sentence(seq, from_length, to_length):
random_begin = np.random.randint(0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin: random_begin + to_length]
random_begin = np.random.randint(
0, np.maximum(0, from_length - to_length) + 1)
return seq[random_begin:random_begin + to_length]
def build_pair(seg_a, seg_b, max_seqlen, vocab):
......@@ -86,9 +130,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
ml = max_seqlen - 3
half_ml = ml // 2
if a_len > b_len:
a_len_truncated, b_len_truncated = np.maximum(half_ml, ml - b_len), np.minimum(half_ml, b_len)
a_len_truncated, b_len_truncated = np.maximum(
half_ml, ml - b_len), np.minimum(half_ml, b_len)
else:
a_len_truncated, b_len_truncated = np.minimum(half_ml, a_len), np.maximum(half_ml, ml - a_len)
a_len_truncated, b_len_truncated = np.minimum(
half_ml, a_len), np.maximum(half_ml, ml - a_len)
seg_a = truncate_sentence(seg_a, a_len, a_len_truncated)
seg_b = truncate_sentence(seg_b, b_len, b_len_truncated)
......@@ -98,9 +144,11 @@ def build_pair(seg_a, seg_b, max_seqlen, vocab):
token_type_a = np.ones_like(seg_a_txt, dtype=np.int64) * 0
token_type_b = np.ones_like(seg_b_txt, dtype=np.int64) * 1
sen_emb = np.concatenate([[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
sen_emb = np.concatenate(
[[cls_id], seg_a_txt, [sep_id], seg_b_txt, [sep_id]], 0)
info_emb = np.concatenate([[-1], seg_a_info, [-1], seg_b_info, [-1]], 0)
token_type_emb = np.concatenate([[0], token_type_a, [0], token_type_b, [1]], 0)
token_type_emb = np.concatenate(
[[0], token_type_a, [0], token_type_b, [1]], 0)
return sen_emb, info_emb, token_type_emb
......@@ -115,13 +163,14 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
seg_info += 1 #no more =1
seg_info_flatten = seg_info.reshape([-1])
seg_info_incr = seg_info_flatten - np.roll(seg_info_flatten, shift=1)
seg_info = np.add.accumulate(np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info = np.add.accumulate(
np.array([0 if s == 0 else 1 for s in seg_info_incr])).reshape(shape)
seg_info[invalid_pos] = -1
u_seginfo = np.array([i for i in np.unique(seg_info) if i != -1])
np.random.shuffle(u_seginfo)
sample_num = max(1, int(len(u_seginfo) * mask_rate))
u_seginfo = u_seginfo[: sample_num]
u_seginfo = u_seginfo[:sample_num]
mask = reduce(np.logical_or, [seg_info == i for i in u_seginfo])
mask[:, 0] = False # ignore CLS head
......@@ -144,25 +193,34 @@ def apply_mask(sentence, seg_info, mask_rate, vocab_size, vocab):
return sentence, np.stack(mask_pos, -1), mask_label
def make_pretrain_dataset(name, dir, vocab, args):
def make_pretrain_dataset(name, dir, vocab, hparams, args):
gz_files = glob(dir)
if not gz_files:
raise ValueError('train data not found in %s' % gz_files)
raise ValueError('train data not found in %s' % dir)
log.info('read from %s' % '\n'.join(gz_files))
max_input_seqlen = args.max_seqlen
max_pretrain_seqlen = lambda: max_input_seqlen if r.random() > 0.15 else r.randint(1, max_input_seqlen) # short sentence rate
def _parse_gz(record_str): # function that takes python_str as input
ex = propeller.data.example_pb2.SequenceExample()
ex = propeller_base.data.example_pb2.SequenceExample()
ex.ParseFromString(record_str)
doc = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['txt'].feature]
doc_seg = [np.array(f.int64_list.value, dtype=np.int64) for f in ex.feature_lists.feature_list['segs'].feature]
doc = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['txt'].feature
]
doc_seg = [
np.array(
f.int64_list.value, dtype=np.int64)
for f in ex.feature_lists.feature_list['segs'].feature
]
return doc, doc_seg
def bb_to_segments(filename):
ds = Dataset.from_record_file(filename).map(_parse_gz)
iterable = iter(ds)
def gen():
buf, size = [], 0
iterator = iter(ds)
......@@ -172,7 +230,9 @@ def make_pretrain_dataset(name, dir, vocab, args):
#line = np.array(sp_model.SampleEncodeAsIds(line, -1, 0.1), dtype=np.int64) # 0.1 means large variance on sentence piece result
if len(line) == 0:
continue
line = np.array(line) # 0.1 means large variance on sentence piece result
line = np.array(
line
) # 0.1 means large variance on sentence piece result
line_seg = np.array(line_seg)
size += len(line)
buf.append(np.stack([line, line_seg]).transpose())
......@@ -182,6 +242,7 @@ def make_pretrain_dataset(name, dir, vocab, args):
if len(buf) != 0:
yield buf,
buf, size = [], 0
return Dataset.from_generator_func(gen)
def sample_negative(dataset):
......@@ -195,8 +256,11 @@ def make_pretrain_dataset(name, dir, vocab, args):
seqlen_a = r.randint(1, seqlen)
seqlen_b = seqlen - seqlen_a
len_a = list(accumulate([len(c) for c in chunk_a]))
buf_a = [c for c, l in zip(chunk_a, len_a) if l < seqlen_a] #always take the first one
buf_b = [c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen]
buf_a = [c for c, l in zip(chunk_a, len_a)
if l < seqlen_a] #always take the first one
buf_b = [
c for c, l in zip(chunk_a, len_a) if seqlen_a <= l < seqlen
]
if r.random() < 0.5: #pos or neg
label = np.int64(1)
......@@ -210,7 +274,9 @@ def make_pretrain_dataset(name, dir, vocab, args):
b = np.concatenate(buf_b)
#log.debug(a)
#log.debug(b)
sample, seg_info, token_type = build_pair(a, b, args.max_seqlen, vocab) #negative sample might exceed max seqlen
sample, seg_info, token_type = build_pair(
a, b, args.max_seqlen,
vocab) #negative sample might exceed max seqlen
yield sample, seg_info, token_type, label
ds = propeller.data.Dataset.from_generator_func(gen)
......@@ -218,14 +284,19 @@ def make_pretrain_dataset(name, dir, vocab, args):
def after(sentence, seg_info, segments, label):
batch_size, seqlen = sentence.shape
sentence, mask_pos, mlm_label = apply_mask(sentence, seg_info, args.mask_rate, len(vocab), vocab)
sentence, mask_pos, mlm_label = apply_mask(
sentence, seg_info, args.mask_rate, hparams.vocab_size, vocab)
ra = r.random()
if ra < args.check:
print('***')
print('\n'.join([str(j) + '\t' + '|'.join(map(str, i)) for i, j in zip(sentence.tolist(), label)]))
print('\n'.join([
str(j) + '\t' + '|'.join(map(str, i))
for i, j in zip(sentence.tolist(), label)
]))
print('***')
print('\n'.join(['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('\n'.join(
['|'.join(map(str, i)) for i in seg_info.tolist()]))
print('***')
print('|'.join(map(str, mlm_label.tolist())))
print('***')
......@@ -236,15 +307,17 @@ def make_pretrain_dataset(name, dir, vocab, args):
dataset = Dataset.from_list(gz_files)
if propeller.train.distribution.status.mode == propeller.train.distribution.DistributionMode.NCCL:
log.info('Apply sharding in distribution env')
if len(gz_files) < propeller.train.distribution.status.num_replica:
raise ValueError('not enough train file to shard: # of train files: %d, # of workers %d' % (len(gz_files), propeller.train.distribution.status.num_replica))
dataset = dataset.shard(propeller.train.distribution.status.num_replica, propeller.train.distribution.status.replica_id)
dataset = dataset.shard(
propeller.train.distribution.status.num_replica,
propeller.train.distribution.status.replica_id)
dataset = dataset.repeat().shuffle(buffer_size=len(gz_files))
dataset = dataset.interleave(map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = dataset.interleave(
map_fn=bb_to_segments, cycle_length=len(gz_files), block_length=1)
dataset = dataset.shuffle(
buffer_size=1000) #must shuffle to ensure negative sample randomness
dataset = sample_negative(dataset)
dataset = dataset.padded_batch(args.bsz, (0, 0, 0, 0)).map(after)
dataset = dataset.padded_batch(hparams.batch_size, (0, 0, 0, 0)).map(after)
dataset.name = name
return dataset
......@@ -256,53 +329,77 @@ if __name__ == '__main__':
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
parser = propeller.ArgumentParser('DAN model with Paddle')
parser.add_argument('--max_seqlen', type=int, default=256, help='max sequence length, documents from pretrain data will expand to this length')
parser.add_argument('--data_dir', type=str, required=True, help='protobuf pretrain data directory')
parser.add_argument('--mask_rate', type=float, default=0.15, help='probability of input token tobe masked')
parser.add_argument('--check', type=float, default=0., help='probability of debug info')
parser.add_argument('--warmup_steps', type=int, default=10000, help='warmups steps')
parser.add_argument('--max_steps', type=int, default=1000000, help='max pretrian steps')
parser.add_argument('--lr', type=float, default=1e-4, help='learning_rate')
parser.add_argument('--from_pretrained', type=str, required=True, help='pretraind model dir')
parser.add_argument('--save_dir', type=str, default=None, help='model output_dir')
parser.add_argument('--bsz', type=int, default=50)
parser.add_argument('--max_seqlen', type=int, default=256)
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--from_pretrained', type=Path, default=None)
parser.add_argument('--use_amp', action='store_true')
parser.add_argument('--mask_rate', type=float, default=0.15)
parser.add_argument('--check', type=float, default=0.)
args = parser.parse_args()
P.enable_static()
if not os.path.exists(args.from_pretrained):
raise ValueError('--from_pretrained not found: %s' %
args.from_pretrained)
cfg_file_path = os.path.join(args.from_pretrained, 'ernie_config.json')
param_path = os.path.join(args.from_pretrained, 'params')
vocab_path = os.path.join(args.from_pretrained, 'vocab.txt')
assert os.path.exists(cfg_file_path) and os.path.exists(
param_path) and os.path.exists(vocab_path)
hparams_cli = propeller.parse_hparam(args)
hparams_config_file = json.loads(open(cfg_file_path).read())
default_hparams = propeller.HParams(
batch_size=50,
warmup_steps=10000,
learning_rate=1e-4,
weight_decay=0.01, )
hparams = default_hparams.join(propeller.HParams(
**hparams_config_file)).join(hparams_cli)
default_run_config = dict(
max_steps=1000000,
save_steps=10000,
log_steps=10,
max_ckpt=3,
skip_steps=0,
eval_steps=-1)
run_config = dict(default_run_config, **json.loads(args.run_config))
run_config = propeller.RunConfig(**run_config)
tokenizer = ErnieTokenizer.from_pretrained(args.from_pretrained)
train_ds = make_pretrain_dataset('train', args.data_dir,
vocab=tokenizer.vocab, args=args)
train_ds = make_pretrain_dataset(
'train',
args.data_dir,
vocab=tokenizer.vocab,
hparams=hparams,
args=args)
seq_shape = [-1, args.max_seqlen]
ints_shape = [-1,]
ints_shape = [-1, ]
shapes = (seq_shape, seq_shape, ints_shape, [-1, 2], ints_shape)
types = ('int64', 'int64', 'int64', 'int64', 'int64')
train_ds.data_shapes = shapes
train_ds.data_types = types
place = F.CUDAPlace(D.parallel.Env().dev_id)
with D.guard(place):
model = ErnieModelForPretraining.from_pretrained(args.from_pretrained)
opt = AdamW(learning_rate=LinearDecay(args.lr, args.warmup_steps, args.max_steps), parameter_list=model.parameters(), weight_decay=0.01)
ctx = D.parallel.prepare_context()
model = D.parallel.DataParallel(model, ctx)
for step, samples in enumerate(tqdm(train_ds.start(place))):
(src_ids, sent_ids, mlm_label, mask_pos, nsp_label) = samples
loss, mlmloss, nsploss = model(src_ids, sent_ids, labels=mlm_label, mlm_pos=mask_pos, nsp_labels=nsp_label)
scaled_loss = model.scale_loss(loss)
scaled_loss.backward()
model.apply_collective_grads()
opt.minimize(scaled_loss)
model.clear_gradients()
if step % 10 == 0:
log.debug('train loss %.5f scaled loss %.5f' % (loss.numpy(), scaled_loss.numpy()))
if step % 10000 == 0 and D.parallel.Env().dev_id == 0 and args.save_dir is not None:
F.save_dygraph(model.state_dict(), args.save_dir)
ws = None
#varname_to_warmstart = re.compile(r'^encoder.*[wb]_0$|^.*embedding$|^.*bias$|^.*scale$|^pooled_fc.[wb]_0$')
varname_to_warmstart = re.compile(r'.*')
if args.from_pretrained is not None:
warm_start_dir = os.path.join(args.from_pretrained, 'params')
ws = propeller.WarmStartSetting(
predicate_fn=lambda v: varname_to_warmstart.match(v.name) and os.path.exists(os.path.join(warm_start_dir, v.name)),
from_dir=warm_start_dir
)
ernie_learner = propeller.Learner(
ernie_pretrain_model_fn,
run_config,
params=hparams,
warm_start_setting=ws)
ernie_learner.train(train_ds)
......@@ -57,5 +57,3 @@ cat one_column_source_text| python3 demo/seq2seq/decode.py \
--save_dir ./model_cnndm \
--bsz 8
```
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
../../demo/seq2seq/README.md
\ No newline at end of file
此差异已折叠。
......@@ -47,4 +47,3 @@ make
| ----- | ----- |
| CPU(Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz (20 线程)) | 29.8818|
| GPU (P4) | 8.5 |
......@@ -28,4 +28,3 @@ LINK_DIRECTORIES(${MKLDNN_LIB_PATH})
ADD_EXECUTABLE(inference inference.cc)
TARGET_LINK_LIBRARIES(inference dl paddle_fluid glog gflags pthread)
......@@ -28,4 +28,3 @@ LINK_DIRECTORIES(${MKLDNN_LIB_PATH})
ADD_EXECUTABLE(inference inference.cc)
TARGET_LINK_LIBRARIES(inference dl paddle_fluid glog gflags pthread)
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册