未验证 提交 80628bc6 编写于 作者: G Guo Sheng 提交者: GitHub

update transformer to be unified (#3119)

* Update transformer to be unified

* To support CE when unifing Transformer
上级 e16b9cae
#!/bin/bash #!/bin/bash
sed -i '$a\dropout_seed = 1000' ../../models/neural_machine_translation/transformer/desc.py
DATA_PATH=./dataset/wmt16 DATA_PATH=./dataset/wmt16
train(){ train(){
python -u train.py \ python -u main.py \
--do_train True \
--src_vocab_fpath $DATA_PATH/en_10000.dict \ --src_vocab_fpath $DATA_PATH/en_10000.dict \
--trg_vocab_fpath $DATA_PATH/de_10000.dict \ --trg_vocab_fpath $DATA_PATH/de_10000.dict \
--special_token '<s>' '<e>' '<unk>' \ --special_token '<s>' '<e>' '<unk>' \
--train_file_pattern $DATA_PATH/wmt16/train \ --training_file $DATA_PATH/wmt16/train \
--val_file_pattern $DATA_PATH/wmt16/val \
--use_token_batch True \ --use_token_batch True \
--batch_size 2048 \ --batch_size 2048 \
--sort_type pool \ --sort_type pool \
--pool_size 10000 \ --pool_size 10000 \
--print_step 1 \
--weight_sharing False \
--epoch 20 \
--enable_ce True \ --enable_ce True \
--fetch_steps 1 \ --random_seed 1000 \
weight_sharing False \ --save_checkpoint "" \
pass_num 20 --save_param ""
} }
cudaid=${transformer:=0} # use 0-th card as default cudaid=${transformer:=0} # use 0-th card as default
...@@ -29,4 +30,4 @@ train | python _ce.py ...@@ -29,4 +30,4 @@ train | python _ce.py
cudaid=${transformer_m:=0,1,2,3} # use 0,1,2,3 card as default cudaid=${transformer_m:=0,1,2,3} # use 0,1,2,3 card as default
export CUDA_VISIBLE_DEVICES=$cudaid export CUDA_VISIBLE_DEVICES=$cudaid
train | python _ce.py train | python _ce.py
\ No newline at end of file
## 简介 ## Transformer
### 任务说明 以下是本例的简要目录结构及说明:
机器翻译(machine translation, MT)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。本示例是机器翻译主流模型 Transformer 的实现和相关介绍。 ```text
.
├── images # README 文档中的图片
├── palm # 工具包
├── desc.py # 输入描述文件
├── dist_utils.py # 多进程训练工具
├── gen_data.sh # 数据生成脚本
├── inference_model.py # 保存 inference_model 的脚本
├── main.py # 主程序入口
├── predict.py # 预测脚本
├── reader.py # 数据读取接口
├── README.md # 文档
├── train.py # 训练脚本
├── transformer.py # 模型定义文件
└── transformer.yaml # 配置文件
```
### 效果说明 ## 模型简介
我们使用公开的 [WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)训练 Base、Big 两种配置的Transformer 模型后,在相应的测试集上进行评测,效果如下所示: 机器翻译(machine translation, MT)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。
| 测试集 | newstest2014 | newstest2015 | newstest2016 | 本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现, 包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。
|-|-|-|-|
| Base | 26.35 | 29.07 | 33.30 |
| Big | 27.07 | 30.09 | 34.38 |
## 快速开始 ## 快速开始
...@@ -19,11 +31,11 @@ ...@@ -19,11 +31,11 @@
1. paddle安装 1. paddle安装
本项目依赖于 PaddlePaddle Fluid 1.3.1 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装 本项目依赖于 PaddlePaddle Fluid 1.5.0 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
2. 安装代码 2. 安装代码
克隆数据集代码库到本地 克隆代码库到本地
```shell ```shell
git clone https://github.com/PaddlePaddle/models.git git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleNLP/neural_machine_translation/transformer cd models/PaddleNLP/neural_machine_translation/transformer
...@@ -31,95 +43,195 @@ ...@@ -31,95 +43,195 @@
3. 环境依赖 3. 环境依赖
请参考PaddlePaddle[安装说明](http://paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/install/index_cn.html)部分的内容 请参考PaddlePaddle[安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html)部分的内容
### 开始第一次模型调用
1. 数据准备 ### 数据准备
运行 `gen_data.sh` 脚本进行 WMT'16 EN-DE 数据集的下载和预处理(时间较长,建议后台运行)。数据处理过程主要包括 Tokenize 和 [BPE 编码(byte-pair encoding)](https://arxiv.org/pdf/1508.07909)。运行成功后,将会生成文件夹 `gen_data`,其目录结构如下: 公开数据集:WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛,其中英德翻译任务提供了一个中等规模的数据集,这个数据集是较多论文中使用的数据集,也是 Transformer 论文中用到的一个数据集。我们也将[WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)作为示例提供。运行 `gen_data.sh` 脚本进行 WMT'16 EN-DE 数据集的下载和预处理(时间较长,建议后台运行)。数据处理过程主要包括 Tokenize 和 [BPE 编码(byte-pair encoding)](https://arxiv.org/pdf/1508.07909)。运行成功后,将会生成文件夹 `gen_data`,其目录结构如下:
```text ```text
. .
├── wmt16_ende_data # WMT16 英德翻译数据 ├── wmt16_ende_data # WMT16 英德翻译数据
├── wmt16_ende_data_bpe # BPE 编码的 WMT16 英德翻译数据 ├── wmt16_ende_data_bpe # BPE 编码的 WMT16 英德翻译数据
├── mosesdecoder # Moses 机器翻译工具集,包含了 Tokenize、BLEU 评估等脚本 ├── mosesdecoder # Moses 机器翻译工具集,包含了 Tokenize、BLEU 评估等脚本
└── subword-nmt # BPE 编码的代码 └── subword-nmt # BPE 编码的代码
``` ```
另外我们也整理提供了一份处理好的 WMT'16 EN-DE 数据以供[下载](https://transformer-res.bj.bcebos.com/wmt16_ende_data_bpe_clean.tar.gz)使用(包含训练所需 BPE 数据和词典以及预测和评估所需的 BPE 数据和 tokenize 的数据) 另外我们也整理提供了一份处理好的 WMT'16 EN-DE 数据以供[下载](https://transformer-res.bj.bcebos.com/wmt16_ende_data_bpe_clean.tar.gz)使用,其中包含词典(`vocab_all.bpe.32000`文件)、训练所需的 BPE 数据(`train.tok.clean.bpe.32000.en-de`文件)、预测所需的 BPE 数据(`newstest2016.tok.bpe.32000.en-de`等文件)和相应的评估预测结果所需的 tokenize 数据(`newstest2016.tok.de`等文件)。
2. 模型下载
我们提供了基于 WMT'16 EN-DE 数据训练好的模型以供使用:[base model](https://transformer-res.bj.bcebos.com/base_model.tar.gz) 、[big model](https://transformer-res.bj.bcebos.com/big_model.tar.gz)
3. 模型预测
使用以上提供的数据和模型,可以按照以下代码进行预测,翻译结果将打印到标准输出:
```sh
# base model
python -u infer.py \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--test_file_pattern gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
--token_delimiter ' ' \
--batch_size 32 \
model_path trained_models/iter_100000.infer.model \
beam_size 5 \
max_out_len 255
# big model
python -u infer.py \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--test_file_pattern gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
--token_delimiter ' ' \
--batch_size 32 \
model_path trained_models/iter_100000.infer.model \
n_head 16 \
d_model 1024 \
d_inner_hid 4096 \
prepostprocess_dropout 0.3 \
beam_size 5 \
max_out_len 255
```
4. 模型评估
预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标):
```sh
# 还原 predict.txt 中的预测结果为 tokenize 后的数据
sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
# 若无 BLEU 评估工具,需先进行下载
# git clone https://github.com/moses-smt/mosesdecoder.git
# 以英德翻译 newstest2014 测试数据为例
perl gen_data/mosesdecoder/scripts/generic/multi-bleu.perl gen_data/wmt16_ende_data/newstest2014.tok.de < predict.tok.txt
```
完成后可以看到类似如下的结果:
``` 自定义数据:如果需要使用自定义数据,本项目程序中可直接支持的数据格式为制表符 \t 分隔的源语言和目标语言句子对,句子中的 token 之间使用空格分隔。提供以上格式的数据文件(可以分多个part,数据读取支持文件通配符)和相应的词典文件即可直接运行。
BLEU = 26.35, 57.7/32.1/20.0/13.0 (BP=1.000, ratio=1.013, hyp_len=63903, ref_len=63078)
``` ### 单机训练
以提供的英德翻译数据为例,可以执行以下命令进行模型训练:
```sh
# open garbage collection to save memory
export FLAGS_eager_delete_tensor_gb=0.0
# setting visible devices for training
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -u main.py \
--do_train True \
--epoch 30 \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--training_file gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
--batch_size 4096
```
以上命令中传入了执行训练(`do_train`)、训练轮数(`epoch`)和训练数据文件路径(注意请正确设置,支持通配符)等参数,更多参数的使用以及支持的模型超参数可以参见 `transformer.yaml` 配置文件,其中默认提供了 Transformer base model 的配置,如需调整可以在配置文件中更改或通过命令行传入(命令行传入内容将覆盖配置文件中的设置)。如可以通过以下命令来训练 Transformer 论文中的 big model:
```sh
# open garbage collection to save memory
export FLAGS_eager_delete_tensor_gb=0.0
# setting visible devices for training
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -u main.py \
--do_train True \
--epoch 30 \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--training_file gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
--batch_size 4096 \
--n_head 16 \
--d_model 1024 \
--d_inner_hid 4096 \
--prepostprocess_dropout 0.3
```
训练时默认使用所有 GPU,可以通过 `CUDA_VISIBLE_DEVICES` 环境变量来设置使用的 GPU 数目。也可以只使用 CPU 训练(通过参数 `--use_cuda False` 设置),训练速度相对较慢。在执行训练时若提供了 `save_param``save_checkpoint`(默认为 trained_params 和 trained_ckpts),则每隔一定 iteration 后(通过参数 `save_step` 设置,默认为10000)将分别保存当前训练的参数值和 checkpoint 到相应目录,每隔一定数目的 iteration (通过参数 `print_step` 设置,默认为100)将打印如下的日志到标准输出:
```txt
[2019-08-02 15:30:51,656 INFO train.py:262] step_idx: 150100, epoch: 32, batch: 1364, avg loss: 2.880427, normalized loss: 1.504687, ppl: 17.821888, speed: 3.34 step/s
[2019-08-02 15:31:19,824 INFO train.py:262] step_idx: 150200, epoch: 32, batch: 1464, avg loss: 2.955965, normalized loss: 1.580225, ppl: 19.220257, speed: 3.55 step/s
[2019-08-02 15:31:48,151 INFO train.py:262] step_idx: 150300, epoch: 32, batch: 1564, avg loss: 2.951180, normalized loss: 1.575439, ppl: 19.128502, speed: 3.53 step/s
[2019-08-02 15:32:16,401 INFO train.py:262] step_idx: 150400, epoch: 32, batch: 1664, avg loss: 3.027281, normalized loss: 1.651540, ppl: 20.641024, speed: 3.54 step/s
[2019-08-02 15:32:44,764 INFO train.py:262] step_idx: 150500, epoch: 32, batch: 1764, avg loss: 3.069125, normalized loss: 1.693385, ppl: 21.523066, speed: 3.53 step/s
[2019-08-02 15:33:13,199 INFO train.py:262] step_idx: 150600, epoch: 32, batch: 1864, avg loss: 2.869379, normalized loss: 1.493639, ppl: 17.626074, speed: 3.52 step/s
[2019-08-02 15:33:41,601 INFO train.py:262] step_idx: 150700, epoch: 32, batch: 1964, avg loss: 2.980905, normalized loss: 1.605164, ppl: 19.705633, speed: 3.52 step/s
[2019-08-02 15:34:10,079 INFO train.py:262] step_idx: 150800, epoch: 32, batch: 2064, avg loss: 3.047716, normalized loss: 1.671976, ppl: 21.067181, speed: 3.51 step/s
[2019-08-02 15:34:38,598 INFO train.py:262] step_idx: 150900, epoch: 32, batch: 2164, avg loss: 2.956475, normalized loss: 1.580735, ppl: 19.230072, speed: 3.51 step/s
```
### 模型推断
以英德翻译数据为例,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译:
```sh
# open garbage collection to save memory
export FLAGS_eager_delete_tensor_gb=0.0
# setting visible devices for prediction
export CUDA_VISIBLE_DEVICES=0
python -u main.py \
--do_predict True \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--predict_file gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
--batch_size 32 \
--init_from_params trained_params/step_100000 \
--beam_size 5 \
--max_out_len 255 \
--output_file predict.txt
```
`predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `transformer.yaml` 文件中查阅注释说明并进行更改设置。注意若在执行预测时设置了模型超参数,应与模型训练时的设置一致,如若训练时使用 big model 的参数设置,则预测时对应类似如下命令:
```sh
# open garbage collection to save memory
export FLAGS_eager_delete_tensor_gb=0.0
# setting visible devices for prediction
export CUDA_VISIBLE_DEVICES=0
python -u main.py \
--do_predict True \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--predict_file gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
--batch_size 32 \
--init_from_params trained_params/step_100000 \
--beam_size 5 \
--max_out_len 255 \
--output_file predict.txt \
--n_head 16 \
--d_model 1024 \
--d_inner_hid 4096 \
--prepostprocess_dropout 0.3
```
### 模型评估
预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标):
```sh
# 还原 predict.txt 中的预测结果为 tokenize 后的数据
sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
# 若无 BLEU 评估工具,需先进行下载
# git clone https://github.com/moses-smt/mosesdecoder.git
# 以英德翻译 newstest2014 测试数据为例
perl gen_data/mosesdecoder/scripts/generic/multi-bleu.perl gen_data/wmt16_ende_data/newstest2014.tok.de < predict.tok.txt
```
可以看到类似如下的结果:
```
BLEU = 26.35, 57.7/32.1/20.0/13.0 (BP=1.000, ratio=1.013, hyp_len=63903, ref_len=63078)
```
使用本项目中提供的内容,英德翻译 base model 和 big model 八卡训练 100K 个 iteration 后测试有大约如下的 BLEU 值:
| 测试集 | newstest2014 | newstest2015 | newstest2016 |
|-|-|-|-|
| Base | 26.35 | 29.07 | 33.30 |
| Big | 27.07 | 30.09 | 34.38 |
### 预训练模型
我们这里提供了对应有以上 BLEU 值的 [base model](https://transformer-res.bj.bcebos.com/base_model_params.tar.gz)[big model](https://transformer-res.bj.bcebos.com/big_model_params.tar.gz) 的模型参数提供下载使用。
## 进阶使用 ## 进阶使用
### 模型原理介绍 ### 背景介绍
Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模[1]。
相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用(Self)Attention 进行输入序列到输出序列的变换主要具有以下优势:
Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构。其同样使用了 Seq2Seq 任务中典型的编码器-解码器(Encoder-Decoder)的框架结构,但相较于此前广泛使用的循环神经网络(Recurrent Neural Network, RNN),其完全使用注意力(Attention)机制来实现序列到序列的建模,整体网络结构如图1所示。 - 计算复杂度小
- 特征维度为 d 、长度为 n 的序列,在 RNN 中计算复杂度为 `O(n * d * d)` (n 个时间步,每个时间步计算 d 维的矩阵向量乘法),在 Self-Attention 中计算复杂度为 `O(n * n * d)` (n 个时间步两两计算 d 维的向量点积或其他相关度函数),n 通常要小于 d 。
- 计算并行度高
- RNN 中当前时间步的计算要依赖前一个时间步的计算结果;Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出,各时间步可以完全并行。
- 容易学习长程依赖(long-range dependencies)
- RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立;Self-Attention 中任何两个位置都直接相连;路径越短信号传播越容易。
Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构,已被广泛应用在 Bert [2]等语义表示模型中,取得了显著效果。
### 模型概览
Transformer 同样使用了 Seq2Seq 模型中典型的编码器-解码器(Encoder-Decoder)的框架结构,整体网络结构如图1所示。
<p align="center"> <p align="center">
<img src="images/transformer_network.png" height=400 hspace='10'/> <br /> <img src="images/transformer_network.png" height=400 hspace='10'/> <br />
图 1. Transformer 网络结构图 图 1. Transformer 网络结构图
</p> </p>
Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注意力(Multi-Head Attention)和全连接的前馈(Feed-Forward)网络这两个 sub-layer 构成。 可以看到,和以往 Seq2Seq 模型不同,Transformer 的 Encoder 和 Decoder 中不再使用 RNN 的结构。
### 模型特点
Transformer 中的 Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注意力(Multi-Head Attention)和全连接的前馈(Feed-Forward)网络这两个 sub-layer 构成。
- Multi-Head Attention 在这里用于实现 Self-Attention,相比于简单的 Attention 机制,其将输入进行多路线性变换后分别计算 Attention 的结果,并将所有结果拼接后再次进行线性变换作为输出。参见图2,其中 Attention 使用的是点积(Dot-Product),并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。 - Multi-Head Attention 在这里用于实现 Self-Attention,相比于简单的 Attention 机制,其将输入进行多路线性变换后分别计算 Attention 的结果,并将所有结果拼接后再次进行线性变换作为输出。参见图2,其中 Attention 使用的是点积(Dot-Product),并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
- Feed-Forward 网络会对序列中的每个位置进行相同的计算(Position-wise),其采用的是两次线性变换中间加以 ReLU 激活的结构。 - Feed-Forward 网络会对序列中的每个位置进行相同的计算(Position-wise),其采用的是两次线性变换中间加以 ReLU 激活的结构。
此外,每个 sub-layer 后还施以 [Residual Connection](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) 来促进梯度传播和模型收敛。 此外,每个 sub-layer 后还施以 Residual Connection [3]和 Layer Normalization [4]来促进梯度传播和模型收敛。
<p align="center"> <p align="center">
<img src="images/multi_head_attention.png" height=300 hspace='10'/> <br /> <img src="images/multi_head_attention.png" height=300 hspace='10'/> <br />
...@@ -128,72 +240,30 @@ Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注 ...@@ -128,72 +240,30 @@ Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注
Decoder 具有和 Encoder 类似的结构,只是相比于组成 Encoder 的 layer ,在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention,这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。 Decoder 具有和 Encoder 类似的结构,只是相比于组成 Encoder 的 layer ,在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention,这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
### 代码结构说明 ## FAQ
以下是本例的简要目录结构及说明: **Q:** 预测结果中样本数少于输入的样本数是什么原因
**A:** 若样本中最大长度超过 `transformer.yaml``max_length` 的默认设置,请注意运行时增大 `--max_length` 的设置,否则超长样本将被过滤。
```text **Q:** 预测时最大长度超过了训练时的最大长度怎么办
. **A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小,若预测时长度超过 `max_length`,请调大该值,会重新生成更大的 position encoding 表。
├── images # README 文档中的图片
├── config.py # 训练、预测以及模型参数配置
├── infer.py # 预测脚本
├── reader.py # 数据读取接口
├── README.md # 文档
├── train.py # 训练脚本
└── gen_data.sh # 数据生成脚本
```
### 数据格式说明
本示例程序中支持的数据格式为制表符 `\t` 分隔的源语言和目标语言句子对,句子中的 token 之间使用空格分隔 ## 参考文献
。如需使用 BPE 编码,亦可以使用类似 WMT'16 EN-DE 原始数据的格式,参照 `gen_data.sh` 进行处理。 1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018.
3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015.
### 如何训练
数据准备完成后,可以使用 `train,py` 脚本进行训练。以提供的 WMT'16 EN-DE 数据为例,具体如下: ## 版本更新
```sh
python -u train.py \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--train_file_pattern gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
--token_delimiter ' ' \
--use_token_batch True \
--batch_size 4096 \
--sort_type pool \
--pool_size 200000
```
上述命令中设置了源语言词典文件路径(`src_vocab_fpath`)、目标语言词典文件路径(`trg_vocab_fpath`)、训练数据文件(`train_file_pattern`,支持通配符)等数据相关的参数和构造 batch 方式(`use_token_batch` 指定了数据按照 token 数目或者 sequence 数目组成 batch)等 reader 相关的参数。有关这些参数更详细的信息可以通过执行以下命令查看:
```sh
python train.py --help
```
更多模型训练相关的参数则在 `config.py` 中的 `ModelHyperParams``TrainTaskConfig` 内定义;`ModelHyperParams` 定义了 embedding 维度等模型超参数,`TrainTaskConfig` 定义了 warmup 步数等训练需要的参数。这些参数默认使用了 Transformer 论文中 base model 的配置,如需调整可以在该脚本中进行修改。另外这些参数同样可在执行训练脚本的命令行中设置,传入的配置会合并并覆盖 `config.py` 中的配置,如可以通过以下命令来训练 Transformer 论文中的 big model :
```sh
python -u train.py \
--src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
--special_token '<s>' '<e>' '<unk>' \
--train_file_pattern gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
--token_delimiter ' ' \
--use_token_batch True \
--batch_size 3200 \
--sort_type pool \
--pool_size 200000 \
n_head 16 \
d_model 1024 \
d_inner_hid 4096 \
prepostprocess_dropout 0.3
```
注意,如训练时更改了模型配置,使用 `infer.py` 预测时需要使用对应相同的模型配置;另外,训练时默认使用所有 GPU,可以通过 `CUDA_VISIBLE_DEVICES` 环境变量来设置使用指定的 GPU。 2019/08/16 进行了规范化,更新了 Paddle 接口的使用
## 其他 ## 作者
- [guochengCS](https://github.com/guoshengCS)
### 如何贡献代码 ## 如何贡献代码
如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。 如果你可以修复某个issue或者增加一个新功能,欢迎给我们提交PR。如果对应的PR被接受了,我们将根据贡献的质量和难度进行打分(0-5分,越高越好)。如果你累计获得了10分,可以联系我们获得面试机会或者为你写推荐信。
...@@ -8,20 +8,20 @@ from kpi import CostKpi, DurationKpi, AccKpi ...@@ -8,20 +8,20 @@ from kpi import CostKpi, DurationKpi, AccKpi
#### NOTE kpi.py should shared in models in some way!!!! #### NOTE kpi.py should shared in models in some way!!!!
train_cost_card1_kpi = CostKpi('train_cost_card1', 0.002, 0, actived=True) train_cost_card1_kpi = CostKpi('train_cost_card1', 0.002, 0, actived=True)
test_cost_card1_kpi = CostKpi('test_cost_card1', 0.008, 0, actived=True) # test_cost_card1_kpi = CostKpi('test_cost_card1', 0.008, 0, actived=True)
train_duration_card1_kpi = DurationKpi( train_duration_card1_kpi = DurationKpi(
'train_duration_card1', 0.006, 0, actived=True) 'train_duration_card1', 0.006, 0, actived=True)
train_cost_card4_kpi = CostKpi('train_cost_card4', 0.001, 0, actived=True) train_cost_card4_kpi = CostKpi('train_cost_card4', 0.001, 0, actived=True)
test_cost_card4_kpi = CostKpi('test_cost_card4', 0.001, 0, actived=True) # test_cost_card4_kpi = CostKpi('test_cost_card4', 0.001, 0, actived=True)
train_duration_card4_kpi = DurationKpi( train_duration_card4_kpi = DurationKpi(
'train_duration_card4', 0.02, 0, actived=True) 'train_duration_card4', 0.02, 0, actived=True)
tracking_kpis = [ tracking_kpis = [
train_cost_card1_kpi, train_cost_card1_kpi,
test_cost_card1_kpi, # test_cost_card1_kpi,
train_duration_card1_kpi, train_duration_card1_kpi,
train_cost_card4_kpi, train_cost_card4_kpi,
test_cost_card4_kpi, # test_cost_card4_kpi,
train_duration_card4_kpi, train_duration_card4_kpi,
] ]
......
class TrainTaskConfig(object):
# support both CPU and GPU now.
use_gpu = True
# the epoch number to train.
pass_num = 30
# the number of sequences contained in a mini-batch.
# deprecated, set batch_size in args.
batch_size = 32
# the hyper parameters for Adam optimizer.
# This static learning_rate will be multiplied to the LearningRateScheduler
# derived learning rate the to get the final learning rate.
learning_rate = 2.0
beta1 = 0.9
beta2 = 0.997
eps = 1e-9
# the parameters for learning rate scheduling.
warmup_steps = 8000
# the weight used to mix up the ground-truth distribution and the fixed
# uniform distribution in label smoothing when training.
# Set this as zero if label smoothing is not wanted.
label_smooth_eps = 0.1
# the directory for saving trained models.
model_dir = "trained_models"
# the directory for saving checkpoints.
ckpt_dir = "trained_ckpts"
# the directory for loading checkpoint.
# If provided, continue training from the checkpoint.
ckpt_path = None
# the parameter to initialize the learning rate scheduler.
# It should be provided if use checkpoints, since the checkpoint doesn't
# include the training step counter currently.
start_step = 0
# the frequency to save trained models.
save_freq = 10000
class InferTaskConfig(object):
use_gpu = True
# the number of examples in one run for sequence generation.
batch_size = 10
# the parameters for beam search.
beam_size = 5
max_out_len = 256
# the number of decoded sentences to output.
n_best = 1
# the flags indicating whether to output the special tokens.
output_bos = False
output_eos = False
output_unk = True
# the directory for loading the trained model.
model_path = "trained_models/pass_1.infer.model"
class ModelHyperParams(object):
# These following five vocabularies related configurations will be set
# automatically according to the passed vocabulary path and special tokens.
# size of source word dictionary.
src_vocab_size = 10000
# size of target word dictionay
trg_vocab_size = 10000
# index for <bos> token
bos_idx = 0
# index for <eos> token
eos_idx = 1
# index for <unk> token
unk_idx = 2
# max length of sequences deciding the size of position encoding table.
max_length = 256
# the dimension for word embeddings, which is also the last dimension of
# the input and output of multi-head attention, position-wise feed-forward
# networks, encoder and decoder.
d_model = 512
# size of the hidden layer in position-wise feed-forward networks.
d_inner_hid = 2048
# the dimension that keys are projected to for dot-product attention.
d_key = 64
# the dimension that values are projected to for dot-product attention.
d_value = 64
# number of head used in multi-head attention.
n_head = 8
# number of sub-layers to be stacked in the encoder and decoder.
n_layer = 6
# dropout rates of different modules.
prepostprocess_dropout = 0.1
attention_dropout = 0.1
relu_dropout = 0.1
# to process before each sub-layer
preprocess_cmd = "n" # layer normalization
# to process after each sub-layer
postprocess_cmd = "da" # dropout + residual connection
# random seed used in dropout for CE.
dropout_seed = None
# the flag indicating whether to share embedding and softmax weights.
# vocabularies in source and target should be same for weight sharing.
weight_sharing = True
def merge_cfg_from_list(cfg_list, g_cfgs):
"""
Set the above global configurations using the cfg_list.
"""
assert len(cfg_list) % 2 == 0
for key, value in zip(cfg_list[0::2], cfg_list[1::2]):
for g_cfg in g_cfgs:
if hasattr(g_cfg, key):
try:
value = eval(value)
except Exception: # for file path
pass
setattr(g_cfg, key, value)
break
# The placeholder for batch_size in compile time. Must be -1 currently to be
# consistent with some ops' infer-shape output in compile time, such as the
# sequence_expand op used in beamsearch decoder.
batch_size = -1
# The placeholder for squence length in compile time.
seq_len = 256
# The placeholder for head number in compile time.
n_head = 8
# The placeholder for model dim in compile time.
d_model = 512
# Here list the data shapes and data types of all inputs.
# The shapes here act as placeholder and are set to pass the infer-shape in
# compile time.
input_descs = {
# The actual data shape of src_word is:
# [batch_size, max_src_len_in_batch, 1]
"src_word": [(batch_size, seq_len, 1), "int64", 2],
# The actual data shape of src_pos is:
# [batch_size, max_src_len_in_batch, 1]
"src_pos": [(batch_size, seq_len, 1), "int64"],
# This input is used to remove attention weights on paddings in the
# encoder.
# The actual data shape of src_slf_attn_bias is:
# [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
"src_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
# The actual data shape of trg_word is:
# [batch_size, max_trg_len_in_batch, 1]
"trg_word": [(batch_size, seq_len, 1), "int64",
2], # lod_level is only used in fast decoder.
# The actual data shape of trg_pos is:
# [batch_size, max_trg_len_in_batch, 1]
"trg_pos": [(batch_size, seq_len, 1), "int64"],
# This input is used to remove attention weights on paddings and
# subsequent words in the decoder.
# The actual data shape of trg_slf_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
"trg_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used to remove attention weights on paddings of the source
# input in the encoder-decoder attention.
# The actual data shape of trg_src_attn_bias is:
# [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
"trg_src_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
# This input is used in independent decoder program for inference.
# The actual data shape of enc_output is:
# [batch_size, max_src_len_in_batch, d_model]
"enc_output": [(batch_size, seq_len, d_model), "float32"],
# The actual data shape of label_word is:
# [batch_size * max_trg_len_in_batch, 1]
"lbl_word": [(batch_size * seq_len, 1), "int64"],
# This input is used to mask out the loss of paddding tokens.
# The actual data shape of label_weight is:
# [batch_size * max_trg_len_in_batch, 1]
"lbl_weight": [(batch_size * seq_len, 1), "float32"],
# This input is used in beam-search decoder.
"init_score": [(batch_size, 1), "float32", 2],
# This input is used in beam-search decoder for the first gather
# (cell states updation)
"init_idx": [(batch_size, ), "int32"],
}
# Names of word embedding table which might be reused for weight sharing.
word_emb_param_names = (
"src_word_emb_table",
"trg_word_emb_table", )
# Names of position encoding table which will be initialized externally.
pos_enc_param_names = (
"src_pos_enc_table",
"trg_pos_enc_table", )
# separated inputs for different usages.
encoder_data_input_fields = (
"src_word",
"src_pos",
"src_slf_attn_bias", )
decoder_data_input_fields = (
"trg_word",
"trg_pos",
"trg_slf_attn_bias",
"trg_src_attn_bias",
"enc_output", )
label_data_input_fields = (
"lbl_word",
"lbl_weight", )
# In fast decoder, trg_pos (only containing the current time step) is generated
# by ops and trg_slf_attn_bias is not needed.
fast_decoder_data_input_fields = (
"trg_word",
"init_score",
"init_idx",
"trg_src_attn_bias", )
import argparse
import ast
import multiprocessing
import numpy as np
import os
import sys
sys.path.append("../../")
sys.path.append("../../models/neural_machine_translation/transformer/")
from functools import partial
import paddle
import paddle.fluid as fluid
from models.model_check import check_cuda
import reader
from config import *
from desc import *
from model import fast_decode as fast_decoder
from train import pad_batch_data, prepare_data_generator
def parse_args():
parser = argparse.ArgumentParser("Training for Transformer.")
parser.add_argument(
"--src_vocab_fpath",
type=str,
required=True,
help="The path of vocabulary file of source language.")
parser.add_argument(
"--trg_vocab_fpath",
type=str,
required=True,
help="The path of vocabulary file of target language.")
parser.add_argument(
"--test_file_pattern",
type=str,
required=True,
help="The pattern to match test data files.")
parser.add_argument(
"--batch_size",
type=int,
default=50,
help="The number of examples in one run for sequence generation.")
parser.add_argument(
"--pool_size",
type=int,
default=10000,
help="The buffer size to pool data.")
parser.add_argument(
"--special_token",
type=str,
default=["<s>", "<e>", "<unk>"],
nargs=3,
help="The <bos>, <eos> and <unk> tokens in the dictionary.")
parser.add_argument(
"--token_delimiter",
type=lambda x: str(x.encode().decode("unicode-escape")),
default=" ",
help="The delimiter used to split tokens in source or target sentences. "
"For EN-DE BPE data we provided, use spaces as token delimiter. ")
parser.add_argument(
"--use_mem_opt",
type=ast.literal_eval,
default=True,
help="The flag indicating whether to use memory optimization.")
parser.add_argument(
"--use_py_reader",
type=ast.literal_eval,
default=True,
help="The flag indicating whether to use py_reader.")
parser.add_argument(
"--use_parallel_exe",
type=ast.literal_eval,
default=False,
help="The flag indicating whether to use ParallelExecutor.")
parser.add_argument(
'opts',
help='See config.py for all options',
default=None,
nargs=argparse.REMAINDER)
args = parser.parse_args()
# Append args related to dict
src_dict = reader.DataReader.load_dict(args.src_vocab_fpath)
trg_dict = reader.DataReader.load_dict(args.trg_vocab_fpath)
dict_args = [
"src_vocab_size", str(len(src_dict)), "trg_vocab_size",
str(len(trg_dict)), "bos_idx", str(src_dict[args.special_token[0]]),
"eos_idx", str(src_dict[args.special_token[1]]), "unk_idx",
str(src_dict[args.special_token[2]])
]
merge_cfg_from_list(args.opts + dict_args,
[InferTaskConfig, ModelHyperParams])
return args
def post_process_seq(seq,
bos_idx=ModelHyperParams.bos_idx,
eos_idx=ModelHyperParams.eos_idx,
output_bos=InferTaskConfig.output_bos,
output_eos=InferTaskConfig.output_eos):
"""
Post-process the beam-search decoded sequence. Truncate from the first
<eos> and remove the <bos> and <eos> tokens currently.
"""
eos_pos = len(seq) - 1
for i, idx in enumerate(seq):
if idx == eos_idx:
eos_pos = i
break
seq = [
idx for idx in seq[:eos_pos + 1]
if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
]
return seq
def prepare_batch_input(insts, data_input_names, src_pad_idx, bos_idx, n_head,
d_model, place):
"""
Put all padded data needed by beam search decoder into a dict.
"""
src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
[inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
# start tokens
trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64")
trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
[1, 1, 1, 1]).astype("float32")
trg_word = trg_word.reshape(-1, 1, 1)
src_word = src_word.reshape(-1, src_max_len, 1)
src_pos = src_pos.reshape(-1, src_max_len, 1)
def to_lodtensor(data, place, lod=None):
data_tensor = fluid.LoDTensor()
data_tensor.set(data, place)
if lod is not None:
data_tensor.set_lod(lod)
return data_tensor
# beamsearch_op must use tensors with lod
init_score = to_lodtensor(
np.zeros_like(
trg_word, dtype="float32").reshape(-1, 1),
place, [range(trg_word.shape[0] + 1)] * 2)
trg_word = to_lodtensor(trg_word, place, [range(trg_word.shape[0] + 1)] * 2)
init_idx = np.asarray(range(len(insts)), dtype="int32")
data_input_dict = dict(
zip(data_input_names, [
src_word, src_pos, src_slf_attn_bias, trg_word, init_score,
init_idx, trg_src_attn_bias
]))
return data_input_dict
def prepare_feed_dict_list(data_generator, count, place):
"""
Prepare the list of feed dict for multi-devices.
"""
feed_dict_list = []
if data_generator is not None: # use_py_reader == False
data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
data = next(data_generator)
for idx, data_buffer in enumerate(data):
data_input_dict = prepare_batch_input(
data_buffer, data_input_names, ModelHyperParams.eos_idx,
ModelHyperParams.bos_idx, ModelHyperParams.n_head,
ModelHyperParams.d_model, place)
feed_dict_list.append(data_input_dict)
return feed_dict_list if len(feed_dict_list) == count else None
def py_reader_provider_wrapper(data_reader, place):
"""
Data provider needed by fluid.layers.py_reader.
"""
def py_reader_provider():
data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
for batch_id, data in enumerate(data_reader()):
data_input_dict = prepare_batch_input(
data, data_input_names, ModelHyperParams.eos_idx,
ModelHyperParams.bos_idx, ModelHyperParams.n_head,
ModelHyperParams.d_model, place)
yield [data_input_dict[item] for item in data_input_names]
return py_reader_provider
def fast_infer(args):
"""
Inference by beam search decoder based solely on Fluid operators.
"""
out_ids, out_scores, pyreader = fast_decoder(
ModelHyperParams.src_vocab_size,
ModelHyperParams.trg_vocab_size,
ModelHyperParams.max_length + 1,
ModelHyperParams.n_layer,
ModelHyperParams.n_head,
ModelHyperParams.d_key,
ModelHyperParams.d_value,
ModelHyperParams.d_model,
ModelHyperParams.d_inner_hid,
ModelHyperParams.prepostprocess_dropout,
ModelHyperParams.attention_dropout,
ModelHyperParams.relu_dropout,
ModelHyperParams.preprocess_cmd,
ModelHyperParams.postprocess_cmd,
ModelHyperParams.weight_sharing,
InferTaskConfig.beam_size,
InferTaskConfig.max_out_len,
ModelHyperParams.bos_idx,
ModelHyperParams.eos_idx,
use_py_reader=args.use_py_reader)
# This is used here to set dropout to the test mode.
infer_program = fluid.default_main_program().clone(for_test=True)
if args.use_mem_opt:
fluid.memory_optimize(infer_program)
if InferTaskConfig.use_gpu:
check_cuda(InferTaskConfig.use_gpu)
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
fluid.io.load_vars(
exe,
InferTaskConfig.model_path,
vars=[
var for var in infer_program.list_vars()
if isinstance(var, fluid.framework.Parameter)
])
exec_strategy = fluid.ExecutionStrategy()
# For faster executor
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 1
build_strategy = fluid.BuildStrategy()
infer_exe = fluid.ParallelExecutor(
use_cuda=TrainTaskConfig.use_gpu,
main_program=infer_program,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
# data reader settings for inference
args.train_file_pattern = args.test_file_pattern
args.use_token_batch = False
args.sort_type = reader.SortType.NONE
args.shuffle = False
args.shuffle_batch = False
test_data = prepare_data_generator(
args,
is_test=False,
count=dev_count,
pyreader=pyreader,
py_reader_provider_wrapper=py_reader_provider_wrapper,
place=place)
if args.use_py_reader:
pyreader.start()
data_generator = None
else:
data_generator = test_data()
trg_idx2word = reader.DataReader.load_dict(
dict_path=args.trg_vocab_fpath, reverse=True)
while True:
try:
feed_dict_list = prepare_feed_dict_list(data_generator, dev_count,
place)
if args.use_parallel_exe:
seq_ids, seq_scores = infer_exe.run(
fetch_list=[out_ids.name, out_scores.name],
feed=feed_dict_list,
return_numpy=False)
else:
seq_ids, seq_scores = exe.run(
program=infer_program,
fetch_list=[out_ids.name, out_scores.name],
feed=feed_dict_list[0]
if feed_dict_list is not None else None,
return_numpy=False,
use_program_cache=False)
seq_ids_list, seq_scores_list = [seq_ids], [
seq_scores
] if isinstance(seq_ids,
paddle.fluid.LoDTensor) else (seq_ids, seq_scores)
for seq_ids, seq_scores in zip(seq_ids_list, seq_scores_list):
# How to parse the results:
# Suppose the lod of seq_ids is:
# [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
# then from lod[0]:
# there are 2 source sentences, beam width is 3.
# from lod[1]:
# the first source sentence has 3 hyps; the lengths are 12, 12, 16
# the second source sentence has 3 hyps; the lengths are 14, 13, 15
hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
for i in range(len(seq_ids.lod()[0]) -
1): # for each source sentence
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
for j in range(end - start): # for each candidate
sub_start = seq_ids.lod()[1][start + j]
sub_end = seq_ids.lod()[1][start + j + 1]
hyps[i].append(" ".join([
trg_idx2word[idx]
for idx in post_process_seq(
np.array(seq_ids)[sub_start:sub_end])
]))
scores[i].append(np.array(seq_scores)[sub_end - 1])
print(hyps[i][-1])
if len(hyps[i]) >= InferTaskConfig.n_best:
break
except (StopIteration, fluid.core.EOFException):
# The data pass is over.
if args.use_py_reader:
pyreader.reset()
break
if __name__ == "__main__":
args = parse_args()
fast_infer(args)
#encoding=utf8
import logging
import os
import six
import sys
import time
import numpy as np
import paddle
import paddle.fluid as fluid
#include palm for easier nlp coding
from palm.toolkit.input_field import InputField
from palm.toolkit.configure import PDConfig
# include task-specific libs
import desc
import reader
from transformer import create_net
def init_from_pretrain_model(args, exe, program):
assert isinstance(args.init_from_pretrain_model, str)
if not os.path.exists(args.init_from_pretrain_model):
raise Warning("The pretrained params do not exist.")
return False
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(
os.path.join(args.init_from_pretrain_model, var.name))
fluid.io.load_vars(
exe,
args.init_from_pretrain_model,
main_program=program,
predicate=existed_params)
print("finish initing model from pretrained params from %s" %
(args.init_from_pretrain_model))
return True
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
fluid.io.load_params(
executor=exe,
dirname=args.init_from_params,
main_program=program,
filename="params.pdparams")
print("finish init model from params from %s" % (args.init_from_params))
return True
def do_save_inference_model(args):
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(0)
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace()
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
# define input and reader
input_field_names = desc.encoder_data_input_fields + desc.fast_decoder_data_input_fields
input_slots = [{
"name": name,
"shape": desc.input_descs[name][0],
"dtype": desc.input_descs[name][1]
} for name in input_field_names]
input_field = InputField(input_slots)
input_field.build(build_pyreader=True)
# define the network
predictions = create_net(
is_training=False, model_input=input_field, args=args)
out_ids, out_scores = predictions
# This is used here to set dropout to the test mode.
test_prog = test_prog.clone(for_test=True)
# prepare predicting
## define the executor and program for training
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
init_from_pretrain_model(args, exe, test_prog)
# saving inference model
fluid.io.save_inference_model(
args.inference_model_dir,
feeded_var_names=input_field_names,
target_vars=[out_ids, out_scores],
executor=exe,
main_program=test_prog,
model_filename="model.pdmodel",
params_filename="params.pdparams")
print("save inference model at %s" % (args.inference_model_dir))
if __name__ == "__main__":
args = PDConfig(yaml_file="./transformer.yaml")
args.build()
args.Print()
do_save_inference_model(args)
#encoding=utf8
import os
import sys
import logging
import numpy as np
import paddle
import paddle.fluid as fluid
#include palm for easier nlp coding
from palm.toolkit.configure import PDConfig
from train import do_train
from predict import do_predict
from inference_model import do_save_inference_model
if __name__ == "__main__":
LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
logging.basicConfig(
stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT)
logging.getLogger().setLevel(logging.INFO)
args = PDConfig(yaml_file="./transformer.yaml")
args.build()
args.Print()
if args.do_train:
do_train(args)
if args.do_predict:
do_predict(args)
if args.do_save_inference_model:
do_save_inference_model(args)
\ No newline at end of file
#encoding=utf8
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import argparse
import json
import yaml
import six
import logging
logging_only_message = "%(message)s"
logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
class JsonConfig(object):
"""
A high-level api for handling json configure file.
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except:
raise IOError("Error in parsing bert model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, **kwargs):
type = str2bool if type == bool else type
self._group.add_argument(
"--" + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
class ArgConfig(object):
"""
A high-level api for handling argument configs.
"""
def __init__(self):
parser = argparse.ArgumentParser()
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5,
"Learning rate used to train with warmup.")
train_g.add_arg(
"lr_scheduler",
str,
"linear_warmup_decay",
"scheduler of learning rate.",
choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01,
"Weight decay rate for L2 regularizer.")
train_g.add_arg(
"warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for."
)
train_g.add_arg("save_steps", int, 1000,
"The steps interval to save checkpoints.")
train_g.add_arg("use_fp16", bool, False,
"Whether to use fp16 mixed precision training.")
train_g.add_arg(
"loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
)
train_g.add_arg("pred_dir", str, None,
"Path to save the prediction results")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10,
"The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True,
"If set, use GPU for training.")
run_type_g.add_arg(
"use_fast_executor", bool, False,
"If set, use fast parallel executor (in experiment).")
run_type_g.add_arg(
"num_iteration_per_drop_scope", int, 1,
"Ihe iteration intervals to clean up temporary variables.")
run_type_g.add_arg("do_train", bool, True,
"Whether to perform training.")
run_type_g.add_arg("do_predict", bool, True,
"Whether to perform prediction.")
custom_g = ArgumentGroup(parser, "customize", "customized options.")
self.custom_g = custom_g
self.parser = parser
def add_arg(self, name, dtype, default, descrip):
self.custom_g.add_arg(name, dtype, default, descrip)
def build_conf(self):
return self.parser.parse_args()
def str2bool(v):
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
def print_arguments(args, log=None):
if not log:
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
else:
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
class PDConfig(object):
"""
A high-level API for managing configuration files in PaddlePaddle.
Can jointly work with command-line-arugment, json files and yaml files.
"""
def __init__(self, json_file="", yaml_file="", fuse_args=True):
"""
Init funciton for PDConfig.
json_file: the path to the json configure file.
yaml_file: the path to the yaml configure file.
fuse_args: if fuse the json/yaml configs with argparse.
"""
assert isinstance(json_file, str)
assert isinstance(yaml_file, str)
if json_file != "" and yaml_file != "":
raise Warning(
"json_file and yaml_file can not co-exist for now. please only use one configure file type."
)
return
self.args = None
self.arg_config = {}
self.json_config = {}
self.yaml_config = {}
parser = argparse.ArgumentParser()
self.default_g = ArgumentGroup(parser, "default", "default options.")
self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
self.json_g = ArgumentGroup(parser, "json", "options from json.")
self.com_g = ArgumentGroup(parser, "custom", "customized options.")
self.default_g.add_arg("do_train", bool, False,
"Whether to perform training.")
self.default_g.add_arg("do_predict", bool, False,
"Whether to perform predicting.")
self.default_g.add_arg("do_eval", bool, False,
"Whether to perform evaluating.")
self.default_g.add_arg("do_save_inference_model", bool, False,
"Whether to perform model saving for inference.")
self.parser = parser
if json_file != "":
self.load_json(json_file, fuse_args=fuse_args)
if yaml_file:
self.load_yaml(yaml_file, fuse_args=fuse_args)
def load_json(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the json file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.json_config = json.loads(fin.read())
fin.close()
if fuse_args:
for name in self.json_config:
if isinstance(self.json_config[name], list):
self.json_g.add_arg(
name,
type(self.json_config[name][0]),
self.json_config[name],
"This is from %s" % file_path,
nargs=len(self.json_config[name]))
continue
if not isinstance(self.json_config[name], int) \
and not isinstance(self.json_config[name], float) \
and not isinstance(self.json_config[name], str) \
and not isinstance(self.json_config[name], bool):
continue
self.json_g.add_arg(name,
type(self.json_config[name]),
self.json_config[name],
"This is from %s" % file_path)
def load_yaml(self, file_path, fuse_args=True):
if not os.path.exists(file_path):
raise Warning("the yaml file %s does not exist." % file_path)
return
with open(file_path, "r") as fin:
self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
fin.close()
if fuse_args:
for name in self.yaml_config:
if isinstance(self.yaml_config[name], list):
self.yaml_g.add_arg(
name,
type(self.yaml_config[name][0]),
self.yaml_config[name],
"This is from %s" % file_path,
nargs=len(self.yaml_config[name]))
continue
if not isinstance(self.yaml_config[name], int) \
and not isinstance(self.yaml_config[name], float) \
and not isinstance(self.yaml_config[name], str) \
and not isinstance(self.yaml_config[name], bool):
continue
self.yaml_g.add_arg(name,
type(self.yaml_config[name]),
self.yaml_config[name],
"This is from %s" % file_path)
def build(self):
self.args = self.parser.parse_args()
self.arg_config = vars(self.args)
def __add__(self, new_arg):
assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
assert len(new_arg) >= 3
assert self.args is None
name = new_arg[0]
dtype = new_arg[1]
dvalue = new_arg[2]
desc = new_arg[3] if len(
new_arg) == 4 else "Description is not provided."
self.com_g.add_arg(name, dtype, dvalue, desc)
return self
def __getattr__(self, name):
if name in self.arg_config:
return self.arg_config[name]
if name in self.json_config:
return self.json_config[name]
if name in self.yaml_config:
return self.yaml_config[name]
raise Warning("The argument %s is not defined." % name)
def Print(self):
print("-" * 70)
for name in self.arg_config:
print("%s:\t\t\t\t%s" % (str(name), str(self.arg_config[name])))
for name in self.json_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.json_config[name])))
for name in self.yaml_config:
if name not in self.arg_config:
print("%s:\t\t\t\t%s" %
(str(name), str(self.yaml_config[name])))
print("-" * 70)
if __name__ == "__main__":
"""
pd_config = PDConfig(json_file = "./test/bert_config.json")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
pd_config = PDConfig(yaml_file = "./test/bert_config.yaml")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
"""
pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
pd_config += ("my_age", int, 18, "I am forever 18.")
pd_config.build()
print(pd_config.do_train)
print(pd_config.hidden_size)
print(pd_config.my_age)
#encoding=utf8
from __future__ import print_function
from __future__ import division
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
class Placeholder(object):
def __init__(self):
self.shapes = []
self.dtypes = []
self.lod_levels = []
self.names = []
def __init__(self, input_shapes):
self.shapes = []
self.dtypes = []
self.lod_levels = []
self.names = []
for new_holder in input_shapes:
shape = new_holder[0]
dtype = new_holder[1]
lod_level = new_holder[2] if len(new_holder) >= 3 else 0
name = new_holder[3] if len(new_holder) >= 4 else ""
self.append_placeholder(
shape, dtype, lod_level=lod_level, name=name)
def append_placeholder(self, shape, dtype, lod_level=0, name=""):
self.shapes.append(shape)
self.dtypes.append(dtype)
self.lod_levels.append(lod_level)
self.names.append(name)
def build(self, capacity, reader_name, use_double_buffer=False):
pyreader = fluid.layers.py_reader(
capacity=capacity,
shapes=self.shapes,
dtypes=self.dtypes,
lod_levels=self.lod_levels,
name=reader_name,
use_double_buffer=use_double_buffer)
return [pyreader, fluid.layers.read_file(pyreader)]
def __add__(self, new_holder):
assert isinstance(new_holder, tuple) or isinstance(new_holder, list)
assert len(new_holder) >= 2
shape = new_holder[0]
dtype = new_holder[1]
lod_level = new_holder[2] if len(new_holder) >= 3 else 0
name = new_holder[3] if len(new_holder) >= 4 else ""
self.append_placeholder(shape, dtype, lod_level=lod_level, name=name)
class InputField(object):
"""
A high-level API for handling inputs in PaddlePaddle.
"""
def __init__(self, input_slots=[]):
self.shapes = []
self.dtypes = []
self.names = []
self.lod_levels = []
self.input_slots = {}
self.feed_list_str = []
self.feed_list = []
self.reader = None
if input_slots:
for input_slot in input_slots:
self += input_slot
def __add__(self, input_slot):
if isinstance(input_slot, list) or isinstance(input_slot, tuple):
name = input_slot[0]
shape = input_slot[1]
dtype = input_slot[2]
lod_level = input_slot[3] if len(input_slot) == 4 else 0
if isinstance(input_slot, dict):
name = input_slot["name"]
shape = input_slot["shape"]
dtype = input_slot["dtype"]
lod_level = input_slot[
"lod_level"] if "lod_level" in input_slot else 0
self.shapes.append(shape)
self.dtypes.append(dtype)
self.names.append(name)
self.lod_levels.append(lod_level)
self.feed_list_str.append(name)
return self
def __getattr__(self, name):
if name not in self.input_slots:
raise Warning("the attr %s has not been defined yet." % name)
return None
return self.input_slots[name]
def build(self, build_pyreader=False, capacity=100, iterable=False):
for _name, _shape, _dtype, _lod_level in zip(
self.names, self.shapes, self.dtypes, self.lod_levels):
self.input_slots[_name] = fluid.layers.data(
name=_name, shape=_shape, dtype=_dtype, lod_level=_lod_level)
for name in self.feed_list_str:
self.feed_list.append(self.input_slots[name])
if build_pyreader:
self.reader = fluid.io.PyReader(
feed_list=self.feed_list, capacity=capacity, iterable=iterable)
def start(self, generator=None):
if generator is not None:
self.reader.decorate_batch_generator(generator)
self.reader.start()
if __name__ == "__main__":
mnist_input_slots = [{
"name": "image",
"shape": (-1, 32, 32, 1),
"dtype": "int32"
}, {
"name": "label",
"shape": [-1, 1],
"dtype": "int64"
}]
input_field = InputField(mnist_input_slots)
input_field += {
"name": "large_image",
"shape": (-1, 64, 64, 1),
"dtype": "int32"
}
input_field += {
"name": "large_color_image",
"shape": (-1, 64, 64, 3),
"dtype": "int32"
}
input_field.build()
print(input_field.feed_list)
print(input_field.image)
print(input_field.large_color_image)
#encoding=utf8
import logging
import os
import six
import sys
import time
import numpy as np
import paddle
import paddle.fluid as fluid
#include palm for easier nlp coding
from palm.toolkit.input_field import InputField
from palm.toolkit.configure import PDConfig
# include task-specific libs
import desc
import reader
from transformer import create_net, position_encoding_init
def init_from_pretrain_model(args, exe, program):
assert isinstance(args.init_from_pretrain_model, str)
if not os.path.exists(args.init_from_pretrain_model):
raise Warning("The pretrained params do not exist.")
return False
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(
os.path.join(args.init_from_pretrain_model, var.name))
fluid.io.load_vars(
exe,
args.init_from_pretrain_model,
main_program=program,
predicate=existed_params)
print("finish initing model from pretrained params from %s" %
(args.init_from_pretrain_model))
return True
def init_from_params(args, exe, program):
assert isinstance(args.init_from_params, str)
if not os.path.exists(args.init_from_params):
raise Warning("the params path does not exist.")
return False
fluid.io.load_params(
executor=exe,
dirname=args.init_from_params,
main_program=program,
filename="params.pdparams")
print("finish init model from params from %s" % (args.init_from_params))
return True
def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
"""
Post-process the beam-search decoded sequence. Truncate from the first
<eos> and remove the <bos> and <eos> tokens currently.
"""
eos_pos = len(seq) - 1
for i, idx in enumerate(seq):
if idx == eos_idx:
eos_pos = i
break
seq = [
idx for idx in seq[:eos_pos + 1]
if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
]
return seq
def do_predict(args):
if args.use_cuda:
dev_count = fluid.core.get_cuda_device_count()
place = fluid.CUDAPlace(0)
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
place = fluid.CPUPlace()
# define the data generator
processor = reader.DataProcessor(
fpattern=args.predict_file,
src_vocab_fpath=args.src_vocab_fpath,
trg_vocab_fpath=args.trg_vocab_fpath,
token_delimiter=args.token_delimiter,
use_token_batch=False,
batch_size=args.batch_size,
device_count=dev_count,
pool_size=args.pool_size,
sort_type=reader.SortType.NONE,
shuffle=False,
shuffle_batch=False,
start_mark=args.special_token[0],
end_mark=args.special_token[1],
unk_mark=args.special_token[2],
max_length=args.max_length,
n_head=args.n_head)
batch_generator = processor.data_generator(phase="predict", place=place)
args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
args.unk_idx = processor.get_vocab_summary()
trg_idx2word = reader.DataProcessor.load_dict(
dict_path=args.trg_vocab_fpath, reverse=True)
test_prog = fluid.default_main_program()
startup_prog = fluid.default_startup_program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
# define input and reader
input_field_names = desc.encoder_data_input_fields + desc.fast_decoder_data_input_fields
input_slots = [{
"name": name,
"shape": desc.input_descs[name][0],
"dtype": desc.input_descs[name][1]
} for name in input_field_names]
input_field = InputField(input_slots)
input_field.build(build_pyreader=True)
# define the network
predictions = create_net(
is_training=False, model_input=input_field, args=args)
out_ids, out_scores = predictions
out_ids.persistable = out_scores.persistable = True
# This is used here to set dropout to the test mode.
test_prog = test_prog.clone(for_test=True)
# prepare predicting
## define the executor and program for training
exe = fluid.Executor(place)
exe.run(startup_prog)
assert (args.init_from_params) or (args.init_from_pretrain_model)
if args.init_from_params:
init_from_params(args, exe, test_prog)
elif args.init_from_pretrain_model:
init_from_pretrain_model(args, exe, test_prog)
# to avoid a longer length than training, reset the size of position encoding to max_length
for pos_enc_param_name in desc.pos_enc_param_names:
pos_enc_param = fluid.global_scope().find_var(
pos_enc_param_name).get_tensor()
pos_enc_param.set(
position_encoding_init(args.max_length + 1, args.d_model), place)
compiled_test_prog = fluid.CompiledProgram(test_prog)
f = open(args.output_file, "wb")
# start predicting
## decorate the pyreader with batch_generator
input_field.reader.decorate_batch_generator(batch_generator)
input_field.reader.start()
while True:
try:
seq_ids, seq_scores = exe.run(
compiled_test_prog,
fetch_list=[out_ids.name, out_scores.name],
return_numpy=False)
# How to parse the results:
# Suppose the lod of seq_ids is:
# [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
# then from lod[0]:
# there are 2 source sentences, beam width is 3.
# from lod[1]:
# the first source sentence has 3 hyps; the lengths are 12, 12, 16
# the second source sentence has 3 hyps; the lengths are 14, 13, 15
hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
for i in range(len(seq_ids.lod()[0]) -
1): # for each source sentence
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
for j in range(end - start): # for each candidate
sub_start = seq_ids.lod()[1][start + j]
sub_end = seq_ids.lod()[1][start + j + 1]
hyps[i].append(b" ".join([
trg_idx2word[idx]
for idx in post_process_seq(
np.array(seq_ids)[sub_start:sub_end], args.bos_idx,
args.eos_idx)
]))
scores[i].append(np.array(seq_scores)[sub_end - 1])
f.write(hyps[i][-1] + b"\n")
if len(hyps[i]) >= args.n_best:
break
except fluid.core.EOFException:
break
f.close()
if __name__ == "__main__":
args = PDConfig(yaml_file="./transformer.yaml")
args.build()
args.Print()
do_predict(args)
...@@ -4,6 +4,134 @@ import os ...@@ -4,6 +4,134 @@ import os
import tarfile import tarfile
import numpy as np import numpy as np
import paddle.fluid as fluid
def pad_batch_data(insts,
pad_idx,
n_head,
is_target=False,
is_label=False,
return_attn_bias=True,
return_max_len=True,
return_num_token=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, 1])]
if is_label: # label weight
inst_weight = np.array([[1.] * len(inst) + [0.] * (max_len - len(inst))
for inst in insts])
return_list += [inst_weight.astype("float32").reshape([-1, 1])]
else: # position data
inst_pos = np.array([
list(range(0, len(inst))) + [0] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, 1])]
if return_attn_bias:
if is_target:
# This is used to avoid attention on paddings and subsequent
# words.
slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
slf_attn_bias_data = np.triu(slf_attn_bias_data,
1).reshape([-1, 1, max_len, max_len])
slf_attn_bias_data = np.tile(slf_attn_bias_data,
[1, n_head, 1, 1]) * [-1e9]
else:
# This is used to avoid attention on paddings.
slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
(max_len - len(inst))
for inst in insts])
slf_attn_bias_data = np.tile(
slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
[1, n_head, max_len, 1])
return_list += [slf_attn_bias_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
return return_list if len(return_list) > 1 else return_list[0]
def prepare_train_input(insts, src_pad_idx, trg_pad_idx, n_head):
"""
Put all padded data needed by training into a list.
"""
src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
[inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
src_word = src_word.reshape(-1, src_max_len, 1)
src_pos = src_pos.reshape(-1, src_max_len, 1)
trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
[inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
trg_word = trg_word.reshape(-1, trg_max_len, 1)
trg_pos = trg_pos.reshape(-1, trg_max_len, 1)
trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
[1, 1, trg_max_len, 1]).astype("float32")
lbl_word, lbl_weight, num_token = pad_batch_data(
[inst[2] for inst in insts],
trg_pad_idx,
n_head,
is_target=False,
is_label=True,
return_attn_bias=False,
return_max_len=False,
return_num_token=True)
data_inputs = [
src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
]
return data_inputs
def prepare_infer_input(insts, src_pad_idx, bos_idx, n_head, place):
"""
Put all padded data needed by beam search decoder into a list.
"""
src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
[inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
# start tokens
trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64")
trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
[1, 1, 1, 1]).astype("float32")
trg_word = trg_word.reshape(-1, 1, 1)
src_word = src_word.reshape(-1, src_max_len, 1)
src_pos = src_pos.reshape(-1, src_max_len, 1)
def to_lodtensor(data, place, lod=None):
data_tensor = fluid.LoDTensor()
data_tensor.set(data, place)
if lod is not None:
data_tensor.set_lod(lod)
return data_tensor
# beamsearch_op must use tensors with lod
init_score = to_lodtensor(
np.zeros_like(
trg_word, dtype="float32").reshape(-1, 1),
place, [range(trg_word.shape[0] + 1)] * 2)
trg_word = to_lodtensor(trg_word, place, [range(trg_word.shape[0] + 1)] * 2)
init_idx = np.asarray(range(len(insts)), dtype="int32")
data_inputs = [
src_word, src_pos, src_slf_attn_bias, trg_word, init_score, init_idx,
trg_src_attn_bias
]
return data_inputs
class SortType(object): class SortType(object):
...@@ -95,7 +223,7 @@ class MinMaxFilter(object): ...@@ -95,7 +223,7 @@ class MinMaxFilter(object):
return self._creator.batch return self._creator.batch
class DataReader(object): class DataProcessor(object):
""" """
The data reader loads all data from files and produces batches of data The data reader loads all data from files and produces batches of data
in the way corresponding to settings. in the way corresponding to settings.
...@@ -104,12 +232,14 @@ class DataReader(object): ...@@ -104,12 +232,14 @@ class DataReader(object):
is shuffled in each pass and sorted in each pool: is shuffled in each pass and sorted in each pool:
``` ```
train_data = DataReader( train_data = DataProcessor(
src_vocab_fpath='data/src_vocab_file', src_vocab_fpath='data/src_vocab_file',
trg_vocab_fpath='data/trg_vocab_file', trg_vocab_fpath='data/trg_vocab_file',
fpattern='data/part-*', fpattern='data/part-*',
use_token_batch=True, use_token_batch=True,
batch_size=2000, batch_size=2000,
device_count=8,
n_head=8,
pool_size=10000, pool_size=10000,
sort_type=SortType.POOL, sort_type=SortType.POOL,
shuffle=True, shuffle=True,
...@@ -117,7 +247,7 @@ class DataReader(object): ...@@ -117,7 +247,7 @@ class DataReader(object):
start_mark='<s>', start_mark='<s>',
end_mark='<e>', end_mark='<e>',
unk_mark='<unk>', unk_mark='<unk>',
clip_last_batch=False).batch_generator clip_last_batch=False).data_generator(phase='train')
``` ```
:param src_vocab_fpath: The path of vocabulary file of source language. :param src_vocab_fpath: The path of vocabulary file of source language.
...@@ -131,6 +261,12 @@ class DataReader(object): ...@@ -131,6 +261,12 @@ class DataReader(object):
mini-batch. mini-batch.
:type batch_size: int :type batch_size: int
:param pool_size: The size of pool buffer. :param pool_size: The size of pool buffer.
:type device_count: int
:param device_count: The number of devices. The actual batch size is
determined by both batch_size and device_count.
:type n_head: int
:param n_head: The number of head used in multi-head attention. Actually,
this is not a reader related argument, but is used for input data.
:type pool_size: int :type pool_size: int
:param sort_type: The grain to sort by length: 'global' for all :param sort_type: The grain to sort by length: 'global' for all
instances; 'pool' for instances in pool; 'none' for no sort. instances; 'pool' for instances in pool; 'none' for no sort.
...@@ -164,6 +300,9 @@ class DataReader(object): ...@@ -164,6 +300,9 @@ class DataReader(object):
:type end_mark: basestring :type end_mark: basestring
:param unk_mark: The token representing for unknown word in dictionary. :param unk_mark: The token representing for unknown word in dictionary.
:type unk_mark: basestring :type unk_mark: basestring
:param only_src: Whether each line is a source and target sentence
pair or only has the source sentence.
:type only_src: bool
:param seed: The seed for random. :param seed: The seed for random.
:type seed: int :type seed: int
""" """
...@@ -173,14 +312,15 @@ class DataReader(object): ...@@ -173,14 +312,15 @@ class DataReader(object):
trg_vocab_fpath, trg_vocab_fpath,
fpattern, fpattern,
batch_size, batch_size,
device_count,
n_head,
pool_size, pool_size,
sort_type=SortType.GLOBAL, sort_type=SortType.GLOBAL,
clip_last_batch=True, clip_last_batch=False,
tar_fname=None, tar_fname=None,
min_length=0, min_length=0,
max_length=100, max_length=100,
shuffle=True, shuffle=True,
shuffle_seed=None,
shuffle_batch=False, shuffle_batch=False,
use_token_batch=False, use_token_batch=False,
field_delimiter="\t", field_delimiter="\t",
...@@ -188,37 +328,44 @@ class DataReader(object): ...@@ -188,37 +328,44 @@ class DataReader(object):
start_mark="<s>", start_mark="<s>",
end_mark="<e>", end_mark="<e>",
unk_mark="<unk>", unk_mark="<unk>",
only_src=False,
seed=0): seed=0):
# convert str to bytes, and use byte data
field_delimiter = field_delimiter.encode("utf8")
token_delimiter = token_delimiter.encode("utf8")
start_mark = start_mark.encode("utf8")
end_mark = end_mark.encode("utf8")
unk_mark = unk_mark.encode("utf8")
self._src_vocab = self.load_dict(src_vocab_fpath) self._src_vocab = self.load_dict(src_vocab_fpath)
self._only_src = True self._trg_vocab = self.load_dict(trg_vocab_fpath)
if trg_vocab_fpath is not None: self._bos_idx = self._src_vocab[start_mark]
self._trg_vocab = self.load_dict(trg_vocab_fpath) self._eos_idx = self._src_vocab[end_mark]
self._only_src = False self._unk_idx = self._src_vocab[unk_mark]
self._only_src = only_src
self._pool_size = pool_size self._pool_size = pool_size
self._batch_size = batch_size self._batch_size = batch_size
self._device_count = device_count
self._n_head = n_head
self._use_token_batch = use_token_batch self._use_token_batch = use_token_batch
self._sort_type = sort_type self._sort_type = sort_type
self._clip_last_batch = clip_last_batch self._clip_last_batch = clip_last_batch
self._shuffle = shuffle self._shuffle = shuffle
self._shuffle_seed = shuffle_seed
self._shuffle_batch = shuffle_batch self._shuffle_batch = shuffle_batch
self._min_length = min_length self._min_length = min_length
self._max_length = max_length self._max_length = max_length
self._field_delimiter = field_delimiter self._field_delimiter = field_delimiter
self._token_delimiter = token_delimiter self._token_delimiter = token_delimiter
self.load_src_trg_ids(end_mark, fpattern, start_mark, tar_fname, self.load_src_trg_ids(fpattern, tar_fname)
unk_mark)
self._random = np.random self._random = np.random
self._random.seed(seed) self._random.seed(seed)
def load_src_trg_ids(self, end_mark, fpattern, start_mark, tar_fname, def load_src_trg_ids(self, fpattern, tar_fname):
unk_mark):
converters = [ converters = [
Converter( Converter(
vocab=self._src_vocab, vocab=self._src_vocab,
beg=self._src_vocab[start_mark], beg=self._bos_idx,
end=self._src_vocab[end_mark], end=self._eos_idx,
unk=self._src_vocab[unk_mark], unk=self._unk_idx,
delimiter=self._token_delimiter, delimiter=self._token_delimiter,
add_beg=False) add_beg=False)
] ]
...@@ -226,9 +373,9 @@ class DataReader(object): ...@@ -226,9 +373,9 @@ class DataReader(object):
converters.append( converters.append(
Converter( Converter(
vocab=self._trg_vocab, vocab=self._trg_vocab,
beg=self._trg_vocab[start_mark], beg=self._bos_idx,
end=self._trg_vocab[end_mark], end=self._eos_idx,
unk=self._trg_vocab[unk_mark], unk=self._unk_idx,
delimiter=self._token_delimiter, delimiter=self._token_delimiter,
add_beg=True)) add_beg=True))
...@@ -254,9 +401,9 @@ class DataReader(object): ...@@ -254,9 +401,9 @@ class DataReader(object):
if tar_fname is None: if tar_fname is None:
raise Exception("If tar file provided, please set tar_fname.") raise Exception("If tar file provided, please set tar_fname.")
f = tarfile.open(fpaths[0], "r") f = tarfile.open(fpaths[0], "rb")
for line in f.extractfile(tar_fname): for line in f.extractfile(tar_fname):
fields = line.strip("\n").split(self._field_delimiter) fields = line.strip(b"\n").split(self._field_delimiter)
if (not self._only_src and len(fields) == 2) or ( if (not self._only_src and len(fields) == 2) or (
self._only_src and len(fields) == 1): self._only_src and len(fields) == 1):
yield fields yield fields
...@@ -267,9 +414,7 @@ class DataReader(object): ...@@ -267,9 +414,7 @@ class DataReader(object):
with open(fpath, "rb") as f: with open(fpath, "rb") as f:
for line in f: for line in f:
if six.PY3: fields = line.strip(b"\n").split(self._field_delimiter)
line = line.decode("utf8", errors="ignore")
fields = line.strip("\n").split(self._field_delimiter)
if (not self._only_src and len(fields) == 2) or ( if (not self._only_src and len(fields) == 2) or (
self._only_src and len(fields) == 1): self._only_src and len(fields) == 1):
yield fields yield fields
...@@ -279,61 +424,127 @@ class DataReader(object): ...@@ -279,61 +424,127 @@ class DataReader(object):
word_dict = {} word_dict = {}
with open(dict_path, "rb") as fdict: with open(dict_path, "rb") as fdict:
for idx, line in enumerate(fdict): for idx, line in enumerate(fdict):
if six.PY3:
line = line.decode("utf8", errors="ignore")
if reverse: if reverse:
word_dict[idx] = line.strip("\n") word_dict[idx] = line.strip(b"\n")
else: else:
word_dict[line.strip("\n")] = idx word_dict[line.strip(b"\n")] = idx
return word_dict return word_dict
def batch_generator(self): def batch_generator(self, batch_size, use_token_batch):
# global sort or global shuffle def __impl__():
if self._sort_type == SortType.GLOBAL: # global sort or global shuffle
infos = sorted(self._sample_infos, key=lambda x: x.max_len) if self._sort_type == SortType.GLOBAL:
else: infos = sorted(self._sample_infos, key=lambda x: x.max_len)
if self._shuffle:
infos = self._sample_infos
if self._shuffle_seed is not None:
self._random.seed(self._shuffle_seed)
self._random.shuffle(infos)
else:
infos = self._sample_infos
if self._sort_type == SortType.POOL:
reverse = True
for i in range(0, len(infos), self._pool_size):
# to avoid placing short next to long sentences
reverse = not reverse
infos[i:i + self._pool_size] = sorted(
infos[i:i + self._pool_size],
key=lambda x: x.max_len,
reverse=reverse)
# concat batch
batches = []
batch_creator = TokenBatchCreator(
self._batch_size
) if self._use_token_batch else SentenceBatchCreator(self._batch_size)
batch_creator = MinMaxFilter(self._max_length, self._min_length,
batch_creator)
for info in infos:
batch = batch_creator.append(info)
if batch is not None:
batches.append(batch)
if not self._clip_last_batch and len(batch_creator.batch) != 0:
batches.append(batch_creator.batch)
if self._shuffle_batch:
self._random.shuffle(batches)
for batch in batches:
batch_ids = [info.i for info in batch]
if self._only_src:
yield [[self._src_seq_ids[idx]] for idx in batch_ids]
else: else:
yield [(self._src_seq_ids[idx], self._trg_seq_ids[idx][:-1], if self._shuffle:
self._trg_seq_ids[idx][1:]) for idx in batch_ids] infos = self._sample_infos
self._random.shuffle(infos)
else:
infos = self._sample_infos
if self._sort_type == SortType.POOL:
reverse = True
for i in range(0, len(infos), self._pool_size):
# to avoid placing short next to long sentences
reverse = not reverse
infos[i:i + self._pool_size] = sorted(
infos[i:i + self._pool_size],
key=lambda x: x.max_len,
reverse=reverse)
# concat batch
batches = []
batch_creator = TokenBatchCreator(
batch_size) if use_token_batch else SentenceBatchCreator(
batch_size)
batch_creator = MinMaxFilter(self._max_length, self._min_length,
batch_creator)
for info in infos:
batch = batch_creator.append(info)
if batch is not None:
batches.append(batch)
if not self._clip_last_batch and len(batch_creator.batch) != 0:
batches.append(batch_creator.batch)
if self._shuffle_batch:
self._random.shuffle(batches)
for batch in batches:
batch_ids = [info.i for info in batch]
if self._only_src:
yield [[self._src_seq_ids[idx]] for idx in batch_ids]
else:
yield [(self._src_seq_ids[idx], self._trg_seq_ids[idx][:-1],
self._trg_seq_ids[idx][1:]) for idx in batch_ids]
return __impl__
@staticmethod
def stack(data_reader, count, clip_last=True):
def __impl__():
res = []
for item in data_reader():
res.append(item)
if len(res) == count:
yield res
res = []
if len(res) == count:
yield res
elif not clip_last:
data = []
for item in res:
data += item
if len(data) > count:
inst_num_per_part = len(data) // count
yield [
data[inst_num_per_part * i:inst_num_per_part * (i + 1)]
for i in range(count)
]
return __impl__
@staticmethod
def split(data_reader, count):
def __impl__():
for item in data_reader():
inst_num_per_part = len(item) // count
for i in range(count):
yield item[inst_num_per_part * i:inst_num_per_part * (i + 1
)]
return __impl__
def data_generator(self, phase, place=None):
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
src_pad_idx = trg_pad_idx = self._eos_idx
bos_idx = self._bos_idx
n_head = self._n_head
data_reader = self.batch_generator(
self._batch_size *
(1 if self._use_token_batch else self._device_count),
self._use_token_batch)
if not self._use_token_batch:
# to make data on each device have similar token number
data_reader = self.split(data_reader, self._device_count)
def __for_train__():
for data in data_reader():
data_inputs = prepare_train_input(data, src_pad_idx,
trg_pad_idx, n_head)
yield data_inputs
def __for_predict__():
for data in data_reader():
data_inputs = prepare_infer_input(data, src_pad_idx, bos_idx,
n_head, place)
yield data_inputs
return __for_train__ if phase == "train" else __for_predict__
def get_vocab_summary(self):
return len(self._src_vocab), len(
self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx
import argparse #encoding=utf8
import ast
import copy
import logging import logging
import multiprocessing
import os import os
import subprocess
if os.environ.get('FLAGS_eager_delete_tensor_gb', None) is None:
os.environ['FLAGS_eager_delete_tensor_gb'] = '0'
import six import six
import sys import sys
sys.path.append("../../")
sys.path.append("../../models/neural_machine_translation/transformer/")
import time import time
import numpy as np import numpy as np
import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
from models.model_check import check_cuda #include palm for easier nlp coding
import reader from palm.toolkit.input_field import InputField
from config import * from palm.toolkit.configure import PDConfig
from desc import *
from model import transformer, position_encoding_init # include task-specific libs
import desc
import dist_utils import dist_utils
import reader
from transformer import create_net, position_encoding_init
if os.environ.get('FLAGS_eager_delete_tensor_gb', None) is None:
os.environ['FLAGS_eager_delete_tensor_gb'] = '0'
# num_trainers is used for multi-process gpu training
num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1)) num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
def parse_args(): def init_from_pretrain_model(args, exe, program):
parser = argparse.ArgumentParser("Training for Transformer.")
parser.add_argument( assert isinstance(args.init_from_pretrain_model, str)
"--src_vocab_fpath",
type=str, if not os.path.exists(args.init_from_pretrain_model):
required=True, raise Warning("The pretrained params do not exist.")
help="The path of vocabulary file of source language.") return False
parser.add_argument(
"--trg_vocab_fpath", def existed_params(var):
type=str, if not isinstance(var, fluid.framework.Parameter):
required=True, return False
help="The path of vocabulary file of target language.") return os.path.exists(
parser.add_argument( os.path.join(args.init_from_pretrain_model, var.name))
"--train_file_pattern",
type=str, fluid.io.load_vars(
required=True, exe,
help="The pattern to match training data files.") args.init_from_pretrain_model,
parser.add_argument( main_program=program,
"--val_file_pattern", predicate=existed_params)
type=str,
help="The pattern to match validation data files.") print("finish initing model from pretrained params from %s" %
parser.add_argument( (args.init_from_pretrain_model))
"--use_token_batch",
type=ast.literal_eval, return True
default=True,
help="The flag indicating whether to "
"produce batch data according to token number.") def init_from_checkpoint(args, exe, program):
parser.add_argument(
"--batch_size", assert isinstance(args.init_from_checkpoint, str)
type=int,
default=4096, if not os.path.exists(args.init_from_checkpoint):
help="The number of sequences contained in a mini-batch, or the maximum " raise Warning("the checkpoint path does not exist.")
"number of tokens (include paddings) contained in a mini-batch. Note " return False
"that this represents the number on single device and the actual batch "
"size for multi-devices will multiply the device number.") fluid.io.load_persistables(
parser.add_argument( executor=exe,
"--pool_size", dirname=args.init_from_checkpoint,
type=int, main_program=program,
default=200000, filename="checkpoint.pdckpt")
help="The buffer size to pool data.")
parser.add_argument( print("finish initing model from checkpoint from %s" %
"--sort_type", (args.init_from_checkpoint))
default="pool",
choices=("global", "pool", "none"), return True
help="The grain to sort by length: global for all instances; pool for "
"instances in pool; none for no sort.")
parser.add_argument( def save_checkpoint(args, exe, program, dirname):
"--shuffle",
type=ast.literal_eval, assert isinstance(args.save_model_path, str)
default=True,
help="The flag indicating whether to shuffle instances in each pass.") checkpoint_dir = os.path.join(args.save_model_path, args.save_checkpoint)
parser.add_argument(
"--shuffle_batch", if not os.path.exists(checkpoint_dir):
type=ast.literal_eval, os.mkdir(checkpoint_dir)
default=True,
help="The flag indicating whether to shuffle the data batches.") fluid.io.save_persistables(
parser.add_argument( exe,
"--special_token", os.path.join(checkpoint_dir, dirname),
type=str, main_program=program,
default=["<s>", "<e>", "<unk>"], filename="checkpoint.pdparams")
nargs=3,
help="The <bos>, <eos> and <unk> tokens in the dictionary.") print("save checkpoint at %s" % (os.path.join(checkpoint_dir, dirname)))
parser.add_argument(
"--token_delimiter", return True
type=lambda x: str(x.encode().decode("unicode-escape")),
default=" ",
help="The delimiter used to split tokens in source or target sentences. " def save_param(args, exe, program, dirname):
"For EN-DE BPE data we provided, use spaces as token delimiter. ")
parser.add_argument( assert isinstance(args.save_model_path, str)
'opts',
help='See config.py for all options', param_dir = os.path.join(args.save_model_path, args.save_param)
default=None,
nargs=argparse.REMAINDER) if not os.path.exists(param_dir):
parser.add_argument( os.mkdir(param_dir)
'--local',
type=ast.literal_eval, fluid.io.save_params(
default=True, exe,
help='Whether to run as local mode.') os.path.join(param_dir, dirname),
parser.add_argument( main_program=program,
'--device', filename="params.pdparams")
type=str, print("save parameters at %s" % (os.path.join(param_dir, dirname)))
default='GPU',
choices=['CPU', 'GPU'], return True
help="The device type.")
parser.add_argument(
'--update_method', def do_train(args):
choices=("pserver", "nccl2"), if args.use_cuda:
default="pserver", if num_trainers > 1: # for multi-process gpu training
help='Update method.') dev_count = 1
parser.add_argument(
'--sync', type=ast.literal_eval, default=True, help="sync mode.")
parser.add_argument(
"--enable_ce",
type=ast.literal_eval,
default=False,
help="The flag indicating whether to run the task "
"for continuous evaluation.")
parser.add_argument(
"--use_mem_opt",
type=ast.literal_eval,
default=True,
help="The flag indicating whether to use memory optimization.")
parser.add_argument(
"--use_py_reader",
type=ast.literal_eval,
default=True,
help="The flag indicating whether to use py_reader.")
parser.add_argument(
"--fetch_steps",
type=int,
default=100,
help="The frequency to fetch and print output.")
args = parser.parse_args()
# Append args related to dict
src_dict = reader.DataReader.load_dict(args.src_vocab_fpath)
trg_dict = reader.DataReader.load_dict(args.trg_vocab_fpath)
dict_args = [
"src_vocab_size", str(len(src_dict)), "trg_vocab_size",
str(len(trg_dict)), "bos_idx", str(src_dict[args.special_token[0]]),
"eos_idx", str(src_dict[args.special_token[1]]), "unk_idx",
str(src_dict[args.special_token[2]])
]
merge_cfg_from_list(args.opts + dict_args,
[TrainTaskConfig, ModelHyperParams])
return args
def get_device_num():
# NOTE(zcd): for multi-processe training, each process use one GPU card.
if num_trainers > 1:
return 1
return fluid.core.get_cuda_device_count()
def append_nccl2_prepare(startup_prog, trainer_id, worker_endpoints,
current_endpoint):
assert (trainer_id >= 0 and len(worker_endpoints) > 1 and
current_endpoint in worker_endpoints)
eps = copy.deepcopy(worker_endpoints)
eps.remove(current_endpoint)
nccl_id_var = startup_prog.global_block().create_var(
name="NCCLID", persistable=True, type=fluid.core.VarDesc.VarType.RAW)
startup_prog.global_block().append_op(
type="gen_nccl_id",
inputs={},
outputs={"NCCLID": nccl_id_var},
attrs={
"endpoint": current_endpoint,
"endpoint_list": eps,
"trainer_id": trainer_id
})
return nccl_id_var
def pad_batch_data(insts,
pad_idx,
n_head,
is_target=False,
is_label=False,
return_attn_bias=True,
return_max_len=True,
return_num_token=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, 1])]
if is_label: # label weight
inst_weight = np.array(
[[1.] * len(inst) + [0.] * (max_len - len(inst)) for inst in insts])
return_list += [inst_weight.astype("float32").reshape([-1, 1])]
else: # position data
inst_pos = np.array([
list(range(0, len(inst))) + [0] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, 1])]
if return_attn_bias:
if is_target:
# This is used to avoid attention on paddings and subsequent
# words.
slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
slf_attn_bias_data = np.triu(slf_attn_bias_data,
1).reshape([-1, 1, max_len, max_len])
slf_attn_bias_data = np.tile(slf_attn_bias_data,
[1, n_head, 1, 1]) * [-1e9]
else: else:
# This is used to avoid attention on paddings. dev_count = fluid.core.get_cuda_device_count()
slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] * gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
(max_len - len(inst)) place = fluid.CUDAPlace(gpu_id)
for inst in insts]) else:
slf_attn_bias_data = np.tile( dev_count = int(os.environ.get('CPU_NUM', 1))
slf_attn_bias_data.reshape([-1, 1, 1, max_len]), place = fluid.CPUPlace()
[1, n_head, max_len, 1])
return_list += [slf_attn_bias_data.astype("float32")] # define the data generator
if return_max_len: processor = reader.DataProcessor(
return_list += [max_len] fpattern=args.training_file,
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
return return_list if len(return_list) > 1 else return_list[0]
def prepare_batch_input(insts, data_input_names, src_pad_idx, trg_pad_idx,
n_head, d_model):
"""
Put all padded data needed by training into a dict.
"""
src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
[inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
src_word = src_word.reshape(-1, src_max_len, 1)
src_pos = src_pos.reshape(-1, src_max_len, 1)
trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
[inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
trg_word = trg_word.reshape(-1, trg_max_len, 1)
trg_pos = trg_pos.reshape(-1, trg_max_len, 1)
trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
[1, 1, trg_max_len, 1]).astype("float32")
lbl_word, lbl_weight, num_token = pad_batch_data(
[inst[2] for inst in insts],
trg_pad_idx,
n_head,
is_target=False,
is_label=True,
return_attn_bias=False,
return_max_len=False,
return_num_token=True)
data_input_dict = dict(
zip(data_input_names, [
src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
]))
return data_input_dict, np.asarray([num_token], dtype="float32")
def prepare_data_generator(args,
is_test,
count,
pyreader,
py_reader_provider_wrapper,
place=None):
"""
Data generator wrapper for DataReader. If use py_reader, set the data
provider for py_reader
"""
# NOTE: If num_trainers > 1, the shuffle_seed must be set, because
# the order of batch data generated by reader
# must be the same in the respective processes.
shuffle_seed = 1 if num_trainers > 1 else None
data_reader = reader.DataReader(
fpattern=args.val_file_pattern if is_test else args.train_file_pattern,
src_vocab_fpath=args.src_vocab_fpath, src_vocab_fpath=args.src_vocab_fpath,
trg_vocab_fpath=args.trg_vocab_fpath, trg_vocab_fpath=args.trg_vocab_fpath,
token_delimiter=args.token_delimiter, token_delimiter=args.token_delimiter,
use_token_batch=args.use_token_batch, use_token_batch=args.use_token_batch,
batch_size=args.batch_size * (1 if args.use_token_batch else count), batch_size=args.batch_size,
device_count=dev_count,
pool_size=args.pool_size, pool_size=args.pool_size,
sort_type=args.sort_type, sort_type=args.sort_type,
shuffle=args.shuffle, shuffle=args.shuffle,
shuffle_seed=shuffle_seed,
shuffle_batch=args.shuffle_batch, shuffle_batch=args.shuffle_batch,
start_mark=args.special_token[0], start_mark=args.special_token[0],
end_mark=args.special_token[1], end_mark=args.special_token[1],
unk_mark=args.special_token[2], unk_mark=args.special_token[2],
# count start and end tokens out max_length=args.max_length,
max_length=ModelHyperParams.max_length - 2, n_head=args.n_head)
clip_last_batch=False).batch_generator batch_generator = processor.data_generator(phase="train")
if num_trainers > 1: # for multi-process gpu training
def stack(data_reader, count, clip_last=True): batch_generator = fluid.contrib.reader.distributed_batch_reader(
def __impl__(): batch_generator)
res = [] args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
for item in data_reader(): args.unk_idx = processor.get_vocab_summary()
res.append(item)
if len(res) == count: train_prog = fluid.default_main_program()
yield res startup_prog = fluid.default_startup_program()
res = [] random_seed = eval(str(args.random_seed))
if len(res) == count: if random_seed is not None:
yield res train_prog.random_seed = random_seed
elif not clip_last: startup_prog.random_seed = random_seed
data = []
for item in res: with fluid.program_guard(train_prog, startup_prog):
data += item
if len(data) > count:
inst_num_per_part = len(data) // count
yield [
data[inst_num_per_part * i:inst_num_per_part * (i + 1)]
for i in range(count)
]
return __impl__
def split(data_reader, count):
def __impl__():
for item in data_reader():
inst_num_per_part = len(item) // count
for i in range(count):
yield item[inst_num_per_part * i:inst_num_per_part * (i + 1
)]
return __impl__
if not args.use_token_batch:
# to make data on each device have similar token number
data_reader = split(data_reader, count)
if args.use_py_reader:
train_reader = py_reader_provider_wrapper(data_reader, place)
if num_trainers > 1:
assert shuffle_seed is not None
train_reader = fluid.contrib.reader.distributed_batch_reader(
train_reader)
pyreader.decorate_tensor_provider(train_reader)
data_reader = None
else: # Data generator for multi-devices
data_reader = stack(data_reader, count)
return data_reader
def prepare_feed_dict_list(data_generator, init_flag, count):
"""
Prepare the list of feed dict for multi-devices.
"""
feed_dict_list = []
if data_generator is not None: # use_py_reader == False
data_input_names = encoder_data_input_fields + \
decoder_data_input_fields[:-1] + label_data_input_fields
data = next(data_generator)
for idx, data_buffer in enumerate(data):
data_input_dict, num_token = prepare_batch_input(
data_buffer, data_input_names, ModelHyperParams.eos_idx,
ModelHyperParams.eos_idx, ModelHyperParams.n_head,
ModelHyperParams.d_model)
feed_dict_list.append(data_input_dict)
if init_flag:
for idx in range(count):
pos_enc_tables = dict()
for pos_enc_param_name in pos_enc_param_names:
pos_enc_tables[pos_enc_param_name] = position_encoding_init(
ModelHyperParams.max_length + 1, ModelHyperParams.d_model)
if len(feed_dict_list) <= idx:
feed_dict_list.append(pos_enc_tables)
else:
feed_dict_list[idx] = dict(
list(pos_enc_tables.items()) + list(feed_dict_list[idx]
.items()))
return feed_dict_list if len(feed_dict_list) == count else None
def py_reader_provider_wrapper(data_reader, place):
"""
Data provider needed by fluid.layers.py_reader.
"""
def py_reader_provider():
data_input_names = encoder_data_input_fields + \
decoder_data_input_fields[:-1] + label_data_input_fields
for batch_id, data in enumerate(data_reader()):
data_input_dict, num_token = prepare_batch_input(
data, data_input_names, ModelHyperParams.eos_idx,
ModelHyperParams.eos_idx, ModelHyperParams.n_head,
ModelHyperParams.d_model)
total_dict = dict(data_input_dict.items())
yield [total_dict[item] for item in data_input_names]
return py_reader_provider
def test_context(exe, train_exe, dev_count):
# Context to do validation.
test_prog = fluid.Program()
startup_prog = fluid.Program()
if args.enable_ce:
test_prog.random_seed = 1000
startup_prog.random_seed = 1000
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
sum_cost, avg_cost, predict, token_num, pyreader = transformer(
ModelHyperParams.src_vocab_size,
ModelHyperParams.trg_vocab_size,
ModelHyperParams.max_length + 1,
ModelHyperParams.n_layer,
ModelHyperParams.n_head,
ModelHyperParams.d_key,
ModelHyperParams.d_value,
ModelHyperParams.d_model,
ModelHyperParams.d_inner_hid,
ModelHyperParams.prepostprocess_dropout,
ModelHyperParams.attention_dropout,
ModelHyperParams.relu_dropout,
ModelHyperParams.preprocess_cmd,
ModelHyperParams.postprocess_cmd,
ModelHyperParams.weight_sharing,
TrainTaskConfig.label_smooth_eps,
use_py_reader=args.use_py_reader,
is_test=True)
test_prog = test_prog.clone(for_test=True)
test_data = prepare_data_generator(
args,
is_test=True,
count=dev_count,
pyreader=pyreader,
py_reader_provider_wrapper=py_reader_provider_wrapper)
exe.run(startup_prog) # to init pyreader for testing
if TrainTaskConfig.ckpt_path:
fluid.io.load_persistables(
exe, TrainTaskConfig.ckpt_path, main_program=test_prog)
build_strategy = fluid.BuildStrategy()
test_exe = fluid.ParallelExecutor(
use_cuda=TrainTaskConfig.use_gpu,
main_program=test_prog,
build_strategy=build_strategy,
share_vars_from=train_exe)
def test(exe=test_exe, pyreader=pyreader): # define input and reader
test_total_cost = 0
test_total_token = 0
if args.use_py_reader: input_field_names = desc.encoder_data_input_fields + \
pyreader.start() desc.decoder_data_input_fields[:-1] + desc.label_data_input_fields
data_generator = None input_slots = [{
else: "name": name,
data_generator = test_data() "shape": desc.input_descs[name][0],
while True: "dtype": desc.input_descs[name][1]
try: } for name in input_field_names]
feed_dict_list = prepare_feed_dict_list(data_generator, False,
dev_count) input_field = InputField(input_slots)
outs = test_exe.run(fetch_list=[sum_cost.name, token_num.name], input_field.build(build_pyreader=True)
feed=feed_dict_list)
except (StopIteration, fluid.core.EOFException): # define the network
# The current pass is over.
if args.use_py_reader: sum_cost, avg_cost, token_num = create_net(
pyreader.reset() is_training=True, model_input=input_field, args=args)
break
sum_cost_val, token_num_val = np.array(outs[0]), np.array(outs[1]) sum_cost.persistable = avg_cost.persistable = token_num.persistable = True
test_total_cost += sum_cost_val.sum()
test_total_token += token_num_val.sum() # define the optimizer
test_avg_cost = test_total_cost / test_total_token
test_ppl = np.exp([min(test_avg_cost, 100)])
return test_avg_cost, test_ppl
return test
def train_loop(exe,
train_prog,
startup_prog,
dev_count,
sum_cost,
avg_cost,
token_num,
predict,
pyreader,
nccl2_num_trainers=1,
nccl2_trainer_id=0):
# Initialize the parameters.
if TrainTaskConfig.ckpt_path:
exe.run(startup_prog) # to init pyreader for training
logging.info("load checkpoint from {}".format(
TrainTaskConfig.ckpt_path))
fluid.io.load_persistables(
exe, TrainTaskConfig.ckpt_path, main_program=train_prog)
else:
logging.info("init fluid.framework.default_startup_program")
exe.run(startup_prog)
logging.info("begin reader")
train_data = prepare_data_generator(
args,
is_test=False,
count=dev_count,
pyreader=pyreader,
py_reader_provider_wrapper=py_reader_provider_wrapper)
# For faster executor
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_iteration_per_drop_scope = int(args.fetch_steps)
build_strategy = fluid.BuildStrategy()
build_strategy.memory_optimize = False
build_strategy.enable_inplace = True
sum_cost.persistable = True with fluid.default_main_program()._lr_schedule_guard():
token_num.persistable = True learning_rate = fluid.layers.learning_rate_scheduler.noam_decay(
# Since the token number differs among devices, customize gradient scale to args.d_model, args.warmup_steps) * args.learning_rate
# use token average cost among multi-devices. and the gradient scale is
# `1 / token_number` for average cost. optimizer = fluid.optimizer.Adam(
# build_strategy.gradient_scale_strategy = fluid.BuildStrategy.GradientScaleStrategy.Customized learning_rate=learning_rate,
build_strategy.fuse_all_optimizer_ops = True beta1=args.beta1,
beta2=args.beta2,
epsilon=float(args.eps))
optimizer.minimize(avg_cost)
# prepare training
## decorate the pyreader with batch_generator
input_field.reader.decorate_batch_generator(batch_generator)
## define the executor and program for training
exe = fluid.Executor(place)
if num_trainers > 1 and args.use_py_reader and TrainTaskConfig.use_gpu: exe.run(startup_prog)
# init position_encoding
for pos_enc_param_name in desc.pos_enc_param_names:
pos_enc_param = fluid.global_scope().find_var(
pos_enc_param_name).get_tensor()
pos_enc_param.set(
position_encoding_init(args.max_length + 1, args.d_model), place)
assert (args.init_from_checkpoint == "") or (
args.init_from_pretrain_model == "")
## init from some checkpoint, to resume the previous training
if args.init_from_checkpoint:
init_from_checkpoint(args, exe, train_prog)
## init from some pretrain models, to better solve the current task
if args.init_from_pretrain_model:
init_from_pretrain_model(args, exe, train_prog)
build_strategy = fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True
exec_strategy = fluid.ExecutionStrategy()
if num_trainers > 1:
dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog) dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog)
exec_strategy.num_threads = 1 exec_strategy.num_threads = 1
logging.info("begin executor") compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
train_exe = fluid.ParallelExecutor(
use_cuda=TrainTaskConfig.use_gpu,
loss_name=avg_cost.name, loss_name=avg_cost.name,
main_program=train_prog,
build_strategy=build_strategy, build_strategy=build_strategy,
exec_strategy=exec_strategy, exec_strategy=exec_strategy)
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
if args.val_file_pattern is not None:
test = test_context(exe, train_exe, dev_count)
# the best cross-entropy value with label smoothing # the best cross-entropy value with label smoothing
loss_normalizer = -((1. - TrainTaskConfig.label_smooth_eps) * np.log( loss_normalizer = -(
(1. - TrainTaskConfig.label_smooth_eps (1. - args.label_smooth_eps) * np.log(
)) + TrainTaskConfig.label_smooth_eps * (1. - args.label_smooth_eps)) + args.label_smooth_eps *
np.log(TrainTaskConfig.label_smooth_eps / ( np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20))
ModelHyperParams.trg_vocab_size - 1) + 1e-20)) # start training
step_idx = 0 step_idx = 0
init_flag = True for pass_id in range(args.epoch):
logging.info("begin train")
for pass_id in six.moves.xrange(TrainTaskConfig.pass_num):
pass_start_time = time.time() pass_start_time = time.time()
input_field.reader.start()
if args.use_py_reader:
pyreader.start()
data_generator = None
else:
data_generator = train_data()
batch_id = 0 batch_id = 0
while True: while True:
try: try:
feed_dict_list = prepare_feed_dict_list(data_generator, outs = exe.run(compiled_train_prog,
init_flag, dev_count) fetch_list=[sum_cost.name, token_num.name]
outs = train_exe.run( if step_idx % args.print_step == 0 else [])
fetch_list=[sum_cost.name, token_num.name]
if step_idx % args.fetch_steps == 0 else [], if step_idx % args.print_step == 0:
feed=feed_dict_list)
if step_idx % args.fetch_steps == 0:
sum_cost_val, token_num_val = np.array(outs[0]), np.array( sum_cost_val, token_num_val = np.array(outs[0]), np.array(
outs[1]) outs[1])
# sum the cost from multi-devices # sum the cost from multi-devices
...@@ -600,208 +273,42 @@ def train_loop(exe, ...@@ -600,208 +273,42 @@ def train_loop(exe,
(step_idx, pass_id, batch_id, total_avg_cost, (step_idx, pass_id, batch_id, total_avg_cost,
total_avg_cost - loss_normalizer, total_avg_cost - loss_normalizer,
np.exp([min(total_avg_cost, 100)]), np.exp([min(total_avg_cost, 100)]),
args.fetch_steps / (time.time() - avg_batch_time))) args.print_step / (time.time() - avg_batch_time)))
avg_batch_time = time.time() avg_batch_time = time.time()
if step_idx % TrainTaskConfig.save_freq == 0 and step_idx > 0: if step_idx % args.save_step == 0 and step_idx != 0:
fluid.io.save_persistables(
exe, if args.save_checkpoint:
os.path.join(TrainTaskConfig.ckpt_dir, save_checkpoint(args, exe, train_prog,
"latest.checkpoint"), train_prog) "step_" + str(step_idx))
fluid.io.save_params(
exe, if args.save_param:
os.path.join(TrainTaskConfig.model_dir, save_param(args, exe, train_prog,
"iter_" + str(step_idx) + ".infer.model"), "step_" + str(step_idx))
train_prog)
init_flag = False
batch_id += 1 batch_id += 1
step_idx += 1 step_idx += 1
except (StopIteration, fluid.core.EOFException):
# The current pass is over. except fluid.core.EOFException:
if args.use_py_reader: input_field.reader.reset()
pyreader.reset()
break break
time_consumed = time.time() - pass_start_time time_consumed = time.time() - pass_start_time
# Validate and save the persistable.
if args.val_file_pattern is not None:
val_avg_cost, val_ppl = test()
logging.info(
"epoch: %d, val avg loss: %f, val normalized loss: %f, val ppl: %f,"
" consumed %fs" % (pass_id, val_avg_cost,
val_avg_cost - loss_normalizer, val_ppl,
time_consumed))
else:
logging.info("epoch: %d, consumed %fs" % (pass_id, time_consumed))
if not args.enable_ce: if args.save_checkpoint:
fluid.io.save_persistables( save_checkpoint(args, exe, train_prog, "step_final")
exe,
os.path.join(TrainTaskConfig.ckpt_dir, if args.save_param:
"pass_" + str(pass_id) + ".checkpoint"), save_param(args, exe, train_prog, "step_final")
train_prog)
if args.enable_ce: # For CE if args.enable_ce: # For CE
print("kpis\ttrain_cost_card%d\t%f" % (dev_count, total_avg_cost)) print("kpis\ttrain_cost_card%d\t%f" % (dev_count, total_avg_cost))
if args.val_file_pattern is not None:
print("kpis\ttest_cost_card%d\t%f" % (dev_count, val_avg_cost))
print("kpis\ttrain_duration_card%d\t%f" % (dev_count, time_consumed)) print("kpis\ttrain_duration_card%d\t%f" % (dev_count, time_consumed))
def train(args):
# priority: ENV > args > config
is_local = os.getenv("PADDLE_IS_LOCAL", "1")
if is_local == '0':
args.local = False
logging.info(args)
if args.device == 'CPU':
TrainTaskConfig.use_gpu = False
training_role = os.getenv("TRAINING_ROLE", "TRAINER")
if training_role == "PSERVER" or (not TrainTaskConfig.use_gpu):
place = fluid.CPUPlace()
# the default setting of CPU_NUM in paddle framework is 1
dev_count = int(os.environ.get('CPU_NUM', 1))
else:
check_cuda(TrainTaskConfig.use_gpu)
gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
place = fluid.CUDAPlace(gpu_id)
dev_count = get_device_num()
exe = fluid.Executor(place)
train_prog = fluid.Program()
startup_prog = fluid.Program()
if args.enable_ce:
train_prog.random_seed = 1000
startup_prog.random_seed = 1000
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
sum_cost, avg_cost, predict, token_num, pyreader = transformer(
ModelHyperParams.src_vocab_size,
ModelHyperParams.trg_vocab_size,
ModelHyperParams.max_length + 1,
ModelHyperParams.n_layer,
ModelHyperParams.n_head,
ModelHyperParams.d_key,
ModelHyperParams.d_value,
ModelHyperParams.d_model,
ModelHyperParams.d_inner_hid,
ModelHyperParams.prepostprocess_dropout,
ModelHyperParams.attention_dropout,
ModelHyperParams.relu_dropout,
ModelHyperParams.preprocess_cmd,
ModelHyperParams.postprocess_cmd,
ModelHyperParams.weight_sharing,
TrainTaskConfig.label_smooth_eps,
ModelHyperParams.bos_idx,
use_py_reader=args.use_py_reader,
is_test=False)
optimizer = None
if args.sync:
lr_decay = fluid.layers.learning_rate_scheduler.noam_decay(
ModelHyperParams.d_model, TrainTaskConfig.warmup_steps)
logging.info("before adam")
with fluid.default_main_program()._lr_schedule_guard():
learning_rate = lr_decay * TrainTaskConfig.learning_rate
optimizer = fluid.optimizer.Adam(
learning_rate=learning_rate,
beta1=TrainTaskConfig.beta1,
beta2=TrainTaskConfig.beta2,
epsilon=TrainTaskConfig.eps)
else:
optimizer = fluid.optimizer.SGD(0.003)
optimizer.minimize(avg_cost)
if args.local:
logging.info("local start_up:")
train_loop(exe, train_prog, startup_prog, dev_count, sum_cost, avg_cost,
token_num, predict, pyreader)
else:
if args.update_method == "nccl2":
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
port = os.getenv("PADDLE_PORT")
worker_ips = os.getenv("PADDLE_TRAINERS")
worker_endpoints = []
for ip in worker_ips.split(","):
worker_endpoints.append(':'.join([ip, port]))
trainers_num = len(worker_endpoints)
current_endpoint = os.getenv("POD_IP") + ":" + port
if trainer_id == 0:
logging.info("train_id == 0, sleep 60s")
time.sleep(60)
logging.info("trainers_num:{}".format(trainers_num))
logging.info("worker_endpoints:{}".format(worker_endpoints))
logging.info("current_endpoint:{}".format(current_endpoint))
append_nccl2_prepare(startup_prog, trainer_id, worker_endpoints,
current_endpoint)
train_loop(exe, train_prog, startup_prog, dev_count, sum_cost,
avg_cost, token_num, predict, pyreader, trainers_num,
trainer_id)
return
port = os.getenv("PADDLE_PORT", "6174")
pserver_ips = os.getenv("PADDLE_PSERVERS") # ip,ip...
eplist = []
for ip in pserver_ips.split(","):
eplist.append(':'.join([ip, port]))
pserver_endpoints = ",".join(eplist) # ip:port,ip:port...
trainers = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
current_endpoint = os.getenv("POD_IP") + ":" + port
trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
logging.info("pserver_endpoints:{}".format(pserver_endpoints))
logging.info("current_endpoint:{}".format(current_endpoint))
logging.info("trainer_id:{}".format(trainer_id))
logging.info("pserver_ips:{}".format(pserver_ips))
logging.info("port:{}".format(port))
t = fluid.DistributeTranspiler()
t.transpile(
trainer_id,
pservers=pserver_endpoints,
trainers=trainers,
program=train_prog,
startup_program=startup_prog)
if training_role == "PSERVER":
logging.info("distributed: pserver started")
current_endpoint = os.getenv("POD_IP") + ":" + os.getenv(
"PADDLE_PORT")
if not current_endpoint:
logging.critical("need env SERVER_ENDPOINT")
exit(1)
pserver_prog = t.get_pserver_program(current_endpoint)
pserver_startup = t.get_startup_program(current_endpoint,
pserver_prog)
exe.run(pserver_startup)
exe.run(pserver_prog)
elif training_role == "TRAINER":
logging.info("distributed: trainer started")
trainer_prog = t.get_trainer_program()
train_loop(exe, train_prog, startup_prog, dev_count, sum_cost,
avg_cost, token_num, predict, pyreader)
else:
logging.critical(
"environment var TRAINER_ROLE should be TRAINER os PSERVER")
exit(1)
if __name__ == "__main__": if __name__ == "__main__":
LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s" args = PDConfig(yaml_file="./transformer.yaml")
logging.basicConfig( args.build()
stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT) args.Print()
logging.getLogger().setLevel(logging.INFO)
args = parse_args() do_train(args)
train(args)
from functools import partial
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from desc import *
# Set seed for CE or debug
dropout_seed = None
def wrap_layer_with_block(layer, block_idx):
"""
Make layer define support indicating block, by which we can add layers
to other blocks within current block. This will make it easy to define
cache among while loop.
"""
class BlockGuard(object):
"""
BlockGuard class.
BlockGuard class is used to switch to the given block in a program by
using the Python `with` keyword.
"""
def __init__(self, block_idx=None, main_program=None):
self.main_program = fluid.default_main_program(
) if main_program is None else main_program
self.old_block_idx = self.main_program.current_block().idx
self.new_block_idx = block_idx
def __enter__(self):
self.main_program.current_block_idx = self.new_block_idx
def __exit__(self, exc_type, exc_val, exc_tb):
self.main_program.current_block_idx = self.old_block_idx
if exc_type is not None:
return False # re-raise exception
return True
def layer_wrapper(*args, **kwargs):
with BlockGuard(block_idx):
return layer(*args, **kwargs)
return layer_wrapper
def position_encoding_init(n_position, d_pos_vec):
"""
Generate the initial values for the sinusoid position encoding table.
"""
channels = d_pos_vec
position = np.arange(n_position)
num_timescales = channels // 2
log_timescale_increment = (np.log(float(1e4) / float(1)) /
(num_timescales - 1))
inv_timescales = np.exp(np.arange(
num_timescales)) * -log_timescale_increment
scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales,
0)
signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant')
position_enc = signal
return position_enc.astype("float32")
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
gather_idx=None,
static_kv=False):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
bias_attr=False,
num_flatten_dims=2)
# For encoder-decoder attention in inference, insert the ops and vars
# into global block to use as cache among beam search.
fc_layer = wrap_layer_with_block(
layers.fc, fluid.default_main_program().current_block(
).parent_idx) if cache is not None and static_kv else layers.fc
k = fc_layer(
input=keys,
size=d_key * n_head,
bias_attr=False,
num_flatten_dims=2)
v = fc_layer(
input=values,
size=d_value * n_head,
bias_attr=False,
num_flatten_dims=2)
return q, k, v
def __split_heads_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Reshape input tensors at the last dimension to split multi-heads
and then transpose. Specifically, transform the input tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] to the output tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped_q = layers.reshape(
x=queries, shape=[0, 0, n_head, d_key], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
q = layers.transpose(x=reshaped_q, perm=[0, 2, 1, 3])
# For encoder-decoder attention in inference, insert the ops and vars
# into global block to use as cache among beam search.
reshape_layer = wrap_layer_with_block(
layers.reshape,
fluid.default_main_program().current_block(
).parent_idx) if cache is not None and static_kv else layers.reshape
transpose_layer = wrap_layer_with_block(
layers.transpose,
fluid.default_main_program().current_block().
parent_idx) if cache is not None and static_kv else layers.transpose
reshaped_k = reshape_layer(
x=keys, shape=[0, 0, n_head, d_key], inplace=True)
k = transpose_layer(x=reshaped_k, perm=[0, 2, 1, 3])
reshaped_v = reshape_layer(
x=values, shape=[0, 0, n_head, d_value], inplace=True)
v = transpose_layer(x=reshaped_v, perm=[0, 2, 1, 3])
if cache is not None: # only for faster inference
if static_kv: # For encoder-decoder attention in inference
cache_k, cache_v = cache["static_k"], cache["static_v"]
# To init the static_k and static_v in cache.
# Maybe we can use condition_op(if_else) to do these at the first
# step in while loop to replace these, however it might be less
# efficient.
static_cache_init = wrap_layer_with_block(
layers.assign,
fluid.default_main_program().current_block().parent_idx)
static_cache_init(k, cache_k)
static_cache_init(v, cache_v)
else: # For decoder self-attention in inference
cache_k, cache_v = cache["k"], cache["v"]
# gather cell states corresponding to selected parent
select_k = layers.gather(cache_k, index=gather_idx)
select_v = layers.gather(cache_v, index=gather_idx)
if not static_kv:
# For self attention in inference, use cache and concat time steps.
select_k = layers.concat([select_k, k], axis=2)
select_v = layers.concat([select_v, v], axis=2)
# update cell states(caches) cached in global block
layers.assign(select_k, cache_k)
layers.assign(select_v, cache_v)
return q, select_k, select_v
return q, k, v
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
product = layers.matmul(x=q, y=k, transpose_y=True, alpha=d_key**-0.5)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
seed=dropout_seed,
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
q, k, v = __split_heads_qkv(q, k, v, n_head, d_key, d_value)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_model,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
bias_attr=False,
num_flatten_dims=2)
return proj_out
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act="relu")
if dropout_rate:
hidden = layers.dropout(
hidden, dropout_prob=dropout_rate, seed=dropout_seed, is_test=False)
out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.initializer.Constant(1.),
bias_attr=fluid.initializer.Constant(0.))
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
seed=dropout_seed,
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def prepare_encoder_decoder(src_word,
src_pos,
src_vocab_size,
src_emb_dim,
src_max_len,
dropout_rate=0.,
bos_idx=0,
word_emb_param_name=None,
pos_enc_param_name=None):
"""Add word embeddings and position encodings.
The output tensor has a shape of:
[batch_size, max_src_length_in_batch, d_model].
This module is used at the bottom of the encoder stacks.
"""
src_word_emb = layers.embedding(
src_word,
size=[src_vocab_size, src_emb_dim],
padding_idx=bos_idx, # set embedding of bos to 0
param_attr=fluid.ParamAttr(
name=word_emb_param_name,
initializer=fluid.initializer.Normal(0., src_emb_dim**-0.5)))
src_word_emb = layers.scale(x=src_word_emb, scale=src_emb_dim**0.5)
src_pos_enc = layers.embedding(
src_pos,
size=[src_max_len, src_emb_dim],
param_attr=fluid.ParamAttr(
name=pos_enc_param_name, trainable=False))
src_pos_enc.stop_gradient = True
enc_input = src_word_emb + src_pos_enc
return layers.dropout(
enc_input, dropout_prob=dropout_rate, seed=dropout_seed,
is_test=False) if dropout_rate else enc_input
prepare_encoder = partial(
prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[0])
prepare_decoder = partial(
prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[1])
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd="n",
postprocess_cmd="da"):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(enc_input, preprocess_cmd,
prepostprocess_dropout), None, None, attn_bias, d_key,
d_value, d_model, n_head, attention_dropout)
attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd,
prepostprocess_dropout)
ffd_output = positionwise_feed_forward(
pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout),
d_inner_hid, d_model, relu_dropout)
return post_process_layer(attn_output, ffd_output, postprocess_cmd,
prepostprocess_dropout)
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd="n",
postprocess_cmd="da"):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd, )
enc_input = enc_output
enc_output = pre_process_layer(enc_output, preprocess_cmd,
prepostprocess_dropout)
return enc_output
def decoder_layer(dec_input,
enc_output,
slf_attn_bias,
dec_enc_attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
cache=None,
gather_idx=None):
""" The layer to be stacked in decoder part.
The structure of this module is similar to that in the encoder part except
a multi-head attention is added to implement encoder-decoder attention.
"""
slf_attn_output = multi_head_attention(
pre_process_layer(dec_input, preprocess_cmd, prepostprocess_dropout),
None,
None,
slf_attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
cache=cache,
gather_idx=gather_idx)
slf_attn_output = post_process_layer(
dec_input,
slf_attn_output,
postprocess_cmd,
prepostprocess_dropout, )
enc_attn_output = multi_head_attention(
pre_process_layer(slf_attn_output, preprocess_cmd,
prepostprocess_dropout),
enc_output,
enc_output,
dec_enc_attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
cache=cache,
gather_idx=gather_idx,
static_kv=True)
enc_attn_output = post_process_layer(
slf_attn_output,
enc_attn_output,
postprocess_cmd,
prepostprocess_dropout, )
ffd_output = positionwise_feed_forward(
pre_process_layer(enc_attn_output, preprocess_cmd,
prepostprocess_dropout),
d_inner_hid,
d_model,
relu_dropout, )
dec_output = post_process_layer(
enc_attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout, )
return dec_output
def decoder(dec_input,
enc_output,
dec_slf_attn_bias,
dec_enc_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
caches=None,
gather_idx=None):
"""
The decoder is composed of a stack of identical decoder_layer layers.
"""
for i in range(n_layer):
dec_output = decoder_layer(
dec_input,
enc_output,
dec_slf_attn_bias,
dec_enc_attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
cache=None if caches is None else caches[i],
gather_idx=gather_idx)
dec_input = dec_output
dec_output = pre_process_layer(dec_output, preprocess_cmd,
prepostprocess_dropout)
return dec_output
def make_all_inputs(input_fields):
"""
Define the input data layers for the transformer model.
"""
inputs = []
for input_field in input_fields:
input_var = layers.data(
name=input_field,
shape=input_descs[input_field][0],
dtype=input_descs[input_field][1],
lod_level=input_descs[input_field][2]
if len(input_descs[input_field]) == 3 else 0,
append_batch_size=False)
inputs.append(input_var)
return inputs
def make_all_py_reader_inputs(input_fields, is_test=False):
reader = layers.py_reader(
capacity=20,
name="test_reader" if is_test else "train_reader",
shapes=[input_descs[input_field][0] for input_field in input_fields],
dtypes=[input_descs[input_field][1] for input_field in input_fields],
lod_levels=[
input_descs[input_field][2]
if len(input_descs[input_field]) == 3 else 0
for input_field in input_fields
])
return layers.read_file(reader), reader
def transformer(src_vocab_size,
trg_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
label_smooth_eps,
bos_idx=0,
use_py_reader=False,
is_test=False,
model_input=None):
if weight_sharing:
assert src_vocab_size == trg_vocab_size, (
"Vocabularies in source and target should be same for weight sharing."
)
if model_input:
enc_inputs = (model_input.src_word, model_input.src_pos,
model_input.src_slf_attn_bias)
dec_inputs = (model_input.trg_word, model_input.trg_pos,
model_input.trg_slf_attn_bias,
model_input.trg_src_attn_bias)
label = model_input.lbl_word
weights = model_input.lbl_weight
else:
data_input_names = encoder_data_input_fields + \
decoder_data_input_fields[:-1] + label_data_input_fields
if use_py_reader:
all_inputs, reader = make_all_py_reader_inputs(data_input_names,
is_test)
else:
all_inputs = make_all_inputs(data_input_names)
enc_inputs_len = len(encoder_data_input_fields)
dec_inputs_len = len(decoder_data_input_fields[:-1])
enc_inputs = all_inputs[0:enc_inputs_len]
dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
label = all_inputs[-2]
weights = all_inputs[-1]
enc_output = wrap_encoder(
src_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
enc_inputs,
bos_idx=bos_idx)
predict = wrap_decoder(
trg_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
dec_inputs,
enc_output, )
# Padding index do not contribute to the total loss. The weights is used to
# cancel padding index in calculating the loss.
if label_smooth_eps:
label = layers.label_smooth(
label=layers.one_hot(
input=label, depth=trg_vocab_size),
epsilon=label_smooth_eps)
cost = layers.softmax_with_cross_entropy(
logits=predict,
label=label,
soft_label=True if label_smooth_eps else False)
weighted_cost = cost * weights
sum_cost = layers.reduce_sum(weighted_cost)
token_num = layers.reduce_sum(weights)
token_num.stop_gradient = True
avg_cost = sum_cost / token_num
return sum_cost, avg_cost, predict, token_num, reader if use_py_reader else None
def wrap_encoder(src_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
enc_inputs=None,
bos_idx=0):
"""
The wrapper assembles together all needed layers for the encoder.
"""
if enc_inputs is None:
# This is used to implement independent encoder program in inference.
src_word, src_pos, src_slf_attn_bias = make_all_inputs(
encoder_data_input_fields)
else:
src_word, src_pos, src_slf_attn_bias = enc_inputs
enc_input = prepare_encoder(
src_word,
src_pos,
src_vocab_size,
d_model,
max_length,
prepostprocess_dropout,
bos_idx=bos_idx,
word_emb_param_name=word_emb_param_names[0])
enc_output = encoder(
enc_input,
src_slf_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd, )
return enc_output
def wrap_decoder(trg_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
dec_inputs=None,
enc_output=None,
caches=None,
gather_idx=None,
bos_idx=0):
"""
The wrapper assembles together all needed layers for the decoder.
"""
if dec_inputs is None:
# This is used to implement independent decoder program in inference.
trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = \
make_all_inputs(decoder_data_input_fields)
else:
trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_inputs
dec_input = prepare_decoder(
trg_word,
trg_pos,
trg_vocab_size,
d_model,
max_length,
prepostprocess_dropout,
bos_idx=bos_idx,
word_emb_param_name=word_emb_param_names[0]
if weight_sharing else word_emb_param_names[1])
dec_output = decoder(
dec_input,
enc_output,
trg_slf_attn_bias,
trg_src_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
caches=caches,
gather_idx=gather_idx)
# Reshape to 2D tensor to use GEMM instead of BatchedGEMM
dec_output = layers.reshape(
dec_output, shape=[-1, dec_output.shape[-1]], inplace=True)
if weight_sharing:
predict = layers.matmul(
x=dec_output,
y=fluid.default_main_program().global_block().var(
word_emb_param_names[0]),
transpose_y=True)
else:
predict = layers.fc(input=dec_output,
size=trg_vocab_size,
bias_attr=False)
if dec_inputs is None:
# Return probs for independent decoder program.
predict = layers.softmax(predict)
return predict
def fast_decode(src_vocab_size,
trg_vocab_size,
max_in_len,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
beam_size,
max_out_len,
bos_idx,
eos_idx,
use_py_reader=False,
model_input=None):
"""
Use beam search to decode. Caches will be used to store states of history
steps which can make the decoding faster.
"""
if model_input:
enc_inputs = (model_input.src_word, model_input.src_pos,
model_input.src_slf_attn_bias)
dec_inputs = (model_input.trg_word, model_input.init_score,
model_input.init_idx, model_input.trg_src_attn_bias)
else:
data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
if use_py_reader:
all_inputs, reader = make_all_py_reader_inputs(data_input_names)
else:
all_inputs = make_all_inputs(data_input_names)
enc_inputs_len = len(encoder_data_input_fields)
dec_inputs_len = len(fast_decoder_data_input_fields)
enc_inputs = all_inputs[0:enc_inputs_len]
dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
enc_output = wrap_encoder(
src_vocab_size,
max_in_len,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
enc_inputs,
bos_idx=bos_idx)
start_tokens, init_scores, parent_idx, trg_src_attn_bias = dec_inputs
def beam_search():
max_len = layers.fill_constant(
shape=[1],
dtype=start_tokens.dtype,
value=max_out_len,
force_cpu=True)
step_idx = layers.fill_constant(
shape=[1], dtype=start_tokens.dtype, value=0, force_cpu=True)
cond = layers.less_than(x=step_idx, y=max_len) # default force_cpu=True
while_op = layers.While(cond)
# array states will be stored for each step.
ids = layers.array_write(
layers.reshape(start_tokens, (-1, 1)), step_idx)
scores = layers.array_write(init_scores, step_idx)
# cell states will be overwrited at each step.
# caches contains states of history steps in decoder self-attention
# and static encoder output projections in encoder-decoder attention
# to reduce redundant computation.
caches = [
{
"k": # for self attention
layers.fill_constant_batch_size_like(
input=start_tokens,
shape=[-1, n_head, 0, d_key],
dtype=enc_output.dtype,
value=0),
"v": # for self attention
layers.fill_constant_batch_size_like(
input=start_tokens,
shape=[-1, n_head, 0, d_value],
dtype=enc_output.dtype,
value=0),
"static_k": # for encoder-decoder attention
layers.create_tensor(dtype=enc_output.dtype),
"static_v": # for encoder-decoder attention
layers.create_tensor(dtype=enc_output.dtype)
} for i in range(n_layer)
]
with while_op.block():
pre_ids = layers.array_read(array=ids, i=step_idx)
# Since beam_search_op dosen't enforce pre_ids' shape, we can do
# inplace reshape here which actually change the shape of pre_ids.
pre_ids = layers.reshape(pre_ids, (-1, 1, 1), inplace=True)
pre_scores = layers.array_read(array=scores, i=step_idx)
# gather cell states corresponding to selected parent
pre_src_attn_bias = layers.gather(
trg_src_attn_bias, index=parent_idx)
pre_pos = layers.elementwise_mul(
x=layers.fill_constant_batch_size_like(
input=pre_src_attn_bias, # cann't use lod tensor here
value=1,
shape=[-1, 1, 1],
dtype=pre_ids.dtype),
y=step_idx,
axis=0)
logits = wrap_decoder(
trg_vocab_size,
max_in_len,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
preprocess_cmd,
postprocess_cmd,
weight_sharing,
dec_inputs=(pre_ids, pre_pos, None, pre_src_attn_bias),
enc_output=enc_output,
caches=caches,
gather_idx=parent_idx,
bos_idx=bos_idx)
# intra-beam topK
topk_scores, topk_indices = layers.topk(
input=layers.softmax(logits), k=beam_size)
accu_scores = layers.elementwise_add(
x=layers.log(topk_scores), y=pre_scores, axis=0)
# beam_search op uses lod to differentiate branches.
accu_scores = layers.lod_reset(accu_scores, pre_ids)
# topK reduction across beams, also contain special handle of
# end beams and end sentences(batch reduction)
selected_ids, selected_scores, gather_idx = layers.beam_search(
pre_ids=pre_ids,
pre_scores=pre_scores,
ids=topk_indices,
scores=accu_scores,
beam_size=beam_size,
end_id=eos_idx,
return_parent_idx=True)
layers.increment(x=step_idx, value=1.0, in_place=True)
# cell states(caches) have been updated in wrap_decoder,
# only need to update beam search states here.
layers.array_write(selected_ids, i=step_idx, array=ids)
layers.array_write(selected_scores, i=step_idx, array=scores)
layers.assign(gather_idx, parent_idx)
layers.assign(pre_src_attn_bias, trg_src_attn_bias)
length_cond = layers.less_than(x=step_idx, y=max_len)
finish_cond = layers.logical_not(layers.is_empty(x=selected_ids))
layers.logical_and(x=length_cond, y=finish_cond, out=cond)
finished_ids, finished_scores = layers.beam_search_decode(
ids, scores, beam_size=beam_size, end_id=eos_idx)
return finished_ids, finished_scores
finished_ids, finished_scores = beam_search()
return finished_ids, finished_scores, reader if use_py_reader else None
def create_net(is_training, model_input, args):
if is_training:
sum_cost, avg_cost, _, token_num, _ = transformer(
args.src_vocab_size,
args.trg_vocab_size,
args.max_length + 1,
args.n_layer,
args.n_head,
args.d_key,
args.d_value,
args.d_model,
args.d_inner_hid,
args.prepostprocess_dropout,
args.attention_dropout,
args.relu_dropout,
args.preprocess_cmd,
args.postprocess_cmd,
args.weight_sharing,
args.label_smooth_eps,
args.bos_idx,
model_input=model_input)
return sum_cost, avg_cost, token_num
else:
out_ids, out_scores, _ = fast_decode(
args.src_vocab_size,
args.trg_vocab_size,
args.max_length + 1,
args.n_layer,
args.n_head,
args.d_key,
args.d_value,
args.d_model,
args.d_inner_hid,
args.prepostprocess_dropout,
args.attention_dropout,
args.relu_dropout,
args.preprocess_cmd,
args.postprocess_cmd,
args.weight_sharing,
args.beam_size,
args.max_out_len,
args.bos_idx,
args.eos_idx,
model_input=model_input)
return out_ids, out_scores
# used for continuous evaluation
enable_ce: False
# The frequency to save trained models when training.
save_step: 10000
# The frequency to fetch and print output when training.
print_step: 100
# path of the checkpoint, to resume the previous training
init_from_checkpoint: ""
# path of the pretrain model, to better solve the current task
init_from_pretrain_model: ""
# path of trained parameter, to make prediction
init_from_params: "trained_params/step_100000"
save_model_path: ""
# the directory for saving checkpoints.
save_checkpoint: "trained_ckpts"
# the directory for saving trained parameters.
save_param: "trained_params"
# the directory for saving inference model.
inference_model_dir: "infer_model"
# Set seed for CE or debug
random_seed: None
# The pattern to match training data files.
training_file: "wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de"
# The pattern to match test data files.
predict_file: "wmt16_ende_data_bpe/newstest2016.tok.bpe.32000.en-de"
# The file to output the translation results of predict_file to.
output_file: "predict.txt"
# The path of vocabulary file of source language.
src_vocab_fpath: "wmt16_ende_data_bpe/vocab_all.bpe.32000"
# The path of vocabulary file of target language.
trg_vocab_fpath: "wmt16_ende_data_bpe/vocab_all.bpe.32000"
# The <bos>, <eos> and <unk> tokens in the dictionary.
special_token: ["<s>", "<e>", "<unk>"]
# whether to use cuda
use_cuda: True
# args for reader, see reader.py for details
token_delimiter: " "
use_token_batch: True
pool_size: 200000
sort_type: "pool"
shuffle: True
shuffle_batch: True
batch_size: 4096
# Hyparams for training:
# the number of epoches for training
epoch: 30
# the hyper parameters for Adam optimizer.
# This static learning_rate will be multiplied to the LearningRateScheduler
# derived learning rate the to get the final learning rate.
learning_rate: 2.0
beta1: 0.9
beta2: 0.997
eps: 1e-9
# the parameters for learning rate scheduling.
warmup_steps: 8000
# the weight used to mix up the ground-truth distribution and the fixed
# uniform distribution in label smoothing when training.
# Set this as zero if label smoothing is not wanted.
label_smooth_eps: 0.1
# Hyparams for generation:
# the parameters for beam search.
beam_size: 5
max_out_len: 256
# the number of decoded sentences to output.
n_best: 1
# Hyparams for model:
# These following five vocabularies related configurations will be set
# automatically according to the passed vocabulary path and special tokens.
# size of source word dictionary.
src_vocab_size: 10000
# size of target word dictionay
trg_vocab_size: 10000
# index for <bos> token
bos_idx: 0
# index for <eos> token
eos_idx: 1
# index for <unk> token
unk_idx: 2
# max length of sequences deciding the size of position encoding table.
max_length: 256
# the dimension for word embeddings, which is also the last dimension of
# the input and output of multi-head attention, position-wise feed-forward
# networks, encoder and decoder.
d_model: 512
# size of the hidden layer in position-wise feed-forward networks.
d_inner_hid: 2048
# the dimension that keys are projected to for dot-product attention.
d_key: 64
# the dimension that values are projected to for dot-product attention.
d_value: 64
# number of head used in multi-head attention.
n_head: 8
# number of sub-layers to be stacked in the encoder and decoder.
n_layer: 6
# dropout rates of different modules.
prepostprocess_dropout: 0.1
attention_dropout: 0.1
relu_dropout: 0.1
# to process before each sub-layer
preprocess_cmd: "n" # layer normalization
# to process after each sub-layer
postprocess_cmd: "da" # dropout + residual connection
# the flag indicating whether to share embedding and softmax weights.
# vocabularies in source and target should be same for weight sharing.
weight_sharing: True
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册