diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/README.md b/PaddleNLP/examples/machine_translation/transformer-rc0/README.md deleted file mode 100644 index 9ba9a49253ee8daad0acf70f844a12f0fe92a0cf..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/README.md +++ /dev/null @@ -1,161 +0,0 @@ -## Transformer - -以下是本例的简要目录结构及说明: - -```text -. -├── images # README 文档中的图片 -├── predict.py # 预测脚本 -├── reader.py # 数据读取接口 -├── README.md # 文档 -├── train.py # 训练脚本 -├── transformer.py # 模型定义文件 -└── transformer.yaml # 配置文件 -``` - -## 模型简介 - -机器翻译(machine translation, MT)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。 - -本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现, 包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。 - - -## 快速开始 - -### 安装说明 - -1. paddle安装 - - 本项目依赖于 PaddlePaddle 2.0rc及以上版本或适当的develop版本,请参考 [安装指南](https://www.paddlepaddle.org.cn/install/quick) 进行安装 - -2. 下载代码 - - 克隆代码库到本地 - -3. 环境依赖 - - 该模型使用PaddlePaddle,关于环境依赖部分,请先参考PaddlePaddle[安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/index_cn.html)关于环境依赖部分的内容。 - 此外,需要另外涉及: - * attrdict - * pyyaml - - -### 数据准备 - -公开数据集:WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛,其中英德翻译任务提供了一个中等规模的数据集,这个数据集是较多论文中使用的数据集,也是 Transformer 论文中用到的一个数据集。我们也将[WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)作为示例提供。 - -另外我们也整理提供了一份处理好的 WMT'16 EN-DE 数据以供下载使用,其中包含词典(`vocab_all.bpe.32000`文件)、训练所需的 BPE 数据(`train.tok.clean.bpe.32000.en-de`文件)、预测所需的 BPE 数据(`newstest2016.tok.bpe.32000.en-de`等文件)和相应的评估预测结果所需的 tokenize 数据(`newstest2016.tok.de`等文件)。 - -### 单机训练 - -### 单机单卡 - -以提供的英德翻译数据为例,首先需要设置 `transformer.yaml` 中 `world_size` 的值为 1,即卡数,后可以执行以下命令进行模型训练: - -```sh -# setting visible devices for training -CUDA_VISIBLE_DEVICES=0 python train.py -``` - -可以在 transformer.yaml 文件中设置相应的参数。 - -### 单机多卡 - -同样,需要设置 `transformer.yaml` 中 `world_size` 的值为 8,即卡数,后执行如下命令可以实现八卡训练: - -```sh -export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -python train.py -``` - -### 模型推断 - -以英德翻译数据为例,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译: - -```sh -# setting visible devices for prediction -export CUDA_VISIBLE_DEVICES=0 -python predict.py -``` - - 由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,同样可以使用 `infer_batch_size` 来指定预测的 batch_size 的大小,更多参数的使用可以在 `transformer.yaml` 文件中查阅注释说明并进行更改设置。 - - 因模型评估与 `predict_file` 写入的顺序相关,故仅有单卡预测。 - - -### 模型评估 - -预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): - -```sh -# 还原 predict.txt 中的预测结果为 tokenize 后的数据 -sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt -# 若无 BLEU 评估工具,需先进行下载 -# git clone https://github.com/moses-smt/mosesdecoder.git -# 以英德翻译 newstest2014 测试数据为例 -perl gen_data/mosesdecoder/scripts/generic/multi-bleu.perl gen_data/wmt16_ende_data/newstest2014.tok.de < predict.tok.txt -``` -可以看到类似如下的结果: -``` -BLEU = 26.36, 57.7/32.1/20.0/13.0 (BP=1.000, ratio=1.013, hyp_len=63903, ref_len=63078) -``` - -## 进阶使用 - -### 背景介绍 - -Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模[1]。 - -相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用(Self)Attention 进行输入序列到输出序列的变换主要具有以下优势: - -- 计算复杂度小 - - 特征维度为 d 、长度为 n 的序列,在 RNN 中计算复杂度为 `O(n * d * d)` (n 个时间步,每个时间步计算 d 维的矩阵向量乘法),在 Self-Attention 中计算复杂度为 `O(n * n * d)` (n 个时间步两两计算 d 维的向量点积或其他相关度函数),n 通常要小于 d 。 -- 计算并行度高 - - RNN 中当前时间步的计算要依赖前一个时间步的计算结果;Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出,各时间步可以完全并行。 -- 容易学习长程依赖(long-range dependencies) - - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立;Self-Attention 中任何两个位置都直接相连;路径越短信号传播越容易。 - -Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构,已被广泛应用在 Bert [2]等语义表示模型中,取得了显著效果。 - - -### 模型概览 - -Transformer 同样使用了 Seq2Seq 模型中典型的编码器-解码器(Encoder-Decoder)的框架结构,整体网络结构如图1所示。 - -

-
-图 1. Transformer 网络结构图 -

- -可以看到,和以往 Seq2Seq 模型不同,Transformer 的 Encoder 和 Decoder 中不再使用 RNN 的结构。 - -### 模型特点 - -Transformer 中的 Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注意力(Multi-Head Attention)和全连接的前馈(Feed-Forward)网络这两个 sub-layer 构成。 -- Multi-Head Attention 在这里用于实现 Self-Attention,相比于简单的 Attention 机制,其将输入进行多路线性变换后分别计算 Attention 的结果,并将所有结果拼接后再次进行线性变换作为输出。参见图2,其中 Attention 使用的是点积(Dot-Product),并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。 -- Feed-Forward 网络会对序列中的每个位置进行相同的计算(Position-wise),其采用的是两次线性变换中间加以 ReLU 激活的结构。 - -此外,每个 sub-layer 后还施以 Residual Connection [3]和 Layer Normalization [4]来促进梯度传播和模型收敛。 - -

-
-图 2. Multi-Head Attention -

- -Decoder 具有和 Encoder 类似的结构,只是相比于组成 Encoder 的 layer ,在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention,这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。 - -## FAQ - -**Q:** 预测结果中样本数少于输入的样本数是什么原因 -**A:** 若样本中最大长度超过 `transformer.yaml` 中 `max_length` 的默认设置,请注意运行时增大 `--max_length` 的设置,否则超长样本将被过滤。 - -**Q:** 预测时最大长度超过了训练时的最大长度怎么办 -**A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小,若预测时长度超过 `max_length`,请调大该值,会重新生成更大的 position encoding 表。 - - -## 参考文献 -1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010. -2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018. -3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. -4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016. -5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015. diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh b/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh deleted file mode 100755 index b2be343e5719a651c4197f4133f5f7652cb3f44b..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh +++ /dev/null @@ -1,217 +0,0 @@ -#! /usr/bin/env bash - -set -e - -OUTPUT_DIR=$PWD/gen_data - -############################################################################### -# change these variables for other WMT data -############################################################################### -OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt16_ende_data" -OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt16_ende_data_bpe" -LANG1="en" -LANG2="de" -# each of TRAIN_DATA: data_url data_file_lang1 data_file_lang2 -TRAIN_DATA=( -'http://www.statmt.org/europarl/v7/de-en.tgz' -'europarl-v7.de-en.en' 'europarl-v7.de-en.de' -'http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz' -'commoncrawl.de-en.en' 'commoncrawl.de-en.de' -'http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz' -'news-commentary-v11.de-en.en' 'news-commentary-v11.de-en.de' -) -# each of DEV_TEST_DATA: data_url data_file_lang1 data_file_lang2 -DEV_TEST_DATA=( -'http://data.statmt.org/wmt16/translation-task/dev.tgz' -'newstest201[45]-deen-ref.en.sgm' 'newstest201[45]-deen-src.de.sgm' -'http://data.statmt.org/wmt16/translation-task/test.tgz' -'newstest2016-deen-ref.en.sgm' 'newstest2016-deen-src.de.sgm' -) -############################################################################### - -############################################################################### -# change these variables for other WMT data -############################################################################### -# OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt14_enfr_data" -# OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt14_enfr_data_bpe" -# LANG1="en" -# LANG2="fr" -# # each of TRAIN_DATA: ata_url data_tgz data_file -# TRAIN_DATA=( -# 'http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz' -# 'commoncrawl.fr-en.en' 'commoncrawl.fr-en.fr' -# 'http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz' -# 'training/europarl-v7.fr-en.en' 'training/europarl-v7.fr-en.fr' -# 'http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz' -# 'training/news-commentary-v9.fr-en.en' 'training/news-commentary-v9.fr-en.fr' -# 'http://www.statmt.org/wmt10/training-giga-fren.tar' -# 'giga-fren.release2.fixed.en.*' 'giga-fren.release2.fixed.fr.*' -# 'http://www.statmt.org/wmt13/training-parallel-un.tgz' -# 'un/undoc.2000.fr-en.en' 'un/undoc.2000.fr-en.fr' -# ) -# # each of DEV_TEST_DATA: data_url data_tgz data_file_lang1 data_file_lang2 -# DEV_TEST_DATA=( -# 'http://data.statmt.org/wmt16/translation-task/dev.tgz' -# '.*/newstest201[45]-fren-ref.en.sgm' '.*/newstest201[45]-fren-src.fr.sgm' -# 'http://data.statmt.org/wmt16/translation-task/test.tgz' -# '.*/newstest2016-fren-ref.en.sgm' '.*/newstest2016-fren-src.fr.sgm' -# ) -############################################################################### - -mkdir -p $OUTPUT_DIR_DATA $OUTPUT_DIR_BPE_DATA - -# Extract training data -for ((i=0;i<${#TRAIN_DATA[@]};i+=3)); do - data_url=${TRAIN_DATA[i]} - data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz - data=${data_tgz%.*} # training-parallel-commoncrawl - data_lang1=${TRAIN_DATA[i+1]} - data_lang2=${TRAIN_DATA[i+2]} - if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then - echo "Download "${data_url} - wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url} - fi - - if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then - echo "Extract "${data_tgz} - mkdir -p ${OUTPUT_DIR_DATA}/${data} - tar_type=${data_tgz:0-3} - if [ ${tar_type} == "tar" ]; then - tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data} - else - tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data} - fi - fi - # concatenate all training data - for data_lang in $data_lang1 $data_lang2; do - for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do - data_dir=`dirname $f` - data_file=`basename $f` - f_base=${f%.*} - f_ext=${f##*.} - if [ $f_ext == "gz" ]; then - gunzip $f - l=${f_base##*.} - f_base=${f_base%.*} - else - l=${f_ext} - fi - - if [ $i -eq 0 ]; then - cat ${f_base}.$l > ${OUTPUT_DIR_DATA}/train.$l - else - cat ${f_base}.$l >> ${OUTPUT_DIR_DATA}/train.$l - fi - done - done -done - -# Clone mosesdecoder -if [ ! -d ${OUTPUT_DIR}/mosesdecoder ]; then - echo "Cloning moses for data processing" - git clone https://github.com/moses-smt/mosesdecoder.git ${OUTPUT_DIR}/mosesdecoder -fi - -# Extract develop and test data -dev_test_data="" -for ((i=0;i<${#DEV_TEST_DATA[@]};i+=3)); do - data_url=${DEV_TEST_DATA[i]} - data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz - data=${data_tgz%.*} # training-parallel-commoncrawl - data_lang1=${DEV_TEST_DATA[i+1]} - data_lang2=${DEV_TEST_DATA[i+2]} - if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then - echo "Download "${data_url} - wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url} - fi - - if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then - echo "Extract "${data_tgz} - mkdir -p ${OUTPUT_DIR_DATA}/${data} - tar_type=${data_tgz:0-3} - if [ ${tar_type} == "tar" ]; then - tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data} - else - tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data} - fi - fi - - for data_lang in $data_lang1 $data_lang2; do - for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do - data_dir=`dirname $f` - data_file=`basename $f` - data_out=`echo ${data_file} | cut -d '-' -f 1` # newstest2016 - l=`echo ${data_file} | cut -d '.' -f 2` # en - dev_test_data="${dev_test_data}\|${data_out}" # to make regexp - if [ ! -e ${OUTPUT_DIR_DATA}/${data_out}.$l ]; then - ${OUTPUT_DIR}/mosesdecoder/scripts/ems/support/input-from-sgm.perl \ - < $f > ${OUTPUT_DIR_DATA}/${data_out}.$l - fi - done - done -done - -# Tokenize data -for l in ${LANG1} ${LANG2}; do - for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train${dev_test_data}\)\.$l$"`; do - f_base=${f%.*} # dir/train dir/newstest2016 - f_out=$f_base.tok.$l - if [ ! -e $f_out ]; then - echo "Tokenize "$f - ${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l $l -threads 8 < $f > $f_out - fi - done -done - -# Clean data -for f in ${OUTPUT_DIR_DATA}/train.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.${LANG1}; do - f_base=${f%.*} # dir/train dir/train.tok - f_out=${f_base}.clean - if [ ! -e $f_out.${LANG1} ] && [ ! -e $f_out.${LANG2} ]; then - echo "Clean "${f_base} - ${OUTPUT_DIR}/mosesdecoder/scripts/training/clean-corpus-n.perl $f_base ${LANG1} ${LANG2} ${f_out} 1 80 - fi -done - -python -m pip install subword-nmt - -# Generate BPE data and vocabulary -for num_operations in 32000; do - if [ ! -e ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} ]; then - echo "Learn BPE with ${num_operations} merge operations" - cat ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG2} | \ - subword-nmt learn-bpe -s $num_operations > ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} - fi - - for l in ${LANG1} ${LANG2}; do - for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train${dev_test_data}\)\.tok\(\.clean\)\?\.$l$"`; do - f_base=${f%.*} # dir/train.tok dir/train.tok.clean dir/newstest2016.tok - f_base=${f_base##*/} # train.tok train.tok.clean newstest2016.tok - f_out=${OUTPUT_DIR_BPE_DATA}/${f_base}.bpe.${num_operations}.$l - if [ ! -e $f_out ]; then - echo "Apply BPE to "$f - subword-nmt apply-bpe -c ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} < $f > $f_out - fi - done - done - - if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} ]; then - echo "Create vocabulary for BPE data" - cat ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG1} ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG2} | \ - subword-nmt get-vocab | cut -f1 -d ' ' > ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} - fi -done - -# Adapt to the reader -for f in ${OUTPUT_DIR_BPE_DATA}/*.bpe.${num_operations}.${LANG1}; do - f_base=${f%.*} # dir/train.tok.clean.bpe.32000 dir/newstest2016.tok.bpe.32000 - f_out=${f_base}.${LANG1}-${LANG2} - if [ ! -e $f_out ]; then - paste -d '\t' $f_base.${LANG1} $f_base.${LANG2} > $f_out - fi -done -if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations} ]; then - sed '1i\\n\n' ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} > ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations} -fi - -echo "All done." diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py b/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py deleted file mode 100644 index db7121d246346e2fef123a813aef639892e89309..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py +++ /dev/null @@ -1,109 +0,0 @@ -import logging -import os -import six -import sys -import time - -import numpy as np -import paddle - -import yaml -from attrdict import AttrDict -from pprint import pprint - -# Include task-specific libs -import reader -from transformer import InferTransformerModel, position_encoding_init - - -def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): - """ - Post-process the decoded sequence. - """ - eos_pos = len(seq) - 1 - for i, idx in enumerate(seq): - if idx == eos_idx: - eos_pos = i - break - seq = [ - idx for idx in seq[:eos_pos + 1] - if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx) - ] - return seq - - -@paddle.fluid.dygraph.no_grad() -def do_predict(args): - if args.use_gpu: - place = "gpu:0" - else: - place = "cpu" - - paddle.set_device(place) - - # Define data loader - (test_loader, - test_steps_fn), trg_idx2word = reader.create_infer_loader(args) - - # Define model - transformer = InferTransformerModel( - src_vocab_size=args.src_vocab_size, - trg_vocab_size=args.trg_vocab_size, - max_length=args.max_length + 1, - n_layer=args.n_layer, - n_head=args.n_head, - d_model=args.d_model, - d_inner_hid=args.d_inner_hid, - dropout=args.dropout, - weight_sharing=args.weight_sharing, - bos_id=args.bos_idx, - eos_id=args.eos_idx, - beam_size=args.beam_size, - max_out_len=args.max_out_len) - - # Load the trained model - assert args.init_from_params, ( - "Please set init_from_params to load the infer model.") - - model_dict = paddle.load( - os.path.join(args.init_from_params, "transformer.pdparams")) - - # To avoid a longer length than training, reset the size of position - # encoding to max_length - model_dict["encoder.pos_encoder.weight"] = position_encoding_init( - args.max_length + 1, args.d_model) - model_dict["decoder.pos_encoder.weight"] = position_encoding_init( - args.max_length + 1, args.d_model) - transformer.load_dict(model_dict) - - # Set evaluate mode - transformer.eval() - - f = open(args.output_file, "w") - for input_data in test_loader: - (src_word, src_pos, src_slf_attn_bias, trg_word, - trg_src_attn_bias) = input_data - finished_seq = transformer( - src_word=src_word, - src_pos=src_pos, - src_slf_attn_bias=src_slf_attn_bias, - trg_word=trg_word, - trg_src_attn_bias=trg_src_attn_bias) - finished_seq = finished_seq.numpy().transpose([0, 2, 1]) - for ins in finished_seq: - for beam_idx, beam in enumerate(ins): - if beam_idx >= args.n_best: - break - id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) - word_list = [trg_idx2word[id] for id in id_list] - sequence = " ".join(word_list) + "\n" - f.write(sequence) - - -if __name__ == "__main__": - yaml_file = "./transformer.yaml" - with open(yaml_file, 'rt') as f: - args = AttrDict(yaml.safe_load(f)) - pprint(args) - - do_predict(args) diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py b/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py deleted file mode 100644 index 7fc0c5d70593af9819542b4012211079c240037f..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py +++ /dev/null @@ -1,509 +0,0 @@ -import glob -import sys -import os -import io -import itertools -from functools import partial -import numpy as np -from paddle.io import BatchSampler, DataLoader, Dataset - - -def create_infer_loader(args): - dataset = TransformerDataset( - fpattern=args.predict_file, - src_vocab_fpath=args.src_vocab_fpath, - trg_vocab_fpath=args.trg_vocab_fpath, - token_delimiter=args.token_delimiter, - start_mark=args.special_token[0], - end_mark=args.special_token[1], - unk_mark=args.special_token[2]) - args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \ - args.unk_idx = dataset.get_vocab_summary() - trg_idx2word = TransformerDataset.load_dict( - dict_path=args.trg_vocab_fpath, reverse=True) - batch_sampler = TransformerBatchSampler( - dataset=dataset, - use_token_batch=False, - batch_size=args.infer_batch_size, - max_length=args.max_length) - data_loader = DataLoader( - dataset=dataset, - batch_sampler=batch_sampler, - collate_fn=partial( - prepare_infer_input, - bos_idx=args.bos_idx, - eos_idx=args.eos_idx, - src_pad_idx=args.eos_idx, - n_head=args.n_head), - num_workers=0, - return_list=True) - data_loaders = (data_loader, batch_sampler.__len__) - return data_loaders, trg_idx2word - - -def create_data_loader(args, world_size, rank): - data_loaders = [(None, None)] * 2 - data_files = [args.training_file, args.validation_file - ] if args.validation_file else [args.training_file] - for i, data_file in enumerate(data_files): - dataset = TransformerDataset( - fpattern=data_file, - src_vocab_fpath=args.src_vocab_fpath, - trg_vocab_fpath=args.trg_vocab_fpath, - token_delimiter=args.token_delimiter, - start_mark=args.special_token[0], - end_mark=args.special_token[1], - unk_mark=args.special_token[2]) - args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \ - args.unk_idx = dataset.get_vocab_summary() - batch_sampler = TransformerBatchSampler( - dataset=dataset, - batch_size=args.batch_size, - pool_size=args.pool_size, - sort_type=args.sort_type, - shuffle=args.shuffle, - shuffle_batch=args.shuffle_batch, - use_token_batch=args.use_token_batch, - max_length=args.max_length, - distribute_mode=True if i == 0 else False, - world_size=world_size, - rank=rank) - data_loader = DataLoader( - dataset=dataset, - batch_sampler=batch_sampler, - collate_fn=partial( - prepare_train_input, - bos_idx=args.bos_idx, - eos_idx=args.eos_idx, - src_pad_idx=args.eos_idx, - trg_pad_idx=args.eos_idx, - n_head=args.n_head), - num_workers=0, - return_list=True) - data_loaders[i] = (data_loader, batch_sampler.__len__) - return data_loaders - - -def prepare_train_input(insts, bos_idx, eos_idx, src_pad_idx, trg_pad_idx, - n_head): - """ - Put all padded data needed by training into a list. - """ - src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data( - [inst[0] + [eos_idx] for inst in insts], - src_pad_idx, - n_head, - is_target=False) - src_word = src_word.reshape(-1, src_max_len) - src_pos = src_pos.reshape(-1, src_max_len) - trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data( - [[bos_idx] + inst[1] for inst in insts], - trg_pad_idx, - n_head, - is_target=True) - trg_word = trg_word.reshape(-1, trg_max_len) - trg_pos = trg_pos.reshape(-1, trg_max_len) - - trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :], - [1, 1, trg_max_len, 1]).astype("float32") - - lbl_word, lbl_weight, lbl_max_len, num_token = pad_batch_data( - [inst[1] + [eos_idx] for inst in insts], - trg_pad_idx, - n_head, - is_target=False, - is_label=True, - return_attn_bias=False, - return_max_len=True, - return_num_token=True) - lbl_word = lbl_word.reshape(-1, lbl_max_len, 1) - lbl_weight = lbl_weight.reshape(-1, lbl_max_len, 1) - - data_inputs = [ - src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos, - trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight - ] - - return data_inputs - - -def prepare_infer_input(insts, bos_idx, eos_idx, src_pad_idx, n_head): - """ - Put all padded data needed by beam search decoder into a list. - """ - src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data( - [inst[0] + [eos_idx] for inst in insts], - src_pad_idx, - n_head, - is_target=False) - trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64") - trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :], - [1, 1, 1, 1]).astype("float32") - trg_word = trg_word.reshape(-1, 1) - src_word = src_word.reshape(-1, src_max_len) - src_pos = src_pos.reshape(-1, src_max_len) - - data_inputs = [ - src_word, src_pos, src_slf_attn_bias, trg_word, trg_src_attn_bias - ] - - return data_inputs - - -def pad_batch_data(insts, - pad_idx, - n_head, - is_target=False, - is_label=False, - return_attn_bias=True, - return_max_len=True, - return_num_token=False): - """ - Pad the instances to the max sequence length in batch, and generate the - corresponding position data and attention bias. - """ - return_list = [] - max_len = max(len(inst) for inst in insts) - # Any token included in dict can be used to pad, since the paddings' loss - # will be masked out by weights and make no effect on parameter gradients. - inst_data = np.array( - [inst + [pad_idx] * (max_len - len(inst)) for inst in insts]) - return_list += [inst_data.astype("int64").reshape([-1, 1])] - if is_label: # label weight - inst_weight = np.array([[1.] * len(inst) + [0.] * (max_len - len(inst)) - for inst in insts]) - return_list += [inst_weight.astype("float32").reshape([-1, 1])] - else: # position data - inst_pos = np.array([ - list(range(0, len(inst))) + [0] * (max_len - len(inst)) - for inst in insts - ]) - return_list += [inst_pos.astype("int64").reshape([-1, 1])] - if return_attn_bias: - if is_target: - # This is used to avoid attention on paddings and subsequent - # words. - slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len)) - slf_attn_bias_data = np.triu(slf_attn_bias_data, - 1).reshape([-1, 1, max_len, max_len]) - slf_attn_bias_data = np.tile(slf_attn_bias_data, - [1, n_head, 1, 1]) * [-1e9] - else: - # This is used to avoid attention on paddings. - slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] * - (max_len - len(inst)) - for inst in insts]) - slf_attn_bias_data = np.tile( - slf_attn_bias_data.reshape([-1, 1, 1, max_len]), - [1, n_head, max_len, 1]) - return_list += [slf_attn_bias_data.astype("float32")] - if return_max_len: - return_list += [max_len] - if return_num_token: - num_token = 0 - for inst in insts: - num_token += len(inst) - return_list += [num_token] - return return_list if len(return_list) > 1 else return_list[0] - - -class SortType(object): - GLOBAL = 'global' - POOL = 'pool' - NONE = "none" - - -class Converter(object): - def __init__(self, vocab, beg, end, unk, delimiter, add_beg, add_end): - self._vocab = vocab - self._beg = beg - self._end = end - self._unk = unk - self._delimiter = delimiter - self._add_beg = add_beg - self._add_end = add_end - - def __call__(self, sentence): - return ([self._beg] if self._add_beg else []) + [ - self._vocab.get(w, self._unk) - for w in sentence.split(self._delimiter) - ] + ([self._end] if self._add_end else []) - - -class ComposedConverter(object): - def __init__(self, converters): - self._converters = converters - - def __call__(self, fields): - return [ - converter(field) - for field, converter in zip(fields, self._converters) - ] - - -class SentenceBatchCreator(object): - def __init__(self, batch_size): - self.batch = [] - self._batch_size = batch_size - - def append(self, info): - self.batch.append(info) - if len(self.batch) == self._batch_size: - tmp = self.batch - self.batch = [] - return tmp - - -class TokenBatchCreator(object): - def __init__(self, batch_size): - self.batch = [] - self.max_len = -1 - self._batch_size = batch_size - - def append(self, info): - cur_len = info.max_len - max_len = max(self.max_len, cur_len) - if max_len * (len(self.batch) + 1) > self._batch_size: - result = self.batch - self.batch = [info] - self.max_len = cur_len - return result - else: - self.max_len = max_len - self.batch.append(info) - - -class SampleInfo(object): - def __init__(self, i, lens): - self.i = i - # Take bos and eos into account - self.min_len = min(lens[0] + 1, lens[1] + 2) - self.max_len = max(lens[0] + 1, lens[1] + 2) - - -class MinMaxFilter(object): - def __init__(self, max_len, min_len, underlying_creator): - self._min_len = min_len - self._max_len = max_len - self._creator = underlying_creator - - def append(self, info): - if info.max_len > self._max_len or info.min_len < self._min_len: - return - else: - return self._creator.append(info) - - @property - def batch(self): - return self._creator.batch - - -class TransformerDataset(Dataset): - def __init__(self, - src_vocab_fpath, - trg_vocab_fpath, - fpattern, - field_delimiter="\t", - token_delimiter=" ", - start_mark="", - end_mark="", - unk_mark="", - trg_fpattern=None): - self._src_vocab = self.load_dict(src_vocab_fpath) - self._trg_vocab = self.load_dict(trg_vocab_fpath) - self._bos_idx = self._src_vocab[start_mark] - self._eos_idx = self._src_vocab[end_mark] - self._unk_idx = self._src_vocab[unk_mark] - self._field_delimiter = field_delimiter - self._token_delimiter = token_delimiter - self.load_src_trg_ids(fpattern, trg_fpattern) - - def load_src_trg_ids(self, fpattern, trg_fpattern=None): - src_converter = Converter( - vocab=self._src_vocab, - beg=self._bos_idx, - end=self._eos_idx, - unk=self._unk_idx, - delimiter=self._token_delimiter, - add_beg=False, - add_end=False) - - trg_converter = Converter( - vocab=self._trg_vocab, - beg=self._bos_idx, - end=self._eos_idx, - unk=self._unk_idx, - delimiter=self._token_delimiter, - add_beg=False, - add_end=False) - - converters = ComposedConverter([src_converter, trg_converter]) - - self._src_seq_ids = [] - self._trg_seq_ids = [] - self._sample_infos = [] - - slots = [self._src_seq_ids, self._trg_seq_ids] - for i, line in enumerate(self._load_lines(fpattern, trg_fpattern)): - lens = [] - for field, slot in zip(converters(line), slots): - slot.append(field) - lens.append(len(field)) - self._sample_infos.append(SampleInfo(i, lens)) - - def _load_lines(self, fpattern, trg_fpattern=None): - fpaths = glob.glob(fpattern) - fpaths = sorted(fpaths) # TODO: Add custum sort - assert len(fpaths) > 0, "no matching file to the provided data path" - - (f_mode, f_encoding, endl) = ("r", "utf8", "\n") - - if trg_fpattern is None: - for fpath in fpaths: - with io.open(fpath, f_mode, encoding=f_encoding) as f: - for line in f: - fields = line.strip(endl).split(self._field_delimiter) - yield fields - else: - # Separated source and target language data files - # assume we can get aligned data by sort the two language files - # TODO: Need more rigorous check - trg_fpaths = glob.glob(trg_fpattern) - trg_fpaths = sorted(trg_fpaths) - assert len(fpaths) == len( - trg_fpaths - ), "the number of source language data files must equal \ - with that of source language" - - for fpath, trg_fpath in zip(fpaths, trg_fpaths): - with io.open(fpath, f_mode, encoding=f_encoding) as f: - with io.open( - trg_fpath, f_mode, encoding=f_encoding) as trg_f: - for line in zip(f, trg_f): - fields = [field.strip(endl) for field in line] - yield fields - - @staticmethod - def load_dict(dict_path, reverse=False): - word_dict = {} - (f_mode, f_encoding, endl) = ("r", "utf8", "\n") - with io.open(dict_path, f_mode, encoding=f_encoding) as fdict: - for idx, line in enumerate(fdict): - if reverse: - word_dict[idx] = line.strip(endl) - else: - word_dict[line.strip(endl)] = idx - return word_dict - - def get_vocab_summary(self): - return len(self._src_vocab), len( - self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx - - def __getitem__(self, idx): - return (self._src_seq_ids[idx], self._trg_seq_ids[idx] - ) if self._trg_seq_ids else self._src_seq_ids[idx] - - def __len__(self): - return len(self._sample_infos) - - -class TransformerBatchSampler(BatchSampler): - def __init__(self, - dataset, - batch_size, - pool_size=10000, - sort_type=SortType.NONE, - min_length=0, - max_length=100, - shuffle=False, - shuffle_batch=False, - use_token_batch=False, - clip_last_batch=False, - distribute_mode=True, - seed=0, - world_size=1, - rank=0): - for arg, value in locals().items(): - if arg != "self": - setattr(self, "_" + arg, value) - self._random = np.random - self._random.seed(seed) - # for multi-devices - self._distribute_mode = distribute_mode - self._nranks = world_size - self._local_rank = rank - - def __iter__(self): - # global sort or global shuffle - if self._sort_type == SortType.GLOBAL: - infos = sorted(self._dataset._sample_infos, key=lambda x: x.max_len) - else: - if self._shuffle: - infos = self._dataset._sample_infos - self._random.shuffle(infos) - else: - infos = self._dataset._sample_infos - - if self._sort_type == SortType.POOL: - reverse = True - for i in range(0, len(infos), self._pool_size): - # To avoid placing short next to long sentences - reverse = not reverse - infos[i:i + self._pool_size] = sorted( - infos[i:i + self._pool_size], - key=lambda x: x.max_len, - reverse=reverse) - - batches = [] - batch_creator = TokenBatchCreator( - self. - _batch_size) if self._use_token_batch else SentenceBatchCreator( - self._batch_size * self._nranks) - batch_creator = MinMaxFilter(self._max_length, self._min_length, - batch_creator) - - for info in infos: - batch = batch_creator.append(info) - if batch is not None: - batches.append(batch) - - if not self._clip_last_batch and len(batch_creator.batch) != 0: - batches.append(batch_creator.batch) - - if self._shuffle_batch: - self._random.shuffle(batches) - - if not self._use_token_batch: - # When producing batches according to sequence number, to confirm - # neighbor batches which would be feed and run parallel have similar - # length (thus similar computational cost) after shuffle, we as take - # them as a whole when shuffling and split here - batches = [[ - batch[self._batch_size * i:self._batch_size * (i + 1)] - for i in range(self._nranks) - ] for batch in batches] - batches = list(itertools.chain.from_iterable(batches)) - self.batch_number = (len(batches) + self._nranks - 1) // self._nranks - - # for multi-device - for batch_id, batch in enumerate(batches): - if not self._distribute_mode or ( - batch_id % self._nranks == self._local_rank): - batch_indices = [info.i for info in batch] - yield batch_indices - if self._distribute_mode and len(batches) % self._nranks != 0: - if self._local_rank >= len(batches) % self._nranks: - # use previous data to pad - yield batch_indices - - def __len__(self): - if hasattr(self, "batch_number"): # - return self.batch_number - if not self._use_token_batch: - batch_number = ( - len(self._dataset) + self._batch_size * self._nranks - 1) // ( - self._batch_size * self._nranks) - else: - # For uncertain batch number, the actual value is self.batch_number - batch_number = sys.maxsize - return batch_number diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/train.py b/PaddleNLP/examples/machine_translation/transformer-rc0/train.py deleted file mode 100644 index e12588dba54eac17f824035178374037067081fb..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/train.py +++ /dev/null @@ -1,225 +0,0 @@ -import logging -import os -import six -import sys -import time - -import numpy as np -import yaml -from attrdict import AttrDict -from pprint import pprint - -import paddle -import paddle.distributed as dist - -import reader -from transformer import TransformerModel, CrossEntropyCriterion - -FORMAT = '%(asctime)s-%(levelname)s: %(message)s' -logging.basicConfig(level=logging.INFO, format=FORMAT) -logger = logging.getLogger(__name__) - - -def do_train(args): - if args.use_gpu: - rank = dist.get_rank() - trainer_count = dist.get_world_size() - else: - rank = 0 - trainer_count = 1 - paddle.set_device("cpu") - - if trainer_count > 1: - dist.init_parallel_env() - - # Set seed for CE - random_seed = eval(str(args.random_seed)) - if random_seed is not None: - paddle.seed(random_seed) - - # Define data loader - (train_loader, train_steps_fn), ( - eval_loader, - eval_steps_fn) = reader.create_data_loader(args, trainer_count, rank) - - # Define model - transformer = TransformerModel( - src_vocab_size=args.src_vocab_size, - trg_vocab_size=args.trg_vocab_size, - max_length=args.max_length + 1, - n_layer=args.n_layer, - n_head=args.n_head, - d_model=args.d_model, - d_inner_hid=args.d_inner_hid, - dropout=args.dropout, - weight_sharing=args.weight_sharing, - bos_id=args.bos_idx, - eos_id=args.eos_idx) - - # Define loss - criterion = CrossEntropyCriterion(args.label_smooth_eps) - - scheduler = paddle.optimizer.lr.NoamDecay( - args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0) - # Define optimizer - optimizer = paddle.optimizer.Adam( - learning_rate=scheduler, - beta1=args.beta1, - beta2=args.beta2, - epsilon=float(args.eps), - parameters=transformer.parameters()) - - # Init from some checkpoint, to resume the previous training - if args.init_from_checkpoint: - model_dict = paddle.load( - os.path.join(args.init_from_checkpoint, "transformer.pdparams")) - opt_dict = paddle.load( - os.path.join(args.init_from_checkpoint, "transformer.pdopt")) - transformer.set_state_dict(model_dict) - optimizer.set_state_dict(opt_dict) - print("loaded from checkpoint.") - # Init from some pretrain models, to better solve the current task - if args.init_from_pretrain_model: - model_dict = paddle.load( - os.path.join(args.init_from_pretrain_model, "transformer.pdparams")) - transformer.set_state_dict(model_dict) - print("loaded from pre-trained model.") - - if trainer_count > 1: - transformer = paddle.DataParallel(transformer) - - # The best cross-entropy value with label smoothing - loss_normalizer = -( - (1. - args.label_smooth_eps) * np.log( - (1. - args.label_smooth_eps)) + args.label_smooth_eps * - np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)) - - ce_time = [] - ce_ppl = [] - step_idx = 0 - - # Train loop - for pass_id in range(args.epoch): - epoch_start = time.time() - - batch_id = 0 - batch_start = time.time() - for input_data in train_loader: - batch_reader_end = time.time() - (src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos, - trg_slf_attn_bias, trg_src_attn_bias, lbl_word, - lbl_weight) = input_data - - logits = transformer( - src_word=src_word, - src_pos=src_pos, - src_slf_attn_bias=src_slf_attn_bias, - trg_word=trg_word, - trg_pos=trg_pos, - trg_slf_attn_bias=trg_slf_attn_bias, - trg_src_attn_bias=trg_src_attn_bias) - - sum_cost, avg_cost, token_num = criterion(logits, lbl_word, - lbl_weight) - - avg_cost.backward() - - optimizer.step() - optimizer.clear_grad() - if step_idx % args.print_step == 0 and (trainer_count == 1 or - dist.get_rank() == 0): - total_avg_cost = avg_cost.numpy() - - if step_idx == 0: - logger.info( - "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " - "normalized loss: %f, ppl: %f" % - (step_idx, pass_id, batch_id, total_avg_cost, - total_avg_cost - loss_normalizer, - np.exp([min(total_avg_cost, 100)]))) - else: - train_avg_batch_cost = args.print_step / ( - time.time() - batch_start) - logger.info( - "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " - "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, " - % ( - step_idx, - pass_id, - batch_id, - total_avg_cost, - total_avg_cost - loss_normalizer, - np.exp([min(total_avg_cost, 100)]), - train_avg_batch_cost, )) - - batch_start = time.time() - - if step_idx % args.save_step == 0 and step_idx != 0: - # Validation - if args.validation_file: - transformer.eval() - total_sum_cost = 0 - total_token_num = 0 - for input_data in eval_loader: - (src_word, src_pos, src_slf_attn_bias, trg_word, - trg_pos, trg_slf_attn_bias, trg_src_attn_bias, - lbl_word, lbl_weight) = input_data - logits = transformer( - src_word, src_pos, src_slf_attn_bias, trg_word, - trg_pos, trg_slf_attn_bias, trg_src_attn_bias) - sum_cost, avg_cost, token_num = criterion( - logits, lbl_word, lbl_weight) - total_sum_cost += sum_cost.numpy() - total_token_num += token_num.numpy() - total_avg_cost = total_sum_cost / total_token_num - logger.info("validation, step_idx: %d, avg loss: %f, " - "normalized loss: %f, ppl: %f" % - (step_idx, total_avg_cost, - total_avg_cost - loss_normalizer, - np.exp([min(total_avg_cost, 100)]))) - transformer.train() - - if args.save_model and (trainer_count == 1 or - dist.get_rank() == 0): - model_dir = os.path.join(args.save_model, - "step_" + str(step_idx)) - if not os.path.exists(model_dir): - os.makedirs(model_dir) - paddle.save(transformer.state_dict(), - os.path.join(model_dir, "transformer.pdparams")) - paddle.save(optimizer.state_dict(), - os.path.join(model_dir, "transformer.pdopt")) - - batch_id += 1 - step_idx += 1 - scheduler.step() - - train_epoch_cost = time.time() - epoch_start - ce_time.append(train_epoch_cost) - logger.info("train epoch: %d, epoch_cost: %.5f s" % - (pass_id, train_epoch_cost)) - - if args.save_model and (trainer_count == 1 or dist.get_rank() == 0): - model_dir = os.path.join(args.save_model, "step_final") - if not os.path.exists(model_dir): - os.makedirs(model_dir) - paddle.save(transformer.state_dict(), - os.path.join(model_dir, "transformer.pdparams")) - paddle.save(optimizer.state_dict(), - os.path.join(model_dir, "transformer.pdopt")) - - -def train(args, world_size=1): - if world_size > 1 and args.use_gpu: - dist.spawn(do_train, nprocs=world_size, args=(args, )) - else: - do_train(args) - - -if __name__ == "__main__": - yaml_file = "./transformer.yaml" - with open(yaml_file, 'rt') as f: - args = AttrDict(yaml.safe_load(f)) - pprint(args) - - train(args, eval(str(args.world_size))) diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py b/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py deleted file mode 100644 index c59a8c7cff5a76169076a38b874a2cd51acf9061..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py +++ /dev/null @@ -1,351 +0,0 @@ -from __future__ import print_function - -import numpy as np - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F -from paddle.fluid.layers.utils import map_structure - - -def position_encoding_init(n_position, d_pos_vec): - """ - Generate the initial values for the sinusoid position encoding table. - """ - channels = d_pos_vec - position = np.arange(n_position) - num_timescales = channels // 2 - log_timescale_increment = (np.log(float(1e4) / float(1)) / - (num_timescales - 1)) - inv_timescales = np.exp( - np.arange(num_timescales) * -log_timescale_increment) - scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, - 0) - signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1) - signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant') - position_enc = signal - return position_enc.astype("float32") - - -class WordEmbedding(nn.Layer): - """ - Word Embedding + Scale - """ - - def __init__(self, vocab_size, emb_dim, bos_idx=0): - super(WordEmbedding, self).__init__() - self.emb_dim = emb_dim - - self.word_embedding = nn.Embedding( - num_embeddings=vocab_size, - embedding_dim=emb_dim, - padding_idx=bos_idx, - weight_attr=paddle.ParamAttr( - initializer=nn.initializer.Normal(0., emb_dim**-0.5))) - - def forward(self, word): - word_emb = self.emb_dim**0.5 * self.word_embedding(word) - return word_emb - - -class PositionalEmbedding(nn.Layer): - """ - Positional Embedding - """ - - def __init__(self, emb_dim, max_length, bos_idx=0): - super(PositionalEmbedding, self).__init__() - self.emb_dim = emb_dim - - self.pos_encoder = nn.Embedding( - num_embeddings=max_length, embedding_dim=self.emb_dim) - self.pos_encoder.weight.set_value( - position_encoding_init(max_length, self.emb_dim)) - - def forward(self, pos): - pos_emb = self.pos_encoder(pos) - pos_emb.stop_gradient = True - return pos_emb - - -class CrossEntropyCriterion(nn.Layer): - def __init__(self, label_smooth_eps): - super(CrossEntropyCriterion, self).__init__() - self.label_smooth_eps = label_smooth_eps - - def forward(self, predict, label, weights): - if self.label_smooth_eps: - label = paddle.squeeze(label, axis=[2]) - label = F.label_smooth( - label=F.one_hot( - x=label, num_classes=predict.shape[-1]), - epsilon=self.label_smooth_eps) - - cost = F.softmax_with_cross_entropy( - logits=predict, - label=label, - soft_label=True if self.label_smooth_eps else False) - weighted_cost = cost * weights - sum_cost = paddle.sum(weighted_cost) - token_num = paddle.sum(weights) - token_num.stop_gradient = True - avg_cost = sum_cost / token_num - return sum_cost, avg_cost, token_num - - -class TransformerDecodeCell(nn.Layer): - def __init__(self, - decoder, - word_embedding=None, - pos_embedding=None, - linear=None, - dropout=0.1): - super(TransformerDecodeCell, self).__init__() - self.decoder = decoder - self.word_embedding = word_embedding - self.pos_embedding = pos_embedding - self.linear = linear - self.dropout = dropout - - def forward(self, inputs, states, static_cache, trg_src_attn_bias, memory): - if states and static_cache: - states = list(zip(states, static_cache)) - - if self.word_embedding: - if not isinstance(inputs, (list, tuple)): - inputs = (inputs) - - word_emb = self.word_embedding(inputs[0]) - pos_emb = self.pos_embedding(inputs[1]) - word_emb = word_emb + pos_emb - inputs = F.dropout( - word_emb, p=self.dropout, - training=False) if self.dropout else word_emb - - cell_outputs, new_states = self.decoder(inputs, memory, None, - trg_src_attn_bias, states) - else: - cell_outputs, new_states = self.decoder(inputs, memory, None, - trg_src_attn_bias, states) - - if self.linear: - cell_outputs = self.linear(cell_outputs) - - new_states = [cache[0] for cache in new_states] - - return cell_outputs, new_states - - -class TransformerBeamSearchDecoder(nn.decode.BeamSearchDecoder): - def __init__(self, cell, start_token, end_token, beam_size, - var_dim_in_state): - super(TransformerBeamSearchDecoder, - self).__init__(cell, start_token, end_token, beam_size) - self.cell = cell - self.var_dim_in_state = var_dim_in_state - - def _merge_batch_beams_with_var_dim(self, c): - # Init length of cache is 0, and it increases with decoding carrying on, - # thus need to reshape elaborately - var_dim_in_state = self.var_dim_in_state + 1 # count in beam dim - c = paddle.transpose(c, - list(range(var_dim_in_state, len(c.shape))) + - list(range(0, var_dim_in_state))) - c = paddle.reshape( - c, [0] * (len(c.shape) - var_dim_in_state - ) + [self.batch_size * self.beam_size] + - [int(size) for size in c.shape[-var_dim_in_state + 2:]]) - c = paddle.transpose( - c, - list(range((len(c.shape) + 1 - var_dim_in_state), len(c.shape))) + - list(range(0, (len(c.shape) + 1 - var_dim_in_state)))) - return c - - def _split_batch_beams_with_var_dim(self, c): - var_dim_size = c.shape[self.var_dim_in_state] - c = paddle.reshape( - c, [-1, self.beam_size] + - [int(size) - for size in c.shape[1:self.var_dim_in_state]] + [var_dim_size] + - [int(size) for size in c.shape[self.var_dim_in_state + 1:]]) - return c - - @staticmethod - def tile_beam_merge_with_batch(t, beam_size): - return map_structure( - lambda x: nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(x, beam_size), - t) - - def step(self, time, inputs, states, **kwargs): - # Steps for decoding. - # Compared to RNN, Transformer has 3D data at every decoding step - inputs = paddle.reshape(inputs, [-1, 1]) # token - pos = paddle.ones_like(inputs) * time # pos - - cell_states = map_structure(self._merge_batch_beams_with_var_dim, - states.cell_states) - - cell_outputs, next_cell_states = self.cell((inputs, pos), cell_states, - **kwargs) - - # Squeeze to adapt to BeamSearchDecoder which use 2D logits - cell_outputs = map_structure( - lambda x: paddle.squeeze(x, [1]) if len(x.shape) == 3 else x, - cell_outputs) - cell_outputs = map_structure(self._split_batch_beams, cell_outputs) - next_cell_states = map_structure(self._split_batch_beams_with_var_dim, - next_cell_states) - - beam_search_output, beam_search_state = self._beam_search_step( - time=time, - logits=cell_outputs, - next_cell_states=next_cell_states, - beam_state=states) - next_inputs, finished = (beam_search_output.predicted_ids, - beam_search_state.finished) - - return (beam_search_output, beam_search_state, next_inputs, finished) - - -class TransformerModel(nn.Layer): - """ - model - """ - - def __init__(self, - src_vocab_size, - trg_vocab_size, - max_length, - n_layer, - n_head, - d_model, - d_inner_hid, - dropout, - weight_sharing, - bos_id=0, - eos_id=1): - super(TransformerModel, self).__init__() - self.trg_vocab_size = trg_vocab_size - self.emb_dim = d_model - self.bos_id = bos_id - self.eos_id = eos_id - self.dropout = dropout - - self.src_word_embedding = WordEmbedding( - vocab_size=src_vocab_size, emb_dim=d_model, bos_idx=self.bos_id) - self.src_pos_embedding = PositionalEmbedding( - emb_dim=d_model, max_length=max_length, bos_idx=self.bos_id) - if weight_sharing: - assert src_vocab_size == trg_vocab_size, ( - "Vocabularies in source and target should be same for weight sharing." - ) - self.trg_word_embedding = self.src_word_embedding - self.trg_pos_embedding = self.src_pos_embedding - else: - self.trg_word_embedding = WordEmbedding( - vocab_size=trg_vocab_size, emb_dim=d_model, bos_idx=self.bos_id) - self.trg_pos_embedding = PositionalEmbedding( - emb_dim=d_model, max_length=max_length, bos_idx=self.bos_id) - - self.transformer = nn.Transformer( - d_model=d_model, - nhead=n_head, - num_encoder_layers=n_layer, - num_decoder_layers=n_layer, - dim_feedforward=d_inner_hid, - dropout=dropout, - normalize_before=True) - - if weight_sharing: - self.linear = lambda x: paddle.matmul(x=x, - y=self.trg_word_embedding.word_embedding.weight, - transpose_y=True) - else: - self.linear = nn.Linear( - input_dim=d_model, output_dim=trg_vocab_size, bias_attr=False) - - def forward(self, src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos, - trg_slf_attn_bias, trg_src_attn_bias): - src_emb = self.src_word_embedding(src_word) - src_pos_emb = self.src_pos_embedding(src_pos) - src_emb = src_emb + src_pos_emb - enc_input = F.dropout( - src_emb, p=self.dropout, - training=self.training) if self.dropout else src_emb - - trg_emb = self.trg_word_embedding(trg_word) - trg_pos_emb = self.trg_pos_embedding(trg_pos) - trg_emb = trg_emb + trg_pos_emb - dec_input = F.dropout( - trg_emb, p=self.dropout, - training=self.training) if self.dropout else trg_emb - - dec_output = self.transformer( - enc_input, - dec_input, - src_mask=src_slf_attn_bias, - tgt_mask=trg_slf_attn_bias, - memory_mask=trg_src_attn_bias) - - predict = self.linear(dec_output) - - return predict - - -class InferTransformerModel(TransformerModel): - def __init__(self, - src_vocab_size, - trg_vocab_size, - max_length, - n_layer, - n_head, - d_model, - d_inner_hid, - dropout, - weight_sharing, - bos_id=0, - eos_id=1, - beam_size=4, - max_out_len=256): - args = dict(locals()) - args.pop("self") - args.pop("__class__", None) - self.beam_size = args.pop("beam_size") - self.max_out_len = args.pop("max_out_len") - self.dropout = dropout - super(InferTransformerModel, self).__init__(**args) - - cell = TransformerDecodeCell( - self.transformer.decoder, self.trg_word_embedding, - self.trg_pos_embedding, self.linear, self.dropout) - - self.decode = TransformerBeamSearchDecoder( - cell, bos_id, eos_id, beam_size, var_dim_in_state=2) - - def forward(self, src_word, src_pos, src_slf_attn_bias, trg_word, - trg_src_attn_bias): - # Run encoder - src_emb = self.src_word_embedding(src_word) - src_pos_emb = self.src_pos_embedding(src_pos) - src_emb = src_emb + src_pos_emb - enc_input = F.dropout( - src_emb, p=self.dropout, - training=False) if self.dropout else src_emb - enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias) - - # Init states (caches) for transformer, need to be updated according to selected beam - incremental_cache, static_cache = self.transformer.decoder.gen_cache( - enc_output, do_zip=True) - - static_cache, enc_output, trg_src_attn_bias = TransformerBeamSearchDecoder.tile_beam_merge_with_batch( - (static_cache, enc_output, trg_src_attn_bias), self.beam_size) - - rs, _ = nn.decode.dynamic_decode( - decoder=self.decode, - inits=incremental_cache, - max_step_num=self.max_out_len, - memory=enc_output, - trg_src_attn_bias=trg_src_attn_bias, - static_cache=static_cache) - - return rs diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml b/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml deleted file mode 100644 index 18701016e434353d46c822638dacf00c276505e2..0000000000000000000000000000000000000000 --- a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml +++ /dev/null @@ -1,100 +0,0 @@ -# The frequency to save trained models when training. -save_step: 10000 -# The frequency to fetch and print output when training. -print_step: 100 -# path of the checkpoint, to resume the previous training -init_from_checkpoint: "" -# path of the pretrain model, to better solve the current task -init_from_pretrain_model: "" # "base_model_dygraph/step_100000/" -# path of trained parameter, to make prediction -init_from_params: "./base_model_dygraph/api_saved/" # "base_model_dygraph/step_100000/" -# the directory for saving model -save_model: "trained_models" -# the directory for saving inference model. -inference_model_dir: "infer_model" -# Set seed for CE or debug -random_seed: None -# The pattern to match training data files. -training_file: "gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de" -# The pattern to match validation data files. -validation_file: "gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de" -# The pattern to match test data files. -predict_file: "gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de" -# The file to output the translation results of predict_file to. -output_file: "predict.txt" -# The path of vocabulary file of source language. -src_vocab_fpath: "gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000" -# The path of vocabulary file of target language. -trg_vocab_fpath: "gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000" -# The , and tokens in the dictionary. -special_token: ["", "", ""] - -# whether to use cuda -use_gpu: True - -# args for reader, see reader.py for details -token_delimiter: " " -use_token_batch: True -pool_size: 200000 -sort_type: "pool" -shuffle: True -shuffle_batch: True -batch_size: 4096 -infer_batch_size: 64 - -# Hyparams for training: -# the number of epoches for training -epoch: 30 -# cards -world_size: 8 -# the hyper parameters for Adam optimizer. -# This static learning_rate will be multiplied to the LearningRateScheduler -# derived learning rate the to get the final learning rate. -learning_rate: 2.0 -beta1: 0.9 -beta2: 0.997 -eps: 1e-9 -# the parameters for learning rate scheduling. -warmup_steps: 8000 -# the weight used to mix up the ground-truth distribution and the fixed -# uniform distribution in label smoothing when training. -# Set this as zero if label smoothing is not wanted. -label_smooth_eps: 0.1 - -# Hyparams for generation: -# the parameters for beam search. -beam_size: 5 -max_out_len: 256 -# the number of decoded sentences to output. -n_best: 1 - -# Hyparams for model: -# These following five vocabularies related configurations will be set -# automatically according to the passed vocabulary path and special tokens. -# size of source word dictionary. -src_vocab_size: 10000 -# size of target word dictionay -trg_vocab_size: 10000 -# index for token -bos_idx: 0 -# index for token -eos_idx: 1 -# index for token -unk_idx: 2 -# max length of sequences deciding the size of position encoding table. -max_length: 256 -# the dimension for word embeddings, which is also the last dimension of -# the input and output of multi-head attention, position-wise feed-forward -# networks, encoder and decoder. -d_model: 512 -# size of the hidden layer in position-wise feed-forward networks. -d_inner_hid: 2048 -# number of head used in multi-head attention. -n_head: 8 -# number of sub-layers to be stacked in the encoder and decoder. -n_layer: 6 -# dropout rates. -dropout: 0.1 -# the flag indicating whether to share embedding and softmax weights. -# vocabularies in source and target should be same for weight sharing. -weight_sharing: True