diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/README.md b/PaddleNLP/examples/machine_translation/transformer-rc0/README.md
deleted file mode 100644
index 9ba9a49253ee8daad0acf70f844a12f0fe92a0cf..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/README.md
+++ /dev/null
@@ -1,161 +0,0 @@
-## Transformer
-
-以下是本例的简要目录结构及说明:
-
-```text
-.
-├── images # README 文档中的图片
-├── predict.py # 预测脚本
-├── reader.py # 数据读取接口
-├── README.md # 文档
-├── train.py # 训练脚本
-├── transformer.py # 模型定义文件
-└── transformer.yaml # 配置文件
-```
-
-## 模型简介
-
-机器翻译(machine translation, MT)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。
-
-本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现, 包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。
-
-
-## 快速开始
-
-### 安装说明
-
-1. paddle安装
-
- 本项目依赖于 PaddlePaddle 2.0rc及以上版本或适当的develop版本,请参考 [安装指南](https://www.paddlepaddle.org.cn/install/quick) 进行安装
-
-2. 下载代码
-
- 克隆代码库到本地
-
-3. 环境依赖
-
- 该模型使用PaddlePaddle,关于环境依赖部分,请先参考PaddlePaddle[安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/index_cn.html)关于环境依赖部分的内容。
- 此外,需要另外涉及:
- * attrdict
- * pyyaml
-
-
-### 数据准备
-
-公开数据集:WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛,其中英德翻译任务提供了一个中等规模的数据集,这个数据集是较多论文中使用的数据集,也是 Transformer 论文中用到的一个数据集。我们也将[WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)作为示例提供。
-
-另外我们也整理提供了一份处理好的 WMT'16 EN-DE 数据以供下载使用,其中包含词典(`vocab_all.bpe.32000`文件)、训练所需的 BPE 数据(`train.tok.clean.bpe.32000.en-de`文件)、预测所需的 BPE 数据(`newstest2016.tok.bpe.32000.en-de`等文件)和相应的评估预测结果所需的 tokenize 数据(`newstest2016.tok.de`等文件)。
-
-### 单机训练
-
-### 单机单卡
-
-以提供的英德翻译数据为例,首先需要设置 `transformer.yaml` 中 `world_size` 的值为 1,即卡数,后可以执行以下命令进行模型训练:
-
-```sh
-# setting visible devices for training
-CUDA_VISIBLE_DEVICES=0 python train.py
-```
-
-可以在 transformer.yaml 文件中设置相应的参数。
-
-### 单机多卡
-
-同样,需要设置 `transformer.yaml` 中 `world_size` 的值为 8,即卡数,后执行如下命令可以实现八卡训练:
-
-```sh
-export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-python train.py
-```
-
-### 模型推断
-
-以英德翻译数据为例,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译:
-
-```sh
-# setting visible devices for prediction
-export CUDA_VISIBLE_DEVICES=0
-python predict.py
-```
-
- 由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,同样可以使用 `infer_batch_size` 来指定预测的 batch_size 的大小,更多参数的使用可以在 `transformer.yaml` 文件中查阅注释说明并进行更改设置。
-
- 因模型评估与 `predict_file` 写入的顺序相关,故仅有单卡预测。
-
-
-### 模型评估
-
-预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标):
-
-```sh
-# 还原 predict.txt 中的预测结果为 tokenize 后的数据
-sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
-# 若无 BLEU 评估工具,需先进行下载
-# git clone https://github.com/moses-smt/mosesdecoder.git
-# 以英德翻译 newstest2014 测试数据为例
-perl gen_data/mosesdecoder/scripts/generic/multi-bleu.perl gen_data/wmt16_ende_data/newstest2014.tok.de < predict.tok.txt
-```
-可以看到类似如下的结果:
-```
-BLEU = 26.36, 57.7/32.1/20.0/13.0 (BP=1.000, ratio=1.013, hyp_len=63903, ref_len=63078)
-```
-
-## 进阶使用
-
-### 背景介绍
-
-Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模[1]。
-
-相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用(Self)Attention 进行输入序列到输出序列的变换主要具有以下优势:
-
-- 计算复杂度小
- - 特征维度为 d 、长度为 n 的序列,在 RNN 中计算复杂度为 `O(n * d * d)` (n 个时间步,每个时间步计算 d 维的矩阵向量乘法),在 Self-Attention 中计算复杂度为 `O(n * n * d)` (n 个时间步两两计算 d 维的向量点积或其他相关度函数),n 通常要小于 d 。
-- 计算并行度高
- - RNN 中当前时间步的计算要依赖前一个时间步的计算结果;Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出,各时间步可以完全并行。
-- 容易学习长程依赖(long-range dependencies)
- - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立;Self-Attention 中任何两个位置都直接相连;路径越短信号传播越容易。
-
-Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构,已被广泛应用在 Bert [2]等语义表示模型中,取得了显著效果。
-
-
-### 模型概览
-
-Transformer 同样使用了 Seq2Seq 模型中典型的编码器-解码器(Encoder-Decoder)的框架结构,整体网络结构如图1所示。
-
-
-
-图 1. Transformer 网络结构图
-
-
-可以看到,和以往 Seq2Seq 模型不同,Transformer 的 Encoder 和 Decoder 中不再使用 RNN 的结构。
-
-### 模型特点
-
-Transformer 中的 Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注意力(Multi-Head Attention)和全连接的前馈(Feed-Forward)网络这两个 sub-layer 构成。
-- Multi-Head Attention 在这里用于实现 Self-Attention,相比于简单的 Attention 机制,其将输入进行多路线性变换后分别计算 Attention 的结果,并将所有结果拼接后再次进行线性变换作为输出。参见图2,其中 Attention 使用的是点积(Dot-Product),并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
-- Feed-Forward 网络会对序列中的每个位置进行相同的计算(Position-wise),其采用的是两次线性变换中间加以 ReLU 激活的结构。
-
-此外,每个 sub-layer 后还施以 Residual Connection [3]和 Layer Normalization [4]来促进梯度传播和模型收敛。
-
-
-
-图 2. Multi-Head Attention
-
-
-Decoder 具有和 Encoder 类似的结构,只是相比于组成 Encoder 的 layer ,在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention,这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
-
-## FAQ
-
-**Q:** 预测结果中样本数少于输入的样本数是什么原因
-**A:** 若样本中最大长度超过 `transformer.yaml` 中 `max_length` 的默认设置,请注意运行时增大 `--max_length` 的设置,否则超长样本将被过滤。
-
-**Q:** 预测时最大长度超过了训练时的最大长度怎么办
-**A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小,若预测时长度超过 `max_length`,请调大该值,会重新生成更大的 position encoding 表。
-
-
-## 参考文献
-1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
-2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018.
-3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
-4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
-5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015.
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh b/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh
deleted file mode 100755
index b2be343e5719a651c4197f4133f5f7652cb3f44b..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/gen_data.sh
+++ /dev/null
@@ -1,217 +0,0 @@
-#! /usr/bin/env bash
-
-set -e
-
-OUTPUT_DIR=$PWD/gen_data
-
-###############################################################################
-# change these variables for other WMT data
-###############################################################################
-OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt16_ende_data"
-OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt16_ende_data_bpe"
-LANG1="en"
-LANG2="de"
-# each of TRAIN_DATA: data_url data_file_lang1 data_file_lang2
-TRAIN_DATA=(
-'http://www.statmt.org/europarl/v7/de-en.tgz'
-'europarl-v7.de-en.en' 'europarl-v7.de-en.de'
-'http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz'
-'commoncrawl.de-en.en' 'commoncrawl.de-en.de'
-'http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz'
-'news-commentary-v11.de-en.en' 'news-commentary-v11.de-en.de'
-)
-# each of DEV_TEST_DATA: data_url data_file_lang1 data_file_lang2
-DEV_TEST_DATA=(
-'http://data.statmt.org/wmt16/translation-task/dev.tgz'
-'newstest201[45]-deen-ref.en.sgm' 'newstest201[45]-deen-src.de.sgm'
-'http://data.statmt.org/wmt16/translation-task/test.tgz'
-'newstest2016-deen-ref.en.sgm' 'newstest2016-deen-src.de.sgm'
-)
-###############################################################################
-
-###############################################################################
-# change these variables for other WMT data
-###############################################################################
-# OUTPUT_DIR_DATA="${OUTPUT_DIR}/wmt14_enfr_data"
-# OUTPUT_DIR_BPE_DATA="${OUTPUT_DIR}/wmt14_enfr_data_bpe"
-# LANG1="en"
-# LANG2="fr"
-# # each of TRAIN_DATA: ata_url data_tgz data_file
-# TRAIN_DATA=(
-# 'http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz'
-# 'commoncrawl.fr-en.en' 'commoncrawl.fr-en.fr'
-# 'http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz'
-# 'training/europarl-v7.fr-en.en' 'training/europarl-v7.fr-en.fr'
-# 'http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz'
-# 'training/news-commentary-v9.fr-en.en' 'training/news-commentary-v9.fr-en.fr'
-# 'http://www.statmt.org/wmt10/training-giga-fren.tar'
-# 'giga-fren.release2.fixed.en.*' 'giga-fren.release2.fixed.fr.*'
-# 'http://www.statmt.org/wmt13/training-parallel-un.tgz'
-# 'un/undoc.2000.fr-en.en' 'un/undoc.2000.fr-en.fr'
-# )
-# # each of DEV_TEST_DATA: data_url data_tgz data_file_lang1 data_file_lang2
-# DEV_TEST_DATA=(
-# 'http://data.statmt.org/wmt16/translation-task/dev.tgz'
-# '.*/newstest201[45]-fren-ref.en.sgm' '.*/newstest201[45]-fren-src.fr.sgm'
-# 'http://data.statmt.org/wmt16/translation-task/test.tgz'
-# '.*/newstest2016-fren-ref.en.sgm' '.*/newstest2016-fren-src.fr.sgm'
-# )
-###############################################################################
-
-mkdir -p $OUTPUT_DIR_DATA $OUTPUT_DIR_BPE_DATA
-
-# Extract training data
-for ((i=0;i<${#TRAIN_DATA[@]};i+=3)); do
- data_url=${TRAIN_DATA[i]}
- data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz
- data=${data_tgz%.*} # training-parallel-commoncrawl
- data_lang1=${TRAIN_DATA[i+1]}
- data_lang2=${TRAIN_DATA[i+2]}
- if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then
- echo "Download "${data_url}
- wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url}
- fi
-
- if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then
- echo "Extract "${data_tgz}
- mkdir -p ${OUTPUT_DIR_DATA}/${data}
- tar_type=${data_tgz:0-3}
- if [ ${tar_type} == "tar" ]; then
- tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
- else
- tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
- fi
- fi
- # concatenate all training data
- for data_lang in $data_lang1 $data_lang2; do
- for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do
- data_dir=`dirname $f`
- data_file=`basename $f`
- f_base=${f%.*}
- f_ext=${f##*.}
- if [ $f_ext == "gz" ]; then
- gunzip $f
- l=${f_base##*.}
- f_base=${f_base%.*}
- else
- l=${f_ext}
- fi
-
- if [ $i -eq 0 ]; then
- cat ${f_base}.$l > ${OUTPUT_DIR_DATA}/train.$l
- else
- cat ${f_base}.$l >> ${OUTPUT_DIR_DATA}/train.$l
- fi
- done
- done
-done
-
-# Clone mosesdecoder
-if [ ! -d ${OUTPUT_DIR}/mosesdecoder ]; then
- echo "Cloning moses for data processing"
- git clone https://github.com/moses-smt/mosesdecoder.git ${OUTPUT_DIR}/mosesdecoder
-fi
-
-# Extract develop and test data
-dev_test_data=""
-for ((i=0;i<${#DEV_TEST_DATA[@]};i+=3)); do
- data_url=${DEV_TEST_DATA[i]}
- data_tgz=${data_url##*/} # training-parallel-commoncrawl.tgz
- data=${data_tgz%.*} # training-parallel-commoncrawl
- data_lang1=${DEV_TEST_DATA[i+1]}
- data_lang2=${DEV_TEST_DATA[i+2]}
- if [ ! -e ${OUTPUT_DIR_DATA}/${data_tgz} ]; then
- echo "Download "${data_url}
- wget -O ${OUTPUT_DIR_DATA}/${data_tgz} ${data_url}
- fi
-
- if [ ! -d ${OUTPUT_DIR_DATA}/${data} ]; then
- echo "Extract "${data_tgz}
- mkdir -p ${OUTPUT_DIR_DATA}/${data}
- tar_type=${data_tgz:0-3}
- if [ ${tar_type} == "tar" ]; then
- tar -xvf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
- else
- tar -xvzf ${OUTPUT_DIR_DATA}/${data_tgz} -C ${OUTPUT_DIR_DATA}/${data}
- fi
- fi
-
- for data_lang in $data_lang1 $data_lang2; do
- for f in `find ${OUTPUT_DIR_DATA}/${data} -regex ".*/${data_lang}"`; do
- data_dir=`dirname $f`
- data_file=`basename $f`
- data_out=`echo ${data_file} | cut -d '-' -f 1` # newstest2016
- l=`echo ${data_file} | cut -d '.' -f 2` # en
- dev_test_data="${dev_test_data}\|${data_out}" # to make regexp
- if [ ! -e ${OUTPUT_DIR_DATA}/${data_out}.$l ]; then
- ${OUTPUT_DIR}/mosesdecoder/scripts/ems/support/input-from-sgm.perl \
- < $f > ${OUTPUT_DIR_DATA}/${data_out}.$l
- fi
- done
- done
-done
-
-# Tokenize data
-for l in ${LANG1} ${LANG2}; do
- for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train${dev_test_data}\)\.$l$"`; do
- f_base=${f%.*} # dir/train dir/newstest2016
- f_out=$f_base.tok.$l
- if [ ! -e $f_out ]; then
- echo "Tokenize "$f
- ${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l $l -threads 8 < $f > $f_out
- fi
- done
-done
-
-# Clean data
-for f in ${OUTPUT_DIR_DATA}/train.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.${LANG1}; do
- f_base=${f%.*} # dir/train dir/train.tok
- f_out=${f_base}.clean
- if [ ! -e $f_out.${LANG1} ] && [ ! -e $f_out.${LANG2} ]; then
- echo "Clean "${f_base}
- ${OUTPUT_DIR}/mosesdecoder/scripts/training/clean-corpus-n.perl $f_base ${LANG1} ${LANG2} ${f_out} 1 80
- fi
-done
-
-python -m pip install subword-nmt
-
-# Generate BPE data and vocabulary
-for num_operations in 32000; do
- if [ ! -e ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} ]; then
- echo "Learn BPE with ${num_operations} merge operations"
- cat ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG1} ${OUTPUT_DIR_DATA}/train.tok.clean.${LANG2} | \
- subword-nmt learn-bpe -s $num_operations > ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations}
- fi
-
- for l in ${LANG1} ${LANG2}; do
- for f in `ls ${OUTPUT_DIR_DATA}/*.$l | grep "\(train${dev_test_data}\)\.tok\(\.clean\)\?\.$l$"`; do
- f_base=${f%.*} # dir/train.tok dir/train.tok.clean dir/newstest2016.tok
- f_base=${f_base##*/} # train.tok train.tok.clean newstest2016.tok
- f_out=${OUTPUT_DIR_BPE_DATA}/${f_base}.bpe.${num_operations}.$l
- if [ ! -e $f_out ]; then
- echo "Apply BPE to "$f
- subword-nmt apply-bpe -c ${OUTPUT_DIR_BPE_DATA}/bpe.${num_operations} < $f > $f_out
- fi
- done
- done
-
- if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} ]; then
- echo "Create vocabulary for BPE data"
- cat ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG1} ${OUTPUT_DIR_BPE_DATA}/train.tok.clean.bpe.${num_operations}.${LANG2} | \
- subword-nmt get-vocab | cut -f1 -d ' ' > ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations}
- fi
-done
-
-# Adapt to the reader
-for f in ${OUTPUT_DIR_BPE_DATA}/*.bpe.${num_operations}.${LANG1}; do
- f_base=${f%.*} # dir/train.tok.clean.bpe.32000 dir/newstest2016.tok.bpe.32000
- f_out=${f_base}.${LANG1}-${LANG2}
- if [ ! -e $f_out ]; then
- paste -d '\t' $f_base.${LANG1} $f_base.${LANG2} > $f_out
- fi
-done
-if [ ! -e ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations} ]; then
- sed '1i\\n\n' ${OUTPUT_DIR_BPE_DATA}/vocab.bpe.${num_operations} > ${OUTPUT_DIR_BPE_DATA}/vocab_all.bpe.${num_operations}
-fi
-
-echo "All done."
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py b/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py
deleted file mode 100644
index db7121d246346e2fef123a813aef639892e89309..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/predict.py
+++ /dev/null
@@ -1,109 +0,0 @@
-import logging
-import os
-import six
-import sys
-import time
-
-import numpy as np
-import paddle
-
-import yaml
-from attrdict import AttrDict
-from pprint import pprint
-
-# Include task-specific libs
-import reader
-from transformer import InferTransformerModel, position_encoding_init
-
-
-def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
- """
- Post-process the decoded sequence.
- """
- eos_pos = len(seq) - 1
- for i, idx in enumerate(seq):
- if idx == eos_idx:
- eos_pos = i
- break
- seq = [
- idx for idx in seq[:eos_pos + 1]
- if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
- ]
- return seq
-
-
-@paddle.fluid.dygraph.no_grad()
-def do_predict(args):
- if args.use_gpu:
- place = "gpu:0"
- else:
- place = "cpu"
-
- paddle.set_device(place)
-
- # Define data loader
- (test_loader,
- test_steps_fn), trg_idx2word = reader.create_infer_loader(args)
-
- # Define model
- transformer = InferTransformerModel(
- src_vocab_size=args.src_vocab_size,
- trg_vocab_size=args.trg_vocab_size,
- max_length=args.max_length + 1,
- n_layer=args.n_layer,
- n_head=args.n_head,
- d_model=args.d_model,
- d_inner_hid=args.d_inner_hid,
- dropout=args.dropout,
- weight_sharing=args.weight_sharing,
- bos_id=args.bos_idx,
- eos_id=args.eos_idx,
- beam_size=args.beam_size,
- max_out_len=args.max_out_len)
-
- # Load the trained model
- assert args.init_from_params, (
- "Please set init_from_params to load the infer model.")
-
- model_dict = paddle.load(
- os.path.join(args.init_from_params, "transformer.pdparams"))
-
- # To avoid a longer length than training, reset the size of position
- # encoding to max_length
- model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
- args.max_length + 1, args.d_model)
- model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
- args.max_length + 1, args.d_model)
- transformer.load_dict(model_dict)
-
- # Set evaluate mode
- transformer.eval()
-
- f = open(args.output_file, "w")
- for input_data in test_loader:
- (src_word, src_pos, src_slf_attn_bias, trg_word,
- trg_src_attn_bias) = input_data
- finished_seq = transformer(
- src_word=src_word,
- src_pos=src_pos,
- src_slf_attn_bias=src_slf_attn_bias,
- trg_word=trg_word,
- trg_src_attn_bias=trg_src_attn_bias)
- finished_seq = finished_seq.numpy().transpose([0, 2, 1])
- for ins in finished_seq:
- for beam_idx, beam in enumerate(ins):
- if beam_idx >= args.n_best:
- break
- id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
- word_list = [trg_idx2word[id] for id in id_list]
- sequence = " ".join(word_list) + "\n"
- f.write(sequence)
-
-
-if __name__ == "__main__":
- yaml_file = "./transformer.yaml"
- with open(yaml_file, 'rt') as f:
- args = AttrDict(yaml.safe_load(f))
- pprint(args)
-
- do_predict(args)
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py b/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py
deleted file mode 100644
index 7fc0c5d70593af9819542b4012211079c240037f..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/reader.py
+++ /dev/null
@@ -1,509 +0,0 @@
-import glob
-import sys
-import os
-import io
-import itertools
-from functools import partial
-import numpy as np
-from paddle.io import BatchSampler, DataLoader, Dataset
-
-
-def create_infer_loader(args):
- dataset = TransformerDataset(
- fpattern=args.predict_file,
- src_vocab_fpath=args.src_vocab_fpath,
- trg_vocab_fpath=args.trg_vocab_fpath,
- token_delimiter=args.token_delimiter,
- start_mark=args.special_token[0],
- end_mark=args.special_token[1],
- unk_mark=args.special_token[2])
- args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
- args.unk_idx = dataset.get_vocab_summary()
- trg_idx2word = TransformerDataset.load_dict(
- dict_path=args.trg_vocab_fpath, reverse=True)
- batch_sampler = TransformerBatchSampler(
- dataset=dataset,
- use_token_batch=False,
- batch_size=args.infer_batch_size,
- max_length=args.max_length)
- data_loader = DataLoader(
- dataset=dataset,
- batch_sampler=batch_sampler,
- collate_fn=partial(
- prepare_infer_input,
- bos_idx=args.bos_idx,
- eos_idx=args.eos_idx,
- src_pad_idx=args.eos_idx,
- n_head=args.n_head),
- num_workers=0,
- return_list=True)
- data_loaders = (data_loader, batch_sampler.__len__)
- return data_loaders, trg_idx2word
-
-
-def create_data_loader(args, world_size, rank):
- data_loaders = [(None, None)] * 2
- data_files = [args.training_file, args.validation_file
- ] if args.validation_file else [args.training_file]
- for i, data_file in enumerate(data_files):
- dataset = TransformerDataset(
- fpattern=data_file,
- src_vocab_fpath=args.src_vocab_fpath,
- trg_vocab_fpath=args.trg_vocab_fpath,
- token_delimiter=args.token_delimiter,
- start_mark=args.special_token[0],
- end_mark=args.special_token[1],
- unk_mark=args.special_token[2])
- args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
- args.unk_idx = dataset.get_vocab_summary()
- batch_sampler = TransformerBatchSampler(
- dataset=dataset,
- batch_size=args.batch_size,
- pool_size=args.pool_size,
- sort_type=args.sort_type,
- shuffle=args.shuffle,
- shuffle_batch=args.shuffle_batch,
- use_token_batch=args.use_token_batch,
- max_length=args.max_length,
- distribute_mode=True if i == 0 else False,
- world_size=world_size,
- rank=rank)
- data_loader = DataLoader(
- dataset=dataset,
- batch_sampler=batch_sampler,
- collate_fn=partial(
- prepare_train_input,
- bos_idx=args.bos_idx,
- eos_idx=args.eos_idx,
- src_pad_idx=args.eos_idx,
- trg_pad_idx=args.eos_idx,
- n_head=args.n_head),
- num_workers=0,
- return_list=True)
- data_loaders[i] = (data_loader, batch_sampler.__len__)
- return data_loaders
-
-
-def prepare_train_input(insts, bos_idx, eos_idx, src_pad_idx, trg_pad_idx,
- n_head):
- """
- Put all padded data needed by training into a list.
- """
- src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
- [inst[0] + [eos_idx] for inst in insts],
- src_pad_idx,
- n_head,
- is_target=False)
- src_word = src_word.reshape(-1, src_max_len)
- src_pos = src_pos.reshape(-1, src_max_len)
- trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
- [[bos_idx] + inst[1] for inst in insts],
- trg_pad_idx,
- n_head,
- is_target=True)
- trg_word = trg_word.reshape(-1, trg_max_len)
- trg_pos = trg_pos.reshape(-1, trg_max_len)
-
- trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
- [1, 1, trg_max_len, 1]).astype("float32")
-
- lbl_word, lbl_weight, lbl_max_len, num_token = pad_batch_data(
- [inst[1] + [eos_idx] for inst in insts],
- trg_pad_idx,
- n_head,
- is_target=False,
- is_label=True,
- return_attn_bias=False,
- return_max_len=True,
- return_num_token=True)
- lbl_word = lbl_word.reshape(-1, lbl_max_len, 1)
- lbl_weight = lbl_weight.reshape(-1, lbl_max_len, 1)
-
- data_inputs = [
- src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
- trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
- ]
-
- return data_inputs
-
-
-def prepare_infer_input(insts, bos_idx, eos_idx, src_pad_idx, n_head):
- """
- Put all padded data needed by beam search decoder into a list.
- """
- src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
- [inst[0] + [eos_idx] for inst in insts],
- src_pad_idx,
- n_head,
- is_target=False)
- trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64")
- trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
- [1, 1, 1, 1]).astype("float32")
- trg_word = trg_word.reshape(-1, 1)
- src_word = src_word.reshape(-1, src_max_len)
- src_pos = src_pos.reshape(-1, src_max_len)
-
- data_inputs = [
- src_word, src_pos, src_slf_attn_bias, trg_word, trg_src_attn_bias
- ]
-
- return data_inputs
-
-
-def pad_batch_data(insts,
- pad_idx,
- n_head,
- is_target=False,
- is_label=False,
- return_attn_bias=True,
- return_max_len=True,
- return_num_token=False):
- """
- Pad the instances to the max sequence length in batch, and generate the
- corresponding position data and attention bias.
- """
- return_list = []
- max_len = max(len(inst) for inst in insts)
- # Any token included in dict can be used to pad, since the paddings' loss
- # will be masked out by weights and make no effect on parameter gradients.
- inst_data = np.array(
- [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
- return_list += [inst_data.astype("int64").reshape([-1, 1])]
- if is_label: # label weight
- inst_weight = np.array([[1.] * len(inst) + [0.] * (max_len - len(inst))
- for inst in insts])
- return_list += [inst_weight.astype("float32").reshape([-1, 1])]
- else: # position data
- inst_pos = np.array([
- list(range(0, len(inst))) + [0] * (max_len - len(inst))
- for inst in insts
- ])
- return_list += [inst_pos.astype("int64").reshape([-1, 1])]
- if return_attn_bias:
- if is_target:
- # This is used to avoid attention on paddings and subsequent
- # words.
- slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
- slf_attn_bias_data = np.triu(slf_attn_bias_data,
- 1).reshape([-1, 1, max_len, max_len])
- slf_attn_bias_data = np.tile(slf_attn_bias_data,
- [1, n_head, 1, 1]) * [-1e9]
- else:
- # This is used to avoid attention on paddings.
- slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
- (max_len - len(inst))
- for inst in insts])
- slf_attn_bias_data = np.tile(
- slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
- [1, n_head, max_len, 1])
- return_list += [slf_attn_bias_data.astype("float32")]
- if return_max_len:
- return_list += [max_len]
- if return_num_token:
- num_token = 0
- for inst in insts:
- num_token += len(inst)
- return_list += [num_token]
- return return_list if len(return_list) > 1 else return_list[0]
-
-
-class SortType(object):
- GLOBAL = 'global'
- POOL = 'pool'
- NONE = "none"
-
-
-class Converter(object):
- def __init__(self, vocab, beg, end, unk, delimiter, add_beg, add_end):
- self._vocab = vocab
- self._beg = beg
- self._end = end
- self._unk = unk
- self._delimiter = delimiter
- self._add_beg = add_beg
- self._add_end = add_end
-
- def __call__(self, sentence):
- return ([self._beg] if self._add_beg else []) + [
- self._vocab.get(w, self._unk)
- for w in sentence.split(self._delimiter)
- ] + ([self._end] if self._add_end else [])
-
-
-class ComposedConverter(object):
- def __init__(self, converters):
- self._converters = converters
-
- def __call__(self, fields):
- return [
- converter(field)
- for field, converter in zip(fields, self._converters)
- ]
-
-
-class SentenceBatchCreator(object):
- def __init__(self, batch_size):
- self.batch = []
- self._batch_size = batch_size
-
- def append(self, info):
- self.batch.append(info)
- if len(self.batch) == self._batch_size:
- tmp = self.batch
- self.batch = []
- return tmp
-
-
-class TokenBatchCreator(object):
- def __init__(self, batch_size):
- self.batch = []
- self.max_len = -1
- self._batch_size = batch_size
-
- def append(self, info):
- cur_len = info.max_len
- max_len = max(self.max_len, cur_len)
- if max_len * (len(self.batch) + 1) > self._batch_size:
- result = self.batch
- self.batch = [info]
- self.max_len = cur_len
- return result
- else:
- self.max_len = max_len
- self.batch.append(info)
-
-
-class SampleInfo(object):
- def __init__(self, i, lens):
- self.i = i
- # Take bos and eos into account
- self.min_len = min(lens[0] + 1, lens[1] + 2)
- self.max_len = max(lens[0] + 1, lens[1] + 2)
-
-
-class MinMaxFilter(object):
- def __init__(self, max_len, min_len, underlying_creator):
- self._min_len = min_len
- self._max_len = max_len
- self._creator = underlying_creator
-
- def append(self, info):
- if info.max_len > self._max_len or info.min_len < self._min_len:
- return
- else:
- return self._creator.append(info)
-
- @property
- def batch(self):
- return self._creator.batch
-
-
-class TransformerDataset(Dataset):
- def __init__(self,
- src_vocab_fpath,
- trg_vocab_fpath,
- fpattern,
- field_delimiter="\t",
- token_delimiter=" ",
- start_mark="",
- end_mark="",
- unk_mark="",
- trg_fpattern=None):
- self._src_vocab = self.load_dict(src_vocab_fpath)
- self._trg_vocab = self.load_dict(trg_vocab_fpath)
- self._bos_idx = self._src_vocab[start_mark]
- self._eos_idx = self._src_vocab[end_mark]
- self._unk_idx = self._src_vocab[unk_mark]
- self._field_delimiter = field_delimiter
- self._token_delimiter = token_delimiter
- self.load_src_trg_ids(fpattern, trg_fpattern)
-
- def load_src_trg_ids(self, fpattern, trg_fpattern=None):
- src_converter = Converter(
- vocab=self._src_vocab,
- beg=self._bos_idx,
- end=self._eos_idx,
- unk=self._unk_idx,
- delimiter=self._token_delimiter,
- add_beg=False,
- add_end=False)
-
- trg_converter = Converter(
- vocab=self._trg_vocab,
- beg=self._bos_idx,
- end=self._eos_idx,
- unk=self._unk_idx,
- delimiter=self._token_delimiter,
- add_beg=False,
- add_end=False)
-
- converters = ComposedConverter([src_converter, trg_converter])
-
- self._src_seq_ids = []
- self._trg_seq_ids = []
- self._sample_infos = []
-
- slots = [self._src_seq_ids, self._trg_seq_ids]
- for i, line in enumerate(self._load_lines(fpattern, trg_fpattern)):
- lens = []
- for field, slot in zip(converters(line), slots):
- slot.append(field)
- lens.append(len(field))
- self._sample_infos.append(SampleInfo(i, lens))
-
- def _load_lines(self, fpattern, trg_fpattern=None):
- fpaths = glob.glob(fpattern)
- fpaths = sorted(fpaths) # TODO: Add custum sort
- assert len(fpaths) > 0, "no matching file to the provided data path"
-
- (f_mode, f_encoding, endl) = ("r", "utf8", "\n")
-
- if trg_fpattern is None:
- for fpath in fpaths:
- with io.open(fpath, f_mode, encoding=f_encoding) as f:
- for line in f:
- fields = line.strip(endl).split(self._field_delimiter)
- yield fields
- else:
- # Separated source and target language data files
- # assume we can get aligned data by sort the two language files
- # TODO: Need more rigorous check
- trg_fpaths = glob.glob(trg_fpattern)
- trg_fpaths = sorted(trg_fpaths)
- assert len(fpaths) == len(
- trg_fpaths
- ), "the number of source language data files must equal \
- with that of source language"
-
- for fpath, trg_fpath in zip(fpaths, trg_fpaths):
- with io.open(fpath, f_mode, encoding=f_encoding) as f:
- with io.open(
- trg_fpath, f_mode, encoding=f_encoding) as trg_f:
- for line in zip(f, trg_f):
- fields = [field.strip(endl) for field in line]
- yield fields
-
- @staticmethod
- def load_dict(dict_path, reverse=False):
- word_dict = {}
- (f_mode, f_encoding, endl) = ("r", "utf8", "\n")
- with io.open(dict_path, f_mode, encoding=f_encoding) as fdict:
- for idx, line in enumerate(fdict):
- if reverse:
- word_dict[idx] = line.strip(endl)
- else:
- word_dict[line.strip(endl)] = idx
- return word_dict
-
- def get_vocab_summary(self):
- return len(self._src_vocab), len(
- self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx
-
- def __getitem__(self, idx):
- return (self._src_seq_ids[idx], self._trg_seq_ids[idx]
- ) if self._trg_seq_ids else self._src_seq_ids[idx]
-
- def __len__(self):
- return len(self._sample_infos)
-
-
-class TransformerBatchSampler(BatchSampler):
- def __init__(self,
- dataset,
- batch_size,
- pool_size=10000,
- sort_type=SortType.NONE,
- min_length=0,
- max_length=100,
- shuffle=False,
- shuffle_batch=False,
- use_token_batch=False,
- clip_last_batch=False,
- distribute_mode=True,
- seed=0,
- world_size=1,
- rank=0):
- for arg, value in locals().items():
- if arg != "self":
- setattr(self, "_" + arg, value)
- self._random = np.random
- self._random.seed(seed)
- # for multi-devices
- self._distribute_mode = distribute_mode
- self._nranks = world_size
- self._local_rank = rank
-
- def __iter__(self):
- # global sort or global shuffle
- if self._sort_type == SortType.GLOBAL:
- infos = sorted(self._dataset._sample_infos, key=lambda x: x.max_len)
- else:
- if self._shuffle:
- infos = self._dataset._sample_infos
- self._random.shuffle(infos)
- else:
- infos = self._dataset._sample_infos
-
- if self._sort_type == SortType.POOL:
- reverse = True
- for i in range(0, len(infos), self._pool_size):
- # To avoid placing short next to long sentences
- reverse = not reverse
- infos[i:i + self._pool_size] = sorted(
- infos[i:i + self._pool_size],
- key=lambda x: x.max_len,
- reverse=reverse)
-
- batches = []
- batch_creator = TokenBatchCreator(
- self.
- _batch_size) if self._use_token_batch else SentenceBatchCreator(
- self._batch_size * self._nranks)
- batch_creator = MinMaxFilter(self._max_length, self._min_length,
- batch_creator)
-
- for info in infos:
- batch = batch_creator.append(info)
- if batch is not None:
- batches.append(batch)
-
- if not self._clip_last_batch and len(batch_creator.batch) != 0:
- batches.append(batch_creator.batch)
-
- if self._shuffle_batch:
- self._random.shuffle(batches)
-
- if not self._use_token_batch:
- # When producing batches according to sequence number, to confirm
- # neighbor batches which would be feed and run parallel have similar
- # length (thus similar computational cost) after shuffle, we as take
- # them as a whole when shuffling and split here
- batches = [[
- batch[self._batch_size * i:self._batch_size * (i + 1)]
- for i in range(self._nranks)
- ] for batch in batches]
- batches = list(itertools.chain.from_iterable(batches))
- self.batch_number = (len(batches) + self._nranks - 1) // self._nranks
-
- # for multi-device
- for batch_id, batch in enumerate(batches):
- if not self._distribute_mode or (
- batch_id % self._nranks == self._local_rank):
- batch_indices = [info.i for info in batch]
- yield batch_indices
- if self._distribute_mode and len(batches) % self._nranks != 0:
- if self._local_rank >= len(batches) % self._nranks:
- # use previous data to pad
- yield batch_indices
-
- def __len__(self):
- if hasattr(self, "batch_number"): #
- return self.batch_number
- if not self._use_token_batch:
- batch_number = (
- len(self._dataset) + self._batch_size * self._nranks - 1) // (
- self._batch_size * self._nranks)
- else:
- # For uncertain batch number, the actual value is self.batch_number
- batch_number = sys.maxsize
- return batch_number
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/train.py b/PaddleNLP/examples/machine_translation/transformer-rc0/train.py
deleted file mode 100644
index e12588dba54eac17f824035178374037067081fb..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/train.py
+++ /dev/null
@@ -1,225 +0,0 @@
-import logging
-import os
-import six
-import sys
-import time
-
-import numpy as np
-import yaml
-from attrdict import AttrDict
-from pprint import pprint
-
-import paddle
-import paddle.distributed as dist
-
-import reader
-from transformer import TransformerModel, CrossEntropyCriterion
-
-FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
-logging.basicConfig(level=logging.INFO, format=FORMAT)
-logger = logging.getLogger(__name__)
-
-
-def do_train(args):
- if args.use_gpu:
- rank = dist.get_rank()
- trainer_count = dist.get_world_size()
- else:
- rank = 0
- trainer_count = 1
- paddle.set_device("cpu")
-
- if trainer_count > 1:
- dist.init_parallel_env()
-
- # Set seed for CE
- random_seed = eval(str(args.random_seed))
- if random_seed is not None:
- paddle.seed(random_seed)
-
- # Define data loader
- (train_loader, train_steps_fn), (
- eval_loader,
- eval_steps_fn) = reader.create_data_loader(args, trainer_count, rank)
-
- # Define model
- transformer = TransformerModel(
- src_vocab_size=args.src_vocab_size,
- trg_vocab_size=args.trg_vocab_size,
- max_length=args.max_length + 1,
- n_layer=args.n_layer,
- n_head=args.n_head,
- d_model=args.d_model,
- d_inner_hid=args.d_inner_hid,
- dropout=args.dropout,
- weight_sharing=args.weight_sharing,
- bos_id=args.bos_idx,
- eos_id=args.eos_idx)
-
- # Define loss
- criterion = CrossEntropyCriterion(args.label_smooth_eps)
-
- scheduler = paddle.optimizer.lr.NoamDecay(
- args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
- # Define optimizer
- optimizer = paddle.optimizer.Adam(
- learning_rate=scheduler,
- beta1=args.beta1,
- beta2=args.beta2,
- epsilon=float(args.eps),
- parameters=transformer.parameters())
-
- # Init from some checkpoint, to resume the previous training
- if args.init_from_checkpoint:
- model_dict = paddle.load(
- os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
- opt_dict = paddle.load(
- os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
- transformer.set_state_dict(model_dict)
- optimizer.set_state_dict(opt_dict)
- print("loaded from checkpoint.")
- # Init from some pretrain models, to better solve the current task
- if args.init_from_pretrain_model:
- model_dict = paddle.load(
- os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
- transformer.set_state_dict(model_dict)
- print("loaded from pre-trained model.")
-
- if trainer_count > 1:
- transformer = paddle.DataParallel(transformer)
-
- # The best cross-entropy value with label smoothing
- loss_normalizer = -(
- (1. - args.label_smooth_eps) * np.log(
- (1. - args.label_smooth_eps)) + args.label_smooth_eps *
- np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20))
-
- ce_time = []
- ce_ppl = []
- step_idx = 0
-
- # Train loop
- for pass_id in range(args.epoch):
- epoch_start = time.time()
-
- batch_id = 0
- batch_start = time.time()
- for input_data in train_loader:
- batch_reader_end = time.time()
- (src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
- trg_slf_attn_bias, trg_src_attn_bias, lbl_word,
- lbl_weight) = input_data
-
- logits = transformer(
- src_word=src_word,
- src_pos=src_pos,
- src_slf_attn_bias=src_slf_attn_bias,
- trg_word=trg_word,
- trg_pos=trg_pos,
- trg_slf_attn_bias=trg_slf_attn_bias,
- trg_src_attn_bias=trg_src_attn_bias)
-
- sum_cost, avg_cost, token_num = criterion(logits, lbl_word,
- lbl_weight)
-
- avg_cost.backward()
-
- optimizer.step()
- optimizer.clear_grad()
- if step_idx % args.print_step == 0 and (trainer_count == 1 or
- dist.get_rank() == 0):
- total_avg_cost = avg_cost.numpy()
-
- if step_idx == 0:
- logger.info(
- "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
- "normalized loss: %f, ppl: %f" %
- (step_idx, pass_id, batch_id, total_avg_cost,
- total_avg_cost - loss_normalizer,
- np.exp([min(total_avg_cost, 100)])))
- else:
- train_avg_batch_cost = args.print_step / (
- time.time() - batch_start)
- logger.info(
- "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
- "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
- % (
- step_idx,
- pass_id,
- batch_id,
- total_avg_cost,
- total_avg_cost - loss_normalizer,
- np.exp([min(total_avg_cost, 100)]),
- train_avg_batch_cost, ))
-
- batch_start = time.time()
-
- if step_idx % args.save_step == 0 and step_idx != 0:
- # Validation
- if args.validation_file:
- transformer.eval()
- total_sum_cost = 0
- total_token_num = 0
- for input_data in eval_loader:
- (src_word, src_pos, src_slf_attn_bias, trg_word,
- trg_pos, trg_slf_attn_bias, trg_src_attn_bias,
- lbl_word, lbl_weight) = input_data
- logits = transformer(
- src_word, src_pos, src_slf_attn_bias, trg_word,
- trg_pos, trg_slf_attn_bias, trg_src_attn_bias)
- sum_cost, avg_cost, token_num = criterion(
- logits, lbl_word, lbl_weight)
- total_sum_cost += sum_cost.numpy()
- total_token_num += token_num.numpy()
- total_avg_cost = total_sum_cost / total_token_num
- logger.info("validation, step_idx: %d, avg loss: %f, "
- "normalized loss: %f, ppl: %f" %
- (step_idx, total_avg_cost,
- total_avg_cost - loss_normalizer,
- np.exp([min(total_avg_cost, 100)])))
- transformer.train()
-
- if args.save_model and (trainer_count == 1 or
- dist.get_rank() == 0):
- model_dir = os.path.join(args.save_model,
- "step_" + str(step_idx))
- if not os.path.exists(model_dir):
- os.makedirs(model_dir)
- paddle.save(transformer.state_dict(),
- os.path.join(model_dir, "transformer.pdparams"))
- paddle.save(optimizer.state_dict(),
- os.path.join(model_dir, "transformer.pdopt"))
-
- batch_id += 1
- step_idx += 1
- scheduler.step()
-
- train_epoch_cost = time.time() - epoch_start
- ce_time.append(train_epoch_cost)
- logger.info("train epoch: %d, epoch_cost: %.5f s" %
- (pass_id, train_epoch_cost))
-
- if args.save_model and (trainer_count == 1 or dist.get_rank() == 0):
- model_dir = os.path.join(args.save_model, "step_final")
- if not os.path.exists(model_dir):
- os.makedirs(model_dir)
- paddle.save(transformer.state_dict(),
- os.path.join(model_dir, "transformer.pdparams"))
- paddle.save(optimizer.state_dict(),
- os.path.join(model_dir, "transformer.pdopt"))
-
-
-def train(args, world_size=1):
- if world_size > 1 and args.use_gpu:
- dist.spawn(do_train, nprocs=world_size, args=(args, ))
- else:
- do_train(args)
-
-
-if __name__ == "__main__":
- yaml_file = "./transformer.yaml"
- with open(yaml_file, 'rt') as f:
- args = AttrDict(yaml.safe_load(f))
- pprint(args)
-
- train(args, eval(str(args.world_size)))
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py b/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py
deleted file mode 100644
index c59a8c7cff5a76169076a38b874a2cd51acf9061..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.py
+++ /dev/null
@@ -1,351 +0,0 @@
-from __future__ import print_function
-
-import numpy as np
-
-import paddle
-import paddle.nn as nn
-import paddle.nn.functional as F
-from paddle.fluid.layers.utils import map_structure
-
-
-def position_encoding_init(n_position, d_pos_vec):
- """
- Generate the initial values for the sinusoid position encoding table.
- """
- channels = d_pos_vec
- position = np.arange(n_position)
- num_timescales = channels // 2
- log_timescale_increment = (np.log(float(1e4) / float(1)) /
- (num_timescales - 1))
- inv_timescales = np.exp(
- np.arange(num_timescales) * -log_timescale_increment)
- scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales,
- 0)
- signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
- signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant')
- position_enc = signal
- return position_enc.astype("float32")
-
-
-class WordEmbedding(nn.Layer):
- """
- Word Embedding + Scale
- """
-
- def __init__(self, vocab_size, emb_dim, bos_idx=0):
- super(WordEmbedding, self).__init__()
- self.emb_dim = emb_dim
-
- self.word_embedding = nn.Embedding(
- num_embeddings=vocab_size,
- embedding_dim=emb_dim,
- padding_idx=bos_idx,
- weight_attr=paddle.ParamAttr(
- initializer=nn.initializer.Normal(0., emb_dim**-0.5)))
-
- def forward(self, word):
- word_emb = self.emb_dim**0.5 * self.word_embedding(word)
- return word_emb
-
-
-class PositionalEmbedding(nn.Layer):
- """
- Positional Embedding
- """
-
- def __init__(self, emb_dim, max_length, bos_idx=0):
- super(PositionalEmbedding, self).__init__()
- self.emb_dim = emb_dim
-
- self.pos_encoder = nn.Embedding(
- num_embeddings=max_length, embedding_dim=self.emb_dim)
- self.pos_encoder.weight.set_value(
- position_encoding_init(max_length, self.emb_dim))
-
- def forward(self, pos):
- pos_emb = self.pos_encoder(pos)
- pos_emb.stop_gradient = True
- return pos_emb
-
-
-class CrossEntropyCriterion(nn.Layer):
- def __init__(self, label_smooth_eps):
- super(CrossEntropyCriterion, self).__init__()
- self.label_smooth_eps = label_smooth_eps
-
- def forward(self, predict, label, weights):
- if self.label_smooth_eps:
- label = paddle.squeeze(label, axis=[2])
- label = F.label_smooth(
- label=F.one_hot(
- x=label, num_classes=predict.shape[-1]),
- epsilon=self.label_smooth_eps)
-
- cost = F.softmax_with_cross_entropy(
- logits=predict,
- label=label,
- soft_label=True if self.label_smooth_eps else False)
- weighted_cost = cost * weights
- sum_cost = paddle.sum(weighted_cost)
- token_num = paddle.sum(weights)
- token_num.stop_gradient = True
- avg_cost = sum_cost / token_num
- return sum_cost, avg_cost, token_num
-
-
-class TransformerDecodeCell(nn.Layer):
- def __init__(self,
- decoder,
- word_embedding=None,
- pos_embedding=None,
- linear=None,
- dropout=0.1):
- super(TransformerDecodeCell, self).__init__()
- self.decoder = decoder
- self.word_embedding = word_embedding
- self.pos_embedding = pos_embedding
- self.linear = linear
- self.dropout = dropout
-
- def forward(self, inputs, states, static_cache, trg_src_attn_bias, memory):
- if states and static_cache:
- states = list(zip(states, static_cache))
-
- if self.word_embedding:
- if not isinstance(inputs, (list, tuple)):
- inputs = (inputs)
-
- word_emb = self.word_embedding(inputs[0])
- pos_emb = self.pos_embedding(inputs[1])
- word_emb = word_emb + pos_emb
- inputs = F.dropout(
- word_emb, p=self.dropout,
- training=False) if self.dropout else word_emb
-
- cell_outputs, new_states = self.decoder(inputs, memory, None,
- trg_src_attn_bias, states)
- else:
- cell_outputs, new_states = self.decoder(inputs, memory, None,
- trg_src_attn_bias, states)
-
- if self.linear:
- cell_outputs = self.linear(cell_outputs)
-
- new_states = [cache[0] for cache in new_states]
-
- return cell_outputs, new_states
-
-
-class TransformerBeamSearchDecoder(nn.decode.BeamSearchDecoder):
- def __init__(self, cell, start_token, end_token, beam_size,
- var_dim_in_state):
- super(TransformerBeamSearchDecoder,
- self).__init__(cell, start_token, end_token, beam_size)
- self.cell = cell
- self.var_dim_in_state = var_dim_in_state
-
- def _merge_batch_beams_with_var_dim(self, c):
- # Init length of cache is 0, and it increases with decoding carrying on,
- # thus need to reshape elaborately
- var_dim_in_state = self.var_dim_in_state + 1 # count in beam dim
- c = paddle.transpose(c,
- list(range(var_dim_in_state, len(c.shape))) +
- list(range(0, var_dim_in_state)))
- c = paddle.reshape(
- c, [0] * (len(c.shape) - var_dim_in_state
- ) + [self.batch_size * self.beam_size] +
- [int(size) for size in c.shape[-var_dim_in_state + 2:]])
- c = paddle.transpose(
- c,
- list(range((len(c.shape) + 1 - var_dim_in_state), len(c.shape))) +
- list(range(0, (len(c.shape) + 1 - var_dim_in_state))))
- return c
-
- def _split_batch_beams_with_var_dim(self, c):
- var_dim_size = c.shape[self.var_dim_in_state]
- c = paddle.reshape(
- c, [-1, self.beam_size] +
- [int(size)
- for size in c.shape[1:self.var_dim_in_state]] + [var_dim_size] +
- [int(size) for size in c.shape[self.var_dim_in_state + 1:]])
- return c
-
- @staticmethod
- def tile_beam_merge_with_batch(t, beam_size):
- return map_structure(
- lambda x: nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(x, beam_size),
- t)
-
- def step(self, time, inputs, states, **kwargs):
- # Steps for decoding.
- # Compared to RNN, Transformer has 3D data at every decoding step
- inputs = paddle.reshape(inputs, [-1, 1]) # token
- pos = paddle.ones_like(inputs) * time # pos
-
- cell_states = map_structure(self._merge_batch_beams_with_var_dim,
- states.cell_states)
-
- cell_outputs, next_cell_states = self.cell((inputs, pos), cell_states,
- **kwargs)
-
- # Squeeze to adapt to BeamSearchDecoder which use 2D logits
- cell_outputs = map_structure(
- lambda x: paddle.squeeze(x, [1]) if len(x.shape) == 3 else x,
- cell_outputs)
- cell_outputs = map_structure(self._split_batch_beams, cell_outputs)
- next_cell_states = map_structure(self._split_batch_beams_with_var_dim,
- next_cell_states)
-
- beam_search_output, beam_search_state = self._beam_search_step(
- time=time,
- logits=cell_outputs,
- next_cell_states=next_cell_states,
- beam_state=states)
- next_inputs, finished = (beam_search_output.predicted_ids,
- beam_search_state.finished)
-
- return (beam_search_output, beam_search_state, next_inputs, finished)
-
-
-class TransformerModel(nn.Layer):
- """
- model
- """
-
- def __init__(self,
- src_vocab_size,
- trg_vocab_size,
- max_length,
- n_layer,
- n_head,
- d_model,
- d_inner_hid,
- dropout,
- weight_sharing,
- bos_id=0,
- eos_id=1):
- super(TransformerModel, self).__init__()
- self.trg_vocab_size = trg_vocab_size
- self.emb_dim = d_model
- self.bos_id = bos_id
- self.eos_id = eos_id
- self.dropout = dropout
-
- self.src_word_embedding = WordEmbedding(
- vocab_size=src_vocab_size, emb_dim=d_model, bos_idx=self.bos_id)
- self.src_pos_embedding = PositionalEmbedding(
- emb_dim=d_model, max_length=max_length, bos_idx=self.bos_id)
- if weight_sharing:
- assert src_vocab_size == trg_vocab_size, (
- "Vocabularies in source and target should be same for weight sharing."
- )
- self.trg_word_embedding = self.src_word_embedding
- self.trg_pos_embedding = self.src_pos_embedding
- else:
- self.trg_word_embedding = WordEmbedding(
- vocab_size=trg_vocab_size, emb_dim=d_model, bos_idx=self.bos_id)
- self.trg_pos_embedding = PositionalEmbedding(
- emb_dim=d_model, max_length=max_length, bos_idx=self.bos_id)
-
- self.transformer = nn.Transformer(
- d_model=d_model,
- nhead=n_head,
- num_encoder_layers=n_layer,
- num_decoder_layers=n_layer,
- dim_feedforward=d_inner_hid,
- dropout=dropout,
- normalize_before=True)
-
- if weight_sharing:
- self.linear = lambda x: paddle.matmul(x=x,
- y=self.trg_word_embedding.word_embedding.weight,
- transpose_y=True)
- else:
- self.linear = nn.Linear(
- input_dim=d_model, output_dim=trg_vocab_size, bias_attr=False)
-
- def forward(self, src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
- trg_slf_attn_bias, trg_src_attn_bias):
- src_emb = self.src_word_embedding(src_word)
- src_pos_emb = self.src_pos_embedding(src_pos)
- src_emb = src_emb + src_pos_emb
- enc_input = F.dropout(
- src_emb, p=self.dropout,
- training=self.training) if self.dropout else src_emb
-
- trg_emb = self.trg_word_embedding(trg_word)
- trg_pos_emb = self.trg_pos_embedding(trg_pos)
- trg_emb = trg_emb + trg_pos_emb
- dec_input = F.dropout(
- trg_emb, p=self.dropout,
- training=self.training) if self.dropout else trg_emb
-
- dec_output = self.transformer(
- enc_input,
- dec_input,
- src_mask=src_slf_attn_bias,
- tgt_mask=trg_slf_attn_bias,
- memory_mask=trg_src_attn_bias)
-
- predict = self.linear(dec_output)
-
- return predict
-
-
-class InferTransformerModel(TransformerModel):
- def __init__(self,
- src_vocab_size,
- trg_vocab_size,
- max_length,
- n_layer,
- n_head,
- d_model,
- d_inner_hid,
- dropout,
- weight_sharing,
- bos_id=0,
- eos_id=1,
- beam_size=4,
- max_out_len=256):
- args = dict(locals())
- args.pop("self")
- args.pop("__class__", None)
- self.beam_size = args.pop("beam_size")
- self.max_out_len = args.pop("max_out_len")
- self.dropout = dropout
- super(InferTransformerModel, self).__init__(**args)
-
- cell = TransformerDecodeCell(
- self.transformer.decoder, self.trg_word_embedding,
- self.trg_pos_embedding, self.linear, self.dropout)
-
- self.decode = TransformerBeamSearchDecoder(
- cell, bos_id, eos_id, beam_size, var_dim_in_state=2)
-
- def forward(self, src_word, src_pos, src_slf_attn_bias, trg_word,
- trg_src_attn_bias):
- # Run encoder
- src_emb = self.src_word_embedding(src_word)
- src_pos_emb = self.src_pos_embedding(src_pos)
- src_emb = src_emb + src_pos_emb
- enc_input = F.dropout(
- src_emb, p=self.dropout,
- training=False) if self.dropout else src_emb
- enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias)
-
- # Init states (caches) for transformer, need to be updated according to selected beam
- incremental_cache, static_cache = self.transformer.decoder.gen_cache(
- enc_output, do_zip=True)
-
- static_cache, enc_output, trg_src_attn_bias = TransformerBeamSearchDecoder.tile_beam_merge_with_batch(
- (static_cache, enc_output, trg_src_attn_bias), self.beam_size)
-
- rs, _ = nn.decode.dynamic_decode(
- decoder=self.decode,
- inits=incremental_cache,
- max_step_num=self.max_out_len,
- memory=enc_output,
- trg_src_attn_bias=trg_src_attn_bias,
- static_cache=static_cache)
-
- return rs
diff --git a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml b/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml
deleted file mode 100644
index 18701016e434353d46c822638dacf00c276505e2..0000000000000000000000000000000000000000
--- a/PaddleNLP/examples/machine_translation/transformer-rc0/transformer.yaml
+++ /dev/null
@@ -1,100 +0,0 @@
-# The frequency to save trained models when training.
-save_step: 10000
-# The frequency to fetch and print output when training.
-print_step: 100
-# path of the checkpoint, to resume the previous training
-init_from_checkpoint: ""
-# path of the pretrain model, to better solve the current task
-init_from_pretrain_model: "" # "base_model_dygraph/step_100000/"
-# path of trained parameter, to make prediction
-init_from_params: "./base_model_dygraph/api_saved/" # "base_model_dygraph/step_100000/"
-# the directory for saving model
-save_model: "trained_models"
-# the directory for saving inference model.
-inference_model_dir: "infer_model"
-# Set seed for CE or debug
-random_seed: None
-# The pattern to match training data files.
-training_file: "gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de"
-# The pattern to match validation data files.
-validation_file: "gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de"
-# The pattern to match test data files.
-predict_file: "gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de"
-# The file to output the translation results of predict_file to.
-output_file: "predict.txt"
-# The path of vocabulary file of source language.
-src_vocab_fpath: "gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000"
-# The path of vocabulary file of target language.
-trg_vocab_fpath: "gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000"
-# The , and tokens in the dictionary.
-special_token: ["", "", ""]
-
-# whether to use cuda
-use_gpu: True
-
-# args for reader, see reader.py for details
-token_delimiter: " "
-use_token_batch: True
-pool_size: 200000
-sort_type: "pool"
-shuffle: True
-shuffle_batch: True
-batch_size: 4096
-infer_batch_size: 64
-
-# Hyparams for training:
-# the number of epoches for training
-epoch: 30
-# cards
-world_size: 8
-# the hyper parameters for Adam optimizer.
-# This static learning_rate will be multiplied to the LearningRateScheduler
-# derived learning rate the to get the final learning rate.
-learning_rate: 2.0
-beta1: 0.9
-beta2: 0.997
-eps: 1e-9
-# the parameters for learning rate scheduling.
-warmup_steps: 8000
-# the weight used to mix up the ground-truth distribution and the fixed
-# uniform distribution in label smoothing when training.
-# Set this as zero if label smoothing is not wanted.
-label_smooth_eps: 0.1
-
-# Hyparams for generation:
-# the parameters for beam search.
-beam_size: 5
-max_out_len: 256
-# the number of decoded sentences to output.
-n_best: 1
-
-# Hyparams for model:
-# These following five vocabularies related configurations will be set
-# automatically according to the passed vocabulary path and special tokens.
-# size of source word dictionary.
-src_vocab_size: 10000
-# size of target word dictionay
-trg_vocab_size: 10000
-# index for token
-bos_idx: 0
-# index for token
-eos_idx: 1
-# index for token
-unk_idx: 2
-# max length of sequences deciding the size of position encoding table.
-max_length: 256
-# the dimension for word embeddings, which is also the last dimension of
-# the input and output of multi-head attention, position-wise feed-forward
-# networks, encoder and decoder.
-d_model: 512
-# size of the hidden layer in position-wise feed-forward networks.
-d_inner_hid: 2048
-# number of head used in multi-head attention.
-n_head: 8
-# number of sub-layers to be stacked in the encoder and decoder.
-n_layer: 6
-# dropout rates.
-dropout: 0.1
-# the flag indicating whether to share embedding and softmax weights.
-# vocabularies in source and target should be same for weight sharing.
-weight_sharing: True