diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/.run_ce.sh b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/.run_ce.sh deleted file mode 100755 index 6be159cb5268ae215998e7a19045f7aa0d620f63..0000000000000000000000000000000000000000 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/.run_ce.sh +++ /dev/null @@ -1,5 +0,0 @@ -###!/bin/bash -####This file is only used for continuous evaluation. - -model_file='train.py' -python $model_file --pass_num 1 --learning_rate 0.001 --save_interval 10 --enable_ce | python _ce.py diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/README.md b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/README.md index 556ea6f5fc481a120bcca67e4c1c7b9c28856b7f..991ee9cc0ad11672967e7dafb8b63914e6dd5935 100644 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/README.md +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/README.md @@ -10,8 +10,11 @@ ├── args.py # 训练、预测以及模型参数 ├── train.py # 训练主程序 ├── infer.py # 预测主程序 +├── run.sh # 默认配置的启动脚本 +├── infer.sh # 默认配置的解码脚本 ├── attention_model.py # 带注意力机制的翻译模型配置 -└── no_attention_model.py # 无注意力机制的翻译模型配置 +└── base_model.py # 无注意力机制的翻译模型配置 + ``` ## 简介 @@ -19,116 +22,93 @@ 近年来,深度学习技术的发展不断为机器翻译任务带来新的突破。直接用神经网络将源语言映射到目标语言,即端到端的神经网络机器翻译(End-to-End Neural Machine Translation, End-to-End NMT)模型逐渐成为主流,此类模型一般简称为NMT模型。 -本目录包含一个经典的机器翻译模型[RNN Search](https://arxiv.org/pdf/1409.0473.pdf)的Paddle Fluid实现。事实上,RNN search是一个较为传统的NMT模型,在现阶段,其表现已被很多新模型(如[Transformer](https://arxiv.org/abs/1706.03762))超越。但除机器翻译外,该模型是许多序列到序列(sequence to sequence, 以下简称Seq2Seq)类模型的基础,很多解决其他NLP问题的模型均以此模型为基础;因此其在NLP领域具有重要意义,并被广泛用作Baseline. +本目录包含两个经典的机器翻译模型一个base model(不带attention机制),一个带attention机制的翻译模型 .在现阶段,其表现已被很多新模型(如[Transformer](https://arxiv.org/abs/1706.03762))超越。但除机器翻译外,该模型是许多序列到序列(sequence to sequence, 以下简称Seq2Seq)类模型的基础,很多解决其他NLP问题的模型均以此模型为基础;因此其在NLP领域具有重要意义,并被广泛用作Baseline. 本目录下此范例模型的实现,旨在展示如何用Paddle Fluid实现一个带有注意力机制(Attention)的RNN模型来解决Seq2Seq类问题,以及如何使用带有Beam Search算法的解码器。如果您仅仅只是需要在机器翻译方面有着较好翻译效果的模型,则建议您参考[Transformer的Paddle Fluid实现](https://github.com/PaddlePaddle/models/tree/develop/fluid/neural_machine_translation/transformer)。 ## 模型概览 RNN Search模型使用了经典的编码器-解码器(Encoder-Decoder)的框架结构来解决Seq2Seq类问题。这种方法先用编码器将源序列编码成vector,再用解码器将该vector解码为目标序列。这其实模拟了人类在进行翻译类任务时的行为:先解析源语言,理解其含义,再根据该含义来写出目标语言的语句。编码器和解码器往往都使用RNN来实现。关于此方法的具体原理和数学表达式,可以参考[深度学习101](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/basics/machine_translation/index.html). -本模型中,在编码器方面,我们的实现使用了双向循环神经网络(Bi-directional Recurrent Neural Network);在解码器方面,我们使用了带注意力(Attention)机制的RNN解码器,并同时提供了一个不带注意力机制的解码器实现作为对比;而在预测方面我们使用柱搜索(beam search)算法来生成翻译的目标语句。以下将分别介绍用到的这些方法。 - -### 双向循环神经网络 -这里介绍Bengio团队在论文\[[2](#参考文献),[4](#参考文献)\]中提出的一种双向循环网络结构。该结构的目的是输入一个序列,得到其在每个时刻的特征表示,即输出的每个时刻都用定长向量表示到该时刻的上下文语义信息。 -具体来说,该双向循环神经网络分别在时间维以顺序和逆序——即前向(forward)和后向(backward)——依次处理输入序列,并将每个时间步RNN的输出拼接成为最终的输出层。这样每个时间步的输出节点,都包含了输入序列中当前时刻完整的过去和未来的上下文信息。下图展示的是一个按时间步展开的双向循环神经网络。该网络包含一个前向和一个后向RNN,其中有六个权重矩阵:输入到前向隐层和后向隐层的权重矩阵($W_1, W_3$),隐层到隐层自己的权重矩阵($W_2,W_5$),前向隐层和后向隐层到输出层的权重矩阵($W_4, W_6$)。注意,该网络的前向隐层和后向隐层之间没有连接。 - -

-
-图1. 按时间步展开的双向循环神经网络 -

- -

-
-图2. 使用双向LSTM的编码器 -

- -### 注意力机制 -如果编码阶段的输出是一个固定维度的向量,会带来以下两个问题:1)不论源语言序列的长度是5个词还是50个词,如果都用固定维度的向量去编码其中的语义和句法结构信息,对模型来说是一个非常高的要求,特别是对长句子序列而言;2)直觉上,当人类翻译一句话时,会对与当前译文更相关的源语言片段上给予更多关注,且关注点会随着翻译的进行而改变。而固定维度的向量则相当于,任何时刻都对源语言所有信息给予了同等程度的关注,这是不合理的。因此,Bahdanau等人\[[4](#参考文献)\]引入注意力(attention)机制,可以对编码后的上下文片段进行解码,以此来解决长句子的特征学习问题。下面介绍在注意力机制下的解码器结构。 - -与简单的解码器不同,这里$z_i$的计算公式为 (由于Github原生不支持LaTeX公式,请您移步[这里](http://www.paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/basics/machine_translation/index.html)查看): - -$$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$$ - -可见,源语言句子的编码向量表示为第$i$个词的上下文片段$c_i$,即针对每一个目标语言中的词$u_i$,都有一个特定的$c_i$与之对应。$c_i$的计算公式如下: - -$$c_i=\sum _{j=1}^{T}a_{ij}h_j, a_i=\left[ a_{i1},a_{i2},...,a_{iT}\right ]$$ +本模型中,在编码器方面,我们采用了基于LSTM的多层的encoder;在解码器方面,我们使用了带注意力(Attention)机制的RNN decoder,并同时提供了一个不带注意力机制的解码器实现作为对比;而在预测方面我们使用柱搜索(beam search)算法来生成翻译的目标语句。以下将分别介绍用到的这些方法。 -从公式中可以看出,注意力机制是通过对编码器中各时刻的RNN状态$h_j$进行加权平均实现的。权重$a_{ij}$表示目标语言中第$i$个词对源语言中第$j$个词的注意力大小,$a_{ij}$的计算公式如下: - -$$a_{ij} = {exp(e_{ij}) \over {\sum_{k=1}^T exp(e_{ik})}}$$ -$$e_{ij} = {align(z_i, h_j)}$$ - -其中,$align$可以看作是一个对齐模型,用来衡量目标语言中第$i$个词和源语言中第$j$个词的匹配程度。具体而言,这个程度是通过解码RNN的第$i$个隐层状态$z_i$和源语言句子的第$j$个上下文片段$h_j$计算得到的。传统的对齐模型中,目标语言的每个词明确对应源语言的一个或多个词(hard alignment);而在注意力模型中采用的是soft alignment,即任何两个目标语言和源语言词间均存在一定的关联,且这个关联强度是由模型计算得到的实数,因此可以融入整个NMT框架,并通过反向传播算法进行训练。 - -

-
-图3. 基于注意力机制的解码器 -

+## 数据介绍 -### 柱搜索算法 +本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料,tst2012的数据作为开发集,tst2013的数据作为测试集 -柱搜索([beam search](http://en.wikipedia.org/wiki/Beam_search))是一种启发式图搜索算法,用于在图或树中搜索有限集合中的最优扩展节点,通常用在解空间非常大的系统(如机器翻译、语音识别)中,原因是内存无法装下图或树中所有展开的解。如在机器翻译任务中希望翻译“`你好`”,就算目标语言字典中只有3个词(``, ``, `hello`),也可能生成无限句话(`hello`循环出现的次数不定),为了找到其中较好的翻译结果,我们可采用柱搜索算法。 +### 数据获取 +```sh +cd data && sh download_en-vi.sh +``` -柱搜索算法使用广度优先策略建立搜索树,在树的每一层,按照启发代价(heuristic cost)(本教程中,为生成词的log概率之和)对节点进行排序,然后仅留下预先确定的个数(文献中通常称为beam width、beam size、柱宽度等)的节点。只有这些节点会在下一层继续扩展,其他节点就被剪掉了,也就是说保留了质量较高的节点,剪枝了质量较差的节点。因此,搜索所占用的空间和时间大幅减少,但缺点是无法保证一定获得最优解。 -使用柱搜索算法的解码阶段,目标是最大化生成序列的概率。思路是: +## 训练模型 -1. 每一个时刻,根据源语言句子的编码信息$c$、生成的第$i$个目标语言序列单词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。 -2. 将$z_{i+1}$通过`softmax`归一化,得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。 -3. 根据$p_{i+1}$采样出单词$u_{i+1}$。 -4. 重复步骤1~3,直到获得句子结束标记``或超过句子的最大生成长度为止。 +`run.sh`包含训练程序的主函数,要使用默认参数开始训练,只需要简单地执行: +```sh +python run.sh +``` -注意:$z_{i+1}$和$p_{i+1}$的计算公式同解码器中的一样。且由于生成时的每一步都是通过贪心法实现的,因此并不能保证得到全局最优解。 +```sh + python train.py \ + --src_lang en --tar_lang vi \ + --attention True \ + --num_layers 2 \ + --hidden_size 512 \ + --src_vocab_size 17191 \ + --tar_vocab_size 7709 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --train_data_prefix data/en-vi/train \ + --eval_data_prefix data/en-vi/tst2012 \ + --test_data_prefix data/en-vi/tst2013 \ + --vocab_prefix data/en-vi/vocab \ + --use_gpu True -## 数据介绍 +``` -本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 -### 数据预处理 +训练程序会在每个epoch训练结束之后,save一次模型 -我们的预处理流程包括两步: -- 将每个源语言到目标语言的平行语料库文件合并为一个文件: - - 合并每个`XXX.src`和`XXX.trg`文件为`XXX`。 - - `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接,用'\t'分隔。 -- 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词,包括:语料中词频最高的(DICTSIZE - 3)个单词,和3个特殊符号``(序列的开始)、``(序列的结束)和``(未登录词)。 +当模型训练完成之后, 可以利用infer.py的脚本进行预测,默认使用beam search的方法进行预测,加载第10个epoch的模型进行预测,对test的数据集进行解码 +```sh +python infer.sh +``` +如果想预测别的数据文件,只需要将 --infer_file参数进行修改 -### 示例数据 +```sh + python infer.py \ + --src_lang en --tar_lang vi \ + --num_layers 2 \ + --hidden_size 512 \ + --src_vocab_size 17191 \ + --tar_vocab_size 7709 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --vocab_prefix data/en-vi/vocab \ + --infer_file data/en-vi/tst2013.en \ + --reload_model model_new/epoch_10/ \ + --use_gpu True -因为完整的数据集数据量较大,为了验证训练流程,PaddlePaddle接口paddle.dataset.wmt14中默认提供了一个经过预处理的[较小规模的数据集](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz)。 +``` -该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。 +## 效果 -## 训练模型 +单个模型 beam_size = 10 -`train.py`包含训练程序的主函数,要使用默认参数开始训练,只需要简单地执行: ```sh -python train.py -``` -您可以使用命令行参数来设置模型训练时的参数。要显示所有可用的命令行参数,执行: -```sh -python train.py -h -``` -这样会显示所有的命令行参数的描述,以及其默认值。默认的模型是带有注意力机制的。您也可以尝试运行无注意力机制的模型,命令如下: -```sh -python train.py --no_attention -``` -训练好的模型默认会被保存到`./models`路径下。您可以用命令行参数`--save_dir`来指定模型的保存路径。默认每个pass结束时会保存一个模型。 +no attention -## 生成预测结果 +tst2012 BLEU: 11.58 +tst2013 BLEU: 12.20 -在模型训练好后,可以用`infer.py`来生成预测结果。同样的,使用默认参数,只需要执行: -```sh -python infer.py -``` -您也可以同样用命令行来指定各参数。注意,预测时的参数设置必须与训练时完全一致,否则载入模型会失败。您可以用`--pass_num`参数来选择读取哪个pass结束时保存的模型。同时您可以使用`--beam_width`参数来选择beam search宽度。 -## 参考文献 -1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009. -2. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://www.aclweb.org/anthology/D/D14/D14-1179.pdf)[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734. -3. Chung J, Gulcehre C, Cho K H, et al. [Empirical evaluation of gated recurrent neural networks on sequence modeling](https://arxiv.org/abs/1412.3555)[J]. arXiv preprint arXiv:1412.3555, 2014. -4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[C]//Proceedings of ICLR 2015, 2015. -5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. +with attention -
-知识共享许可协议
本教程PaddlePaddle 创作,采用 知识共享 署名-相同方式共享 4.0 国际 许可协议进行许可。 +tst2012 BLEU: 22.21 +tst2013 BLEU: 25.30 +``` diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/_ce.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/_ce.py deleted file mode 100644 index e00ac49273ba4bf489e9b837d65d448eaa2aea43..0000000000000000000000000000000000000000 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/_ce.py +++ /dev/null @@ -1,63 +0,0 @@ -####this file is only used for continuous evaluation test! - -import os -import sys -sys.path.append(os.environ['ceroot']) -from kpi import CostKpi, DurationKpi, AccKpi - -#### NOTE kpi.py should shared in models in some way!!!! - -train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=False) -test_cost_kpi = CostKpi('test_cost', 0.005, 0, actived=False) -train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=False) - -tracking_kpis = [ - train_cost_kpi, - test_cost_kpi, - train_duration_kpi, -] - - -def parse_log(log): - ''' - This method should be implemented by model developers. - - The suggestion: - - each line in the log should be key, value, for example: - - " - train_cost\t1.0 - test_cost\t1.0 - train_cost\t1.0 - train_cost\t1.0 - train_acc\t1.2 - " - ''' - for line in log.split('\n'): - fs = line.strip().split('\t') - print(fs) - if len(fs) == 3 and fs[0] == 'kpis': - print("-----%s" % fs) - kpi_name = fs[1] - kpi_value = float(fs[2]) - yield kpi_name, kpi_value - - -def log_to_ce(log): - kpi_tracker = {} - for kpi in tracking_kpis: - kpi_tracker[kpi.name] = kpi - - for (kpi_name, kpi_value) in parse_log(log): - print(kpi_name, kpi_value) - kpi_tracker[kpi_name].add_record(kpi_value) - kpi_tracker[kpi_name].persist() - - -if __name__ == '__main__': - log = sys.stdin.read() - print("*****") - print(log) - print("****") - log_to_ce(log) diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/args.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/args.py index 16f97488d8b976a6eff7dfa38ccddf93fadcbf18..494289a7ace2506d52e4e6a7ff050ceff0fdf4d9 100644 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/args.py +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/args.py @@ -23,76 +23,95 @@ import distutils.util def parse_args(): parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( - "--embedding_dim", - type=int, - default=512, - help="The dimension of embedding table. (default: %(default)d)") + "--train_data_prefix", type=str, help="file prefix for train data") parser.add_argument( - "--encoder_size", - type=int, - default=512, - help="The size of encoder bi-rnn unit. (default: %(default)d)") + "--eval_data_prefix", type=str, help="file prefix for eval data") parser.add_argument( - "--decoder_size", - type=int, - default=512, - help="The size of decoder rnn unit. (default: %(default)d)") + "--test_data_prefix", type=str, help="file prefix for test data") parser.add_argument( - "--batch_size", - type=int, - default=32, - help="The sequence number of a mini-batch data. (default: %(default)d)") + "--vocab_prefix", type=str, help="file prefix for vocab") + parser.add_argument("--src_lang", type=str, help="source language suffix") + parser.add_argument("--tar_lang", type=str, help="target language suffix") + parser.add_argument( - "--dict_size", - type=int, - default=30000, - help="The dictionary capacity. Dictionaries of source sequence and " - "target dictionary have same capacity. (default: %(default)d)") + "--attention", + type=bool, + default=False, + help="Whether use attention model") + parser.add_argument( - "--pass_num", - type=int, - default=5, - help="The pass number to train. In inference mode, load the saved model" - " at the end of given pass.(default: %(default)d)") + "--optimizer", + type=str, + default='adam', + help="optimizer to use, only supprt[sgd|adam]") + parser.add_argument( "--learning_rate", type=float, - default=0.01, - help="Learning rate used to train the model. (default: %(default)f)") + default=0.001, + help="learning rate for optimizer") + parser.add_argument( - "--no_attention", - action='store_true', - help="If set, run no attention model instead of attention model.") + "--num_layers", + type=int, + default=1, + help="layers number of encoder and decoder") parser.add_argument( - "--beam_size", + "--hidden_size", type=int, - default=3, - help="The width for beam search. (default: %(default)d)") + default=100, + help="hidden size of encoder and decoder") + parser.add_argument("--src_vocab_size", type=int, help="source vocab size") + parser.add_argument("--tar_vocab_size", type=int, help="target vocab size") + + parser.add_argument( + "--batch_size", type=int, help="batch size of each step") + parser.add_argument( - "--use_gpu", - type=distutils.util.strtobool, - default=True, - help="Whether to use gpu or not. (default: %(default)d)") + "--max_epoch", type=int, default=12, help="max epoch for the training") + parser.add_argument( - "--max_length", + "--max_len", type=int, default=50, - help="The maximum sequence length for translation result." - "(default: %(default)d)") + help="max length for source and target sentence") parser.add_argument( - "--save_dir", + "--dropout", type=float, default=0.0, help="drop probability") + parser.add_argument( + "--init_scale", + type=float, + default=0.0, + help="init scale for parameter") + parser.add_argument( + "--max_grad_norm", + type=float, + default=5.0, + help="max grad norm for global norm clip") + + parser.add_argument( + "--model_path", type=str, - default="model", - help="Specify the path to save trained models.") + default='./model', + help="model path for model to save") + parser.add_argument( - "--save_interval", - type=int, - default=1, - help="Save the trained model every n passes." - "(default: %(default)d)") + "--reload_model", type=str, help="reload model to inference") + + parser.add_argument( + "--infer_file", type=str, help="file name for inference") + parser.add_argument( + "--infer_output_file", + type=str, + default='./infer_output', + help="file name for inference output") + parser.add_argument( + "--beam_size", type=int, default=10, help="file name for inference") + parser.add_argument( - "--enable_ce", - action='store_true', - help="If set, run the task with continuous evaluation logs.") + '--use_gpu', + type=bool, + default=False, + help='Whether using gpu [True|False]') + args = parser.parse_args() return args diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/attention_model.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/attention_model.py index 0c72697786819179dabce477a9c8d1be760dca28..eba1d5f36c09d1314de716902234ae41c9536a15 100644 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/attention_model.py +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/attention_model.py @@ -1,220 +1,471 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - from __future__ import absolute_import from __future__ import division from __future__ import print_function +import paddle.fluid.layers as layers import paddle.fluid as fluid -from paddle.fluid.contrib.decoder.beam_search_decoder import * - - -def lstm_step(x_t, hidden_t_prev, cell_t_prev, size): - def linear(inputs): - return fluid.layers.fc(input=inputs, size=size, bias_attr=True) - - forget_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t])) - input_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t])) - output_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t])) - cell_tilde = fluid.layers.tanh(x=linear([hidden_t_prev, x_t])) - - cell_t = fluid.layers.sums(input=[ - fluid.layers.elementwise_mul( - x=forget_gate, y=cell_t_prev), fluid.layers.elementwise_mul( - x=input_gate, y=cell_tilde) - ]) - - hidden_t = fluid.layers.elementwise_mul( - x=output_gate, y=fluid.layers.tanh(x=cell_t)) - - return hidden_t, cell_t - - -def seq_to_seq_net(embedding_dim, encoder_size, decoder_size, source_dict_dim, - target_dict_dim, is_generating, beam_size, max_length): - """Construct a seq2seq network.""" - - def bi_lstm_encoder(input_seq, gate_size): - # A bi-directional lstm encoder implementation. - # Linear transformation part for input gate, output gate, forget gate - # and cell activation vectors need be done outside of dynamic_lstm. - # So the output size is 4 times of gate_size. - input_forward_proj = fluid.layers.fc(input=input_seq, - size=gate_size * 4, - act='tanh', - bias_attr=False) - forward, _ = fluid.layers.dynamic_lstm( - input=input_forward_proj, size=gate_size * 4, use_peepholes=False) - input_reversed_proj = fluid.layers.fc(input=input_seq, - size=gate_size * 4, - act='tanh', - bias_attr=False) - reversed, _ = fluid.layers.dynamic_lstm( - input=input_reversed_proj, - size=gate_size * 4, - is_reverse=True, - use_peepholes=False) - return forward, reversed - - # The encoding process. Encodes the input words into tensors. - src_word_idx = fluid.layers.data( - name='source_sequence', shape=[1], dtype='int64', lod_level=1) - - src_embedding = fluid.layers.embedding( - input=src_word_idx, - size=[source_dict_dim, embedding_dim], - dtype='float32') - - src_forward, src_reversed = bi_lstm_encoder( - input_seq=src_embedding, gate_size=encoder_size) - - encoded_vector = fluid.layers.concat( - input=[src_forward, src_reversed], axis=1) - - encoded_proj = fluid.layers.fc(input=encoded_vector, - size=decoder_size, - bias_attr=False) - - backward_first = fluid.layers.sequence_pool( - input=src_reversed, pool_type='first') - - decoder_boot = fluid.layers.fc(input=backward_first, - size=decoder_size, - bias_attr=False, - act='tanh') - - cell_init = fluid.layers.fill_constant_batch_size_like( - input=decoder_boot, - value=0.0, - shape=[-1, decoder_size], - dtype='float32') - cell_init.stop_gradient = False - - # Create a RNN state cell by providing the input and hidden states, and - # specifies the hidden state as output. - h = InitState(init=decoder_boot, need_reorder=True) - c = InitState(init=cell_init) - - state_cell = StateCell( - inputs={'x': None, - 'encoder_vec': None, - 'encoder_proj': None}, - states={'h': h, - 'c': c}, - out_state='h') - - def simple_attention(encoder_vec, encoder_proj, decoder_state): - # The implementation of simple attention model - decoder_state_proj = fluid.layers.fc(input=decoder_state, - size=decoder_size, - bias_attr=False) - decoder_state_expand = fluid.layers.sequence_expand( - x=decoder_state_proj, y=encoder_proj) - # concated lod should inherit from encoder_proj - mixed_state = encoder_proj + decoder_state_expand - attention_weights = fluid.layers.fc(input=mixed_state, - size=1, - bias_attr=False) - attention_weights = fluid.layers.sequence_softmax( - input=attention_weights) - weigths_reshape = fluid.layers.reshape(x=attention_weights, shape=[-1]) - scaled = fluid.layers.elementwise_mul( - x=encoder_vec, y=weigths_reshape, axis=0) - context = fluid.layers.sequence_pool(input=scaled, pool_type='sum') - return context - - @state_cell.state_updater - def state_updater(state_cell): - # Define the updater of RNN state cell - current_word = state_cell.get_input('x') - encoder_vec = state_cell.get_input('encoder_vec') - encoder_proj = state_cell.get_input('encoder_proj') - prev_h = state_cell.get_state('h') - prev_c = state_cell.get_state('c') - context = simple_attention(encoder_vec, encoder_proj, prev_h) - decoder_inputs = fluid.layers.concat( - input=[context, current_word], axis=1) - h, c = lstm_step(decoder_inputs, prev_h, prev_c, decoder_size) - state_cell.set_state('h', h) - state_cell.set_state('c', c) - - # Define the decoding process - if not is_generating: - # Training process - trg_word_idx = fluid.layers.data( - name='target_sequence', shape=[1], dtype='int64', lod_level=1) - - trg_embedding = fluid.layers.embedding( - input=trg_word_idx, - size=[target_dict_dim, embedding_dim], - dtype='float32') - - # A decoder for training - decoder = TrainingDecoder(state_cell) - - with decoder.block(): - current_word = decoder.step_input(trg_embedding) - encoder_vec = decoder.static_input(encoded_vector) - encoder_proj = decoder.static_input(encoded_proj) - decoder.state_cell.compute_state(inputs={ - 'x': current_word, - 'encoder_vec': encoder_vec, - 'encoder_proj': encoder_proj - }) - h = decoder.state_cell.get_state('h') - decoder.state_cell.update_states() - out = fluid.layers.fc(input=h, - size=target_dict_dim, - bias_attr=True, - act='softmax') - decoder.output(out) - - label = fluid.layers.data( - name='label_sequence', shape=[1], dtype='int64', lod_level=1) - cost = fluid.layers.cross_entropy(input=decoder(), label=label) - avg_cost = fluid.layers.mean(x=cost) - feeding_list = ["source_sequence", "target_sequence", "label_sequence"] - return avg_cost, feeding_list - - else: - # Inference - init_ids = fluid.layers.data( - name="init_ids", shape=[1], dtype="int64", lod_level=2) - init_scores = fluid.layers.data( - name="init_scores", shape=[1], dtype="float32", lod_level=2) - - # A beam search decoder - decoder = BeamSearchDecoder( - state_cell=state_cell, - init_ids=init_ids, - init_scores=init_scores, - target_dict_dim=target_dict_dim, - word_dim=embedding_dim, - input_var_dict={ - 'encoder_vec': encoded_vector, - 'encoder_proj': encoded_proj - }, - topk_size=50, - sparse_emb=True, - max_len=max_length, - beam_size=beam_size, - end_id=1, - name=None) - - decoder.decode() - - translation_ids, translation_scores = decoder() - feeding_list = ["source_sequence"] - - return translation_ids, translation_scores, feeding_list +from paddle.fluid.layers.control_flow import StaticRNN +import numpy as np +from paddle.fluid import ParamAttr +from paddle.fluid.contrib.layers import basic_lstm, BasicLSTMUnit +from base_model import BaseModel + +INF = 1. * 1e5 +alpha = 0.6 + + +class AttentionModel(BaseModel): + def __init__(self, + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=1, + init_scale=0.1, + dropout=None, + batch_first=True): + super(AttentionModel, self).__init__( + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=num_layers, + init_scale=init_scale, + dropout=dropout, + batch_first=batch_first) + + def _build_decoder(self, + enc_last_hidden, + enc_last_cell, + mode='train', + beam_size=10): + + dec_input = layers.transpose(self.tar_emb, [1, 0, 2]) + dec_unit_list = [] + for i in range(self.num_layers): + new_name = "dec_layers_" + str(i) + dec_unit_list.append( + BasicLSTMUnit( + new_name, + self.hidden_size, + ParamAttr(initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale)), + ParamAttr(initializer=fluid.initializer.Constant(0.0)), )) + + + attention_weight = layers.create_parameter([self.hidden_size * 2, self.hidden_size], dtype="float32", name="attention_weight", \ + default_initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale)) + + memory_weight = layers.create_parameter([self.hidden_size, self.hidden_size], dtype="float32", name="memory_weight", \ + default_initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale)) + + def dot_attention(query, memory, mask=None): + attn = layers.matmul(query, memory, transpose_y=True) + + if mask: + attn = layers.transpose(attn, [1, 0, 2]) + attn = layers.elementwise_add(attn, mask * 1000000000, -1) + attn = layers.transpose(attn, [1, 0, 2]) + weight = layers.softmax(attn) + weight_memory = layers.matmul(weight, memory) + + return weight_memory, weight + + max_src_seq_len = layers.shape(self.src)[1] + src_mask = layers.sequence_mask( + self.src_sequence_length, maxlen=max_src_seq_len, dtype='float32') + + softmax_weight = layers.create_parameter([self.hidden_size, self.tar_vocab_size], dtype="float32", name="softmax_weight", \ + default_initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale)) + + def decoder_step(currrent_in, pre_feed, pre_hidden_array, + pre_cell_array, enc_memory): + new_hidden_array = [] + new_cell_array = [] + + step_input = layers.concat([currrent_in, pre_feed], 1) + + for i in range(self.num_layers): + pre_hidden = pre_hidden_array[i] + pre_cell = pre_cell_array[i] + + new_hidden, new_cell = dec_unit_list[i](step_input, pre_hidden, + pre_cell) + + new_hidden_array.append(new_hidden) + new_cell_array.append(new_cell) + + step_input = new_hidden + + memory_mask = src_mask - 1.0 + enc_memory = layers.matmul(enc_memory, memory_weight) + att_in = layers.unsqueeze(step_input, [1]) + dec_att, _ = dot_attention(att_in, enc_memory) + dec_att = layers.squeeze(dec_att, [1]) + concat_att_out = layers.concat([dec_att, step_input], 1) + concat_att_out = layers.matmul(concat_att_out, attention_weight) + + return concat_att_out, new_hidden_array, new_cell_array + + if mode == "train": + dec_rnn = StaticRNN() + with dec_rnn.step(): + step_input = dec_rnn.step_input(dec_input) + input_feed = dec_rnn.memory( + batch_ref=dec_input, shape=[-1, self.hidden_size]) + step_input = layers.concat([step_input, input_feed], 1) + + for i in range(self.num_layers): + pre_hidden = dec_rnn.memory(init=enc_last_hidden[i]) + pre_cell = dec_rnn.memory(init=enc_last_cell[i]) + + new_hidden, new_cell = dec_unit_list[i]( + step_input, pre_hidden, pre_cell) + + dec_rnn.update_memory(pre_hidden, new_hidden) + dec_rnn.update_memory(pre_cell, new_cell) + + step_input = new_hidden + + if self.dropout != None and self.dropout > 0.0: + print("using dropout", self.dropout) + step_input = fluid.layers.dropout( + step_input, + dropout_prob=self.dropout, + dropout_implementation='upscale_in_train') + memory_mask = src_mask - 1.0 + enc_memory = layers.matmul(self.enc_output, memory_weight) + att_in = layers.unsqueeze(step_input, [1]) + dec_att, _ = dot_attention(att_in, enc_memory, memory_mask) + dec_att = layers.squeeze(dec_att, [1]) + concat_att_out = layers.concat([dec_att, step_input], 1) + concat_att_out = layers.matmul(concat_att_out, attention_weight) + #concat_att_out = layers.tanh( concat_att_out ) + + dec_rnn.update_memory(input_feed, concat_att_out) + + dec_rnn.step_output(concat_att_out) + + dec_rnn_out = dec_rnn() + dec_output = layers.transpose(dec_rnn_out, [1, 0, 2]) + + dec_output = layers.matmul(dec_output, softmax_weight) + + return dec_output + elif mode == 'beam_search': + + max_length = max_src_seq_len * 2 + #max_length = layers.fill_constant( [1], dtype='int32', value = 10) + pre_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + full_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + + score = layers.fill_constant([1], dtype='float32', value=0.0) + + #eos_ids = layers.fill_constant( [1, 1], dtype='int64', value=2) + + pre_hidden_array = [] + pre_cell_array = [] + pre_feed = layers.fill_constant( + [beam_size, self.hidden_size], dtype='float32', value=0) + for i in range(self.num_layers): + pre_hidden_array.append( + layers.expand(enc_last_hidden[i], [beam_size, 1])) + pre_cell_array.append( + layers.expand(enc_last_cell[i], [beam_size, 1])) + + eos_ids = layers.fill_constant([beam_size], dtype='int64', value=2) + init_score = np.zeros((beam_size)).astype('float32') + init_score[1:] = -INF + pre_score = layers.assign(init_score) + #pre_score = layers.fill_constant( [1,], dtype='float32', value= 0.0) + tokens = layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + + enc_memory = layers.expand(self.enc_output, [beam_size, 1, 1]) + + pre_tokens = layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + + finished_seq = layers.fill_constant( + [beam_size, 1], dtype='int64', value=0) + finished_scores = layers.fill_constant( + [beam_size], dtype='float32', value=-INF) + finished_flag = layers.fill_constant( + [beam_size], dtype='float32', value=0.0) + + step_idx = layers.fill_constant(shape=[1], dtype='int32', value=0) + cond = layers.less_than( + x=step_idx, y=max_length) # default force_cpu=True + + parent_idx = layers.fill_constant([1], dtype='int32', value=0) + while_op = layers.While(cond) + + def compute_topk_scores_and_seq(sequences, + scores, + scores_to_gather, + flags, + beam_size, + select_beam=None, + generate_id=None): + scores = layers.reshape(scores, shape=[1, -1]) + _, topk_indexs = layers.topk(scores, k=beam_size) + + topk_indexs = layers.reshape(topk_indexs, shape=[-1]) + + # gather result + + top_seq = layers.gather(sequences, topk_indexs) + topk_flags = layers.gather(flags, topk_indexs) + topk_gather_scores = layers.gather(scores_to_gather, + topk_indexs) + + if select_beam: + topk_beam = layers.gather(select_beam, topk_indexs) + else: + topk_beam = select_beam + + if generate_id: + topk_id = layers.gather(generate_id, topk_indexs) + else: + topk_id = generate_id + return top_seq, topk_gather_scores, topk_flags, topk_beam, topk_id + + def grow_alive(curr_seq, curr_scores, curr_log_probs, curr_finished, + select_beam, generate_id): + curr_scores += curr_finished * -INF + return compute_topk_scores_and_seq( + curr_seq, + curr_scores, + curr_log_probs, + curr_finished, + beam_size, + select_beam, + generate_id=generate_id) + + def grow_finished(finished_seq, finished_scores, finished_flag, + curr_seq, curr_scores, curr_finished): + finished_seq = layers.concat( + [ + finished_seq, layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + ], + axis=1) + curr_scores += (1.0 - curr_finished) * -INF + #layers.Print( curr_scores, message="curr scores") + curr_finished_seq = layers.concat( + [finished_seq, curr_seq], axis=0) + curr_finished_scores = layers.concat( + [finished_scores, curr_scores], axis=0) + curr_finished_flags = layers.concat( + [finished_flag, curr_finished], axis=0) + + return compute_topk_scores_and_seq( + curr_finished_seq, curr_finished_scores, + curr_finished_scores, curr_finished_flags, beam_size) + + def is_finished(alive_log_prob, finished_scores, + finished_in_finished): + + max_out_len = 200 + max_length_penalty = layers.pow(layers.fill_constant( + [1], dtype='float32', value=((5.0 + max_out_len) / 6.0)), + alpha) + + lower_bound_alive_score = layers.slice( + alive_log_prob, starts=[0], ends=[1], + axes=[0]) / max_length_penalty + + lowest_score_of_fininshed_in_finished = finished_scores * finished_in_finished + lowest_score_of_fininshed_in_finished += ( + 1.0 - finished_in_finished) * -INF + lowest_score_of_fininshed_in_finished = layers.reduce_min( + lowest_score_of_fininshed_in_finished) + + met = layers.less_than(lower_bound_alive_score, + lowest_score_of_fininshed_in_finished) + met = layers.cast(met, 'float32') + bound_is_met = layers.reduce_sum(met) + + finished_eos_num = layers.reduce_sum(finished_in_finished) + + finish_cond = layers.less_than( + finished_eos_num, + layers.fill_constant( + [1], dtype='float32', value=beam_size)) + + return finish_cond + + def grow_top_k(step_idx, alive_seq, alive_log_prob, parant_idx): + pre_ids = alive_seq + + dec_step_emb = layers.embedding( + input=pre_ids, + size=[self.tar_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='target_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + + dec_att_out, new_hidden_array, new_cell_array = decoder_step( + dec_step_emb, pre_feed, pre_hidden_array, pre_cell_array, + enc_memory) + + projection = layers.matmul(dec_att_out, softmax_weight) + + logits = layers.softmax(projection) + current_log = layers.elementwise_add( + x=layers.log(logits), y=alive_log_prob, axis=0) + base_1 = layers.cast(step_idx, 'float32') + 6.0 + base_1 /= 6.0 + length_penalty = layers.pow(base_1, alpha) + + len_pen = layers.pow(( + (5. + layers.cast(step_idx + 1, 'float32')) / 6.), alpha) + + current_log = layers.reshape(current_log, shape=[1, -1]) + + current_log = current_log / length_penalty + topk_scores, topk_indices = layers.topk( + input=current_log, k=beam_size) + + topk_scores = layers.reshape(topk_scores, shape=[-1]) + + topk_log_probs = topk_scores * length_penalty + + generate_id = layers.reshape( + topk_indices, shape=[-1]) % self.tar_vocab_size + + selected_beam = layers.reshape( + topk_indices, shape=[-1]) // self.tar_vocab_size + + topk_finished = layers.equal(generate_id, eos_ids) + + topk_finished = layers.cast(topk_finished, 'float32') + + generate_id = layers.reshape(generate_id, shape=[-1, 1]) + + pre_tokens_list = layers.gather(tokens, selected_beam) + + full_tokens_list = layers.concat( + [pre_tokens_list, generate_id], axis=1) + + + return full_tokens_list, topk_log_probs, topk_scores, topk_finished, selected_beam, generate_id, \ + dec_att_out, new_hidden_array, new_cell_array + + with while_op.block(): + topk_seq, topk_log_probs, topk_scores, topk_finished, topk_beam, topk_generate_id, attention_out, new_hidden_array, new_cell_array = \ + grow_top_k( step_idx, pre_tokens, pre_score, parent_idx) + alive_seq, alive_log_prob, _, alive_beam, alive_id = grow_alive( + topk_seq, topk_scores, topk_log_probs, topk_finished, + topk_beam, topk_generate_id) + + finished_seq_2, finished_scores_2, finished_flags_2, _, _ = grow_finished( + finished_seq, finished_scores, finished_flag, topk_seq, + topk_scores, topk_finished) + + finished_cond = is_finished(alive_log_prob, finished_scores_2, + finished_flags_2) + + layers.increment(x=step_idx, value=1.0, in_place=True) + + layers.assign(alive_beam, parent_idx) + layers.assign(alive_id, pre_tokens) + layers.assign(alive_log_prob, pre_score) + layers.assign(alive_seq, tokens) + layers.assign(finished_seq_2, finished_seq) + layers.assign(finished_scores_2, finished_scores) + layers.assign(finished_flags_2, finished_flag) + + # update init_hidden, init_cell, input_feed + new_feed = layers.gather(attention_out, parent_idx) + layers.assign(new_feed, pre_feed) + for i in range(self.num_layers): + new_hidden_var = layers.gather(new_hidden_array[i], + parent_idx) + layers.assign(new_hidden_var, pre_hidden_array[i]) + new_cell_var = layers.gather(new_cell_array[i], parent_idx) + layers.assign(new_cell_var, pre_cell_array[i]) + + length_cond = layers.less_than(x=step_idx, y=max_length) + layers.logical_and(x=length_cond, y=finished_cond, out=cond) + + tokens_with_eos = tokens + + all_seq = layers.concat([tokens_with_eos, finished_seq], axis=0) + all_score = layers.concat([pre_score, finished_scores], axis=0) + _, topk_index = layers.topk(all_score, k=beam_size) + topk_index = layers.reshape(topk_index, shape=[-1]) + final_seq = layers.gather(all_seq, topk_index) + final_score = layers.gather(all_score, topk_index) + + return final_seq + elif mode == 'greedy_search': + max_length = max_src_seq_len * 2 + #max_length = layers.fill_constant( [1], dtype='int32', value = 10) + pre_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + full_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + + score = layers.fill_constant([1], dtype='float32', value=0.0) + + eos_ids = layers.fill_constant([1, 1], dtype='int64', value=2) + + pre_hidden_array = [] + pre_cell_array = [] + pre_feed = layers.fill_constant( + [1, self.hidden_size], dtype='float32', value=0) + for i in range(self.num_layers): + pre_hidden_array.append(enc_last_hidden[i]) + pre_cell_array.append(enc_last_cell[i]) + #pre_hidden_array.append( layers.fill_constant( [1, hidden_size], dtype='float32', value=0) ) + #pre_cell_array.append( layers.fill_constant( [1, hidden_size], dtype='float32', value=0) ) + + step_idx = layers.fill_constant(shape=[1], dtype='int32', value=0) + cond = layers.less_than( + x=step_idx, y=max_length) # default force_cpu=True + while_op = layers.While(cond) + + with while_op.block(): + + dec_step_emb = layers.embedding( + input=pre_ids, + size=[self.tar_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='target_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + + dec_att_out, new_hidden_array, new_cell_array = decoder_step( + dec_step_emb, pre_feed, pre_hidden_array, pre_cell_array, + self.enc_output) + + projection = layers.matmul(dec_att_out, softmax_weight) + + logits = layers.softmax(projection) + logits = layers.log(logits) + + current_log = layers.elementwise_add(logits, score, axis=0) + + topk_score, topk_indices = layers.topk(input=current_log, k=1) + + new_ids = layers.concat([full_ids, topk_indices]) + layers.assign(new_ids, full_ids) + #layers.Print( full_ids, message="ful ids") + layers.assign(topk_score, score) + layers.assign(topk_indices, pre_ids) + layers.assign(dec_att_out, pre_feed) + for i in range(self.num_layers): + layers.assign(new_hidden_array[i], pre_hidden_array[i]) + layers.assign(new_cell_array[i], pre_cell_array[i]) + + layers.increment(x=step_idx, value=1.0, in_place=True) + + eos_met = layers.not_equal(topk_indices, eos_ids) + length_cond = layers.less_than(x=step_idx, y=max_length) + layers.logical_and(x=length_cond, y=eos_met, out=cond) + + return full_ids diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/base_model.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..bebfc2f86ccf46e61639ac4bb723a9de8b08a0eb --- /dev/null +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/base_model.py @@ -0,0 +1,502 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import paddle.fluid.layers as layers +import paddle.fluid as fluid +from paddle.fluid.layers.control_flow import StaticRNN as PaddingRNN +import numpy as np +from paddle.fluid import ParamAttr +from paddle.fluid.contrib.layers import basic_lstm, BasicLSTMUnit + +INF = 1. * 1e5 +alpha = 0.6 + + +class BaseModel(object): + def __init__(self, + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=1, + init_scale=0.1, + dropout=None, + batch_first=True): + + self.hidden_size = hidden_size + self.src_vocab_size = src_vocab_size + self.tar_vocab_size = tar_vocab_size + self.batch_size = batch_size + self.num_layers = num_layers + self.init_scale = init_scale + self.dropout = dropout + self.batch_first = batch_first + + def _build_data(self): + self.src = layers.data(name="src", shape=[-1, 1, 1], dtype='int64') + self.src_sequence_length = layers.data( + name="src_sequence_length", shape=[-1], dtype='int32') + + self.tar = layers.data(name="tar", shape=[-1, 1, 1], dtype='int64') + self.tar_sequence_length = layers.data( + name="tar_sequence_length", shape=[-1], dtype='int32') + self.label = layers.data(name="label", shape=[-1, 1, 1], dtype='int64') + + def _emebdding(self): + self.src_emb = layers.embedding( + input=self.src, + size=[self.src_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='source_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + self.tar_emb = layers.embedding( + input=self.tar, + size=[self.tar_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='target_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + + def _build_encoder(self): + self.enc_output, enc_last_hidden, enc_last_cell = basic_lstm( self.src_emb, None, None, self.hidden_size, num_layers=self.num_layers, batch_first=self.batch_first, \ + dropout_prob=self.dropout, \ + param_attr = ParamAttr( initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale) ), \ + bias_attr = ParamAttr( initializer = fluid.initializer.Constant(0.0) ), \ + sequence_length=self.src_sequence_length) + + return self.enc_output, enc_last_hidden, enc_last_cell + + def _build_decoder(self, + enc_last_hidden, + enc_last_cell, + mode='train', + beam_size=10): + softmax_weight = layers.create_parameter([self.hidden_size, self.tar_vocab_size], dtype="float32", name="softmax_weight", \ + default_initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale)) + if mode == 'train': + dec_output, dec_last_hidden, dec_last_cell = basic_lstm( self.tar_emb, enc_last_hidden, enc_last_cell, \ + self.hidden_size, num_layers=self.num_layers, \ + batch_first=self.batch_first, \ + dropout_prob=self.dropout, \ + param_attr = ParamAttr( initializer=fluid.initializer.UniformInitializer(low=-self.init_scale, high=self.init_scale) ), \ + bias_attr = ParamAttr( initializer = fluid.initializer.Constant(0.0) )) + + dec_output = layers.matmul(dec_output, softmax_weight) + + return dec_output + elif mode == 'beam_search' or mode == 'greedy_search': + dec_unit_list = [] + name = 'basic_lstm' + for i in range(self.num_layers): + new_name = name + "_layers_" + str(i) + dec_unit_list.append( + BasicLSTMUnit( + new_name, self.hidden_size, dtype='float32')) + + def decoder_step(current_in, pre_hidden_array, pre_cell_array): + new_hidden_array = [] + new_cell_array = [] + + step_in = current_in + for i in range(self.num_layers): + pre_hidden = pre_hidden_array[i] + pre_cell = pre_cell_array[i] + + new_hidden, new_cell = dec_unit_list[i](step_in, pre_hidden, + pre_cell) + + new_hidden_array.append(new_hidden) + new_cell_array.append(new_cell) + + step_in = new_hidden + + return step_in, new_hidden_array, new_cell_array + + if mode == 'beam_search': + max_src_seq_len = layers.shape(self.src)[1] + max_length = max_src_seq_len * 2 + #max_length = layers.fill_constant( [1], dtype='int32', value = 10) + pre_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + full_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + + score = layers.fill_constant([1], dtype='float32', value=0.0) + + #eos_ids = layers.fill_constant( [1, 1], dtype='int64', value=2) + + pre_hidden_array = [] + pre_cell_array = [] + pre_feed = layers.fill_constant( + [beam_size, self.hidden_size], dtype='float32', value=0) + for i in range(self.num_layers): + pre_hidden_array.append( + layers.expand(enc_last_hidden[i], [beam_size, 1])) + pre_cell_array.append( + layers.expand(enc_last_cell[i], [beam_size, 1])) + + eos_ids = layers.fill_constant( + [beam_size], dtype='int64', value=2) + init_score = np.zeros((beam_size)).astype('float32') + init_score[1:] = -INF + pre_score = layers.assign(init_score) + #pre_score = layers.fill_constant( [1,], dtype='float32', value= 0.0) + tokens = layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + + enc_memory = layers.expand(self.enc_output, [beam_size, 1, 1]) + + pre_tokens = layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + + finished_seq = layers.fill_constant( + [beam_size, 1], dtype='int64', value=0) + finished_scores = layers.fill_constant( + [beam_size], dtype='float32', value=-INF) + finished_flag = layers.fill_constant( + [beam_size], dtype='float32', value=0.0) + + step_idx = layers.fill_constant( + shape=[1], dtype='int32', value=0) + cond = layers.less_than( + x=step_idx, y=max_length) # default force_cpu=True + + parent_idx = layers.fill_constant([1], dtype='int32', value=0) + while_op = layers.While(cond) + + def compute_topk_scores_and_seq(sequences, + scores, + scores_to_gather, + flags, + beam_size, + select_beam=None, + generate_id=None): + scores = layers.reshape(scores, shape=[1, -1]) + _, topk_indexs = layers.topk(scores, k=beam_size) + + topk_indexs = layers.reshape(topk_indexs, shape=[-1]) + + # gather result + + top_seq = layers.gather(sequences, topk_indexs) + topk_flags = layers.gather(flags, topk_indexs) + topk_gather_scores = layers.gather(scores_to_gather, + topk_indexs) + + if select_beam: + topk_beam = layers.gather(select_beam, topk_indexs) + else: + topk_beam = select_beam + + if generate_id: + topk_id = layers.gather(generate_id, topk_indexs) + else: + topk_id = generate_id + return top_seq, topk_gather_scores, topk_flags, topk_beam, topk_id + + def grow_alive(curr_seq, curr_scores, curr_log_probs, + curr_finished, select_beam, generate_id): + curr_scores += curr_finished * -INF + return compute_topk_scores_and_seq( + curr_seq, + curr_scores, + curr_log_probs, + curr_finished, + beam_size, + select_beam, + generate_id=generate_id) + + def grow_finished(finished_seq, finished_scores, finished_flag, + curr_seq, curr_scores, curr_finished): + finished_seq = layers.concat( + [ + finished_seq, layers.fill_constant( + [beam_size, 1], dtype='int64', value=1) + ], + axis=1) + curr_scores += (1.0 - curr_finished) * -INF + #layers.Print( curr_scores, message="curr scores") + curr_finished_seq = layers.concat( + [finished_seq, curr_seq], axis=0) + curr_finished_scores = layers.concat( + [finished_scores, curr_scores], axis=0) + curr_finished_flags = layers.concat( + [finished_flag, curr_finished], axis=0) + + return compute_topk_scores_and_seq( + curr_finished_seq, curr_finished_scores, + curr_finished_scores, curr_finished_flags, beam_size) + + def is_finished(alive_log_prob, finished_scores, + finished_in_finished): + + max_out_len = 200 + max_length_penalty = layers.pow(layers.fill_constant( + [1], dtype='float32', value=( + (5.0 + max_out_len) / 6.0)), + alpha) + + lower_bound_alive_score = layers.slice( + alive_log_prob, starts=[0], ends=[1], + axes=[0]) / max_length_penalty + + lowest_score_of_fininshed_in_finished = finished_scores * finished_in_finished + lowest_score_of_fininshed_in_finished += ( + 1.0 - finished_in_finished) * -INF + lowest_score_of_fininshed_in_finished = layers.reduce_min( + lowest_score_of_fininshed_in_finished) + + met = layers.less_than( + lower_bound_alive_score, + lowest_score_of_fininshed_in_finished) + met = layers.cast(met, 'float32') + bound_is_met = layers.reduce_sum(met) + + finished_eos_num = layers.reduce_sum(finished_in_finished) + + finish_cond = layers.less_than( + finished_eos_num, + layers.fill_constant( + [1], dtype='float32', value=beam_size)) + + return finish_cond + + def grow_top_k(step_idx, alive_seq, alive_log_prob, parant_idx): + pre_ids = alive_seq + + dec_step_emb = layers.embedding( + input=pre_ids, + size=[self.tar_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='target_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + + dec_att_out, new_hidden_array, new_cell_array = decoder_step( + dec_step_emb, pre_hidden_array, pre_cell_array) + + projection = layers.matmul(dec_att_out, softmax_weight) + + logits = layers.softmax(projection) + current_log = layers.elementwise_add( + x=layers.log(logits), y=alive_log_prob, axis=0) + base_1 = layers.cast(step_idx, 'float32') + 6.0 + base_1 /= 6.0 + length_penalty = layers.pow(base_1, alpha) + + len_pen = layers.pow((( + 5. + layers.cast(step_idx + 1, 'float32')) / 6.), alpha) + + current_log = layers.reshape(current_log, shape=[1, -1]) + + current_log = current_log / length_penalty + topk_scores, topk_indices = layers.topk( + input=current_log, k=beam_size) + + topk_scores = layers.reshape(topk_scores, shape=[-1]) + + topk_log_probs = topk_scores * length_penalty + + generate_id = layers.reshape( + topk_indices, shape=[-1]) % self.tar_vocab_size + + selected_beam = layers.reshape( + topk_indices, shape=[-1]) // self.tar_vocab_size + + topk_finished = layers.equal(generate_id, eos_ids) + + topk_finished = layers.cast(topk_finished, 'float32') + + generate_id = layers.reshape(generate_id, shape=[-1, 1]) + + pre_tokens_list = layers.gather(tokens, selected_beam) + + full_tokens_list = layers.concat( + [pre_tokens_list, generate_id], axis=1) + + + return full_tokens_list, topk_log_probs, topk_scores, topk_finished, selected_beam, generate_id, \ + dec_att_out, new_hidden_array, new_cell_array + + with while_op.block(): + topk_seq, topk_log_probs, topk_scores, topk_finished, topk_beam, topk_generate_id, attention_out, new_hidden_array, new_cell_array = \ + grow_top_k( step_idx, pre_tokens, pre_score, parent_idx) + alive_seq, alive_log_prob, _, alive_beam, alive_id = grow_alive( + topk_seq, topk_scores, topk_log_probs, topk_finished, + topk_beam, topk_generate_id) + + finished_seq_2, finished_scores_2, finished_flags_2, _, _ = grow_finished( + finished_seq, finished_scores, finished_flag, topk_seq, + topk_scores, topk_finished) + + finished_cond = is_finished( + alive_log_prob, finished_scores_2, finished_flags_2) + + layers.increment(x=step_idx, value=1.0, in_place=True) + + layers.assign(alive_beam, parent_idx) + layers.assign(alive_id, pre_tokens) + layers.assign(alive_log_prob, pre_score) + layers.assign(alive_seq, tokens) + layers.assign(finished_seq_2, finished_seq) + layers.assign(finished_scores_2, finished_scores) + layers.assign(finished_flags_2, finished_flag) + + # update init_hidden, init_cell, input_feed + new_feed = layers.gather(attention_out, parent_idx) + layers.assign(new_feed, pre_feed) + for i in range(self.num_layers): + new_hidden_var = layers.gather(new_hidden_array[i], + parent_idx) + layers.assign(new_hidden_var, pre_hidden_array[i]) + new_cell_var = layers.gather(new_cell_array[i], + parent_idx) + layers.assign(new_cell_var, pre_cell_array[i]) + + length_cond = layers.less_than(x=step_idx, y=max_length) + layers.logical_and(x=length_cond, y=finished_cond, out=cond) + + tokens_with_eos = tokens + + all_seq = layers.concat([tokens_with_eos, finished_seq], axis=0) + all_score = layers.concat([pre_score, finished_scores], axis=0) + _, topk_index = layers.topk(all_score, k=beam_size) + topk_index = layers.reshape(topk_index, shape=[-1]) + final_seq = layers.gather(all_seq, topk_index) + final_score = layers.gather(all_score, topk_index) + + return final_seq + elif mode == 'greedy_search': + max_src_seq_len = layers.shape(self.src)[1] + max_length = max_src_seq_len * 2 + #max_length = layers.fill_constant( [1], dtype='int32', value = 10) + pre_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + full_ids = layers.fill_constant([1, 1], dtype='int64', value=1) + + score = layers.fill_constant([1], dtype='float32', value=0.0) + + eos_ids = layers.fill_constant([1, 1], dtype='int64', value=2) + + pre_hidden_array = [] + pre_cell_array = [] + pre_feed = layers.fill_constant( + [1, self.hidden_size], dtype='float32', value=0) + for i in range(self.num_layers): + pre_hidden_array.append(enc_last_hidden[i]) + pre_cell_array.append(enc_last_cell[i]) + #pre_hidden_array.append( layers.fill_constant( [1, hidden_size], dtype='float32', value=0) ) + #pre_cell_array.append( layers.fill_constant( [1, hidden_size], dtype='float32', value=0) ) + + step_idx = layers.fill_constant( + shape=[1], dtype='int32', value=0) + cond = layers.less_than( + x=step_idx, y=max_length) # default force_cpu=True + while_op = layers.While(cond) + + with while_op.block(): + + dec_step_emb = layers.embedding( + input=pre_ids, + size=[self.tar_vocab_size, self.hidden_size], + dtype='float32', + is_sparse=False, + param_attr=fluid.ParamAttr( + name='target_embedding', + initializer=fluid.initializer.UniformInitializer( + low=-self.init_scale, high=self.init_scale))) + + dec_att_out, new_hidden_array, new_cell_array = decoder_step( + dec_step_emb, pre_hidden_array, pre_cell_array) + + projection = layers.matmul(dec_att_out, softmax_weight) + + logits = layers.softmax(projection) + logits = layers.log(logits) + + current_log = layers.elementwise_add(logits, score, axis=0) + + topk_score, topk_indices = layers.topk( + input=current_log, k=1) + + new_ids = layers.concat([full_ids, topk_indices]) + layers.assign(new_ids, full_ids) + #layers.Print( full_ids, message="ful ids") + layers.assign(topk_score, score) + layers.assign(topk_indices, pre_ids) + layers.assign(dec_att_out, pre_feed) + for i in range(self.num_layers): + layers.assign(new_hidden_array[i], pre_hidden_array[i]) + layers.assign(new_cell_array[i], pre_cell_array[i]) + + layers.increment(x=step_idx, value=1.0, in_place=True) + + eos_met = layers.not_equal(topk_indices, eos_ids) + length_cond = layers.less_than(x=step_idx, y=max_length) + layers.logical_and(x=length_cond, y=eos_met, out=cond) + + return full_ids + + raise Exception("error") + else: + print("mode not supprt", mode) + + def _compute_loss(self, dec_output): + loss = layers.softmax_with_cross_entropy( + logits=dec_output, label=self.label, soft_label=False) + + loss = layers.reshape(loss, shape=[self.batch_size, -1]) + + max_tar_seq_len = layers.shape(self.tar)[1] + tar_mask = layers.sequence_mask( + self.tar_sequence_length, maxlen=max_tar_seq_len, dtype='float32') + loss = loss * tar_mask + loss = layers.reduce_mean(loss, dim=[0]) + loss = layers.reduce_sum(loss) + + loss.permissions = True + + return loss + + def _beam_search(self, enc_last_hidden, enc_last_cell): + pass + + def build_graph(self, mode='train', beam_size=10): + if mode == 'train' or mode == 'eval': + self._build_data() + self._emebdding() + enc_output, enc_last_hidden, enc_last_cell = self._build_encoder() + dec_output = self._build_decoder(enc_last_hidden, enc_last_cell) + + loss = self._compute_loss(dec_output) + return loss + elif mode == "beam_search" or mode == 'greedy_search': + self._build_data() + self._emebdding() + enc_output, enc_last_hidden, enc_last_cell = self._build_encoder() + dec_output = self._build_decoder( + enc_last_hidden, enc_last_cell, mode=mode, beam_size=beam_size) + + return dec_output + else: + print("not support mode ", mode) + raise Exception("not support mode: " + mode) diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/data/download_en-vi.sh b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/data/download_en-vi.sh new file mode 100755 index 0000000000000000000000000000000000000000..ae61044bcd34b84c35cf252871535be2fecb7a2e --- /dev/null +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/data/download_en-vi.sh @@ -0,0 +1,33 @@ +#!/bin/sh +# IWSLT15 Vietnames to English is a small dataset contain 133k parallel data +# this script download the data from stanford website +# +# Usage: +# ./download_en-vi.sh output_path +# +# If output_path is not specified, a dir nameed "./en_vi" will be created and used as +# output path + +set -ex +OUTPUT_PATH="${1:-en-vi}" +SITE_PATH="https://nlp.stanford.edu/projects/nmt/data" + +mkdir -v -p $OUTPUT_PATH + +# Download iwslt15 small dataset from standford website. +echo "Begin to download training dataset train.en and train.vi." +wget "$SITE_PATH/iwslt15.en-vi/train.en" -O "$OUTPUT_PATH/train.en" +wget "$SITE_PATH/iwslt15.en-vi/train.vi" -O "$OUTPUT_PATH/train.vi" + +echo "Begin to download dev dataset tst2012.en and tst2012.vi." +wget "$SITE_PATH/iwslt15.en-vi/tst2012.en" -O "$OUTPUT_PATH/tst2012.en" +wget "$SITE_PATH/iwslt15.en-vi/tst2012.vi" -O "$OUTPUT_PATH/tst2012.vi" + +echo "Begin to download test dataset tst2013.en and tst2013.vi." +wget "$SITE_PATH/iwslt15.en-vi/tst2013.en" -O "$OUTPUT_PATH/tst2013.en" +wget "$SITE_PATH/iwslt15.en-vi/tst2013.vi" -O "$OUTPUT_PATH/tst2013.vi" + +echo "Begin to ownload vocab file vocab.en and vocab.vi." +wget "$SITE_PATH/iwslt15.en-vi/vocab.en" -O "$OUTPUT_PATH/vocab.en" +wget "$SITE_PATH/iwslt15.en-vi/vocab.vi" -O "$OUTPUT_PATH/vocab.vi" + diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/bi_rnn.png b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/bi_rnn.png deleted file mode 100644 index 9d8efd50a49d0305586f550344472ab94c93bed3..0000000000000000000000000000000000000000 Binary files a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/bi_rnn.png and /dev/null differ diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/decoder_attention.png b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/decoder_attention.png deleted file mode 100644 index 1b355e7786d25487a3f564af758c2c52c43b4690..0000000000000000000000000000000000000000 Binary files a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/decoder_attention.png and /dev/null differ diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/encoder_attention.png b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/encoder_attention.png deleted file mode 100644 index 28d7a15a3bd65262bde22a3f41b5aa78b46b368a..0000000000000000000000000000000000000000 Binary files a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/images/encoder_attention.png and /dev/null differ diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.py index f042d5ef63f602ab1d892790c2826c459ef83e5e..9ac4c73e0a289a99267e2c6166b1bb06ff8430db 100644 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.py +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.py @@ -17,120 +17,146 @@ from __future__ import division from __future__ import print_function import numpy as np +import time import os -import six +import random + +import math import paddle import paddle.fluid as fluid import paddle.fluid.framework as framework from paddle.fluid.executor import Executor -from paddle.fluid.contrib.decoder.beam_search_decoder import * + +import reader + +import sys +if sys.version[0] == '2': + reload(sys) + sys.setdefaultencoding("utf-8") +import os from args import * -import attention_model -import no_attention_model +#from . import lm_model +import logging +import pickle + +from attention_model import AttentionModel + +from base_model import BaseModel +SEED = 123 -def infer(): + +def train(): args = parse_args() - # Inference - if args.no_attention: - translation_ids, translation_scores, feed_order = \ - no_attention_model.seq_to_seq_net( - args.embedding_dim, - args.encoder_size, - args.decoder_size, - args.dict_size, - args.dict_size, - True, - beam_size=args.beam_size, - max_length=args.max_length) + num_layers = args.num_layers + src_vocab_size = args.src_vocab_size + tar_vocab_size = args.tar_vocab_size + batch_size = args.batch_size + dropout = args.dropout + init_scale = args.init_scale + max_grad_norm = args.max_grad_norm + hidden_size = args.hidden_size + # inference process + + print("src", src_vocab_size) + + # dropout type using upscale_in_train, dropout can be remove in inferecen + # So we can set dropout to 0 + if args.attention: + model = AttentionModel( + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=num_layers, + init_scale=init_scale, + dropout=0.0) else: - translation_ids, translation_scores, feed_order = \ - attention_model.seq_to_seq_net( - args.embedding_dim, - args.encoder_size, - args.decoder_size, - args.dict_size, - args.dict_size, - True, - beam_size=args.beam_size, - max_length=args.max_length) - - test_batch_generator = paddle.batch( - paddle.reader.shuffle( - paddle.dataset.wmt14.test(args.dict_size), buf_size=1000), - batch_size=args.batch_size, - drop_last=False) + model = BaseModel( + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=num_layers, + init_scale=init_scale, + dropout=0.0) + + beam_size = args.beam_size + trans_res = model.build_graph(mode='beam_search', beam_size=beam_size) + # clone from default main program and use it as the validation program + main_program = fluid.default_main_program() place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace() exe = Executor(place) exe.run(framework.default_startup_program()) - model_path = os.path.join(args.save_dir, str(args.pass_num)) - fluid.io.load_persistables( - executor=exe, - dirname=model_path, - main_program=framework.default_main_program()) - - src_dict, trg_dict = paddle.dataset.wmt14.get_dict(args.dict_size) - - feed_list = [ - framework.default_main_program().global_block().var(var_name) - for var_name in feed_order[0:1] - ] - feeder = fluid.DataFeeder(feed_list, place) - - for batch_id, data in enumerate(test_batch_generator()): - # The value of batch_size may vary in the last batch - batch_size = len(data) - - # Setup initial ids and scores lod tensor - init_ids_data = np.array([0 for _ in range(batch_size)], dtype='int64') - init_scores_data = np.array( - [1. for _ in range(batch_size)], dtype='float32') - init_ids_data = init_ids_data.reshape((batch_size, 1)) - init_scores_data = init_scores_data.reshape((batch_size, 1)) - init_recursive_seq_lens = [1] * batch_size - init_recursive_seq_lens = [ - init_recursive_seq_lens, init_recursive_seq_lens - ] - init_ids = fluid.create_lod_tensor(init_ids_data, - init_recursive_seq_lens, place) - init_scores = fluid.create_lod_tensor(init_scores_data, - init_recursive_seq_lens, place) - - # Feed dict for inference - feed_dict = feeder.feed([[x[0]] for x in data]) - feed_dict['init_ids'] = init_ids - feed_dict['init_scores'] = init_scores - - fetch_outs = exe.run(framework.default_main_program(), - feed=feed_dict, - fetch_list=[translation_ids, translation_scores], - return_numpy=False) - - # Split the output words by lod levels - lod_level_1 = fetch_outs[0].lod()[1] - token_array = np.array(fetch_outs[0]) - result = [] - for i in six.moves.xrange(len(lod_level_1) - 1): - sentence_list = [ - trg_dict[token] - for token in token_array[lod_level_1[i]:lod_level_1[i + 1]] - ] - sentence = " ".join(sentence_list[1:-1]) - result.append(sentence) - lod_level_0 = fetch_outs[0].lod()[0] - paragraphs = [ - result[lod_level_0[i]:lod_level_0[i + 1]] - for i in six.moves.xrange(len(lod_level_0) - 1) - ] - - for paragraph in paragraphs: - print(paragraph) + source_vocab_file = args.vocab_prefix + "." + args.src_lang + infer_file = args.infer_file + + infer_data = reader.raw_mono_data(source_vocab_file, infer_file) + + def prepare_input(batch, epoch_id=0, with_lr=True): + src_ids, src_mask, tar_ids, tar_mask = batch + res = {} + src_ids = src_ids.reshape((src_ids.shape[0], src_ids.shape[1], 1)) + in_tar = tar_ids[:, :-1] + label_tar = tar_ids[:, 1:] + + in_tar = in_tar.reshape((in_tar.shape[0], in_tar.shape[1], 1)) + in_tar = np.zeros_like(in_tar, dtype='int64') + label_tar = label_tar.reshape( + (label_tar.shape[0], label_tar.shape[1], 1)) + label_tar = np.zeros_like(label_tar, dtype='int64') + + res['src'] = src_ids + res['tar'] = in_tar + res['label'] = label_tar + res['src_sequence_length'] = src_mask + res['tar_sequence_length'] = tar_mask + + return res, np.sum(tar_mask) + + dir_name = args.reload_model + print("dir name", dir_name) + fluid.io.load_params(exe, dir_name) + + train_data_iter = reader.get_data_iter(infer_data, 1, mode='eval') + + tar_id2vocab = [] + tar_vocab_file = args.vocab_prefix + "." + args.tar_lang + with open(tar_vocab_file, "r") as f: + for line in f.readlines(): + tar_id2vocab.append(line.strip()) + + infer_output_file = args.infer_output_file + + out_file = open(infer_output_file, 'w') + + for batch_id, batch in enumerate(train_data_iter): + input_data_feed, word_num = prepare_input(batch, epoch_id=0) + + fetch_outs = exe.run(feed=input_data_feed, + fetch_list=[trans_res.name], + use_program_cache=False) + + res = [tar_id2vocab[e] for e in fetch_outs[0].reshape(-1)] + + res = res[1:] + + new_res = [] + for ele in res: + if ele == "
": + break + new_res.append(ele) + + out_file.write(' '.join(new_res)) + out_file.write('\n') + + out_file.close() if __name__ == '__main__': - infer() + train() diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.sh b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.sh new file mode 100644 index 0000000000000000000000000000000000000000..ffd48da6adc4ecc080f504d3b6a6a244fa6eedd9 --- /dev/null +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/infer.sh @@ -0,0 +1,21 @@ +#!/bin/bash + +set -ex +export CUDA_VISIBLE_DEVICES=0 + +python infer.py \ + --src_lang en --tar_lang vi \ + --attention True \ + --num_layers 2 \ + --hidden_size 512 \ + --src_vocab_size 17191 \ + --tar_vocab_size 7709 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --vocab_prefix data/en-vi/vocab \ + --infer_file data/en-vi/tst2013.en \ + --reload_model ./model/epoch_10 \ + --use_gpu True + diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/no_attention_model.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/no_attention_model.py deleted file mode 100644 index 57e7dbe42ad37bbd5d4c85ab4d58b2e1dd3d961b..0000000000000000000000000000000000000000 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/no_attention_model.py +++ /dev/null @@ -1,127 +0,0 @@ -# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import paddle.fluid.layers as layers -from paddle.fluid.contrib.decoder.beam_search_decoder import * - - -def seq_to_seq_net(embedding_dim, encoder_size, decoder_size, source_dict_dim, - target_dict_dim, is_generating, beam_size, max_length): - def encoder(): - # Encoder implementation of RNN translation - src_word = layers.data( - name="src_word", shape=[1], dtype='int64', lod_level=1) - src_embedding = layers.embedding( - input=src_word, - size=[source_dict_dim, embedding_dim], - dtype='float32', - is_sparse=True) - - fc1 = layers.fc(input=src_embedding, size=encoder_size * 4, act='tanh') - lstm_hidden0, lstm_0 = layers.dynamic_lstm( - input=fc1, size=encoder_size * 4) - encoder_out = layers.sequence_last_step(input=lstm_hidden0) - return encoder_out - - def decoder_state_cell(context): - # Decoder state cell, specifies the hidden state variable and its updater - h = InitState(init=context, need_reorder=True) - state_cell = StateCell( - inputs={'x': None}, states={'h': h}, out_state='h') - - @state_cell.state_updater - def updater(state_cell): - current_word = state_cell.get_input('x') - prev_h = state_cell.get_state('h') - # make sure lod of h heritted from prev_h - h = layers.fc(input=[prev_h, current_word], - size=decoder_size, - act='tanh') - state_cell.set_state('h', h) - - return state_cell - - def decoder_train(state_cell): - # Decoder for training implementation of RNN translation - trg_word = layers.data( - name="target_word", shape=[1], dtype='int64', lod_level=1) - trg_embedding = layers.embedding( - input=trg_word, - size=[target_dict_dim, embedding_dim], - dtype='float32', - is_sparse=True) - - # A training decoder - decoder = TrainingDecoder(state_cell) - - # Define the computation in each RNN step done by decoder - with decoder.block(): - current_word = decoder.step_input(trg_embedding) - decoder.state_cell.compute_state(inputs={'x': current_word}) - current_score = layers.fc(input=decoder.state_cell.get_state('h'), - size=target_dict_dim, - act='softmax') - decoder.state_cell.update_states() - decoder.output(current_score) - - return decoder() - - def decoder_infer(state_cell): - # Decoder for inference implementation - init_ids = layers.data( - name="init_ids", shape=[1], dtype="int64", lod_level=2) - init_scores = layers.data( - name="init_scores", shape=[1], dtype="float32", lod_level=2) - - # A beam search decoder for inference - decoder = BeamSearchDecoder( - state_cell=state_cell, - init_ids=init_ids, - init_scores=init_scores, - target_dict_dim=target_dict_dim, - word_dim=embedding_dim, - input_var_dict={}, - topk_size=50, - sparse_emb=True, - max_len=max_length, - beam_size=beam_size, - end_id=1, - name=None) - decoder.decode() - translation_ids, translation_scores = decoder() - - return translation_ids, translation_scores - - context = encoder() - state_cell = decoder_state_cell(context) - - if not is_generating: - label = layers.data( - name="target_next_word", shape=[1], dtype='int64', lod_level=1) - - rnn_out = decoder_train(state_cell) - - cost = layers.cross_entropy(input=rnn_out, label=label) - avg_cost = layers.mean(x=cost) - - feeding_list = ['src_word', 'target_word', 'target_next_word'] - return avg_cost, feeding_list - else: - translation_ids, translation_scores = decoder_infer(state_cell) - feeding_list = ['src_word'] - return translation_ids, translation_scores, feeding_list diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/reader.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..258d042021f2c82e1043d1281fd73d13bcda4aac --- /dev/null +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/reader.py @@ -0,0 +1,210 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Utilities for parsing PTB text files.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import collections +import os +import sys +import numpy as np + +Py3 = sys.version_info[0] == 3 + +UNK_ID = 0 + + +def _read_words(filename): + data = [] + with open(filename, "r") as f: + if Py3: + return f.read().replace("\n", "").split() + else: + return f.read().decode("utf-8").replace("\n", "").split() + + +def read_all_line(filenam): + data = [] + with open(filename, "r") as f: + for line in f.readlines(): + data.append(line.strip()) + + +def _build_vocab(filename): + + vocab_dict = {} + ids = 0 + with open(filename, "r") as f: + for line in f.readlines(): + vocab_dict[line.strip()] = ids + ids += 1 + + print("vocab word num", ids) + + return vocab_dict + + +def _para_file_to_ids(src_file, tar_file, src_vocab, tar_vocab): + + src_data = [] + with open(src_file, "r") as f_src: + for line in f_src.readlines(): + arra = line.strip().split() + ids = [src_vocab[w] if w in src_vocab else UNK_ID for w in arra] + ids = ids + + src_data.append(ids) + + tar_data = [] + with open(tar_file, "r") as f_tar: + for line in f_tar.readlines(): + arra = line.strip().split() + ids = [tar_vocab[w] if w in tar_vocab else UNK_ID for w in arra] + + ids = [1] + ids + [2] + + tar_data.append(ids) + + return src_data, tar_data + + +def filter_len(src, tar, max_sequence_len=50): + new_src = [] + new_tar = [] + + for id1, id2 in zip(src, tar): + if len(id1) > max_sequence_len: + id1 = id1[:max_sequence_len] + if len(id2) > max_sequence_len + 2: + id2 = id2[:max_sequence_len + 2] + + new_src.append(id1) + new_tar.append(id2) + + return new_src, new_tar + + +def raw_data(src_lang, + tar_lang, + vocab_prefix, + train_prefix, + eval_prefix, + test_prefix, + max_sequence_len=50): + + src_vocab_file = vocab_prefix + "." + src_lang + tar_vocab_file = vocab_prefix + "." + tar_lang + + src_train_file = train_prefix + "." + src_lang + tar_train_file = train_prefix + "." + tar_lang + + src_eval_file = eval_prefix + "." + src_lang + tar_eval_file = eval_prefix + "." + tar_lang + + src_test_file = test_prefix + "." + src_lang + tar_test_file = test_prefix + "." + tar_lang + + src_vocab = _build_vocab(src_vocab_file) + tar_vocab = _build_vocab(tar_vocab_file) + + train_src, train_tar = _para_file_to_ids( src_train_file, tar_train_file, \ + src_vocab, tar_vocab ) + train_src, train_tar = filter_len( + train_src, train_tar, max_sequence_len=max_sequence_len) + eval_src, eval_tar = _para_file_to_ids( src_eval_file, tar_eval_file, \ + src_vocab, tar_vocab ) + + test_src, test_tar = _para_file_to_ids( src_test_file, tar_test_file, \ + src_vocab, tar_vocab ) + + return ( train_src, train_tar), (eval_src, eval_tar), (test_src, test_tar),\ + (src_vocab, tar_vocab) + + +def raw_mono_data(vocab_file, file_path): + + src_vocab = _build_vocab(vocab_file) + + test_src, test_tar = _para_file_to_ids( file_path, file_path, \ + src_vocab, src_vocab ) + + return (test_src, test_tar) + + +def get_data_iter(raw_data, batch_size, mode='train'): + + src_data, tar_data = raw_data + + data_len = len(src_data) + + index = np.arange(data_len) + if mode == "train": + np.random.shuffle(index) + + def to_pad_np(data, source=False): + max_len = 0 + for ele in data: + if len(ele) > max_len: + max_len = len(ele) + + ids = np.ones((batch_size, max_len), dtype='int64') * 2 + mask = np.zeros((batch_size), dtype='int32') + + for i, ele in enumerate(data): + ids[i, :len(ele)] = ele + if not source: + mask[i] = len(ele) - 1 + else: + mask[i] = len(ele) + + return ids, mask + + b_src = [] + + cache_num = 20 + if mode != "train": + cache_num = 1 + for j in range(data_len): + if len(b_src) == batch_size * cache_num: + # build batch size + + # sort + new_cache = sorted(b_src, key=lambda k: len(k[0])) + + for i in range(cache_num): + batch_data = new_cache[i * batch_size:(i + 1) * batch_size] + src_cache = [w[0] for w in batch_data] + tar_cache = [w[1] for w in batch_data] + src_ids, src_mask = to_pad_np(src_cache, source=True) + tar_ids, tar_mask = to_pad_np(tar_cache) + + #print( "src ids", src_ids ) + yield (src_ids, src_mask, tar_ids, tar_mask) + + b_src = [] + + b_src.append((src_data[index[j]], tar_data[index[j]])) + if len(b_src) == batch_size * cache_num: + new_cache = sorted(b_src, key=lambda k: len(k[0])) + + for i in range(cache_num): + batch_data = new_cache[i * batch_size:(i + 1) * batch_size] + src_cache = [w[0] for w in batch_data] + tar_cache = [w[1] for w in batch_data] + src_ids, src_mask = to_pad_np(src_cache, source=True) + tar_ids, tar_mask = to_pad_np(tar_cache) + + #print( "src ids", src_ids ) + yield (src_ids, src_mask, tar_ids, tar_mask) diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/run.sh b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..cf48282deea544f5ca2d233a4af7336fb664cc65 --- /dev/null +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/run.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +set -ex +export CUDA_VISIBLE_DEVICES=0 + +python train.py \ + --src_lang en --tar_lang vi \ + --attention True \ + --num_layers 2 \ + --hidden_size 512 \ + --src_vocab_size 17191 \ + --tar_vocab_size 7709 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --train_data_prefix data/en-vi/train \ + --eval_data_prefix data/en-vi/tst2012 \ + --test_data_prefix data/en-vi/tst2013 \ + --vocab_prefix data/en-vi/vocab \ + --use_gpu True + diff --git a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/train.py b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/train.py index fbb93eab1f0eba05d30b75957b7fb18e42592375..60053c4f3c05a2e7cc6b80d69d1efa87b40cff5d 100644 --- a/PaddleNLP/unarchived/neural_machine_translation/rnn_search/train.py +++ b/PaddleNLP/unarchived/neural_machine_translation/rnn_search/train.py @@ -19,150 +19,175 @@ from __future__ import print_function import numpy as np import time import os +import random + +import math import paddle import paddle.fluid as fluid import paddle.fluid.framework as framework from paddle.fluid.executor import Executor -from paddle.fluid.contrib.decoder.beam_search_decoder import * + +import reader + +import sys +if sys.version[0] == '2': + reload(sys) + sys.setdefaultencoding("utf-8") +import os from args import * -import attention_model -import no_attention_model +from base_model import BaseModel +from attention_model import AttentionModel +import logging +import pickle + +SEED = 123 def train(): args = parse_args() - if args.enable_ce: - framework.default_startup_program().random_seed = 111 - + num_layers = args.num_layers + src_vocab_size = args.src_vocab_size + tar_vocab_size = args.tar_vocab_size + batch_size = args.batch_size + dropout = args.dropout + init_scale = args.init_scale + max_grad_norm = args.max_grad_norm + hidden_size = args.hidden_size # Training process - if args.no_attention: - avg_cost, feed_order = no_attention_model.seq_to_seq_net( - args.embedding_dim, - args.encoder_size, - args.decoder_size, - args.dict_size, - args.dict_size, - False, - beam_size=args.beam_size, - max_length=args.max_length) - else: - avg_cost, feed_order = attention_model.seq_to_seq_net( - args.embedding_dim, - args.encoder_size, - args.decoder_size, - args.dict_size, - args.dict_size, - False, - beam_size=args.beam_size, - max_length=args.max_length) + if args.attention: + model = AttentionModel( + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=num_layers, + init_scale=init_scale, + dropout=dropout) + else: + model = BaseModel( + hidden_size, + src_vocab_size, + tar_vocab_size, + batch_size, + num_layers=num_layers, + init_scale=init_scale, + dropout=dropout) + + loss = model.build_graph() # clone from default main program and use it as the validation program main_program = fluid.default_main_program() - inference_program = fluid.default_main_program().clone() - - optimizer = fluid.optimizer.Adam( - learning_rate=args.learning_rate, - regularization=fluid.regularizer.L2DecayRegularizer( - regularization_coeff=1e-5)) - - optimizer.minimize(avg_cost) - - # Disable shuffle for Continuous Evaluation only - if not args.enable_ce: - train_batch_generator = paddle.batch( - paddle.reader.shuffle( - paddle.dataset.wmt14.train(args.dict_size), buf_size=1000), - batch_size=args.batch_size, - drop_last=False) - - test_batch_generator = paddle.batch( - paddle.reader.shuffle( - paddle.dataset.wmt14.test(args.dict_size), buf_size=1000), - batch_size=args.batch_size, - drop_last=False) + inference_program = fluid.default_main_program().clone(for_test=True) + + fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByGlobalNorm( + clip_norm=max_grad_norm)) + + lr = args.learning_rate + opt_type = args.optimizer + if opt_type == "sgd": + optimizer = fluid.optimizer.SGD(lr) + elif opt_type == "adam": + optimizer = fluid.optimizer.Adam(lr) else: - train_batch_generator = paddle.batch( - paddle.dataset.wmt14.train(args.dict_size), - batch_size=args.batch_size, - drop_last=False) + print("only support [sgd|adam]") + raise Exception("opt type not support") - test_batch_generator = paddle.batch( - paddle.dataset.wmt14.test(args.dict_size), - batch_size=args.batch_size, - drop_last=False) + optimizer.minimize(loss) place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace() exe = Executor(place) exe.run(framework.default_startup_program()) - feed_list = [ - main_program.global_block().var(var_name) for var_name in feed_order - ] - feeder = fluid.DataFeeder(feed_list, place) - - def validation(): - # Use test set as validation each pass + train_data_prefix = args.train_data_prefix + eval_data_prefix = args.eval_data_prefix + test_data_prefix = args.test_data_prefix + vocab_prefix = args.vocab_prefix + src_lang = args.src_lang + tar_lang = args.tar_lang + print("begin to load data") + raw_data = reader.raw_data(src_lang, tar_lang, vocab_prefix, + train_data_prefix, eval_data_prefix, + test_data_prefix, args.max_len) + print("finished load data") + train_data, valid_data, test_data, _ = raw_data + + def prepare_input(batch, epoch_id=0, with_lr=True): + src_ids, src_mask, tar_ids, tar_mask = batch + res = {} + src_ids = src_ids.reshape((src_ids.shape[0], src_ids.shape[1], 1)) + in_tar = tar_ids[:, :-1] + label_tar = tar_ids[:, 1:] + + in_tar = in_tar.reshape((in_tar.shape[0], in_tar.shape[1], 1)) + label_tar = label_tar.reshape( + (label_tar.shape[0], label_tar.shape[1], 1)) + + res['src'] = src_ids + res['tar'] = in_tar + res['label'] = label_tar + res['src_sequence_length'] = src_mask + res['tar_sequence_length'] = tar_mask + + return res, np.sum(tar_mask) + + # get train epoch size + def eval(data, epoch_id=0): + eval_data_iter = reader.get_data_iter(data, batch_size, mode='eval') total_loss = 0.0 - count = 0 - val_feed_list = [ - inference_program.global_block().var(var_name) - for var_name in feed_order - ] - val_feeder = fluid.DataFeeder(val_feed_list, place) - - for batch_id, data in enumerate(test_batch_generator()): - val_fetch_outs = exe.run(inference_program, - feed=val_feeder.feed(data), - fetch_list=[avg_cost], - return_numpy=False) - - total_loss += np.array(val_fetch_outs[0])[0] - count += 1 - - return total_loss / count - - for pass_id in range(1, args.pass_num + 1): - pass_start_time = time.time() - words_seen = 0 - for batch_id, data in enumerate(train_batch_generator()): - words_seen += len(data) * 2 - - fetch_outs = exe.run(framework.default_main_program(), - feed=feeder.feed(data), - fetch_list=[avg_cost]) - - avg_cost_train = np.array(fetch_outs[0]) - print('pass_id=%d, batch_id=%d, train_loss: %f' % - (pass_id, batch_id, avg_cost_train)) - # This is for continuous evaluation only - if args.enable_ce and batch_id >= 100: - break - - pass_end_time = time.time() - test_loss = validation() - time_consumed = pass_end_time - pass_start_time - words_per_sec = words_seen / time_consumed - print("pass_id=%d, test_loss: %f, words/s: %f, sec/pass: %f" % - (pass_id, test_loss, words_per_sec, time_consumed)) - - # This log is for continuous evaluation only - if args.enable_ce: - print("kpis\ttrain_cost\t%f" % avg_cost_train) - print("kpis\ttest_cost\t%f" % test_loss) - print("kpis\ttrain_duration\t%f" % time_consumed) - - if pass_id % args.save_interval == 0: - model_path = os.path.join(args.save_dir, str(pass_id)) - if not os.path.isdir(model_path): - os.makedirs(model_path) - - fluid.io.save_persistables( - executor=exe, - dirname=model_path, - main_program=framework.default_main_program()) + word_count = 0.0 + for batch_id, batch in enumerate(eval_data_iter): + input_data_feed, word_num = prepare_input( + batch, epoch_id, with_lr=False) + fetch_outs = exe.run(inference_program, + feed=input_data_feed, + fetch_list=[loss.name], + use_program_cache=False) + + cost_train = np.array(fetch_outs[0]) + + total_loss += cost_train * batch_size + word_count += word_num + + ppl = np.exp(total_loss / word_count) + + return ppl + + max_epoch = args.max_epoch + for epoch_id in range(max_epoch): + start_time = time.time() + print("epoch id", epoch_id) + train_data_iter = reader.get_data_iter(train_data, batch_size) + + total_loss = 0 + word_count = 0.0 + for batch_id, batch in enumerate(train_data_iter): + + input_data_feed, word_num = prepare_input(batch, epoch_id=epoch_id) + fetch_outs = exe.run(feed=input_data_feed, + fetch_list=[loss.name], + use_program_cache=True) + + cost_train = np.array(fetch_outs[0]) + + total_loss += cost_train * batch_size + word_count += word_num + + if batch_id > 0 and batch_id % 100 == 0: + print("ppl", batch_id, np.exp(total_loss / word_count)) + total_loss = 0.0 + word_count = 0.0 + + dir_name = args.model_path + "/epoch_" + str(epoch_id) + print("begin to save", dir_name) + fluid.io.save_params(exe, dir_name) + print("save finished") + dev_ppl = eval(valid_data) + print("dev ppl", dev_ppl) + test_ppl = eval(test_data) + print("test ppl", test_ppl) if __name__ == '__main__':