提交 15013709 编写于 作者: T Tao Luo 提交者: GitHub

Merge pull request #284 from luotao1/mt

add generation in seq2seq
...@@ -41,9 +41,9 @@ Let's consider an example of Chinese-to-English translation. The model is given ...@@ -41,9 +41,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
``` ```
After training and with a beam-search size of 3, the generated translations are as follows: After training and with a beam-search size of 3, the generated translations are as follows:
```text ```text
0 -5.36816 these are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence. - The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary. - There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
...@@ -94,7 +94,7 @@ Figure 4. Encoder-Decoder Framework ...@@ -94,7 +94,7 @@ Figure 4. Encoder-Decoder Framework
There are three steps for encoding a sentence: There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
......
...@@ -26,9 +26,9 @@ ...@@ -26,9 +26,9 @@
``` ```
如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下: 如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下:
```text ```text
0 -5.36816 these are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。 - 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。
- 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。 - 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。
...@@ -74,7 +74,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN ...@@ -74,7 +74,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
编码阶段分为三步: 编码阶段分为三步:
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon R^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。 1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。 2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。
...@@ -167,7 +167,7 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -167,7 +167,7 @@ e_{ij}&=align(z_i,h_j)\\\\
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。 该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 训练流程说明 ## 流程说明
### paddle初始化 ### paddle初始化
...@@ -178,48 +178,34 @@ import paddle.v2 as paddle ...@@ -178,48 +178,34 @@ import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练 # 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # 训练模式False,生成模式True
is_generating = False
### 数据定义
首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### 模型结构 ### 模型结构
1. 首先,定义了一些全局变量。 1. 首先,定义了一些全局变量。
```python ```python
dict_size = 30000 # 字典维度
source_dict_dim = dict_size # 源语言字典维度 source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = dict_size # 目标语言字典维度 target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度 word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小 encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
beam_size = 3 # 柱宽度
max_length = 250 # 生成句子的最大长度
``` ```
1. 其次实现编码器框架分为三步 2. 其次实现编码器框架分为三步
1 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)` - 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)`
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 - 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -227,7 +213,7 @@ wmt14_reader = paddle.batch( ...@@ -227,7 +213,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 - 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -237,9 +223,9 @@ wmt14_reader = paddle.batch( ...@@ -237,9 +223,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. 接着,定义基于注意力机制的解码器框架。分为三步: 3. 接着,定义基于注意力机制的解码器框架。分为三步:
1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 - 对源语言序列编码后的结果(见2的最后一步),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -247,7 +233,7 @@ wmt14_reader = paddle.batch( ...@@ -247,7 +233,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。 - 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -257,7 +243,7 @@ wmt14_reader = paddle.batch( ...@@ -257,7 +243,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。 - 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch( ...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) 4. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -303,7 +289,7 @@ wmt14_reader = paddle.batch( ...@@ -303,7 +289,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. 训练模式下的解码器调用: 5. 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。 - 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
...@@ -311,6 +297,7 @@ wmt14_reader = paddle.batch( ...@@ -311,6 +297,7 @@ wmt14_reader = paddle.batch(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。 - 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python ```python
if not is_generating:
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -335,30 +322,71 @@ wmt14_reader = paddle.batch( ...@@ -335,30 +322,71 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) 6. 生成模式下的解码器调用
### 参数定义 - 首先在序列生成任务中由于解码阶段的RNN总是引用上一时刻生成出的词的词向量作为当前时刻的输入因此使用`GeneratedInput`来自动完成这一过程具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
- 其次使用`beam_search`函数循环调用`gru_decoder_with_attention`函数生成出序列id
首先依据模型配置的`cost`定义模型参数。 ```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python # The encoded source sequence (encoder's output) must be specified by
parameters = paddle.parameters.create(cost) # StaticInput, which is a read-only memory.
``` # Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。 trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python beam_gen = paddle.layer.beam_search(
for param in parameters.keys(): name=decoder_group_name,
print param step=gru_decoder_with_attention,
``` input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 训练模型 ### 训练模型
1. 构造trainer 1. 参数定义
依据模型配置的`cost`定义模型参数。可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. 数据定义
获取wmt14的dataset reader。
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python ```python
if not is_generating:
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
learning_rate=5e-5, learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) regularization=paddle.optimizer.L2Regularization(rate=8e-4))
...@@ -367,73 +395,108 @@ for param in parameters.keys(): ...@@ -367,73 +395,108 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
1. 构造event_handler 4. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 2 == 0 指定每2个batch打印一次日志,包含cost等信息。
```python ```python
if not is_generating:
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
``` ```
1. 启动训练: 5. 启动训练
```python ```python
if not is_generating:
trainer.train( trainer.train(
reader=wmt14_reader, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
event_handler=event_handler,
num_passes=2,
feeding=feeding)
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} .........
......... Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} .........
.........
```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ### 生成模型
### 下载预训练的模型 1. 加载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了一个模型供大家直接下载使用。该模型大小为205MB,[BLEU评估](#BLEU评估)值为26.92。
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh parameters = paddle.dataset.wmt14.model()
``` ```
2. 数据定义
### BLEU评估 从wmt14的生成集中读取前3个样本作为源语言句子。
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 ```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
3. 构造infer
[Moses](http://www.statmt.org/moses/) 是一个统计学的开源机器翻译系统,我们使用其中的 [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) 来做BLEU评估。下载脚本的命令如下: 根据网络拓扑结构和模型参数构造出infer用来生成,在预测时还需要指定输出域`field`,这里使用生成句子的概率`prob`和句子中每个词的`id`。
```bash
./moses_bleu.sh ```python
``` if is_generating:
BLEU评估可以使用`eval_bleu`脚本如下,其中FILE为需要评估的文件名,BEAMSIZE为柱宽度,默认使用`data/wmt14/gen/ntst14.trg`作为标准的翻译结果。 beam_result = paddle.infer(
```bash output_layer=beam_gen,
./eval_bleu.sh FILE BEAMSIZE parameters=parameters,
``` input=gen_data,
本教程的具体命令如下: field=['prob', 'id'])
```bash ```
./eval_bleu.sh gen_result 3
``` 4. 打印生成结果
您会在屏幕上看到:
```text 根据源/目标语言字典,将源语言句子和`beam_size`个生成句子打印输出。
BLEU = 26.92
``` ```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
生成开始后,可以观察到输出的日志如下:
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
## 总结 ## 总结
......
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
mkdir wmt14
cd wmt14
# download the dataset
wget http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz
wget http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz
# untar the dataset
tar -zxvf bitexts.tgz
tar -zxvf dev+test.tgz
gunzip bitexts.selected/*
mv bitexts.selected train
rm bitexts.tgz
rm dev+test.tgz
# separate the dev and test dataset
mkdir test gen
mv dev/ntst1213.* test
mv dev/ntst14.* gen
rm -rf dev
set +x
# rename the suffix, .fr->.src, .en->.trg
for dir in train test gen
do
filelist=`ls $dir`
cd $dir
for file in $filelist
do
if [ ${file##*.} = "fr" ]; then
mv $file ${file/%fr/src}
elif [ ${file##*.} = 'en' ]; then
mv $file ${file/%en/trg}
fi
done
cd ..
done
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
gen_file=$1
beam_size=$2
# find top1 generating result
top1=$(printf '%s_top1.txt' `basename $gen_file .txt`)
if [ $beam_size -eq 1 ]; then
awk -F "\t" '{sub(" <e>","",$2);sub(" ","",$2);print $2}' $gen_file >$top1
else
awk 'BEGIN{
FS="\t";
OFS="\t";
read_pos = 2} {
if (NR == read_pos){
sub(" <e>","",$3);
sub(" ","",$3);
print $3;
read_pos += (2 + res_num);
}}' res_num=$beam_size $gen_file >$top1
fi
# evalute bleu value
bleu_script=multi-bleu.perl
standard_res=data/wmt14/gen/ntst14.trg
bleu_res=`perl $bleu_script $standard_res <$top1`
echo $bleu_res | cut -d, -f 1
rm $top1
...@@ -83,9 +83,9 @@ Let's consider an example of Chinese-to-English translation. The model is given ...@@ -83,9 +83,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
``` ```
After training and with a beam-search size of 3, the generated translations are as follows: After training and with a beam-search size of 3, the generated translations are as follows:
```text ```text
0 -5.36816 these are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence. - The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary. - There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
...@@ -136,7 +136,7 @@ Figure 4. Encoder-Decoder Framework ...@@ -136,7 +136,7 @@ Figure 4. Encoder-Decoder Framework
There are three steps for encoding a sentence: There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
......
...@@ -68,9 +68,9 @@ ...@@ -68,9 +68,9 @@
``` ```
如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下: 如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下:
```text ```text
0 -5.36816 these are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。 - 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。
- 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。 - 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。
...@@ -116,7 +116,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN ...@@ -116,7 +116,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
编码阶段分为三步: 编码阶段分为三步:
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon R^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。 1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。 2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。
...@@ -209,7 +209,7 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -209,7 +209,7 @@ e_{ij}&=align(z_i,h_j)\\\\
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。 该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 训练流程说明 ## 流程说明
### paddle初始化 ### paddle初始化
...@@ -220,48 +220,34 @@ import paddle.v2 as paddle ...@@ -220,48 +220,34 @@ import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练 # 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # 训练模式False,生成模式True
is_generating = False
### 数据定义
首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### 模型结构 ### 模型结构
1. 首先,定义了一些全局变量。 1. 首先,定义了一些全局变量。
```python ```python
dict_size = 30000 # 字典维度
source_dict_dim = dict_size # 源语言字典维度 source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = dict_size # 目标语言字典维度 target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度 word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小 encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
beam_size = 3 # 柱宽度
max_length = 250 # 生成句子的最大长度
``` ```
1. 其次,实现编码器框架。分为三步: 2. 其次,实现编码器框架。分为三步:
1 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。 - 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 - 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -269,7 +255,7 @@ wmt14_reader = paddle.batch( ...@@ -269,7 +255,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 - 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -279,9 +265,9 @@ wmt14_reader = paddle.batch( ...@@ -279,9 +265,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. 接着,定义基于注意力机制的解码器框架。分为三步: 3. 接着,定义基于注意力机制的解码器框架。分为三步:
1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 - 对源语言序列编码后的结果(见2的最后一步),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -289,7 +275,7 @@ wmt14_reader = paddle.batch( ...@@ -289,7 +275,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。 - 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -299,7 +285,7 @@ wmt14_reader = paddle.batch( ...@@ -299,7 +285,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。 - 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch( ...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。 4. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -345,7 +331,7 @@ wmt14_reader = paddle.batch( ...@@ -345,7 +331,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. 训练模式下的解码器调用: 5. 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。 - 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
...@@ -353,6 +339,7 @@ wmt14_reader = paddle.batch( ...@@ -353,6 +339,7 @@ wmt14_reader = paddle.batch(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。 - 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python ```python
if not is_generating:
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -377,30 +364,71 @@ wmt14_reader = paddle.batch( ...@@ -377,30 +364,71 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。 6. 生成模式下的解码器调用:
- 首先,在序列生成任务中,由于解码阶段的RNN总是引用上一时刻生成出的词的词向量,作为当前时刻的输入,因此,使用`GeneratedInput`来自动完成这一过程。具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
- 其次,使用`beam_search`函数循环调用`gru_decoder_with_attention`函数,生成出序列id。
### 参数定义 ```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
首先依据模型配置的`cost`定义模型参数。 # The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
```python trg_embedding = paddle.layer.GeneratedInputV2(
parameters = paddle.parameters.create(cost) size=target_dict_dim,
``` embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。 beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
```python 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
for param in parameters.keys():
print param
```
### 训练模型 ### 训练模型
1. 构造trainer 1. 参数定义
依据模型配置的`cost`定义模型参数。可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. 数据定义
获取wmt14的dataset reader。
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python ```python
if not is_generating:
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
learning_rate=5e-5, learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) regularization=paddle.optimizer.L2Regularization(rate=8e-4))
...@@ -409,73 +437,108 @@ for param in parameters.keys(): ...@@ -409,73 +437,108 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
1. 构造event_handler 4. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 2 == 0 指定每2个batch打印一次日志,包含cost等信息。
```python ```python
if not is_generating:
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
``` ```
1. 启动训练: 5. 启动训练
```python ```python
if not is_generating:
trainer.train( trainer.train(
reader=wmt14_reader, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
event_handler=event_handler,
num_passes=2,
feeding=feeding)
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} .........
......... Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} .........
.........
```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ### 生成模型
### 下载预训练的模型 1. 加载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了一个模型供大家直接下载使用。该模型大小为205MB,[BLEU评估](#BLEU评估)值为26.92。
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh parameters = paddle.dataset.wmt14.model()
``` ```
2. 数据定义
### BLEU评估 从wmt14的生成集中读取前3个样本作为源语言句子。
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 ```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
3. 构造infer
[Moses](http://www.statmt.org/moses/) 是一个统计学的开源机器翻译系统,我们使用其中的 [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) 来做BLEU评估。下载脚本的命令如下: 根据网络拓扑结构和模型参数构造出infer用来生成,在预测时还需要指定输出域`field`,这里使用生成句子的概率`prob`和句子中每个词的`id`。
```bash
./moses_bleu.sh ```python
``` if is_generating:
BLEU评估可以使用`eval_bleu`脚本如下,其中FILE为需要评估的文件名,BEAMSIZE为柱宽度,默认使用`data/wmt14/gen/ntst14.trg`作为标准的翻译结果。 beam_result = paddle.infer(
```bash output_layer=beam_gen,
./eval_bleu.sh FILE BEAMSIZE parameters=parameters,
``` input=gen_data,
本教程的具体命令如下: field=['prob', 'id'])
```bash ```
./eval_bleu.sh gen_result 3
``` 4. 打印生成结果
您会在屏幕上看到:
```text 根据源/目标语言字典,将源语言句子和`beam_size`个生成句子打印输出。
BLEU = 26.92
``` ```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
生成开始后,可以观察到输出的日志如下:
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
## 总结 ## 总结
......
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
echo "Downloading multi-bleu.perl"
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl --no-check-certificate
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
# download the pretrained model
wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz
# untar the model
tar -zxvf wmt14_model.tar.gz
rm wmt14_model.tar.gz
import sys import sys
import paddle.v2 as paddle import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim): def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False):
### Network Architecture ### Network Architecture
word_vector_dim = 512 # dimension of word vector word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network encoder_size = 512 # dimension of hidden unit in GRU Encoder network
beam_size = 3
max_length = 250
#### Encoder #### Encoder
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
...@@ -67,6 +71,7 @@ def seqToseq_net(source_dict_dim, target_dict_dim): ...@@ -67,6 +71,7 @@ def seqToseq_net(source_dict_dim, target_dict_dim):
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
if not is_generating:
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -91,16 +96,44 @@ def seqToseq_net(source_dict_dim, target_dict_dim): ...@@ -91,16 +96,44 @@ def seqToseq_net(source_dict_dim, target_dict_dim):
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost return cost
else:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
return beam_gen
def main(): def main():
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
is_generating = False
# source and target dict dim. # source and target dict dim.
dict_size = 30000 dict_size = 30000
source_dict_dim = target_dict_dim = dict_size source_dict_dim = target_dict_dim = dict_size
# define network topology # train the network
if not is_generating:
cost = seqToseq_net(source_dict_dim, target_dict_dim) cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
...@@ -110,17 +143,10 @@ def main(): ...@@ -110,17 +143,10 @@ def main():
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) regularization=paddle.optimizer.L2Regularization(rate=8e-4))
trainer = paddle.trainer.SGD( trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer) cost=cost, parameters=parameters, update_equation=optimizer)
# define data reader # define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch( wmt14_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192), paddle.dataset.wmt14.train(dict_size), buf_size=8192),
batch_size=5) batch_size=5)
# define event_handler callback # define event_handler callback
...@@ -128,17 +154,59 @@ def main(): ...@@ -128,17 +154,59 @@ def main():
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost,
event.metrics)
else: else:
sys.stdout.write('.') sys.stdout.write('.')
sys.stdout.flush() sys.stdout.flush()
# start to train # start to train
trainer.train( trainer.train(
reader=wmt14_reader, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
event_handler=event_handler,
num_passes=2, # generate a english sequence to french
feeding=feeding) else:
# use the first 3 samples for generation
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
beam_gen = seqToseq_net(source_dict_dim, target_dict_dim, is_generating)
# get the pretrained model, whose bleu = 26.92
parameters = paddle.dataset.wmt14.model()
# prob is the prediction probabilities, and id is the prediction word.
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
beam_size = 3
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
if __name__ == '__main__': if __name__ == '__main__':
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册