修改seqToseq的目标函数(cost function) (#1104) · Issue · PaddlePaddle / Paddle

修改seqToseq的目标函数(cost function)

Created by: coollip

Hi，我之前试用过demo中的seqToseq示例，成功训练了机器翻译nmt模型，工作正常。

最近看了一篇将align信息引入到cost function中的论文： Wenhu Chen, “guided alignment training for topic-aware neural machine translation”.

这篇论文的思路其实比较简单： seqToseq中的attention可以看做对齐信息，但这个对齐没有fast align的强对齐效果好，作者希望模型在训练过程中能够参考fast align的强对齐结果，对attention进行调整。基于这个思路，作者先在线下用fast align对语料进行了对齐，然后定义了fast align结果和网络attention之间的cost，将其加入到cross entropy的cost中，即：其中： HD就是目前seqToseq demo中使用的cross entropy G是作者定义的align cost。可以看到，最后cost function是HD和G两者的加权和。w1、w2控制了两者的比例。

G作为align cost，其实就是mean squared error。G中的A是二维矩阵，即fast align的对齐结果，Ati是target sentence第t个token和src sentence第i个token的对齐情况，若fast align的结果认为这两者是对齐的，则Ati被置为1(实际计算时会对A进行归一化，使得每一行的和是1)。 alpha则是网络中的attention。

OK，上面说完了背景，现在说说我怎么在paddle中尝试实现这一方案。

首先用fast align对语料做了对齐，为每个sentence pair生成了对应的A，A的行数为该pair中target sentence的token数，A的列数为src sentence的token数。根据fast align的结果，将A中相应的位置置为1，最后再对每一行进行了归一化。
完成1后，就要将A作为训练信息通过dataprovider传入到paddle中。由于后面要计算A和attention的align cost，所以我先看了下demo中的attention，其代码在simple_attention中： attention_weight = fc_layer(input=m, size=1, act=SequenceSoftmaxActivation(), param_attr=softmax_param_attr, name="%s_softmax" % name, bias_attr=False)

我的理解是：simple_attention方法在解码端每个time step都会执行一次，在当前t时刻时，这里的attention_weight是一个序列，序列长度是当前sentence pair的src len，序列中每个元素是一个一维的浮点数，第i个元素表示当前解码时刻t的target token和src第i个token的attention值。

相应地，我也将fast align的结果A设置为类似的格式，采用了 dense_vector_sub_sequence(1) 这种格式，假设训练样本sentence pair的src包含3个token，target包含2个token，则A的形式举例如下： [ [ [0.5],[0.5],[0] ], //target第1个token和src每个token的对齐结果 [ [0],[0.5],[0.5] ], //target第2个token和src每个token的对齐结果 ]

我按照这种格式将A传进了paddle，具体如下： `
a = data_layer(name='target_source_align', size=1)

    decoder = recurrent_group(name=decoder_group_name,
                              step=gru_decoder_with_attention,
                              input=[
                                  StaticInput(input=encoded_vector,
                                              is_seq=True),
                                  StaticInput(input=encoded_proj,
                                              is_seq=True),
                                  trg_embedding
                              ])

现在我有几个问题：

我上述的理解和处理流程是否正确？
如何将align_info这种sub_sequence传入到recurrent_group->gru_decoder_with_attention中？
如果attention_weight是长度为src len的序列，那么怎么与a计算上面式子中定义的align cost(即G)
如何将两个cost加权在一起进行训练？

PaddlePaddle / Paddle 大约 1 年 前同步成功

修改seqToseq的目标函数(cost function)

PaddlePaddle / Paddle
大约 1 年前同步成功