Merge pull request #206 from astonzhang/seq

add beam search

Merge pull request #206 from astonzhang/seq
add beam search
8c8cfa93 · Aston Zhang · GitHub · 681c30f4 · db733518 · 8c8cfa93
隐藏空白更改
内联并排

Showing with 31 addition and 2 deletion

chapter_natural-language-processing/nmt.md chapter_natural-language-processing/nmt.md +31 -2

未找到文件。
--- a/chapter_natural-language-processing/nmt.md
+++ b/chapter_natural-language-processing/nmt.md
@@ -263,7 +263,7 @@ def translate(encoder, decoder, decoder_init_state, fr_ens, ctx, max_seq_len):
        print('Expect:', fr_en[1], '\n')
 ```

-下面定义模型训练函数。为了初始化解码器的隐含状态，我们通过一层全连接网络来转化编码器最早时刻的输出隐含状态。
+下面定义模型训练函数。为了初始化解码器的隐含状态，我们通过一层全连接网络来转化编码器最早时刻的输出隐含状态。这里的解码器使用当前时刻的预测结果作为下一时刻的输入。

 ```{.python .input}
 def train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens):
@@ -302,7 +302,7 @@ def train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens):
                for i in range(max_seq_len):
                    decoder_output, decoder_state = decoder(
                        decoder_input, decoder_state, encoder_outputs)
-                    # 使用当前时刻的预测结果作为下一时刻的编码器输入。
+                    # 解码器使用当前时刻的预测结果作为下一时刻的输入。
                    decoder_input = nd.array(
                        [decoder_output.argmax(axis=1).asscalar()], ctx=ctx)
                    loss = loss + softmax_cross_entropy(decoder_output, y[0][i])
@@ -349,6 +349,32 @@ eval_fr_ens =[['elle est japonaise .', 'she is japanese .'],
 train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens)
 ```

+## 束搜索
+
+在上一节里，我们提到编码器最终输出了一个背景向量$\mathbf{c}$，该背景向量编码了输入序列$x_1, x_2, \ldots, x_T$的信息。假设训练数据中的输出序列是$y_1, y_2, \ldots, y_{T^\prime}$，输出序列的生成概率是
+
+$$\mathbb{P}(y_1, \ldots, y_{T^\prime}) = \prod_{t^\prime=1}^{T^\prime} \mathbb{P}(y_{t^\prime} \mid y_1, \ldots, y_{t^\prime-1}, \mathbf{c})$$
+
+
+对于机器翻译的输出来说，如果输出语言的词汇集合$\mathcal{Y}$的大小为$|\mathcal{Y}|$，输出序列的长度为$T^\prime$，那么可能的输出序列种类是$\mathcal{O}(|\mathcal{Y}|^{T^\prime})$。为了找到生成概率最大的输出序列，一种方法是计算所有$\mathcal{O}(|\mathcal{Y}|^{T^\prime})$种可能序列的生成概率，并输出概率最大的序列。我们将该序列称为最优序列。但是这种方法的计算开销过高（例如，$10000^{10} = 1 \times 10^{40}$）。
+
+
+我们目前所介绍的解码器在每个时刻只输出生成概率最大的一个词汇。对于任一时刻$t^\prime$，我们从$|\mathcal{Y}|$个词中搜索出输出词
+
+$$y_{t^\prime} = \text{argmax}_{y_{t^\prime} \in \mathcal{Y}} \mathbb{P}(y_{t^\prime} \mid y_1, \ldots, y_{t^\prime-1}, \mathbf{c})$$
+
+因此，搜索计算开销（$\mathcal{O}(|\mathcal{Y}| \times {T^\prime})$）显著下降（例如，$10000 \times 10 = 1 \times 10^5$），但这并不能保证一定搜索到最优序列。
+
+束搜索（beam search）介于上面二者之间。我们来看一个例子。
+
+假设输出序列的词典中只包含五个词：$\mathcal{Y} = \{A, B, C, D, E\}$。束搜索的一个超参数叫做束宽（beam width）。以束宽等于2为例，假设输出序列长度为3，假如时刻1生成概率$\mathbb{P}(y_{t^\prime} \mid \mathbf{c})$最大的两个词为$A$和$C$，我们在时刻2对于所有的$y_2 \in \mathcal{Y}$都分别计算$\mathbb{P}(y_2 \mid A, \mathbf{c})$和$\mathbb{P}(y_2 \mid C, \mathbf{c})$，从计算出的10个概率中取最大的两个，假设为$\mathbb{P}(B \mid A, \mathbf{c})$和$\mathbb{P}(E \mid C, \mathbf{c})$。那么，我们在时刻3对于所有的$y_3 \in \mathcal{Y}$都分别计算$\mathbb{P}(y_3 \mid A, B, \mathbf{c})$和$\mathbb{P}(y_3 \mid C, E, \mathbf{c})$，从计算出的10个概率中取最大的两个，假设为$\mathbb{P}(D \mid A, B, \mathbf{c})$和$\mathbb{P}(D \mid C, E, \mathbf{c})$。
+
+接下来，我们可以在输出序列：$A$、$C$、$AB$、$CE$、$ABD$、$CED$中筛选出以特殊字符EOS结尾的候选序列。再在候选序列中取以下分数最高的序列作为最终候选序列：
+
+$$ \frac{1}{L^\alpha} \log \mathbb{P}(y_1, \ldots, y_{L})$$
+
+其中$L$为候选序列长度，$\alpha$一般可选为0.75。分母上的$L^\alpha$是为了惩罚较长序列的分数中的相加项。
+
 ## 结论

 * 我们可以将编码器—解码器和注意力机制应用于神经机器翻译中。
@@ -357,6 +383,9 @@ train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens)
 ## 练习

 * 试着使用更大的翻译数据集来训练模型，例如[WMT](http://www.statmt.org/wmt14/translation-task.html)和[Tatoeba Project](http://www.manythings.org/anki/)。调一调不同参数并观察实验结果。
+* Teacher forcing：在模型训练中，试着让解码器使用当前时刻的正确结果（而不是预测结果）作为下一时刻的输入。结果会怎么样？
+
+


 **吐槽和讨论欢迎点**[这里](https://discuss.gluon.ai/t/topic/4689)