+# 噪声对比估计加速词向量训练
+
+词向量是许多自然语言处理任务的基础,详细介绍可见 PaddleBook 中的[词向量](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.cn.md)一节,其中通过训练神经概率语言模型(Neural Probabilistic Language Model, NPLM)得到词向量,是一种流行的方式。然而,神经概率语言模型的最后一层往往需要计算一个词典之上的概率分布,词典越大这一层的计算量也就越大,往往非常耗时。在models的另一篇我们已经介绍了[Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/hsigmoid),这里我们介绍另一种加速词向量训练的方法:使用噪声对比估计(Noise-contrastive estimation, NCE)损失函数\[[1](#参考文献)\]。
+
+## NCE
+NPLM 的最后一层 `softmax` 函数计算时需要考虑每个类别的指数项,必须计算字典中的所有单词,而在一般语料集上面字典往往非常大\[[3](#参考文献)\],从而导致整个训练过程十分耗时。NCE 是一种快速对离散分布进行估计的方法。与常用的 hierarchical-sigmoid \[[2](#参考文献)\] 方法相比,NCE 不再使用复杂的二叉树来构造目标函数,而是采用相对简单的随机负采样,以大幅提升计算效率。
+
+
+假设已知具体的上下文 $h$,并且知道这个分布为 $P^h(w)$ ,并将从中抽样出来的数据作为正样例,而从一个噪音分布 $P_n(w)$ 抽样的数据作为负样例。我们可以任意选择合适的噪音分布,默认为无偏的均匀分布。这里我们同时假设噪音样例 $k$ 倍于数据样例,则训练数据被抽中的概率为\[[1](#参考文献)\]:
+
+$$P^h(D=1|w,\theta)=\frac { P_\theta^h(w) }{ P^h_\theta(w)+kP_n(w) } =\sigma (\Delta s_\theta(w,h))$$
+
+其中 $\Delta s_\theta(w,h)=s_\theta(w,h)-\log (kP_n(w))$ ,$s_\theta(w,h)$ 表示选择在生成 $w$ 字并处于上下文 $h$ 时的特征向量,整体目标函数的目的就是增大正样本的概率同时降低负样本的概率。目标函数如下[[1](#参考文献)]:
+
+$$
+J^h(\theta )=E_{ P_d^h }\left[ \log { P^h(D=1|w,\theta ) } \right] +kE_{ P_n }\left[ \log P^h (D=0|w,\theta ) \right]$$
+$$
+ \\\\\qquad =E_{ P_d^h }\left[ \log { \sigma (\Delta s_\theta(w,h)) } \right] +kE_{ P_n }\left[ \log (1-\sigma (\Delta s_\theta(w,h))) \right]$$
+
+总体上来说,NCE 是通过构造逻辑回归(logistic regression),对正样例和负样例做二分类,对于每一个样本,将自身的预测词 label 作为正样例,同时采样出 $k$ 个其他词 label 作为负样例,从而只需要计算样本在这 $k+1$ 个 label 上的概率。相比原始的 `softmax ` 分类需要计算每个类别的分数,然后归一化得到概率,节约了大量的计算时间。
+
+## 实验数据
+本文采用 Penn Treebank (PTB) 数据集([Tomas Mikolov预处理版本](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz))来训练语言模型。PaddlePaddle 提供 [paddle.dataset.imikolov](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py) 接口来方便调用这些数据,如果当前目录没有找到数据它会自动下载并验证文件的完整性。并提供大小为5的滑动窗口对数据做预处理工作,方便后期处理。语料语种为英文,共有42068句训练数据,3761句测试数据。
+
+## 网络结构
+N-gram 神经概率语言模型详细网络结构见图1:
+
+
+
+图1. 网络配置结构
+
+可以看到,模型主要分为如下几个部分构成:
+
+1. **输入层**:输入的 ptb 样本由原始的英文单词组成,将每个英文单词转换为字典中的 id 表示,使用唯一的 id 表示可以区分每个单词。
+
+2. **词向量层**:比起原先的 id 表示,词向量表示更能体现词与词之间的语义关系。这里使用可更新的 embedding 矩阵,将原先的 id 表示转换为固定维度的词向量表示。训练完成之后,词语之间的语义相似度可以使用词向量之间的距离来表示,语义越相似,距离越近。
+
+3. **词向量拼接层**:将词向量进行串联,并将词向量首尾相接形成一个长向量。这样可以方便后面全连接层的处理。
+
+4. **全连接隐层**:将上一层获得的长向量输入到一层隐层的神经网络,输出特征向量。全连接的隐层可以增强网络的学习能力。
+
+5. **NCE层**:训练时可以直接实用 PaddlePaddle 提供的 NCE Layer。
+
+
+## 训练
+在命令行窗口运行命令``` python train.py ```可以直接开启训练任务。
+
+- 程序第一次运行会检测用户缓存文件夹中是否包含 ptb 数据集,如果未包含,则自动下载。
+- 运行过程中,每10个 batch 会打印模型训练在训练集上的代价值
+- 每个 pass 结束后,会计算测试数据集上的损失,并同时会保存最新的模型快照。
+
+在模型文件`network_conf.py`中 NCE 调用代码如下:
+
+```python
+cost = paddle.layer.nce(
+ input=hidden_layer,
+ label=next_word,
+ num_classes=dict_size,
+ param_attr=paddle.attr.Param(name="nce_w"),
+ bias_attr=paddle.attr.Param(name="nce_b"),
+ act=paddle.activation.Sigmoid(),
+ num_neg_samples=25,
+ neg_distribution=None)
+```
+
+NCE 层的一些重要参数解释如下:
+
+| 参数名 | 参数作用 | 介绍 |
+|:------ |:-------| :--------|
+| param\_attr / bias\_attr | 用来设置参数名字 |方便预测阶段加载参数,具体在预测一节中介绍。|
+| num\_neg\_samples | 负样本采样个数|可以控制正负样本比例,这个值取值区间为 [1, 字典大小-1],负样本个数越多则整个模型的训练速度越慢,模型精度也会越高 |
+| neg\_distribution | 生成负样例标签的分布,默认是一个均匀分布| 可以自行控制负样本采样时各个类别的采样权重。例如:希望正样例为“晴天”时,负样例“洪水”在训练时更被着重区分,则可以将“洪水”这个类别的采样权重增加|
+| act | 使用何种激活函数| 根据 NCE 的原理,这里应该使用 sigmoid 函数 |
+
+## 预测
+1. 首先修改 `infer.py` 脚本的 `main` 函数指定需要测试的模型。
+2. 需要注意的是,**预测和训练的计算逻辑不同**,需要以一个全连接层:`paddle.layer.fc`替换训练使用的`paddle.train.nce`层, 并直接加载NCE学习到的参数,代码如下:
+
+ ```python
+ prediction = paddle.layer.fc(
+ size=dict_size,
+ act=paddle.activation.Softmax(),
+ bias_attr=paddle.attr.Param(name="nce_b"),
+ input=hidden_layer,
+ param_attr=paddle.attr.Param(name="nce_w"))
+ ```
+3. 运行 `python infer.py`。程序首先会加载指定的模型,然后按照 batch 大小依次进行预测,并打印预测结果。预测的输出格式如下:
+
+ ```text
+ 0.6734 their may want to move
+
+ ```
+
+ 每一行是一条预测结果,内部以“\t”分隔,共计3列:
+ - 第一列:下一个词的概率。
+ - 第二列:模型预测的下一个词。
+ - 第三列:输入的 $n$ 个词语,内部以空格分隔。
+
+
+## 参考文献
+1. Mnih A, Kavukcuoglu K. [Learning word embeddings efficiently with noise-contrastive estimation](https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf)[C]//Advances in neural information processing systems. 2013: 2265-2273.
+
+2. Morin, F., & Bengio, Y. (2005, January). [Hierarchical Probabilistic Neural Network Language Model](http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf). In Aistats (Vol. 5, pp. 246-252).
+
+3. Mnih A, Teh Y W. [A Fast and Simple Algorithm for Training Neural Probabilistic Language Models](http://xueshu.baidu.com/s?wd=paperuri%3A%280735b97df93976efb333ac8c266a1eb2%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Farxiv.org%2Fabs%2F1206.6426&ie=utf-8&sc_us=5770715420073315630)[J]. Computer Science, 2012:1751-1758.
+
+
+
+
+
+
diff --git a/nce_cost/infer.py b/nce_cost/infer.py
index 53e3aef45fc02ac008caa7102836ac47915be1fc..89d80792c85d68ee76234d5558b8f363b8768f92 100644
--- a/nce_cost/infer.py
+++ b/nce_cost/infer.py
@@ -1,70 +1,49 @@
+#!/usr/bin/env python
# -*- encoding:utf-8 -*-
-import numpy as np
-import glob
+import os
import gzip
-import paddle.v2 as paddle
-from nce_conf import network_conf
-
-
-def main():
- paddle.init(use_gpu=False, trainer_count=1)
- word_dict = paddle.dataset.imikolov.build_dict()
- dict_size = len(word_dict)
-
- prediction_layer = network_conf(
- is_train=False,
- hidden_size=128,
- embedding_size=512,
- dict_size=dict_size)
-
- models_list = glob.glob('./models/*')
- models_list = sorted(models_list)
-
- with gzip.open(models_list[-1], 'r') as f:
- parameters = paddle.parameters.Parameters.from_tar(f)
+import numpy as np
- idx_word_dict = dict((v, k) for k, v in word_dict.items())
- batch_size = 64
- batch_ins = []
- ins_iter = paddle.dataset.imikolov.test(word_dict, 5)
+import paddle.v2 as paddle
+from network_conf import ngram_lm
- infer_data = []
- infer_data_label = []
- for item in paddle.dataset.imikolov.test(word_dict, 5)():
- infer_data.append((item[:4]))
- infer_data_label.append(item[4])
- # Choose 100 samples from the test set to show how to infer.
- if len(infer_data_label) == 100:
- break
- feeding = {
- 'firstw': 0,
- 'secondw': 1,
- 'thirdw': 2,
- 'fourthw': 3,
- 'fifthw': 4
- }
+def infer_a_batch(inferer, test_batch, id_to_word):
+ probs = inferer.infer(input=test_batch)
+ for i, res in enumerate(zip(test_batch, probs)):
+ maxid = res[1].argsort()[-1]
+ print("%.4f\t%s\t%s" % (res[1][maxid], id_to_word[maxid],
+ " ".join([id_to_word[w] for w in res[0]])))
- predictions = paddle.infer(
- output_layer=prediction_layer,
- parameters=parameters,
- input=infer_data,
- feeding=feeding,
- field=['value'])
- for i, (prob, data,
- label) in enumerate(zip(predictions, infer_data, infer_data_label)):
- print '--------------------------'
- print "No.%d Input: " % (i+1) + \
- idx_word_dict[data[0]] + ' ' + \
- idx_word_dict[data[1]] + ' ' + \
- idx_word_dict[data[2]] + ' ' + \
- idx_word_dict[data[3]]
- print 'Ground Truth Output: ' + idx_word_dict[label]
- print 'Predict Output: ' + idx_word_dict[prob.argsort(
- kind='heapsort', axis=0)[-1]]
- print
+def infer(model_path, batch_size):
+ assert os.path.exists(model_path), "the trained model does not exist."
+ word_to_id = paddle.dataset.imikolov.build_dict()
+ id_to_word = dict((v, k) for k, v in word_to_id.items())
+ dict_size = len(word_to_id)
+ paddle.init(use_gpu=False, trainer_count=1)
-if __name__ == '__main__':
- main()
+ # load the trained model.
+ with gzip.open(model_path) as f:
+ parameters = paddle.parameters.Parameters.from_tar(f)
+ prediction_layer = ngram_lm(
+ is_train=False, hidden_size=128, emb_size=512, dict_size=dict_size)
+ inferer = paddle.inference.Inference(
+ output_layer=prediction_layer, parameters=parameters)
+
+ test_batch = []
+ for idx, item in enumerate(paddle.dataset.imikolov.test(word_to_id, 5)()):
+ test_batch.append((item[:4]))
+ if len(test_batch) == batch_size:
+ infer_a_batch(inferer, test_batch, id_to_word)
+ infer_data = []
+
+ if len(test_batch):
+ infer_a_batch(inferer, test_batch, id_to_word)
+ infer_data = []
+ infer_data_label = []
+
+
+if __name__ == "__main__":
+ infer("models/model_pass_00000_00020.tar.gz", 10)
diff --git a/nce_cost/nce_conf.py b/nce_cost/nce_conf.py
deleted file mode 100644
index 962a9ccc80906bc2272245d0e297142397ffb024..0000000000000000000000000000000000000000
--- a/nce_cost/nce_conf.py
+++ /dev/null
@@ -1,61 +0,0 @@
-# -*- encoding:utf-8 -*-
-import math
-import paddle.v2 as paddle
-
-
-def network_conf(hidden_size, embedding_size, dict_size, is_train):
-
- first_word = paddle.layer.data(
- name="firstw", type=paddle.data_type.integer_value(dict_size))
- second_word = paddle.layer.data(
- name="secondw", type=paddle.data_type.integer_value(dict_size))
- third_word = paddle.layer.data(
- name="thirdw", type=paddle.data_type.integer_value(dict_size))
- fourth_word = paddle.layer.data(
- name="fourthw", type=paddle.data_type.integer_value(dict_size))
- next_word = paddle.layer.data(
- name="fifthw", type=paddle.data_type.integer_value(dict_size))
-
- embed_param_attr = paddle.attr.Param(
- name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0)
- first_embedding = paddle.layer.embedding(
- input=first_word, size=embedding_size, param_attr=embed_param_attr)
- second_embedding = paddle.layer.embedding(
- input=second_word, size=embedding_size, param_attr=embed_param_attr)
- third_embedding = paddle.layer.embedding(
- input=third_word, size=embedding_size, param_attr=embed_param_attr)
- fourth_embedding = paddle.layer.embedding(
- input=fourth_word, size=embedding_size, param_attr=embed_param_attr)
-
- context_embedding = paddle.layer.concat(input=[
- first_embedding, second_embedding, third_embedding, fourth_embedding
- ])
-
- hidden_layer = paddle.layer.fc(
- input=context_embedding,
- size=hidden_size,
- act=paddle.activation.Tanh(),
- bias_attr=paddle.attr.Param(learning_rate=1),
- param_attr=paddle.attr.Param(
- initial_std=1. / math.sqrt(embedding_size * 8), learning_rate=1))
-
- if is_train == True:
- cost = paddle.layer.nce(
- input=hidden_layer,
- label=next_word,
- num_classes=dict_size,
- param_attr=paddle.attr.Param(name='nce_w'),
- bias_attr=paddle.attr.Param(name='nce_b'),
- act=paddle.activation.Sigmoid(),
- num_neg_samples=25,
- neg_distribution=None)
- return cost
- else:
- with paddle.layer.mixed(
- size=dict_size,
- act=paddle.activation.Softmax(),
- bias_attr=paddle.attr.Param(name='nce_b')) as prediction:
- prediction += paddle.layer.trans_full_matrix_projection(
- input=hidden_layer, param_attr=paddle.attr.Param(name='nce_w'))
-
- return prediction
diff --git a/nce_cost/network_conf.py b/nce_cost/network_conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9e33e1b2d143c9662a34ea6c7fd3690b5d49e4e
--- /dev/null
+++ b/nce_cost/network_conf.py
@@ -0,0 +1,49 @@
+#!/usr/bin/env python
+# -*- encoding:utf-8 -*-
+import math
+import paddle.v2 as paddle
+
+
+def ngram_lm(hidden_size, emb_size, dict_size, gram_num=4, is_train=True):
+ emb_layers = []
+ embed_param_attr = paddle.attr.Param(
+ name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0)
+ for i in range(gram_num):
+ word = paddle.layer.data(
+ name="__word%02d__" % (i),
+ type=paddle.data_type.integer_value(dict_size))
+ emb_layers.append(
+ paddle.layer.embedding(
+ input=word, size=emb_size, param_attr=embed_param_attr))
+
+ next_word = paddle.layer.data(
+ name="__target_word__", type=paddle.data_type.integer_value(dict_size))
+
+ context_embedding = paddle.layer.concat(input=emb_layers)
+
+ hidden_layer = paddle.layer.fc(
+ input=context_embedding,
+ size=hidden_size,
+ act=paddle.activation.Tanh(),
+ param_attr=paddle.attr.Param(initial_std=1. / math.sqrt(emb_size * 8)))
+
+ if is_train:
+ cost = paddle.layer.nce(
+ input=hidden_layer,
+ label=next_word,
+ num_classes=dict_size,
+ param_attr=paddle.attr.Param(name="nce_w"),
+ bias_attr=paddle.attr.Param(name="nce_b"),
+ act=paddle.activation.Sigmoid(),
+ num_neg_samples=25,
+ neg_distribution=None)
+ return cost
+ else:
+ prediction = paddle.layer.fc(
+ size=dict_size,
+ act=paddle.activation.Softmax(),
+ bias_attr=paddle.attr.Param(name="nce_b"),
+ input=hidden_layer,
+ param_attr=paddle.attr.Param(name="nce_w"))
+
+ return prediction
diff --git a/nce_cost/train.py b/nce_cost/train.py
index a8b437c1dd9bfc89fd03598b9a4201693c3074d7..4ab5043725805003cf151c6d0c8af8dbbc8c199f 100644
--- a/nce_cost/train.py
+++ b/nce_cost/train.py
@@ -1,52 +1,52 @@
+#!/usr/bin/env python
# -*- encoding:utf-8 -*-
-import paddle.v2 as paddle
+import os
+import logging
import gzip
-from nce_conf import network_conf
+import paddle.v2 as paddle
+from network_conf import ngram_lm
+
+logger = logging.getLogger("paddle")
+logger.setLevel(logging.INFO)
-def main():
+def train(model_save_dir):
+ if not os.path.exists(model_save_dir):
+ os.mkdir(model_save_dir)
+
paddle.init(use_gpu=False, trainer_count=1)
word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict)
- cost = network_conf(
- is_train=True, hidden_size=128, embedding_size=512, dict_size=dict_size)
+ optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
+ cost = ngram_lm(hidden_size=128, emb_size=512, dict_size=dict_size)
parameters = paddle.parameters.create(cost)
- adagrad = paddle.optimizer.Adam(learning_rate=1e-4)
- trainer = paddle.trainer.SGD(cost, parameters, adagrad)
+ trainer = paddle.trainer.SGD(cost, parameters, optimizer)
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 1000 == 0:
- print "Pass %d, Batch %d, Cost %f" % (
- event.pass_id, event.batch_id, event.cost)
+ if event.batch_id and not event.batch_id % 10:
+ logger.info("Pass %d, Batch %d, Cost %f" %
+ (event.pass_id, event.batch_id, event.cost))
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
paddle.batch(paddle.dataset.imikolov.test(word_dict, 5), 64))
- print "Test here.. Pass %d, Cost %f" % (event.pass_id, result.cost)
+ logger.info("Test Pass %d, Cost %f" % (event.pass_id, result.cost))
- model_name = "./models/model_pass_%05d.tar.gz" % event.pass_id
- print "Save model into %s ..." % model_name
- with gzip.open(model_name, 'w') as f:
+ save_path = os.path.join(model_save_dir,
+ "model_pass_%05d.tar.gz" % event.pass_id)
+ logger.info("Save model into %s ..." % save_path)
+ with gzip.open(save_path, "w") as f:
parameters.to_tar(f)
- feeding = {
- 'firstw': 0,
- 'secondw': 1,
- 'thirdw': 2,
- 'fourthw': 3,
- 'fifthw': 4
- }
-
trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, 5), 64),
num_passes=1000,
- event_handler=event_handler,
- feeding=feeding)
+ event_handler=event_handler)
-if __name__ == '__main__':
- main()
+if __name__ == "__main__":
+ train(model_save_dir="models")
diff --git a/nested_sequence/README.md b/nested_sequence/README.md
deleted file mode 100644
index a0990367ef8b03c70c29d285e22ef85907e1d0b7..0000000000000000000000000000000000000000
--- a/nested_sequence/README.md
+++ /dev/null
@@ -1 +0,0 @@
-TBD
diff --git a/nmt_without_attention/README.md b/nmt_without_attention/README.md
index a54b715102574dae1b619997a1ed7a2bfc14131c..2fd43bbdda53091506ca574d8c8b894870471c4f 100644
--- a/nmt_without_attention/README.md
+++ b/nmt_without_attention/README.md
@@ -51,14 +51,15 @@ RNN 的原始结构用一个向量来存储隐状态,然而这种结构的 RNN
在 PaddlePaddle 中,双向编码器可以很方便地调用相关 APIs 实现:
```python
-#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
+
# source embedding
src_embedding = paddle.layer.embedding(
input=src_word_id, size=word_vector_dim)
-# use bidirectional_gru
+
+# bidirectional GRU as encoder
encoded_vector = paddle.networks.bidirectional_gru(
input=src_embedding,
size=encoder_size,
@@ -84,19 +85,17 @@ encoded_vector = paddle.networks.bidirectional_gru(
### 无注意力机制的解码器
-PaddleBook中[机器翻译](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md)的相关章节中,已介绍了带注意力机制(Attention Mechanism)的 Encoder-Decoder 结构,本例则介绍的是不带注意力机制的 Encoder-Decoder 结构。关于注意力机制,读者可进一步参考 PaddleBook 和参考文献\[[3](#参考文献)]。
+- PaddleBook中[机器翻译](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md)的相关章节中,已介绍了带注意力机制(Attention Mechanism)的 Encoder-Decoder 结构,本例介绍的则是不带注意力机制的 Encoder-Decoder 结构。关于注意力机制,读者可进一步参考 PaddleBook 和参考文献\[[3](#参考文献)]。
对于流行的RNN单元,PaddlePaddle 已有很好的实现均可直接调用。如果希望在 RNN 每一个时间步实现某些自定义操作,可使用 PaddlePaddle 中的`recurrent_layer_group`。首先,自定义单步逻辑函数,再利用函数 `recurrent_group()` 循环调用单步逻辑函数处理整个序列。本例中的无注意力机制的解码器便是使用`recurrent_layer_group`来实现,其中,单步逻辑函数`gru_decoder_without_attention()`相关代码如下:
```python
-#### Decoder
+# the initialization state for decoder GRU
encoder_last = paddle.layer.last_seq(input=encoded_vector)
-encoder_last_projected = paddle.layer.mixed(
- size=decoder_size,
- act=paddle.activation.Tanh(),
- input=paddle.layer.full_matrix_projection(input=encoder_last))
+encoder_last_projected = paddle.layer.fc(
+ size=decoder_size, act=paddle.activation.Tanh(), input=encoder_last)
-# gru step
+# the step function for decoder GRU
def gru_decoder_without_attention(enc_vec, current_word):
'''
Step function for gru decoder
@@ -106,33 +105,29 @@ def gru_decoder_without_attention(enc_vec, current_word):
:type current_word: layer object
'''
decoder_mem = paddle.layer.memory(
- name='gru_decoder',
- size=decoder_size,
- boot_layer=encoder_last_projected)
+ name="gru_decoder",
+ size=decoder_size,
+ boot_layer=encoder_last_projected)
context = paddle.layer.last_seq(input=enc_vec)
- decoder_inputs = paddle.layer.mixed(
- size=decoder_size * 3,
- input=[
- paddle.layer.full_matrix_projection(input=context),
- paddle.layer.full_matrix_projection(input=current_word)
- ])
+ decoder_inputs = paddle.layer.fc(
+ size=decoder_size * 3, input=[context, current_word])
gru_step = paddle.layer.gru_step(
- name='gru_decoder',
+ name="gru_decoder",
act=paddle.activation.Tanh(),
gate_act=paddle.activation.Sigmoid(),
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
- out = paddle.layer.mixed(
+ out = paddle.layer.fc(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax(),
- input=paddle.layer.full_matrix_projection(input=gru_step))
- return out
+ input=gru_step)
+ return out
```
在模型训练和测试阶段,解码器的行为有很大的不同:
@@ -143,34 +138,14 @@ def gru_decoder_without_attention(enc_vec, current_word):
训练和生成的逻辑分别实现在如下的`if-else`条件分支中:
```python
-decoder_group_name = "decoder_group"
-group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True)
+group_input1 = paddle.layer.StaticInput(input=encoded_vector)
group_inputs = [group_input1]
-if not generating:
- trg_embedding = paddle.layer.embedding(
- input=paddle.layer.data(
- name='target_language_word',
- type=paddle.data_type.integer_value_sequence(target_dict_dim)),
- size=word_vector_dim,
- param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
- group_inputs.append(trg_embedding)
-
- decoder = paddle.layer.recurrent_group(
- name=decoder_group_name,
- step=gru_decoder_without_attention,
- input=group_inputs)
-
- lbl = paddle.layer.data(
- name='target_language_next_word',
- type=paddle.data_type.integer_value_sequence(target_dict_dim))
- cost = paddle.layer.classification_cost(input=decoder, label=lbl)
-
- return cost
-else:
+decoder_group_name = "decoder_group"
+if is_generating:
trg_embedding = paddle.layer.GeneratedInput(
size=target_dict_dim,
- embedding_name='_target_language_embedding',
+ embedding_name="_target_language_embedding",
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
@@ -184,6 +159,26 @@ else:
max_length=max_length)
return beam_gen
+else:
+ trg_embedding = paddle.layer.embedding(
+ input=paddle.layer.data(
+ name="target_language_word",
+ type=paddle.data_type.integer_value_sequence(target_dict_dim)),
+ size=word_vector_dim,
+ param_attr=paddle.attr.ParamAttr(name="_target_language_embedding"))
+ group_inputs.append(trg_embedding)
+
+ decoder = paddle.layer.recurrent_group(
+ name=decoder_group_name,
+ step=gru_decoder_without_attention,
+ input=group_inputs)
+
+ lbl = paddle.layer.data(
+ name="target_language_next_word",
+ type=paddle.data_type.integer_value_sequence(target_dict_dim))
+ cost = paddle.layer.classification_cost(input=decoder, label=lbl)
+
+ return cost
```
## 数据准备
@@ -191,29 +186,31 @@ else:
## 模型的训练与测试
-在定义好网络结构后,就可以进行模型训练与测试了。根据用户运行时传递的参数是`--train` 还是 `--generate`,Python 脚本的 `main()` 函数分别调用函数`train()`和`generate()`来完成模型的训练与测试。
-
### 模型训练
-模型训练阶段,函数 `train()` 依次完成了如下的逻辑:
+
+启动模型训练的十分简单,只需在命令行窗口中执行`python train.py`。模型训练阶段 `train.py` 脚本中的 `train()` 函数依次完成了如下的逻辑:
**a) 由网络定义,解析网络结构,初始化模型参数**
-```
-# initialize model
+```python
+# define the network topolgy.
cost = seq2seq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
```
**b) 设定训练过程中的优化策略、定义训练数据读取 `reader`**
-```
-# define optimize method and trainer
+```python
+# define optimization method
optimizer = paddle.optimizer.RMSProp(
learning_rate=1e-3,
gradient_clipping_threshold=10.0,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
+
+# define the trainer instance
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
+
# define data reader
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
@@ -223,40 +220,33 @@ wmt14_reader = paddle.batch(
**c) 定义事件句柄,打印训练中间结果、保存模型快照**
-```
-# define event_handler callback
+```python
+# define the event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0 and event.batch_id > 0:
- with gzip.open('models/nmt_without_att_params_batch_%d.tar.gz' %
- event.batch_id, 'w') as f:
+ if not event.batch_id % 100 and event.batch_id:
+ with gzip.open(
+ os.path.join(save_path,
+ "nmt_without_att_%05d_batch_%05d.tar.gz" %
+ event.pass_id, event.batch_id), "w") as f:
parameters.to_tar(f)
- if event.batch_id % 10 == 0:
- print "\nPass %d, Batch %d, Cost%f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- else:
- sys.stdout.write('.')
- sys.stdout.flush()
+ if event.batch_id and not event.batch_id % 10:
+ logger.info("Pass %d, Batch %d, Cost %f, %s" % (
+ event.pass_id, event.batch_id, event.cost, event.metrics))
```
**d) 开始训练**
-```
-# start to train
+```python
+# start training
trainer.train(
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
-启动模型训练的十分简单,只需在命令行窗口中执行
-
-```
-python nmt_without_attention_v2.py --train
-```
-
输出样例为
-```
+```text
Pass 0, Batch 0, Cost 267.674663, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 172.892294, {'classification_error_evaluator': 0.953895092010498}
@@ -268,81 +258,80 @@ Pass 0, Batch 30, Cost 153.633665, {'classification_error_evaluator': 0.86438035
Pass 0, Batch 40, Cost 168.170543, {'classification_error_evaluator': 0.8348183631896973}
```
+### 生成翻译结果
+利用训练好的模型生成翻译文本也十分简单。
+
+1. 首先请修改`generate.py`脚本中`main`中传递给`generate`函数的参数,以选择使用哪一个保存的模型来生成。默认参数如下所示:
+
+ ```python
+ generate(
+ source_dict_dim=30000,
+ target_dict_dim=30000,
+ batch_size=20,
+ beam_size=3,
+ model_path="models/nmt_without_att_params_batch_00100.tar.gz")
+ ```
+
+2. 在终端执行命令 `python generate.py`,脚本中的`generate()`执行了依次如下逻辑:
+
+ **a) 加载测试样本**
+
+ ```python
+ # load data samples for generation
+ gen_creator = paddle.dataset.wmt14.gen(source_dict_dim)
+ gen_data = []
+ for item in gen_creator():
+ gen_data.append((item[0], ))
+ ```
+
+ **b) 初始化模型,执行`infer()`为每个输入样本生成`beam search`的翻译结果**
+
+ ```python
+ beam_gen = seq2seq_net(source_dict_dim, target_dict_dim, True)
+ with gzip.open(init_models_path) as f:
+ parameters = paddle.parameters.Parameters.from_tar(f)
+ # prob is the prediction probabilities, and id is the prediction word.
+ beam_result = paddle.infer(
+ output_layer=beam_gen,
+ parameters=parameters,
+ input=gen_data,
+ field=['prob', 'id'])
+ ```
+
+ **c) 加载源语言和目标语言词典,将`id`序列表示的句子转化成原语言并输出结果**
+
+ ```python
+ beam_result = inferer.infer(input=test_batch, field=["prob", "id"])
+
+ gen_sen_idx = np.where(beam_result[1] == -1)[0]
+ assert len(gen_sen_idx) == len(test_batch) * beam_size
+
+ start_pos, end_pos = 1, 0
+ for i, sample in enumerate(test_batch):
+ print(" ".join([
+ src_dict[w] for w in sample[0][1:-1]
+ ])) # skip the start and ending mark when print the source sentence
+ for j in xrange(beam_size):
+ end_pos = gen_sen_idx[i * beam_size + j]
+ print("%.4f\t%s" % (beam_result[0][i][j], " ".join(
+ trg_dict[w] for w in beam_result[1][start_pos:end_pos])))
+ start_pos = end_pos + 2
+ print("\n")
+ ```
+
+设置beam search的宽度为3,输入为一个法文句子,则自动为测试数据生成对应的翻译结果,输出格式如下:
+
+```text
+Elles connaissent leur entreprise mieux que personne .
+-3.754819 They know their business better than anyone .