提交 c59d32be 编写于 作者: L liaogang

Merge branch 'develop' of https://github.com/PaddlePaddle/book into readme

......@@ -48,7 +48,16 @@ FROM ${paddle_image}:${paddle_tag}
MAINTAINER PaddlePaddle Authors <paddle-dev@baidu.com>
COPY . /book
EOF
if [ -n ${http_proxy} ]; then
cat >> Dockerfile <<EOF
ENV http_proxy ${http_proxy}
ENV https_proxy ${http_proxy}
EOF
fi
cat >> Dockerfile <<EOF
RUN pip install -U nltk \
&& python /book/.tools/cache_dataset.py
......@@ -58,7 +67,7 @@ RUN ${update_mirror_cmd}
apt-get -y install gcc && \
apt-get -y clean && \
localedef -f UTF-8 -i en_US en_US.UTF-8 && \
pip install -U matplotlib jupyter numpy requests scipy
pip install -U pillow matplotlib jupyter numpy requests scipy
#convert md to ipynb
RUN /book/.tools/notedown.sh
......
......@@ -141,7 +141,8 @@ import paddle.v2 as paddle
def convolution_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=128):
hid_dim=128,
is_predict=False):
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
......@@ -152,9 +153,12 @@ def convolution_net(input_dim,
output = paddle.layer.fc(input=[conv_3, conv_4],
size=class_dim,
act=paddle.activation.Softmax())
if not is_predict:
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
else:
return output
```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
......@@ -165,7 +169,8 @@ def stacked_lstm_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=512,
stacked_num=3):
stacked_num=3,
is_predict=False):
"""
A Wrapper for sentiment classification task.
This network uses bi-directional recurrent network,
......@@ -223,9 +228,12 @@ def stacked_lstm_net(input_dim,
bias_attr=bias_attr,
param_attr=para_attr)
if not is_predict:
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
else:
return output
```
网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。
......@@ -294,7 +302,7 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
### 训练
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。另外,通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。
```python
# End batch and end pass event handler
def event_handler(event):
......@@ -309,7 +317,21 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
```
可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
比如,构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
```python
from paddle.v2.plot import Ploter
train_title = "Train cost"
cost_ploter = Ploter(train_title)
step = 0
def event_handler_plot(event):
global step
if isinstance(event, paddle.event.EndIteration):
cost_ploter.append(train_title, step, event.cost)
cost_ploter.plot()
step += 1
```
或者构造一个`event_handler_plot`画出cost曲线。
```python
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
......@@ -331,6 +353,36 @@ Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625}
Test with Pass 0, {'classification_error_evaluator': 0.11432000249624252}
```
## 应用模型
可以使用训练好的模型对电影评论进行分类,下面程序展示了如何使用`paddle.infer`接口进行推断。
```python
import numpy as np
# Movie Reviews, from imdb test
reviews = [
'Read the book, forget the movie!',
'This is a great movie.'
]
reviews = [c.split() for c in reviews]
UNK = word_dict['<unk>']
input = []
for c in reviews:
input.append([[word_dict.get(words, UNK) for words in c]])
# 0 stands for positive sample, 1 stands for negative sample
label = {0:'pos', 1:'neg'}
# Use the network used by trainer
out = convolution_net(dict_dim, class_dim=class_dim, is_predict=True)
# out = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3, is_predict=True)
probs = paddle.infer(output_layer=out, parameters=parameters, input=input)
labs = np.argsort(-probs)
for idx, lab in enumerate(labs):
print idx, "predicting probability is", probs[idx], "label is", label[lab[0]]
```
## 总结
本章我们以情感分析为例,介绍了使用深度学习的方法进行端对端的短文本分类,并且使用PaddlePaddle完成了全部相关实验。同时,我们简要介绍了两种文本处理模型:卷积神经网络和循环神经网络。在后续的章节中我们会看到这两种基本的深度学习模型在其它任务上的应用。
......
......@@ -183,7 +183,8 @@ import paddle.v2 as paddle
def convolution_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=128):
hid_dim=128,
is_predict=False):
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
......@@ -194,9 +195,12 @@ def convolution_net(input_dim,
output = paddle.layer.fc(input=[conv_3, conv_4],
size=class_dim,
act=paddle.activation.Softmax())
if not is_predict:
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
else:
return output
```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
......@@ -207,7 +211,8 @@ def stacked_lstm_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=512,
stacked_num=3):
stacked_num=3,
is_predict=False):
"""
A Wrapper for sentiment classification task.
This network uses bi-directional recurrent network,
......@@ -265,9 +270,12 @@ def stacked_lstm_net(input_dim,
bias_attr=bias_attr,
param_attr=para_attr)
if not is_predict:
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
else:
return output
```
网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。
......@@ -336,7 +344,7 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
### 训练
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。另外,通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。
```python
# End batch and end pass event handler
def event_handler(event):
......@@ -351,7 +359,21 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
```
可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
比如,构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
```python
from paddle.v2.plot import Ploter
train_title = "Train cost"
cost_ploter = Ploter(train_title)
step = 0
def event_handler_plot(event):
global step
if isinstance(event, paddle.event.EndIteration):
cost_ploter.append(train_title, step, event.cost)
cost_ploter.plot()
step += 1
```
或者构造一个`event_handler_plot`画出cost曲线。
```python
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
......@@ -373,6 +395,36 @@ Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625}
Test with Pass 0, {'classification_error_evaluator': 0.11432000249624252}
```
## 应用模型
可以使用训练好的模型对电影评论进行分类,下面程序展示了如何使用`paddle.infer`接口进行推断。
```python
import numpy as np
# Movie Reviews, from imdb test
reviews = [
'Read the book, forget the movie!',
'This is a great movie.'
]
reviews = [c.split() for c in reviews]
UNK = word_dict['<unk>']
input = []
for c in reviews:
input.append([[word_dict.get(words, UNK) for words in c]])
# 0 stands for positive sample, 1 stands for negative sample
label = {0:'pos', 1:'neg'}
# Use the network used by trainer
out = convolution_net(dict_dim, class_dim=class_dim, is_predict=True)
# out = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3, is_predict=True)
probs = paddle.infer(output_layer=out, parameters=parameters, input=input)
labs = np.argsort(-probs)
for idx, lab in enumerate(labs):
print idx, "predicting probability is", probs[idx], "label is", label[lab[0]]
```
## 总结
本章我们以情感分析为例,介绍了使用深度学习的方法进行端对端的短文本分类,并且使用PaddlePaddle完成了全部相关实验。同时,我们简要介绍了两种文本处理模型:卷积神经网络和循环神经网络。在后续的章节中我们会看到这两种基本的深度学习模型在其它任务上的应用。
......
......@@ -214,6 +214,7 @@ import numpy as np
import gzip
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
import paddle.v2.evaluator as evaluator
paddle.init(use_gpu=False, trainer_count=1)
......@@ -343,23 +344,23 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- We will concatenate the output of the top LSTM unit with its input, and project the result into a hidden layer. Then, we put a fully connected layer on top to get the final feature vector representation.
- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
```python
feature_out = paddle.layer.mixed(
```python
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
- At the end of the network, we use CRF as the cost function; the parameter of CRF cost will be named `crfw`.
input=input_tmp[1], param_attr=lstm_para_attr)], )
```python
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
......@@ -370,7 +371,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers.
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
```python
crf_dec = paddle.layer.crf_decoding(
......@@ -414,7 +415,7 @@ We will create trainer given model topology, parameters, and optimization method
```python
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
......@@ -432,7 +433,7 @@ As mentioned in data preparation section, we will use CoNLL 2005 test corpus as
```python
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
conll05.test(), buf_size=8192), batch_size=2)
```
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
......@@ -456,10 +457,10 @@ feeding = {
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
if event.batch_id and event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if event.batch_id % 1000 == 0:
if event.batch_id % 400 == 0:
result = trainer.test(reader=reader, feeding=feeding)
print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)
......
......@@ -192,6 +192,7 @@ import numpy as np
import gzip
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
import paddle.v2.evaluator as evaluator
paddle.init(use_gpu=False, trainer_count=1)
......@@ -274,12 +275,12 @@ emb_layers.append(mark_embedding)
```python
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
])
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
......@@ -320,23 +321,24 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示
- 在PaddlePaddle中,CRF的状态特征和转移特征分别由一个全连接层和一个PaddlePaddle中的CRF层分别学习。在这个例子中,我们用线性激活的paddle.layer.mixed 来学习CRF的状态特征(也可以使用paddle.layer.fc),而 paddle.layer.crf只学习转移特征。paddle.layer.crf层是一个 cost 层,处于整个网络的末端,输出给定输入序列下,标记序列的log probability作为代价。训练阶段,该层需要输入正确的标记序列作为学习目标
```python
# 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,
# 经过一个全连接层映射到标记字典的维度,来学习 CRF 的状态特征
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
- 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。
], )
```python
# 学习 CRF 的转移特征
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
......@@ -347,7 +349,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型
- CRF解码和CRF层参数名字相同,即:加载了`paddle.layer.crf`层学习到的参数。在训练阶段,为`paddle.layer.crf_decoding` 输入了正确的标记序列(target),这一层会输出是否正确标记,`evaluator.sum` 用来计算序列上的标记错误率,可以用来评估模型。解码阶段,没有输入正确的数据标签,该层通过寻找概率最高的标记序列,解码出标记结果
```python
crf_dec = paddle.layer.crf_decoding(
......@@ -394,7 +396,7 @@ parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
# create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
......@@ -412,7 +414,7 @@ trainer = paddle.trainer.SGD(cost=crf_cost,
```python
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
conll05.test(), buf_size=8192), batch_size=2)
```
通过`feeding`来指定每一个数据和data_layer的对应关系。 例如 下面`feeding`表示: `conll05.test()`产生数据的第0列对应`word_data`层的特征。
......@@ -437,10 +439,10 @@ feeding = {
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
if event.batch_id and event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if event.batch_id % 1000 == 0:
if event.batch_id % 400 == 0:
result = trainer.test(reader=reader, feeding=feeding)
print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)
......@@ -459,7 +461,7 @@ def event_handler(event):
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
num_passes=1,
feeding=feeding)
```
......
......@@ -256,6 +256,7 @@ import numpy as np
import gzip
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
import paddle.v2.evaluator as evaluator
paddle.init(use_gpu=False, trainer_count=1)
......@@ -385,23 +386,23 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- We will concatenate the output of the top LSTM unit with its input, and project the result into a hidden layer. Then, we put a fully connected layer on top to get the final feature vector representation.
- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
```python
feature_out = paddle.layer.mixed(
```python
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
- At the end of the network, we use CRF as the cost function; the parameter of CRF cost will be named `crfw`.
input=input_tmp[1], param_attr=lstm_para_attr)], )
```python
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
......@@ -412,7 +413,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers.
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
```python
crf_dec = paddle.layer.crf_decoding(
......@@ -456,7 +457,7 @@ We will create trainer given model topology, parameters, and optimization method
```python
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
......@@ -474,7 +475,7 @@ As mentioned in data preparation section, we will use CoNLL 2005 test corpus as
```python
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
conll05.test(), buf_size=8192), batch_size=2)
```
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
......@@ -498,10 +499,10 @@ feeding = {
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
if event.batch_id and event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if event.batch_id % 1000 == 0:
if event.batch_id % 400 == 0:
result = trainer.test(reader=reader, feeding=feeding)
print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)
......
......@@ -234,6 +234,7 @@ import numpy as np
import gzip
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
import paddle.v2.evaluator as evaluator
paddle.init(use_gpu=False, trainer_count=1)
......@@ -316,12 +317,12 @@ emb_layers.append(mark_embedding)
```python
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
])
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
......@@ -362,23 +363,24 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示
- 在PaddlePaddle中,CRF的状态特征和转移特征分别由一个全连接层和一个PaddlePaddle中的CRF层分别学习。在这个例子中,我们用线性激活的paddle.layer.mixed 来学习CRF的状态特征(也可以使用paddle.layer.fc),而 paddle.layer.crf只学习转移特征。paddle.layer.crf层是一个 cost 层,处于整个网络的末端,输出给定输入序列下,标记序列的log probability作为代价。训练阶段,该层需要输入正确的标记序列作为学习目标
```python
# 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,
# 经过一个全连接层映射到标记字典的维度,来学习 CRF 的状态特征
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
- 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。
], )
```python
# 学习 CRF 的转移特征
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
......@@ -389,7 +391,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型
- CRF解码和CRF层参数名字相同,即:加载了`paddle.layer.crf`层学习到的参数。在训练阶段,为`paddle.layer.crf_decoding` 输入了正确的标记序列(target),这一层会输出是否正确标记,`evaluator.sum` 用来计算序列上的标记错误率,可以用来评估模型。解码阶段,没有输入正确的数据标签,该层通过寻找概率最高的标记序列,解码出标记结果
```python
crf_dec = paddle.layer.crf_decoding(
......@@ -436,7 +438,7 @@ parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
# create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
......@@ -454,7 +456,7 @@ trainer = paddle.trainer.SGD(cost=crf_cost,
```python
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
conll05.test(), buf_size=8192), batch_size=2)
```
通过`feeding`来指定每一个数据和data_layer的对应关系。 例如 下面`feeding`表示: `conll05.test()`产生数据的第0列对应`word_data`层的特征。
......@@ -479,10 +481,10 @@ feeding = {
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
if event.batch_id and event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if event.batch_id % 1000 == 0:
if event.batch_id % 400 == 0:
result = trainer.test(reader=reader, feeding=feeding)
print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)
......@@ -501,7 +503,7 @@ def event_handler(event):
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
num_passes=1,
feeding=feeding)
```
......
......@@ -41,9 +41,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
```
After training and with a beam-search size of 3, the generated translations are as follows:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
......@@ -94,7 +94,7 @@ Figure 4. Encoder-Decoder Framework
There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
......@@ -213,25 +213,8 @@ import paddle.v2 as paddle
# train with a single CPU
paddle.init(use_gpu=False, trainer_count=1)
```
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# False: training, True: generating
is_generating = False
```
### Model Configuration
......@@ -239,15 +222,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables
```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
```
1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2. Implement Encoder as follows:
- Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python
src_word_id = paddle.layer.data(
......@@ -255,7 +241,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim))
```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
- Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python
src_embedding = paddle.layer.embedding(
......@@ -264,7 +250,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
- Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python
src_forward = paddle.networks.simple_gru(
......@@ -274,9 +260,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
```
1. Implement Attention-based Decoder as follows:
3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
......@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
backward_first = paddle.layer.first_seq(input=src_backward)
......@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch(
input=backward_first)
```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
......@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
return out
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
decoder_group_name = "decoder_group"
......@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
......@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
......@@ -373,30 +360,70 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
parameters = paddle.parameters.create(cost)
```
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
for param in parameters.keys():
print param
```
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python
if not is_generating:
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
......@@ -405,68 +432,108 @@ for param in parameters.keys():
update_equation=optimizer)
```
1. Define event handler
4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python
if not is_generating:
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
1. Start training
5. Start training
```python
if not is_generating:
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=2,
feeding=feeding)
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
The training log is as follows:
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
```bash
./moses_bleu.sh
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```bash
./eval_bleu.sh FILE BEAMSIZE
```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
3. Create infer
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```python
if is_generating:
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
4. Print generated translation
Print sequence and its `beam_size` generated translation results based on the dictionary.
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
The generating log is as follows:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## Summary
......
......@@ -26,9 +26,9 @@
```
如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。
- 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。
......@@ -74,7 +74,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
编码阶段分为三步:
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon R^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。
......@@ -167,7 +167,7 @@ e_{ij}&=align(z_i,h_j)\\\\
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 训练流程说明
## 流程说明
### paddle初始化
......@@ -178,48 +178,34 @@ import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1)
```
### 数据定义
首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# 训练模式False,生成模式True
is_generating = False
```
### 模型结构
1. 首先,定义了一些全局变量。
```python
dict_size = 30000 # 字典维度
source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小
beam_size = 3 # 柱宽度
max_length = 250 # 生成句子的最大长度
```
1. 其次实现编码器框架分为三步
2. 其次实现编码器框架分为三步
1 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)`
- 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)`
```python
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
```
1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
- 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python
src_embedding = paddle.layer.embedding(
......@@ -227,7 +213,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
```
1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
- 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python
src_forward = paddle.networks.simple_gru(
......@@ -237,9 +223,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
```
1. 接着,定义基于注意力机制的解码器框架。分为三步:
3. 接着,定义基于注意力机制的解码器框架。分为三步:
1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
- 对源语言序列编码后的结果(见2的最后一步),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
......@@ -247,7 +233,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
- 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python
backward_first = paddle.layer.first_seq(input=src_backward)
......@@ -257,7 +243,7 @@ wmt14_reader = paddle.batch(
input=backward_first)
```
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
......@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch(
return out
```
1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
4. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
```python
decoder_group_name = "decoder_group"
......@@ -303,7 +289,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. 训练模式下的解码器调用:
5. 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
......@@ -311,6 +297,7 @@ wmt14_reader = paddle.batch(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
......@@ -335,30 +322,71 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)
6. 生成模式下的解码器调用
### 参数定义
- 首先在序列生成任务中由于解码阶段的RNN总是引用上一时刻生成出的词的词向量作为当前时刻的输入因此使用`GeneratedInput`来自动完成这一过程具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
- 其次使用`beam_search`函数循环调用`gru_decoder_with_attention`函数生成出序列id
首先依据模型配置的`cost`定义模型参数。
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
parameters = paddle.parameters.create(cost)
```
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
for param in parameters.keys():
print param
```
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 训练模型
1. 构造trainer
1. 参数定义
依据模型配置的`cost`定义模型参数。可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. 数据定义
获取wmt14的dataset reader。
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python
if not is_generating:
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
......@@ -367,73 +395,109 @@ for param in parameters.keys():
update_equation=optimizer)
```
1. 构造event_handler
4. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 2 == 0 指定每2个batch打印一次日志,包含cost等信息。
```python
if not is_generating:
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
```
1. 启动训练:
5. 启动训练
```python
if not is_generating:
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=2,
feeding=feeding)
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
训练开始后,可以观察到event_handler输出的日志如下:
训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
```
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
```
### 生成模型
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
1. 加载预训练的模型
## 应用模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了一个模型供大家直接下载使用。该模型大小为205MB,[BLEU评估](#BLEU评估)值为26.92。
### 下载预训练的模型
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. 数据定义
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
从wmt14的生成集中读取前3个样本作为源语言句子。
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
3. 构造infer
根据网络拓扑结构和模型参数构造出infer用来生成,在预测时还需要指定输出域`field`,这里使用生成句子的概率`prob`和句子中每个词的`id`。
### BLEU评估
```python
if is_generating:
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
4. 打印生成结果
[Moses](http://www.statmt.org/moses/) 是一个统计学的开源机器翻译系统,我们使用其中的 [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) 来做BLEU评估。下载脚本的命令如下:
```bash
./moses_bleu.sh
```
BLEU评估可以使用`eval_bleu`脚本如下,其中FILE为需要评估的文件名,BEAMSIZE为柱宽度,默认使用`data/wmt14/gen/ntst14.trg`作为标准的翻译结果。
```bash
./eval_bleu.sh FILE BEAMSIZE
```
本教程的具体命令如下:
```bash
./eval_bleu.sh gen_result 3
```
您会在屏幕上看到:
```text
BLEU = 26.92
```
根据源/目标语言字典,将源语言句子和`beam_size`个生成句子打印输出。
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## 总结
......
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
mkdir wmt14
cd wmt14
# download the dataset
wget http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz
wget http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz
# untar the dataset
tar -zxvf bitexts.tgz
tar -zxvf dev+test.tgz
gunzip bitexts.selected/*
mv bitexts.selected train
rm bitexts.tgz
rm dev+test.tgz
# separate the dev and test dataset
mkdir test gen
mv dev/ntst1213.* test
mv dev/ntst14.* gen
rm -rf dev
set +x
# rename the suffix, .fr->.src, .en->.trg
for dir in train test gen
do
filelist=`ls $dir`
cd $dir
for file in $filelist
do
if [ ${file##*.} = "fr" ]; then
mv $file ${file/%fr/src}
elif [ ${file##*.} = 'en' ]; then
mv $file ${file/%en/trg}
fi
done
cd ..
done
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
gen_file=$1
beam_size=$2
# find top1 generating result
top1=$(printf '%s_top1.txt' `basename $gen_file .txt`)
if [ $beam_size -eq 1 ]; then
awk -F "\t" '{sub(" <e>","",$2);sub(" ","",$2);print $2}' $gen_file >$top1
else
awk 'BEGIN{
FS="\t";
OFS="\t";
read_pos = 2} {
if (NR == read_pos){
sub(" <e>","",$3);
sub(" ","",$3);
print $3;
read_pos += (2 + res_num);
}}' res_num=$beam_size $gen_file >$top1
fi
# evalute bleu value
bleu_script=multi-bleu.perl
standard_res=data/wmt14/gen/ntst14.trg
bleu_res=`perl $bleu_script $standard_res <$top1`
echo $bleu_res | cut -d, -f 1
rm $top1
......@@ -83,9 +83,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
```
After training and with a beam-search size of 3, the generated translations are as follows:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
......@@ -136,7 +136,7 @@ Figure 4. Encoder-Decoder Framework
There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
......@@ -255,25 +255,8 @@ import paddle.v2 as paddle
# train with a single CPU
paddle.init(use_gpu=False, trainer_count=1)
```
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# False: training, True: generating
is_generating = False
```
### Model Configuration
......@@ -281,15 +264,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables
```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
```
1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2. Implement Encoder as follows:
- Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python
src_word_id = paddle.layer.data(
......@@ -297,7 +283,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim))
```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
- Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python
src_embedding = paddle.layer.embedding(
......@@ -306,7 +292,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
- Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python
src_forward = paddle.networks.simple_gru(
......@@ -316,9 +302,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
```
1. Implement Attention-based Decoder as follows:
3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
......@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
backward_first = paddle.layer.first_seq(input=src_backward)
......@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch(
input=backward_first)
```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
......@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
return out
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
decoder_group_name = "decoder_group"
......@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
......@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
......@@ -415,30 +402,70 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
parameters = paddle.parameters.create(cost)
```
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
for param in parameters.keys():
print param
```
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python
if not is_generating:
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
......@@ -447,68 +474,108 @@ for param in parameters.keys():
update_equation=optimizer)
```
1. Define event handler
4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python
if not is_generating:
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
1. Start training
5. Start training
```python
if not is_generating:
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=2,
feeding=feeding)
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
The training log is as follows:
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
```bash
./moses_bleu.sh
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```bash
./eval_bleu.sh FILE BEAMSIZE
```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
3. Create infer
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```python
if is_generating:
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
4. Print generated translation
Print sequence and its `beam_size` generated translation results based on the dictionary.
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
The generating log is as follows:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## Summary
......
......@@ -68,9 +68,9 @@
```
如果设定显示翻译结果的条数(即[柱搜索算法](#柱搜索算法)的宽度)为3,生成的英语句子如下:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- 左起第一列是生成句子的序号;左起第二列是该条句子的得分(从大到小),分值越高越好;左起第三列是生成的英语句子。
- 另外有两个特殊标志:`<e>`表示句子的结尾,`<unk>`表示未登录词(unknown word),即未在训练字典中出现的词。
......@@ -116,7 +116,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
编码阶段分为三步:
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon R^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
1. one-hot vector表示:将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同,并且只有一个维度上有值1(该位置对应该词在词汇表中的位置),其余全是0。
2. 映射到低维语义空间的词向量:one-hot vector表示存在两个问题,1)生成的向量维度往往很大,容易造成维数灾难;2)难以刻画词与词之间的关系(如语义相似性,也就是无法很好地表达语义)。因此,需再one-hot vector映射到低维的语义空间,由一个固定维度的稠密向量(称为词向量)表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$,用$s_i=Cw_i$表示第$i$个词的词向量,$K$为向量维度。
......@@ -209,7 +209,7 @@ e_{ij}&=align(z_i,h_j)\\\\
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 训练流程说明
## 流程说明
### paddle初始化
......@@ -220,48 +220,34 @@ import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1)
```
### 数据定义
首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# 训练模式False,生成模式True
is_generating = False
```
### 模型结构
1. 首先,定义了一些全局变量。
```python
dict_size = 30000 # 字典维度
source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小
beam_size = 3 # 柱宽度
max_length = 250 # 生成句子的最大长度
```
1. 其次,实现编码器框架。分为三步:
2. 其次,实现编码器框架。分为三步:
1 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。
- 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。
```python
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
```
1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
- 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python
src_embedding = paddle.layer.embedding(
......@@ -269,7 +255,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
```
1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
- 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python
src_forward = paddle.networks.simple_gru(
......@@ -279,9 +265,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
```
1. 接着,定义基于注意力机制的解码器框架。分为三步:
3. 接着,定义基于注意力机制的解码器框架。分为三步:
1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
- 对源语言序列编码后的结果(见2的最后一步),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
......@@ -289,7 +275,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
- 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python
backward_first = paddle.layer.first_seq(input=src_backward)
......@@ -299,7 +285,7 @@ wmt14_reader = paddle.batch(
input=backward_first)
```
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
......@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch(
return out
```
1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
4. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
```python
decoder_group_name = "decoder_group"
......@@ -345,7 +331,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. 训练模式下的解码器调用:
5. 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
......@@ -353,6 +339,7 @@ wmt14_reader = paddle.batch(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
......@@ -377,30 +364,71 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
6. 生成模式下的解码器调用:
- 首先,在序列生成任务中,由于解码阶段的RNN总是引用上一时刻生成出的词的词向量,作为当前时刻的输入,因此,使用`GeneratedInput`来自动完成这一过程。具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
- 其次,使用`beam_search`函数循环调用`gru_decoder_with_attention`函数,生成出序列id。
### 参数定义
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
首先依据模型配置的`cost`定义模型参数。
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
```python
parameters = paddle.parameters.create(cost)
```
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
```python
for param in parameters.keys():
print param
```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 训练模型
1. 构造trainer
1. 参数定义
依据模型配置的`cost`定义模型参数。可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. 数据定义
获取wmt14的dataset reader。
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python
if not is_generating:
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
......@@ -409,73 +437,109 @@ for param in parameters.keys():
update_equation=optimizer)
```
1. 构造event_handler
4. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 2 == 0 指定每2个batch打印一次日志,包含cost等信息。
```python
if not is_generating:
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
```
1. 启动训练:
5. 启动训练
```python
if not is_generating:
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=2,
feeding=feeding)
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
训练开始后,可以观察到event_handler输出的日志如下:
训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
```
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
```
### 生成模型
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
1. 加载预训练的模型
## 应用模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了一个模型供大家直接下载使用。该模型大小为205MB,[BLEU评估](#BLEU评估)值为26.92。
### 下载预训练的模型
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. 数据定义
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
从wmt14的生成集中读取前3个样本作为源语言句子。
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
3. 构造infer
### BLEU评估
根据网络拓扑结构和模型参数构造出infer用来生成,在预测时还需要指定输出域`field`,这里使用生成句子的概率`prob`和句子中每个词的`id`。
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
```python
if is_generating:
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
[Moses](http://www.statmt.org/moses/) 是一个统计学的开源机器翻译系统,我们使用其中的 [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) 来做BLEU评估。下载脚本的命令如下:
```bash
./moses_bleu.sh
```
BLEU评估可以使用`eval_bleu`脚本如下,其中FILE为需要评估的文件名,BEAMSIZE为柱宽度,默认使用`data/wmt14/gen/ntst14.trg`作为标准的翻译结果。
```bash
./eval_bleu.sh FILE BEAMSIZE
```
本教程的具体命令如下:
```bash
./eval_bleu.sh gen_result 3
```
您会在屏幕上看到:
```text
BLEU = 26.92
```
4. 打印生成结果
根据源/目标语言字典,将源语言句子和`beam_size`个生成句子打印输出。
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## 总结
......
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
echo "Downloading multi-bleu.perl"
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl --no-check-certificate
#!/bin/bash
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
set -x
# download the pretrained model
wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz
# untar the model
tar -zxvf wmt14_model.tar.gz
rm wmt14_model.tar.gz
import sys
import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim):
def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False):
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
beam_size = 3
max_length = 250
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
......@@ -67,6 +71,7 @@ def seqToseq_net(source_dict_dim, target_dict_dim):
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
......@@ -91,16 +96,44 @@ def seqToseq_net(source_dict_dim, target_dict_dim):
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
else:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
return beam_gen
def main():
paddle.init(use_gpu=False, trainer_count=1)
is_generating = False
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# define network topology
# train the network
if not is_generating:
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
......@@ -110,17 +143,10 @@ def main():
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
# define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
paddle.dataset.wmt14.train(dict_size), buf_size=8192),
batch_size=5)
# define event_handler callback
......@@ -128,17 +154,59 @@ def main():
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
event.pass_id, event.batch_id, event.cost,
event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=2,
feeding=feeding)
reader=wmt14_reader, event_handler=event_handler, num_passes=2)
# generate a english sequence to french
else:
# use the first 3 samples for generation
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
beam_gen = seqToseq_net(source_dict_dim, target_dict_dim, is_generating)
# get the pretrained model, whose bleu = 26.92
parameters = paddle.dataset.wmt14.model()
# prob is the prediction probabilities, and id is the prediction word.
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
beam_size = 3
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
if __name__ == '__main__':
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册