提交 03c00de6 编写于 作者: T Tao Luo 提交者: GitHub

Merge pull request #293 from luotao1/mt

update seq2seq english md file
...@@ -213,25 +213,8 @@ import paddle.v2 as paddle ...@@ -213,25 +213,8 @@ import paddle.v2 as paddle
# train with a single CPU # train with a single CPU
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # False: training, True: generating
is_generating = False
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### Model Configuration ### Model Configuration
...@@ -239,15 +222,18 @@ wmt14_reader = paddle.batch( ...@@ -239,15 +222,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables 1. Define some global variables
```python ```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
``` ```
1. Implement Encoder as follows: 2. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
...@@ -255,7 +241,7 @@ wmt14_reader = paddle.batch( ...@@ -255,7 +241,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -264,7 +250,7 @@ wmt14_reader = paddle.batch( ...@@ -264,7 +250,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -274,9 +260,9 @@ wmt14_reader = paddle.batch( ...@@ -274,9 +260,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. Implement Attention-based Decoder as follows: 3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch( ...@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch( ...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch( ...@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -341,132 +327,213 @@ wmt14_reader = paddle.batch( ...@@ -341,132 +327,213 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. Training mode: 5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
trg_embedding = paddle.layer.embedding( if not is_generating:
input=paddle.layer.data( trg_embedding = paddle.layer.embedding(
name='target_language_word', input=paddle.layer.data(
type=paddle.data_type.integer_value_sequence(target_dict_dim)), name='target_language_word',
size=word_vector_dim, type=paddle.data_type.integer_value_sequence(target_dict_dim)),
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) size=word_vector_dim,
group_inputs.append(trg_embedding) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input, # For decoder equipped with attention mechanism, in training,
# while encoded source sequence is accessed to as an unbounded memory. # target embeding (the groudtruth) is the data input,
# Here, the StaticInput defines a read-only memory # while encoded source sequence is accessed to as an unbounded memory.
# for the recurrent_group. # Here, the StaticInput defines a read-only memory
decoder = paddle.layer.recurrent_group( # for the recurrent_group.
name=decoder_group_name, decoder = paddle.layer.recurrent_group(
step=gru_decoder_with_attention, name=decoder_group_name,
input=group_inputs) step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word', lbl = paddle.layer.data(
type=paddle.data_type.integer_value_sequence(target_dict_dim)) name='target_language_next_word',
cost = paddle.layer.classification_cost(input=decoder, label=lbl) type=paddle.data_type.integer_value_sequence(target_dict_dim))
``` cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
6. Generating mode:
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
### Create Parameters ## Model Training
Create every parameter that `cost` layer needs. 1. Create Parameters
```python Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
parameters = paddle.parameters.create(cost)
```
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. ```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
```python 2. Define DataSet
for param in parameters.keys():
print param
```
## Model Training Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
1. Create trainer ```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python ```python
optimizer = paddle.optimizer.Adam( if not is_generating:
learning_rate=5e-5, optimizer = paddle.optimizer.Adam(
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) learning_rate=5e-5,
trainer = paddle.trainer.SGD(cost=cost, regularization=paddle.optimizer.L2Regularization(rate=8e-4))
parameters=parameters, trainer = paddle.trainer.SGD(cost=cost,
update_equation=optimizer) parameters=parameters,
update_equation=optimizer)
``` ```
1. Define event handler 4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python ```python
def event_handler(event): if not is_generating:
if isinstance(event, paddle.event.EndIteration): def event_handler(event):
if event.batch_id % 10 == 0: if isinstance(event, paddle.event.EndIteration):
print "\nPass %d, Batch %d, Cost %f, %s" % ( if event.batch_id % 2 == 0:
event.pass_id, event.batch_id, event.cost, event.metrics) print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
1. Start training 5. Start training
```python ```python
trainer.train( if not is_generating:
reader=wmt14_reader, trainer.train(
event_handler=event_handler, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
num_passes=2,
feeding=feeding)
``` ```
```text The training log is as follows:
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} ```text
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
... Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
## Model Usage
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
``` ```
2. Define DataSet
The model training is successful when the `classification_error_evaluator` is lower than 0.35. Get the first 3 samples of wmt14 generating set as the source language sequences.
## Model Usage ```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
### Download Pre-trained Model 3. Create infer
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh beam_result = paddle.infer(
``` output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
4. Print generated translation
### BLEU Evaluation Print sequence and its `beam_size` generated translation results based on the dictionary.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: The generating log is as follows:
```bash ```text
./moses_bleu.sh src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
```bash prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
./eval_bleu.sh FILE BEAMSIZE prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
``` ```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
## Summary ## Summary
......
...@@ -416,13 +416,13 @@ is_generating = False ...@@ -416,13 +416,13 @@ is_generating = False
reader=wmt14_reader, event_handler=event_handler, num_passes=2) reader=wmt14_reader, event_handler=event_handler, num_passes=2)
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
......... .........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
......... .........
```
### 生成模型 ### 生成模型
...@@ -490,13 +490,14 @@ is_generating = False ...@@ -490,13 +490,14 @@ is_generating = False
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
``` ```
生成开始后,可以观察到输出的日志如下: 生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e> prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e> prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e> ```
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
## 总结 ## 总结
......
...@@ -255,25 +255,8 @@ import paddle.v2 as paddle ...@@ -255,25 +255,8 @@ import paddle.v2 as paddle
# train with a single CPU # train with a single CPU
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # False: training, True: generating
is_generating = False
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### Model Configuration ### Model Configuration
...@@ -281,15 +264,18 @@ wmt14_reader = paddle.batch( ...@@ -281,15 +264,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables 1. Define some global variables
```python ```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
``` ```
1. Implement Encoder as follows: 2. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
...@@ -297,7 +283,7 @@ wmt14_reader = paddle.batch( ...@@ -297,7 +283,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -306,7 +292,7 @@ wmt14_reader = paddle.batch( ...@@ -306,7 +292,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -316,9 +302,9 @@ wmt14_reader = paddle.batch( ...@@ -316,9 +302,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. Implement Attention-based Decoder as follows: 3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch( ...@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch( ...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch( ...@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -383,132 +369,213 @@ wmt14_reader = paddle.batch( ...@@ -383,132 +369,213 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. Training mode: 5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
trg_embedding = paddle.layer.embedding( if not is_generating:
input=paddle.layer.data( trg_embedding = paddle.layer.embedding(
name='target_language_word', input=paddle.layer.data(
type=paddle.data_type.integer_value_sequence(target_dict_dim)), name='target_language_word',
size=word_vector_dim, type=paddle.data_type.integer_value_sequence(target_dict_dim)),
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) size=word_vector_dim,
group_inputs.append(trg_embedding) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input, # For decoder equipped with attention mechanism, in training,
# while encoded source sequence is accessed to as an unbounded memory. # target embeding (the groudtruth) is the data input,
# Here, the StaticInput defines a read-only memory # while encoded source sequence is accessed to as an unbounded memory.
# for the recurrent_group. # Here, the StaticInput defines a read-only memory
decoder = paddle.layer.recurrent_group( # for the recurrent_group.
name=decoder_group_name, decoder = paddle.layer.recurrent_group(
step=gru_decoder_with_attention, name=decoder_group_name,
input=group_inputs) step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word', lbl = paddle.layer.data(
type=paddle.data_type.integer_value_sequence(target_dict_dim)) name='target_language_next_word',
cost = paddle.layer.classification_cost(input=decoder, label=lbl) type=paddle.data_type.integer_value_sequence(target_dict_dim))
``` cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
6. Generating mode:
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
### Create Parameters ## Model Training
Create every parameter that `cost` layer needs. 1. Create Parameters
```python Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
parameters = paddle.parameters.create(cost)
```
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. ```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
```python 2. Define DataSet
for param in parameters.keys():
print param
```
## Model Training Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
1. Create trainer ```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python ```python
optimizer = paddle.optimizer.Adam( if not is_generating:
learning_rate=5e-5, optimizer = paddle.optimizer.Adam(
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) learning_rate=5e-5,
trainer = paddle.trainer.SGD(cost=cost, regularization=paddle.optimizer.L2Regularization(rate=8e-4))
parameters=parameters, trainer = paddle.trainer.SGD(cost=cost,
update_equation=optimizer) parameters=parameters,
update_equation=optimizer)
``` ```
1. Define event handler 4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python ```python
def event_handler(event): if not is_generating:
if isinstance(event, paddle.event.EndIteration): def event_handler(event):
if event.batch_id % 10 == 0: if isinstance(event, paddle.event.EndIteration):
print "\nPass %d, Batch %d, Cost %f, %s" % ( if event.batch_id % 2 == 0:
event.pass_id, event.batch_id, event.cost, event.metrics) print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
1. Start training 5. Start training
```python ```python
trainer.train( if not is_generating:
reader=wmt14_reader, trainer.train(
event_handler=event_handler, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
num_passes=2,
feeding=feeding)
``` ```
```text The training log is as follows:
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} ```text
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
... Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
## Model Usage
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
``` ```
2. Define DataSet
The model training is successful when the `classification_error_evaluator` is lower than 0.35. Get the first 3 samples of wmt14 generating set as the source language sequences.
## Model Usage ```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
### Download Pre-trained Model 3. Create infer
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh beam_result = paddle.infer(
``` output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
4. Print generated translation
### BLEU Evaluation Print sequence and its `beam_size` generated translation results based on the dictionary.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: The generating log is as follows:
```bash ```text
./moses_bleu.sh src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
```bash prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
./eval_bleu.sh FILE BEAMSIZE prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
``` ```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
## Summary ## Summary
......
...@@ -458,13 +458,13 @@ is_generating = False ...@@ -458,13 +458,13 @@ is_generating = False
reader=wmt14_reader, event_handler=event_handler, num_passes=2) reader=wmt14_reader, event_handler=event_handler, num_passes=2)
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
......... .........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
......... .........
```
### 生成模型 ### 生成模型
...@@ -532,13 +532,14 @@ is_generating = False ...@@ -532,13 +532,14 @@ is_generating = False
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
``` ```
生成开始后,可以观察到输出的日志如下: 生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e> prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e> prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e> ```
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
## 总结 ## 总结
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册