提交 0c132546 编写于 作者: L Luo Tao

update seq2seq english md file

上级 469c1eaa
...@@ -213,25 +213,8 @@ import paddle.v2 as paddle ...@@ -213,25 +213,8 @@ import paddle.v2 as paddle
# train with a single CPU # train with a single CPU
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # False: training, True: generating
is_generating = False
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### Model Configuration ### Model Configuration
...@@ -239,15 +222,18 @@ wmt14_reader = paddle.batch( ...@@ -239,15 +222,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables 1. Define some global variables
```python ```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
``` ```
1. Implement Encoder as follows: 2. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
...@@ -255,7 +241,7 @@ wmt14_reader = paddle.batch( ...@@ -255,7 +241,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -264,7 +250,7 @@ wmt14_reader = paddle.batch( ...@@ -264,7 +250,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -274,9 +260,9 @@ wmt14_reader = paddle.batch( ...@@ -274,9 +260,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. Implement Attention-based Decoder as follows: 3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch( ...@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch( ...@@ -294,7 +280,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch( ...@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch( ...@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. Training mode: 5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch( ...@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating:
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -373,30 +360,70 @@ wmt14_reader = paddle.batch( ...@@ -373,30 +360,70 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. 6. Generating mode:
### Create Parameters - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs. ```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python # The encoded source sequence (encoder's output) must be specified by
parameters = paddle.parameters.create(cost) # StaticInput, which is a read-only memory.
``` # Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python beam_gen = paddle.layer.beam_search(
for param in parameters.keys(): name=decoder_group_name,
print param step=gru_decoder_with_attention,
``` input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training ## Model Training
1. Create trainer 1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python ```python
if not is_generating:
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
learning_rate=5e-5, learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) regularization=paddle.optimizer.L2Regularization(rate=8e-4))
...@@ -405,68 +432,108 @@ for param in parameters.keys(): ...@@ -405,68 +432,108 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
1. Define event handler 4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python ```python
if not is_generating:
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
1. Start training 5. Start training
```python ```python
if not is_generating:
trainer.train( trainer.train(
reader=wmt14_reader, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
event_handler=event_handler,
num_passes=2,
feeding=feeding)
``` ```
The training log is as follows:
```text ```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
... ...
``` ```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage ## Model Usage
### Download Pre-trained Model 1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh parameters = paddle.dataset.wmt14.model()
``` ```
2. Define DataSet
### BLEU Evaluation Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: 3. Create infer
```bash
./moses_bleu.sh Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. ```python
```bash if is_generating:
./eval_bleu.sh FILE BEAMSIZE beam_result = paddle.infer(
``` output_layer=beam_gen,
Specificaly, the script is run as follows: parameters=parameters,
```bash input=gen_data,
./eval_bleu.sh gen_result 3 field=['prob', 'id'])
``` ```
You will see the following message as output: 4. Print generated translation
```text
BLEU = 26.92 Print sequence and its `beam_size` generated translation results based on the dictionary.
```
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
The generating log is as follows:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## Summary ## Summary
......
...@@ -417,12 +417,12 @@ is_generating = False ...@@ -417,12 +417,12 @@ is_generating = False
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
......... .........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
......... .........
```
### 生成模型 ### 生成模型
...@@ -491,12 +491,13 @@ is_generating = False ...@@ -491,12 +491,13 @@ is_generating = False
``` ```
生成开始后,可以观察到输出的日志如下: 生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e> src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e> prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e> prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e> prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## 总结 ## 总结
......
...@@ -255,25 +255,8 @@ import paddle.v2 as paddle ...@@ -255,25 +255,8 @@ import paddle.v2 as paddle
# train with a single CPU # train with a single CPU
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` # False: training, True: generating
is_generating = False
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
# source and target dict dim.
dict_size = 30000
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
### Model Configuration ### Model Configuration
...@@ -281,15 +264,18 @@ wmt14_reader = paddle.batch( ...@@ -281,15 +264,18 @@ wmt14_reader = paddle.batch(
1. Define some global variables 1. Define some global variables
```python ```python
dict_size = 30000 # dict dim
source_dict_dim = dict_size # source language dictionary size source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
``` ```
1. Implement Encoder as follows: 2. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
...@@ -297,7 +283,7 @@ wmt14_reader = paddle.batch( ...@@ -297,7 +283,7 @@ wmt14_reader = paddle.batch(
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -306,7 +292,7 @@ wmt14_reader = paddle.batch( ...@@ -306,7 +292,7 @@ wmt14_reader = paddle.batch(
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -316,9 +302,9 @@ wmt14_reader = paddle.batch( ...@@ -316,9 +302,9 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
1. Implement Attention-based Decoder as follows: 3. Implement Attention-based Decoder as follows:
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch( ...@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector) input=encoded_vector)
``` ```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch( ...@@ -336,7 +322,7 @@ wmt14_reader = paddle.batch(
input=backward_first) input=backward_first)
``` ```
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch( ...@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
...@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch( ...@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
1. Training mode: 5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch( ...@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating:
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -415,30 +402,70 @@ wmt14_reader = paddle.batch( ...@@ -415,30 +402,70 @@ wmt14_reader = paddle.batch(
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. 6. Generating mode:
### Create Parameters - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs. ```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python # The encoded source sequence (encoder's output) must be specified by
parameters = paddle.parameters.create(cost) # StaticInput, which is a read-only memory.
``` # Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python beam_gen = paddle.layer.beam_search(
for param in parameters.keys(): name=decoder_group_name,
print param step=gru_decoder_with_attention,
``` input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training ## Model Training
1. Create trainer 1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python
if not is_generating:
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
```
3. Create trainer
We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
```python ```python
if not is_generating:
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
learning_rate=5e-5, learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4)) regularization=paddle.optimizer.L2Regularization(rate=8e-4))
...@@ -447,68 +474,108 @@ for param in parameters.keys(): ...@@ -447,68 +474,108 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
1. Define event handler 4. Define event handler
The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```python ```python
if not is_generating:
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 2 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
1. Start training 5. Start training
```python ```python
if not is_generating:
trainer.train( trainer.train(
reader=wmt14_reader, reader=wmt14_reader, event_handler=event_handler, num_passes=2)
event_handler=event_handler,
num_passes=2,
feeding=feeding)
``` ```
The training log is as follows:
```text ```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
... ...
``` ```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage ## Model Usage
### Download Pre-trained Model 1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash ```python
cd pretrained if is_generating:
./wmt14_model.sh parameters = paddle.dataset.wmt14.model()
``` ```
2. Define DataSet
### BLEU Evaluation Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: 3. Create infer
```bash
./moses_bleu.sh Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. ```python
```bash if is_generating:
./eval_bleu.sh FILE BEAMSIZE beam_result = paddle.infer(
``` output_layer=beam_gen,
Specificaly, the script is run as follows: parameters=parameters,
```bash input=gen_data,
./eval_bleu.sh gen_result 3 field=['prob', 'id'])
``` ```
You will see the following message as output: 4. Print generated translation
```text
BLEU = 26.92 Print sequence and its `beam_size` generated translation results based on the dictionary.
```
```python
if is_generating:
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
```
The generating log is as follows:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## Summary ## Summary
......
...@@ -459,12 +459,12 @@ is_generating = False ...@@ -459,12 +459,12 @@ is_generating = False
``` ```
训练开始后,可以观察到event_handler输出的日志如下: 训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
......... .........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
......... .........
```
### 生成模型 ### 生成模型
...@@ -533,12 +533,13 @@ is_generating = False ...@@ -533,12 +533,13 @@ is_generating = False
``` ```
生成开始后,可以观察到输出的日志如下: 生成开始后,可以观察到输出的日志如下:
```text
src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e> src: <s> Les <unk> se <unk> au sujet de la largeur des sièges alors que de grosses commandes sont en jeu <e>
prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e> prob = -19.019573: The <unk> will be rotated about the width of the seats , while large orders are at stake . <e>
prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e> prob = -19.113066: The <unk> will be rotated about the width of the seats , while large commands are at stake . <e>
prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e> prob = -19.512890: The <unk> will be rotated about the width of the seats , while large commands are at play . <e>
```
## 总结 ## 总结
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册