We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...
@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
...
@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
input=encoded_vector)
```
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...
@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
...
@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
return out
return out
```
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
```python
decoder_group_name = "decoder_group"
decoder_group_name = "decoder_group"
...
@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch(
...
@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
group_inputs = [group_input1, group_input2]
```
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
-`recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
-`recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...
@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch(
...
@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
-`beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
ifis_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
# The encoded source sequence (encoder's output) must be specified by
parameters=paddle.parameters.create(cost)
# StaticInput, which is a read-only memory.
```
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding=paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
beam_gen=paddle.layer.beam_search(
forparaminparameters.keys():
name=decoder_group_name,
printparam
step=gru_decoder_with_attention,
```
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
```python
cd pretrained
if is_generating:
./wmt14_model.sh
parameters = paddle.dataset.wmt14.model()
```
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
ifis_generating:
gen_creator=paddle.dataset.wmt14.gen(dict_size)
gen_data=[]
gen_num=3
foritemingen_creator():
gen_data.append((item[0],))
iflen(gen_data)==gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
3. Create infer
```bash
./moses_bleu.sh
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```python
```bash
ifis_generating:
./eval_bleu.sh FILE BEAMSIZE
beam_result=paddle.infer(
```
output_layer=beam_gen,
Specificaly, the script is run as follows:
parameters=parameters,
```bash
input=gen_data,
./eval_bleu.sh gen_result 3
field=['prob','id'])
```
```
You will see the following message as output:
4. Print generated translation
```text
BLEU = 26.92
Print sequence and its `beam_size` generated translation results based on the dictionary.
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
source_dict_dim = dict_size # source language dictionary size
source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size
target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension
word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder
encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder
decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
```
```
1. Implement Encoder as follows:
2. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
- Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...
@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
...
@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
input=encoded_vector)
```
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...
@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
...
@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
return out
return out
```
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
```python
decoder_group_name = "decoder_group"
decoder_group_name = "decoder_group"
...
@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch(
...
@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
group_inputs = [group_input1, group_input2]
```
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...
@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch(
...
@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
# The encoded source sequence (encoder's output) must be specified by
parameters = paddle.parameters.create(cost)
# StaticInput, which is a read-only memory.
```
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
beam_gen = paddle.layer.beam_search(
for param in parameters.keys():
name=decoder_group_name,
print param
step=gru_decoder_with_attention,
```
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
```python
cd pretrained
if is_generating:
./wmt14_model.sh
parameters = paddle.dataset.wmt14.model()
```
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
```python
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
3. Create infer
```bash
./moses_bleu.sh
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```python
```bash
if is_generating:
./eval_bleu.sh FILE BEAMSIZE
beam_result = paddle.infer(
```
output_layer=beam_gen,
Specificaly, the script is run as follows:
parameters=parameters,
```bash
input=gen_data,
./eval_bleu.sh gen_result 3
field=['prob', 'id'])
```
```
You will see the following message as output:
4. Print generated translation
```text
BLEU = 26.92
Print sequence and its `beam_size` generated translation results based on the dictionary.