-We will concatenate the output of the top LSTM unit with its input, and project the result into a hidden layer. Then, we put a fully connected layer on top to get the final feature vector representation.
-In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
```python
feature_out=paddle.layer.mixed(
```python
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
feature_out=paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0],param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1],param_attr=lstm_para_attr)
],)
```
- At the end of the network, we use CRF as the cost function; the parameter of CRF cost will be named `crfw`.
input=input_tmp[1],param_attr=lstm_para_attr)],)
```python
crf_cost=paddle.layer.crf(
size=label_dict_len,
input=feature_out,
...
...
@@ -370,7 +371,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers.
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
```python
crf_dec=paddle.layer.crf_decoding(
...
...
@@ -414,7 +415,7 @@ We will create trainer given model topology, parameters, and optimization method
@@ -432,7 +433,7 @@ As mentioned in data preparation section, we will use CoNLL 2005 test corpus as
```python
reader=paddle.batch(
paddle.reader.shuffle(
conll05.test(),buf_size=8192),batch_size=20)
conll05.test(),buf_size=8192),batch_size=2)
```
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
- We will concatenate the output of the top LSTM unit with its input, and project the result into a hidden layer. Then, we put a fully connected layer on top to get the final feature vector representation.
- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
```python
feature_out = paddle.layer.mixed(
```python
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
- At the end of the network, we use CRF as the cost function; the parameter of CRF cost will be named `crfw`.
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers.
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
```python
crf_dec = paddle.layer.crf_decoding(
...
...
@@ -456,7 +457,7 @@ We will create trainer given model topology, parameters, and optimization method
@@ -474,7 +475,7 @@ As mentioned in data preparation section, we will use CoNLL 2005 test corpus as
```python
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
conll05.test(), buf_size=8192), batch_size=2)
```
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
@@ -41,9 +41,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
```
After training and with a beam-search size of 3, the generated translations are as follows:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
...
...
@@ -213,25 +213,8 @@ import paddle.v2 as paddle
# train with a single CPU
paddle.init(use_gpu=False,trainer_count=1)
```
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...
...
@@ -284,7 +270,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...
...
@@ -332,7 +318,7 @@ wmt14_reader = paddle.batch(
return out
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
decoder_group_name = "decoder_group"
...
...
@@ -341,7 +327,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
-`recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...
...
@@ -349,6 +335,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
-`beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
ifis_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
parameters=paddle.parameters.create(cost)
```
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding=paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
forparaminparameters.keys():
printparam
```
beam_gen=paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
```python
ifis_generating:
gen_creator=paddle.dataset.wmt14.gen(dict_size)
gen_data=[]
gen_num=3
foritemingen_creator():
gen_data.append((item[0],))
iflen(gen_data)==gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
```bash
./moses_bleu.sh
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```bash
./eval_bleu.sh FILE BEAMSIZE
```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
3. Create infer
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```python
ifis_generating:
beam_result=paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob','id'])
```
4. Print generated translation
Print sequence and its `beam_size` generated translation results based on the dictionary.
@@ -83,9 +83,9 @@ Let's consider an example of Chinese-to-English translation. The model is given
```
After training and with a beam-search size of 3, the generated translations are as follows:
```text
0 -5.36816 these are signs of hope and relief . <e>
1 -6.23177 these are the light of hope and relief . <e>
2 -7.7914 these are the light of hope and the relief of hope . <e>
0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e>
```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
...
...
@@ -255,25 +255,8 @@ import paddle.v2 as paddle
# train with a single CPU
paddle.init(use_gpu=False, trainer_count=1)
```
### Define DataSet
We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
source_dict_dim = dict_size # source language dictionary size
target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # hidden layer size of GRU in decoder
beam_size = 3 # expand width in beam search
max_length = 250 # a stop condition of sequence generation
```
1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2. Implement Encoder as follows:
- Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
- Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
...
...
@@ -326,7 +312,7 @@ wmt14_reader = paddle.batch(
input=encoded_vector)
```
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
- Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...
...
@@ -374,7 +360,7 @@ wmt14_reader = paddle.batch(
return out
```
1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python
decoder_group_name = "decoder_group"
...
...
@@ -383,7 +369,7 @@ wmt14_reader = paddle.batch(
group_inputs = [group_input1, group_input2]
```
1. Training mode:
5. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
...
...
@@ -391,6 +377,7 @@ wmt14_reader = paddle.batch(
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
6. Generating mode:
### Create Parameters
- the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
Create every parameter that `cost` layer needs.
```python
if is_generating:
# In generation, the decoder predicts a next target word based on
# the encoded source sequence and the last generated target word.
```python
parameters = paddle.parameters.create(cost)
```
# The encoded source sequence (encoder's output) must be specified by
# StaticInput, which is a read-only memory.
# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
trg_embedding = paddle.layer.GeneratedInputV2(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
```python
for param in parameters.keys():
print param
```
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training
1. Create trainer
1. Create Parameters
Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
if not is_generating:
parameters = paddle.parameters.create(cost)
for param in parameters.keys():
print param
```
2. Define DataSet
Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
1. Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
```bash
cd pretrained
./wmt14_model.sh
```
```python
if is_generating:
parameters = paddle.dataset.wmt14.model()
```
2. Define DataSet
### BLEU Evaluation
Get the first 3 samples of wmt14 generating set as the source language sequences.
BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system.
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores.
```python
if is_generating:
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
```
[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading:
```bash
./moses_bleu.sh
```
BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
```bash
./eval_bleu.sh FILE BEAMSIZE
```
Specificaly, the script is run as follows:
```bash
./eval_bleu.sh gen_result 3
```
You will see the following message as output:
```text
BLEU = 26.92
```
3. Create infer
Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
```python
if is_generating:
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
```
4. Print generated translation
Print sequence and its `beam_size` generated translation results based on the dictionary.