From 0c13254605b87ac14ecf8e3833d768db54ac4ef1 Mon Sep 17 00:00:00 2001 From: Luo Tao Date: Thu, 20 Apr 2017 19:10:09 +0800 Subject: [PATCH] update seq2seq english md file --- 07.machine_translation/README.en.md | 303 ++++++++++++++++----------- 07.machine_translation/README.md | 27 +-- 07.machine_translation/index.en.html | 303 ++++++++++++++++----------- 07.machine_translation/index.html | 27 +-- 4 files changed, 398 insertions(+), 262 deletions(-) diff --git a/07.machine_translation/README.en.md b/07.machine_translation/README.en.md index 1314018..951cd1e 100644 --- a/07.machine_translation/README.en.md +++ b/07.machine_translation/README.en.md @@ -213,25 +213,8 @@ import paddle.v2 as paddle # train with a single CPU paddle.init(use_gpu=False, trainer_count=1) -``` - -### Define DataSet - -We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset. - -```python -# source and target dict dim. -dict_size = 30000 - -feeding = { - 'source_language_word': 0, - 'target_language_word': 1, - 'target_language_next_word': 2 -} -wmt14_reader = paddle.batch( - paddle.reader.shuffle( - paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192), - batch_size=5) +# False: training, True: generating +is_generating = False ``` ### Model Configuration @@ -239,15 +222,18 @@ wmt14_reader = paddle.batch( 1. Define some global variables ```python + dict_size = 30000 # dict dim source_dict_dim = dict_size # source language dictionary size target_dict_dim = dict_size # destination language dictionary size word_vector_dim = 512 # word embedding dimension encoder_size = 512 # hidden layer size of GRU in encoder decoder_size = 512 # hidden layer size of GRU in decoder + beam_size = 3 # expand width in beam search + max_length = 250 # a stop condition of sequence generation ``` -1. Implement Encoder as follows: - 1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` +2. Implement Encoder as follows: + - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` ```python src_word_id = paddle.layer.data( @@ -255,7 +241,7 @@ wmt14_reader = paddle.batch( type=paddle.data_type.integer_value_sequence(source_dict_dim)) ``` - 1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space + - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space ```python src_embedding = paddle.layer.embedding( @@ -264,7 +250,7 @@ wmt14_reader = paddle.batch( param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) ``` - 1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ + - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ ```python src_forward = paddle.networks.simple_gru( @@ -274,9 +260,9 @@ wmt14_reader = paddle.batch( encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) ``` -1. Implement Attention-based Decoder as follows: +3. Implement Attention-based Decoder as follows: - 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network + - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network ```python with paddle.layer.mixed(size=decoder_size) as encoded_proj: @@ -284,7 +270,7 @@ wmt14_reader = paddle.batch( input=encoded_vector) ``` - 1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ + - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ ```python backward_first = paddle.layer.first_seq(input=src_backward) @@ -294,7 +280,7 @@ wmt14_reader = paddle.batch( input=backward_first) ``` - 1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. + - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. @@ -332,7 +318,7 @@ wmt14_reader = paddle.batch( return out ``` -1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. +4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. ```python decoder_group_name = "decoder_group" @@ -341,132 +327,213 @@ wmt14_reader = paddle.batch( group_inputs = [group_input1, group_input2] ``` -1. Training mode: +5. Training mode: - - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - - the sequence of next words from the target language is used as label (lbl) - - multi-class cross-entropy (`classification_cost`) is used to calculate the cost + - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. + - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way + - the sequence of next words from the target language is used as label (lbl) + - multi-class cross-entropy (`classification_cost`) is used to calculate the cost - ```python - trg_embedding = paddle.layer.embedding( - input=paddle.layer.data( - name='target_language_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)), - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) - group_inputs.append(trg_embedding) - - # For decoder equipped with attention mechanism, in training, - # target embeding (the groudtruth) is the data input, - # while encoded source sequence is accessed to as an unbounded memory. - # Here, the StaticInput defines a read-only memory - # for the recurrent_group. - decoder = paddle.layer.recurrent_group( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs) - - lbl = paddle.layer.data( - name='target_language_next_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)) - cost = paddle.layer.classification_cost(input=decoder, label=lbl) - ``` + ```python + if not is_generating: + trg_embedding = paddle.layer.embedding( + input=paddle.layer.data( + name='target_language_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)), + size=word_vector_dim, + param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) + group_inputs.append(trg_embedding) + + # For decoder equipped with attention mechanism, in training, + # target embeding (the groudtruth) is the data input, + # while encoded source sequence is accessed to as an unbounded memory. + # Here, the StaticInput defines a read-only memory + # for the recurrent_group. + decoder = paddle.layer.recurrent_group( + name=decoder_group_name, + step=gru_decoder_with_attention, + input=group_inputs) + + lbl = paddle.layer.data( + name='target_language_next_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)) + cost = paddle.layer.classification_cost(input=decoder, label=lbl) + ``` + +6. Generating mode: + + - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs. + - `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id. + + ```python + if is_generating: + # In generation, the decoder predicts a next target word based on + # the encoded source sequence and the last generated target word. + + # The encoded source sequence (encoder's output) must be specified by + # StaticInput, which is a read-only memory. + # Embedding of the last generated word is automatically gotten by + # GeneratedInputs, which is initialized by a start mark, such as , + # and must be included in generation. + + trg_embedding = paddle.layer.GeneratedInputV2( + size=target_dict_dim, + embedding_name='_target_language_embedding', + embedding_size=word_vector_dim) + group_inputs.append(trg_embedding) + + beam_gen = paddle.layer.beam_search( + name=decoder_group_name, + step=gru_decoder_with_attention, + input=group_inputs, + bos_id=0, + eos_id=1, + beam_size=beam_size, + max_length=max_length) + ``` Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. -### Create Parameters +## Model Training -Create every parameter that `cost` layer needs. +1. Create Parameters -```python -parameters = paddle.parameters.create(cost) -``` + Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated. -We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. + ```python + if not is_generating: + parameters = paddle.parameters.create(cost) + for param in parameters.keys(): + print param + ``` -```python -for param in parameters.keys(): - print param -``` +2. Define DataSet -## Model Training + Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset. -1. Create trainer + ```python + if not is_generating: + wmt14_reader = paddle.batch( + paddle.reader.shuffle( + paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192), + batch_size=5) + ``` +3. Create trainer We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). ```python - optimizer = paddle.optimizer.Adam( - learning_rate=5e-5, - regularization=paddle.optimizer.L2Regularization(rate=8e-4)) - trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer) + if not is_generating: + optimizer = paddle.optimizer.Adam( + learning_rate=5e-5, + regularization=paddle.optimizer.L2Regularization(rate=8e-4)) + trainer = paddle.trainer.SGD(cost=cost, + parameters=parameters, + update_equation=optimizer) ``` -1. Define event handler +4. Define event handler The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. ```python - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 10 == 0: - print "\nPass %d, Batch %d, Cost %f, %s" % ( - event.pass_id, event.batch_id, event.cost, event.metrics) + if not is_generating: + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 2 == 0: + print "\nPass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) ``` -1. Start training +5. Start training ```python - trainer.train( - reader=wmt14_reader, - event_handler=event_handler, - num_passes=2, - feeding=feeding) + if not is_generating: + trainer.train( + reader=wmt14_reader, event_handler=event_handler, num_passes=2) ``` - ```text - Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} - Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} - ... + The training log is as follows: + ```text + Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} + Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} + ... + ``` + +## Model Usage + +1. Download Pre-trained Model + + As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M. + + ```python + if is_generating: + parameters = paddle.dataset.wmt14.model() ``` +2. Define DataSet -The model training is successful when the `classification_error_evaluator` is lower than 0.35. + Get the first 3 samples of wmt14 generating set as the source language sequences. -## Model Usage + ```python + if is_generating: + gen_creator = paddle.dataset.wmt14.gen(dict_size) + gen_data = [] + gen_num = 3 + for item in gen_creator(): + gen_data.append((item[0], )) + if len(gen_data) == gen_num: + break + ``` -### Download Pre-trained Model +3. Create infer -As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: + Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence. -```bash -cd pretrained -./wmt14_model.sh -``` + ```python + if is_generating: + beam_result = paddle.infer( + output_layer=beam_gen, + parameters=parameters, + input=gen_data, + field=['prob', 'id']) + ``` +4. Print generated translation -### BLEU Evaluation + Print sequence and its `beam_size` generated translation results based on the dictionary. -BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. -To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. + ```python + if is_generating: + # get the dictionary + src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size) + + # the delimited element of generated sequences is -1, + # the first element of each generated sequence is the sequence length + seq_list = [] + seq = [] + for w in beam_result[1]: + if w != -1: + seq.append(w) + else: + seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]])) + seq = [] + + prob = beam_result[0] + for i in xrange(gen_num): + print "\n*******************************************************\n" + print "src:", ' '.join( + [src_dict.get(w) for w in gen_data[i][0]]), "\n" + for j in xrange(beam_size): + print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] + ``` -[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: -```bash -./moses_bleu.sh -``` -BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. -```bash -./eval_bleu.sh FILE BEAMSIZE -``` -Specificaly, the script is run as follows: -```bash -./eval_bleu.sh gen_result 3 -``` -You will see the following message as output: -```text -BLEU = 26.92 -``` + The generating log is as follows: + ```text + src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu + + prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . + prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . + prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + ``` ## Summary diff --git a/07.machine_translation/README.md b/07.machine_translation/README.md index 0b63425..58f613b 100644 --- a/07.machine_translation/README.md +++ b/07.machine_translation/README.md @@ -416,13 +416,13 @@ is_generating = False reader=wmt14_reader, event_handler=event_handler, num_passes=2) ``` - 训练开始后,可以观察到event_handler输出的日志如下: - - Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} - ......... - Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} - ......... - + 训练开始后,可以观察到event_handler输出的日志如下: + ```text + Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} + ......... + Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} + ......... + ``` ### 生成模型 @@ -490,13 +490,14 @@ is_generating = False print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] ``` - 生成开始后,可以观察到输出的日志如下: + 生成开始后,可以观察到输出的日志如下: + ```text + src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu - src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu - - prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . - prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . - prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . + prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . + prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + ``` ## 总结 diff --git a/07.machine_translation/index.en.html b/07.machine_translation/index.en.html index 4c5e2c3..03a47d1 100644 --- a/07.machine_translation/index.en.html +++ b/07.machine_translation/index.en.html @@ -255,25 +255,8 @@ import paddle.v2 as paddle # train with a single CPU paddle.init(use_gpu=False, trainer_count=1) -``` - -### Define DataSet - -We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset. - -```python -# source and target dict dim. -dict_size = 30000 - -feeding = { - 'source_language_word': 0, - 'target_language_word': 1, - 'target_language_next_word': 2 -} -wmt14_reader = paddle.batch( - paddle.reader.shuffle( - paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192), - batch_size=5) +# False: training, True: generating +is_generating = False ``` ### Model Configuration @@ -281,15 +264,18 @@ wmt14_reader = paddle.batch( 1. Define some global variables ```python + dict_size = 30000 # dict dim source_dict_dim = dict_size # source language dictionary size target_dict_dim = dict_size # destination language dictionary size word_vector_dim = 512 # word embedding dimension encoder_size = 512 # hidden layer size of GRU in encoder decoder_size = 512 # hidden layer size of GRU in decoder + beam_size = 3 # expand width in beam search + max_length = 250 # a stop condition of sequence generation ``` -1. Implement Encoder as follows: - 1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` +2. Implement Encoder as follows: + - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)` ```python src_word_id = paddle.layer.data( @@ -297,7 +283,7 @@ wmt14_reader = paddle.batch( type=paddle.data_type.integer_value_sequence(source_dict_dim)) ``` - 1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space + - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space ```python src_embedding = paddle.layer.embedding( @@ -306,7 +292,7 @@ wmt14_reader = paddle.batch( param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) ``` - 1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ + - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$ ```python src_forward = paddle.networks.simple_gru( @@ -316,9 +302,9 @@ wmt14_reader = paddle.batch( encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) ``` -1. Implement Attention-based Decoder as follows: +3. Implement Attention-based Decoder as follows: - 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network + - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network ```python with paddle.layer.mixed(size=decoder_size) as encoded_proj: @@ -326,7 +312,7 @@ wmt14_reader = paddle.batch( input=encoded_vector) ``` - 1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ + - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ ```python backward_first = paddle.layer.first_seq(input=src_backward) @@ -336,7 +322,7 @@ wmt14_reader = paddle.batch( input=backward_first) ``` - 1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. + - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. @@ -374,7 +360,7 @@ wmt14_reader = paddle.batch( return out ``` -1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. +4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. ```python decoder_group_name = "decoder_group" @@ -383,132 +369,213 @@ wmt14_reader = paddle.batch( group_inputs = [group_input1, group_input2] ``` -1. Training mode: +5. Training mode: - - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. - - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - - the sequence of next words from the target language is used as label (lbl) - - multi-class cross-entropy (`classification_cost`) is used to calculate the cost + - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word. + - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way + - the sequence of next words from the target language is used as label (lbl) + - multi-class cross-entropy (`classification_cost`) is used to calculate the cost - ```python - trg_embedding = paddle.layer.embedding( - input=paddle.layer.data( - name='target_language_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)), - size=word_vector_dim, - param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) - group_inputs.append(trg_embedding) - - # For decoder equipped with attention mechanism, in training, - # target embeding (the groudtruth) is the data input, - # while encoded source sequence is accessed to as an unbounded memory. - # Here, the StaticInput defines a read-only memory - # for the recurrent_group. - decoder = paddle.layer.recurrent_group( - name=decoder_group_name, - step=gru_decoder_with_attention, - input=group_inputs) - - lbl = paddle.layer.data( - name='target_language_next_word', - type=paddle.data_type.integer_value_sequence(target_dict_dim)) - cost = paddle.layer.classification_cost(input=decoder, label=lbl) - ``` + ```python + if not is_generating: + trg_embedding = paddle.layer.embedding( + input=paddle.layer.data( + name='target_language_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)), + size=word_vector_dim, + param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) + group_inputs.append(trg_embedding) + + # For decoder equipped with attention mechanism, in training, + # target embeding (the groudtruth) is the data input, + # while encoded source sequence is accessed to as an unbounded memory. + # Here, the StaticInput defines a read-only memory + # for the recurrent_group. + decoder = paddle.layer.recurrent_group( + name=decoder_group_name, + step=gru_decoder_with_attention, + input=group_inputs) + + lbl = paddle.layer.data( + name='target_language_next_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)) + cost = paddle.layer.classification_cost(input=decoder, label=lbl) + ``` + +6. Generating mode: + + - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs. + - `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id. + + ```python + if is_generating: + # In generation, the decoder predicts a next target word based on + # the encoded source sequence and the last generated target word. + + # The encoded source sequence (encoder's output) must be specified by + # StaticInput, which is a read-only memory. + # Embedding of the last generated word is automatically gotten by + # GeneratedInputs, which is initialized by a start mark, such as , + # and must be included in generation. + + trg_embedding = paddle.layer.GeneratedInputV2( + size=target_dict_dim, + embedding_name='_target_language_embedding', + embedding_size=word_vector_dim) + group_inputs.append(trg_embedding) + + beam_gen = paddle.layer.beam_search( + name=decoder_group_name, + step=gru_decoder_with_attention, + input=group_inputs, + bos_id=0, + eos_id=1, + beam_size=beam_size, + max_length=max_length) + ``` Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. -### Create Parameters +## Model Training -Create every parameter that `cost` layer needs. +1. Create Parameters -```python -parameters = paddle.parameters.create(cost) -``` + Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated. -We can get parameter names. If the parameter name is not specified during model configuration, it will be generated. + ```python + if not is_generating: + parameters = paddle.parameters.create(cost) + for param in parameters.keys(): + print param + ``` -```python -for param in parameters.keys(): - print param -``` +2. Define DataSet -## Model Training + Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset. -1. Create trainer + ```python + if not is_generating: + wmt14_reader = paddle.batch( + paddle.reader.shuffle( + paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192), + batch_size=5) + ``` +3. Create trainer We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG). ```python - optimizer = paddle.optimizer.Adam( - learning_rate=5e-5, - regularization=paddle.optimizer.L2Regularization(rate=8e-4)) - trainer = paddle.trainer.SGD(cost=cost, - parameters=parameters, - update_equation=optimizer) + if not is_generating: + optimizer = paddle.optimizer.Adam( + learning_rate=5e-5, + regularization=paddle.optimizer.L2Regularization(rate=8e-4)) + trainer = paddle.trainer.SGD(cost=cost, + parameters=parameters, + update_equation=optimizer) ``` -1. Define event handler +4. Define event handler The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler. ```python - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 10 == 0: - print "\nPass %d, Batch %d, Cost %f, %s" % ( - event.pass_id, event.batch_id, event.cost, event.metrics) + if not is_generating: + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 2 == 0: + print "\nPass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) ``` -1. Start training +5. Start training ```python - trainer.train( - reader=wmt14_reader, - event_handler=event_handler, - num_passes=2, - feeding=feeding) + if not is_generating: + trainer.train( + reader=wmt14_reader, event_handler=event_handler, num_passes=2) ``` - ```text - Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} - Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} - ... + The training log is as follows: + ```text + Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0} + Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} + ... + ``` + +## Model Usage + +1. Download Pre-trained Model + + As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M. + + ```python + if is_generating: + parameters = paddle.dataset.wmt14.model() ``` +2. Define DataSet -The model training is successful when the `classification_error_evaluator` is lower than 0.35. + Get the first 3 samples of wmt14 generating set as the source language sequences. -## Model Usage + ```python + if is_generating: + gen_creator = paddle.dataset.wmt14.gen(dict_size) + gen_data = [] + gen_num = 3 + for item in gen_creator(): + gen_data.append((item[0], )) + if len(gen_data) == gen_num: + break + ``` -### Download Pre-trained Model +3. Create infer -As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model: + Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence. -```bash -cd pretrained -./wmt14_model.sh -``` + ```python + if is_generating: + beam_result = paddle.infer( + output_layer=beam_gen, + parameters=parameters, + input=gen_data, + field=['prob', 'id']) + ``` +4. Print generated translation -### BLEU Evaluation + Print sequence and its `beam_size` generated translation results based on the dictionary. -BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. -To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. + ```python + if is_generating: + # get the dictionary + src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size) + + # the delimited element of generated sequences is -1, + # the first element of each generated sequence is the sequence length + seq_list = [] + seq = [] + for w in beam_result[1]: + if w != -1: + seq.append(w) + else: + seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]])) + seq = [] + + prob = beam_result[0] + for i in xrange(gen_num): + print "\n*******************************************************\n" + print "src:", ' '.join( + [src_dict.get(w) for w in gen_data[i][0]]), "\n" + for j in xrange(beam_size): + print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] + ``` -[Moses](http://www.statmt.org/moses/) is an open-source machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading: -```bash -./moses_bleu.sh -``` -BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default. -```bash -./eval_bleu.sh FILE BEAMSIZE -``` -Specificaly, the script is run as follows: -```bash -./eval_bleu.sh gen_result 3 -``` -You will see the following message as output: -```text -BLEU = 26.92 -``` + The generating log is as follows: + ```text + src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu + + prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . + prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . + prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + ``` ## Summary diff --git a/07.machine_translation/index.html b/07.machine_translation/index.html index 840a5e3..c624fbd 100644 --- a/07.machine_translation/index.html +++ b/07.machine_translation/index.html @@ -458,13 +458,13 @@ is_generating = False reader=wmt14_reader, event_handler=event_handler, num_passes=2) ``` - 训练开始后,可以观察到event_handler输出的日志如下: - - Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} - ......... - Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} - ......... - + 训练开始后,可以观察到event_handler输出的日志如下: + ```text + Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0} + ......... + Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789} + ......... + ``` ### 生成模型 @@ -532,13 +532,14 @@ is_generating = False print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] ``` - 生成开始后,可以观察到输出的日志如下: + 生成开始后,可以观察到输出的日志如下: + ```text + src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu - src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu - - prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . - prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . - prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake . + prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake . + prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play . + ``` ## 总结 -- GitLab