README.md 18.6 KB
Newer Older
1
# Neural Machine Translation Model
Y
Yibing 已提交
2

3 4
## Background Introduction
Neural Machine Translation (NMT) is a simple new architecture for getting machines to learn to translate. Traditional machine translation methods are mainly based on phrase-based statistical translation approaches that use separately engineered subcomponents rules or statistical models. NMT models use deep learning and representation learning. This example describes how to construct an end-to-end neural machine translation (NMT) model using the recurrent neural network (RNN) in PaddlePaddle.
Y
Yibing 已提交
5

6 7
## Model Overview
RNN-based neural machine translation follows the encoder-decoder architecture. A common choice for the encoder and decoder is the recurrent neural network (RNN), used by most NMT models. Below is an example diagram of a general approach for NMT.
Y
Yibing 已提交
8

9
Figure 1. Encoder - Decoder frame </ p> </ p> </ p> </ p> <p> <p> >
Y
Yibing 已提交
10

11
The input and output of the neural machine translation model can be any of character, word or phrase. This example illustrates the word-based NMT.
Y
Yibing 已提交
12

13
- ** Encoder **: Encodes the source language sentence into a vector as input to the decoder. The original input of the decoder is the `id` sequence $ w = {w_1, w_2, ..., w_T} $ of the word, expressed in the one-hot code. In order to reduce the input dimension, and to establish the semantic association between words, the model is a word that is expressed by hot independent code. Word embedding is a word vector. For more information about word vector, please refer to PaddleBook [word vector] (https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.cn.md) chapter. Finally, the RNN unit processes the input word by word to get the encoding vector of the complete sentence.
Y
Yibing 已提交
14

15
- ** Decoder **: Accepts the input of the encoder, decoding the target language sequence $ u = {u_1, u_2, ..., u_ {T '}} $ one by one. For each time step, the RNN unit outputs a hidden vector. Then the conditional probability of the next target word is calculated by `Softmax` normalization, i.e. $ P (u_i | w, u_1, u_2, ..., u_ {t- 1}) $. Thus, given the input $ w $, the corresponding translation result is $ u $
Y
Yibing 已提交
16

17
(U_t | w, u_2, u_2, u_ {t) -1}) $$
Y
Yibing 已提交
18

19
In Chinese to English translation, for example, the source language is Chinese, and the target language is English. The following is a sentence after the source language word segmentation.
Y
Yibing 已提交
20

21 22 23
`` ``
Wish the motherland prosperity
`` ``
Y
Yibing 已提交
24

25
Corresponding target language English translation results for:
Y
Yibing 已提交
26

27
`` ``
Y
Yibing 已提交
28
Wish motherland rich and powerful
29
`` ``
Y
Yibing 已提交
30

31
In the preprocessing step, we prepare the parallel corpus data of the source language and the target language. Then we construct the dictionaries of the source language and the target language respectively. In the training stage, we use the pairwise parallel corpus training model. In the model test stage, the model automatically generates the corresponding English translation, and then it evaluates the resulting results with standard translations. For the evaluation metric, BLEU is most commonly used.
Y
Yibing 已提交
32

33 34
### RNN unit
The original structure of the RNN uses a vector to store the hidden state. However, the RNN of this structure is prone to have gradient vanishing problem, which is difficult to model for a long time. This issue can be addressed by using LSTM \ [[1] (# references)] and  GRU (Gated Recurrent Unit) \ [[2] (# references)]. This solves the problem of long-term dependency by carefully forgetting the previous information. In this example, we demonstrate the GRU based model.
Y
Yibing 已提交
35

36 37 38 39
<p align = "center">
& lt; / RTI & gt;
Figure 2. GRU unit
 </ p>
Y
Yibing 已提交
40

41
We can see that, in addition to the implicit state, the GRU also contains two gates: the Update Gate and the Reset Gate. At each time step, the update of the threshold and the hidden state is determined by the formula on the right side of Figure 2. These two thresholds determine how the state is updated.
Y
Yibing 已提交
42

43 44 45 46 47 48
### Bi-directional Encoder
In the above basic model, when the encoder sequentially processes the input sentence sequence, the state of the current time contains only the history input information without the sequence information of the future time. For sequence modeling, the context of the future also contains important information. With the bi-directional encoder (Figure 3), we can get both information at the same time:
<p align = "center">
& lt; / RTI & gt;
Figure 3. Bi-directional encoder structure diagram
 </ p>
Y
Yibing 已提交
49

50
The bi-directional encoder \ [[3] (# reference) \] shown in Figure 3 consists of two independent RNNs that encode the input sequence from the forward and backward, respectively. Then it combines the outputs of the two RNNs together, as the final encoding output.
Y
Yibing 已提交
51

52 53 54 55 56 57
In PaddlePaddle, bi-directional encoders can easily call using APIs:

`` `python
src_word_id = paddle.layer.data (
    name = 'source_language_word',
    type = paddle.data_type.integer_value_sequence (source_dict_dim))
58

Y
Yibing 已提交
59
# source embedding
60 61
src_embedding = paddle.layer.embedding (
    input = src_word_id, size = word_vector_dim)
62

C
caoying03 已提交
63
# bidirectional GRU as encoder
64 65 66 67 68 69 70 71 72 73 74 75
encoded_vector = paddle.networks.bidirectional_gru (
    input = src_embedding,
    size = encoder_size,
    fwd_act = paddle.activation.Tanh (),
    fwd_gate_act = paddle.activation.Sigmoid (),
    bwd_act = paddle.activation.Tanh (),
    bwd_gate_act = paddle.activation.Sigmoid (),
    return_seq = True)
`` ``

### Beam Search Algorithm
After the training is completed, the model will input and decode the corresponding target language translation result according to the source language. Decoding, a direct way is to take each step conditional probability of the largest word, as the current moment of output. But the local optimal does not necessarily guarantee the global optimal. If the search for the full space is large, the cost is too large. In order to solve this problem, beam search algorithm is commonly used. Beam search is a heuristic graph search algorithm that controls the search width with a parameter $ k $] as follows:
Y
Yibing 已提交
76

77
** 1 **. During decoding, always maintain $ k $ decoded sub-sequences;
Y
Yibing 已提交
78

79
** $ ** At the middle of time $ t $, for each sequence in the $ k $ sub-sequence, calculate the probability of the next word and take the maximum of $ k $ words with the largest probability, combining $ k ^ 2 $ New child sequence;
Y
Yibing 已提交
80

81
** 3 **. Take the maximum probability of $ k $ in these combination sequences to update the original subsequence;
Y
Yibing 已提交
82

83
** iterate through it until you get $ k $ complete sentences as candidates for translation results.
Y
Yibing 已提交
84

85
For more information on beam search, refer to the [beam search] chapter in PaddleBook [machine translation] (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md) (https://github.com/PaddlePaddle/book/blob/develop.org.machine_translation/README.cn.md# beam search algorithm) section.
Y
Yibing 已提交
86 87


88 89
### Decoder without attention mechanism
In the relevant section of PaddleBook (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md), the Attention Mechanism has been introduced. This example demonstrates Encoder-Decoder structure without attention mechanism. With regard to the attention mechanism, please refer to PaddleBook and references \ [[3] (# references)].
Y
Yibing 已提交
90

91
In PaddlePaddle, commonly used RNN units can be conveniently called using APIs. For example, `recurrent_layer_group` can be used to implement custom actions at each point in the RNN. First, customize the single-step logic function, and then use the function `recurrent_group ()` to cycle through the single-step logic function to process the entire sequence. In this case, the unattended mechanism of the decoder uses `recurrent_layer_group` to implement the function` gru_decoder_without_attention () `. Corresponding code is as follows:
Y
Yibing 已提交
92

Y
Yibing 已提交
93

94
`` `python
95
# the initialization state for decoder GRU
96 97 98
encoder_last = paddle.layer.last_seq (input = encoded_vector)
encoder_last_projected = paddle.layer.fc (
    size = decoder_size, act = paddle.activation.Tanh (), input = encoder_last)
C
caoying03 已提交
99

100
# the step function for decoder GRU
101 102
def gru_decoder_without_attention (enc_vec, current_word):
    '' '
Y
Yibing 已提交
103
    Step function for gru decoder
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
    : param enc_vec: encoded vector of source language
    : type enc_vec: layer object
    : param current_word: current input of decoder
    : type current_word: layer object
    '' '
    decoder_mem = paddle.layer.memory (
            name = "gru_decoder",
            size = decoder_size,
            boot_layer = encoder_last_projected)

    context = paddle.layer.last_seq (input = enc_vec)

    decoder_inputs = paddle.layer.fc (
        size = decoder_size * 3, input = [context, current_word])

    gru_step = paddle.layer.gru_step (
        name = "gru_decoder",
        act = paddle.activation.Tanh (),
        gate_act = paddle.activation.Sigmoid (),
        input = decoder_inputs,
        output_mem = decoder_mem,
        size = decoder_size)

     out = paddle.layer.fc
        size = target_dict_dim,
        bias_attr = True,
        act = paddle.activation.Softmax (),
        input = gru_step)
132
    return out  
133
`` ``
Y
Yibing 已提交
134

135
In the model training and testing phase, the behavior of the decoder is different:
Y
Yibing 已提交
136

137 138
- ** Training phase **: The word vector of the target translation `trg_embedding` is passed as a parameter to the single step logic` gru_decoder_without_attention () `. The function` recurrent_group () `loop calls the single step logic execution, and finally calculates the target translation with the actual decoding;
- ** Testing phase **: The decoder predicts the next word based on the last generated word, `GeneratedInput ()`. The automatic fetch model predicts the highest probability of the $ k $ word vector passed to the single step logic. Then the beam_search () function calls the function `gru_decoder_without_attention ()` to complete the beam search and returns as a result.
Y
Yibing 已提交
139

140
The training and generated returns are implemented in the following `if-else` conditional branches:
Y
Yibing 已提交
141

142 143
`` `python
group_input1 = paddle.layer.StaticInput (input = encoded_vector)
Y
Yibing 已提交
144 145
group_inputs = [group_input1]

146
decoder_group_name = "decoder_group"
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
if is_generating
    trg_embedding = paddle.layer.GeneratedInput (
        size = target_dict_dim,
        embedding_name = "_ target_language_embedding",
        embedding_size = word_vector_dim)
    group_inputs.append (trg_embedding)

    beam_gen = paddle.layer.beam_search (
        name = decoder_group_name,
        step = gru_decoder_without_attention,
        input = group_inputs,
        bos_id = 0,
        eos_id = 1,
        beam_size = beam_size,
        max_length = max_length)
Y
Yibing 已提交
162 163

    return beam_gen
164
else:
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
    trg_embedding = paddle.layer.embedding (
        input = paddle.layer.data (
            name = "target_language_word",
            type = paddle.data_type.integer_value_sequence (target_dict_dim)),
        size = word_vector_dim,
        param_attr = paddle.attr.ParamAttr (name = "_ target_language_embedding"))
    group_inputs.append (trg_embedding)

    decoder = paddle.layer.recurrent_group (
        name = decoder_group_name,
        step = gru_decoder_without_attention,
        input = group_inputs)

    lbl = paddle.layer.data (
        name = "target_language_next_word",
        type = paddle.data_type.integer_value_sequence (target_dict_dim))
    cost = paddle.layer.classification_cost (input = decoder, label = lbl)
182 183

    return cost
184
`` ``
Y
Yibing 已提交
185

186 187
## Data Preparation
The data used in this example is from [WMT14] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), which is a parallel corpus of French-to-English translation. Use [bitexts] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) as training data, [dev + test data] (http: //www-lium.univ -lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) as validation and test data. PaddlePaddle has been packaged in the data set of the read interface, in the first run, the program will automatically complete the download. Users do not need to manually complete the relevant data preparation!
Y
Yibing 已提交
188

189
## Model Training and Testing
Y
Yibing 已提交
190

191
### Model Training
C
caoying03 已提交
192

193
Starting the model training is very simple, just in the command line window to execute `python train.py`. The `train ()` function in the `train.py` script of the model training phase completes the following logic:
Y
Yibing 已提交
194

195
** a) Define the network, parse the network structure, initialize the model parameters. **
Y
Yibing 已提交
196

197
`` `python
C
caoying03 已提交
198
# define the network topolgy.
199 200 201
cost = seq2seq_net (source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create (cost)
`` ``
Y
Yibing 已提交
202

203
** b) Set the training process optimization strategy. Define the training data to read `reader` **
Y
Yibing 已提交
204

205
`` `python
206
# define optimization method
207 208 209 210
optimizer = paddle.optimizer.RMSProp (
    learning_rate = 1e-3,
    gradient_clipping_threshold = 10.0,
    regularization = paddle.optimizer.L2Regularization (rate = 8e-4))
211 212

# define the trainer instance
213 214
trainer = paddle.trainer.SGD (
    cost = cost, parameters = parameters, update_equation = optimizer)
215

Y
Yibing 已提交
216
# define data reader
217 218 219 220 221
wmt14_reader = paddle.batch (
    paddle.reader.shuffle
        lt; RTI ID = 0.0 & gt;
    batch_size = 55)
`` ``
Y
Yibing 已提交
222

223
** c) Define the event handle, print the training intermediate results, save the model snapshot **
Y
Yibing 已提交
224

225
`` `python
226
# define the event_handler callback
227 228 229 230 231 232
def event_handler (event):
    if isinstance (event, paddle.event.EndIteration):
        if not event.batch_id% 100 and event.batch_id:
            with gzip.open
                    os.path.join (save_path,
                                 "nmt_without_att_% 05d_batch_% 05d.tar.gz"%
233
                                 event.pass_id, event.batch_id), "w") as f:
234
                parameters.to_tar (f)
Y
Yibing 已提交
235

236 237
        if event.batch_id and not event.batch_id% 10:
            logger.info ("Pass% d, Batch% d, Cost% f,% s"%
238
                event.pass_id, event.batch_id, event.cost, event.metrics))
239
`` ``
Y
Yibing 已提交
240

241
** d) Start training **
Y
Yibing 已提交
242

243
`` `python
C
caoying03 已提交
244
# start training
245 246 247
trainer.train
    reader = wmt14_reader, event_handler = event_handler, num_passes = 2)
`` ``
Y
Yibing 已提交
248

249
The output sample is
Y
Yibing 已提交
250

251
`` `text
Y
Yibing 已提交
252 253 254 255 256 257 258 259 260
Pass 0, Batch 0, Cost 267.674663, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 172.892294, {'classification_error_evaluator': 0.953895092010498}
.........
Pass 0, Batch 20, Cost 177.989329, {'classification_error_evaluator': 0.9052488207817078}
.........
Pass 0, Batch 30, Cost 153.633665, {'classification_error_evaluator': 0.8643803596496582}
.........
Pass 0, Batch 40, Cost 168.170543, {'classification_error_evaluator': 0.8348183631896973}
261
`` ``
Y
Yibing 已提交
262

263 264
### Generate Translation Results
In PaddlePaddle, it is also easy to use translated models to generate translated text.
C
caoying03 已提交
265

266
1. First of all, please modify the `generate.py` script` main` passed to the `generate` function parameters to choose which saved model to use. The default parameters are as follows:
C
caoying03 已提交
267

268 269 270 271 272 273 274 275
    `` `python
    generate
        source_dict_dim = 30000,
        target_dict_dim = 30000,
        batch_size = 20,
        beam_size = 3,
        model_path = "models / nmt_without_att_params_batch_00100.tar.gz")
    `` ``
C
caoying03 已提交
276

277
2. In the terminal phase, execute the `python generate.py` command. The` generate () `in the script executes the following code:
C
caoying03 已提交
278

279
    ** a) Load the test sample **
C
caoying03 已提交
280

281 282 283
    `` `python
    # load data samples for generation
    gen_creator = paddle.dataset.wmt14.gen (source_dict_dim)
C
caoying03 已提交
284
    gen_data = []
285 286 287
    for item in gen_creator ():
        gen_data.append ((item [0],))
    `` ``
C
caoying03 已提交
288

289
    ** b) Initialize the model, execute `infer ()` for each input sample to generate `beam search` translation results **
C
caoying03 已提交
290

291 292 293 294
    `` `python
    beam_gen = seq2seq_net (source_dict_dim, target_dict_dim, True)
    with gzip.open (init_models_path) as f:
        parameters = paddle.parameters.Parameters.from_tar (f)
C
caoying03 已提交
295
    # prob is the prediction probabilities, and id is the prediction word.
296 297 298 299 300 301
    beam_result = paddle.infer (
        output_layer = beam_gen,
        parameters = parameters,
        input = gen_data,
        field = ['prob', 'id'])
    `` ``
C
caoying03 已提交
302

303
    ** c) Next, load the source and target language dictionaries, convert the sentences represented by the `id` sequence into the original language and output the results.**
C
caoying03 已提交
304

305 306
    `` `python
    beam_result = inferer.infer (input = test_batch, field = ["prob", "id"])
C
caoying03 已提交
307

308 309
    gen_sen_idx = np.where (beam_result [1] == -1) [0]
    assert len ​​(gen_sen_idx) == len (test_batch) * beam_size
C
caoying03 已提交
310 311

    start_pos, end_pos = 1, 0
312 313 314 315 316 317 318 319
    for i, sample in enumerate (test_batch):
        print ("" .join ([
            src_dict [w] for w in sample [0] [1: -1]
        ]) # skip the start and ending mark when print the source sentence
        for j in xrange (beam_size):
            end_pos = gen_sen_idx [i * beam_size + j]
            print ("%. 4f \ t% s"% (beam_result [0] [i] [j], "" .join (
                trg_dict [w] for w in beam_result [1] [start_pos: end_pos])))
C
caoying03 已提交
320
            start_pos = end_pos + 2
321 322
        print ("\ n")
    `` ``
Y
Yibing 已提交
323

324
Set the width of the beam search to 3, enter a French sentence. Then it automatically generate the corresponding test data for the translation results, the output format is as follows:
Y
Yibing 已提交
325

326 327 328 329 330
`` `text
Elles connaissent leur entreprise mieux que personne.
-3.754819 They know their business better than anyone. <E>
-4.445528 They know their businesses better than anyone. <E>
-5.026885 They know their business better than anybody. <E>
Y
Yibing 已提交
331

332 333 334 335 336
`` ``
- the first line of input for the source language.
- second ~ beam_size + 1 line is the result of the `beam_size` translation generated by the column search
    - the output of the same row is separated into two columns by "\ t", the first column is the log probability of the sentence, and the second column is the text of the translation result.
    - the symbol `<s>` represents the beginning of the sentence, the symbol `<e>` indicates the end of a sentence, and if there is a word that is not included in the dictionary, it is replaced with the symbol `` unk> '.
Y
Yibing 已提交
337

338
So far, we implemented a basic machine translation model using PaddlePaddle. We can see, PaddlePaddle provides a flexible and rich API. This enables users to easily choose and use a various complex network configuration. NMT itself is also a rapidly developing field, and many new ideas continue to emerge. This example is a basic implementation of NMT. Users can also implement more complex NMT models using PaddlePaddle.
Y
Yibing 已提交
339 340


341 342
## references
[1] Sutskever I, Vinyals O, Le Q V. [Sequence to Sequence Learning with Neural Networks] (https://arxiv.org/abs/1409.3215) [J]. 2014, 4: 3104-3112.
Y
Yibing 已提交
343

344
[2] Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation (http://www.aclweb.org/anthology/D/D14/D14-1179 .pdf) [C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
Y
Yibing 已提交
345

346
[3] Bahdanau D, Cho K, Bengio Y. [Neural machine translation by exclusive learning to align and translate] (https://arxiv.org/abs/1409.0473) [C]. Proceedings of ICLR 2015, 2015