README.md 18.5 KB
Newer Older
1
# Neural Machine Translation Model
Y
Yibing 已提交
2

3 4
## Background Introduction
Neural Machine Translation (NMT) is a simple new architecture for getting machines to learn to translate. Traditional machine translation methods are mainly based on phrase-based statistical translation approaches that use separately engineered subcomponents rules or statistical models. NMT models use deep learning and representation learning. This example describes how to construct an end-to-end neural machine translation (NMT) model using the recurrent neural network (RNN) in PaddlePaddle.
Y
Yibing 已提交
5

6 7
## Model Overview
RNN-based neural machine translation follows the encoder-decoder architecture. A common choice for the encoder and decoder is the recurrent neural network (RNN), used by most NMT models. Below is an example diagram of a general approach for NMT.
Y
Yibing 已提交
8

9
<p align="center"><img src="images/encoder-decoder.png" width = "90%" align="center"/><br/>Figure 1. Encoder - Decoder frame </ p>
Y
Yibing 已提交
10

11
The input and output of the neural machine translation model can be any of character, word or phrase. This example illustrates the word-based NMT.
Y
Yibing 已提交
12

13
- **Encoder**: Encodes the source language sentence into a vector as input to the decoder. The original input of the decoder is the `id` sequence $ w = {w_1, w_2, ..., w_T} $ of the word, expressed in the one-hot code. In order to reduce the input dimension, and to establish the semantic association between words, the model is a word that is expressed by hot independent code. Word embedding is a word vector. For more information about word vector, please refer to PaddleBook [word vector] (https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.cn.md) chapter. Finally, the RNN unit processes the input word by word to get the encoding vector of the complete sentence.
Y
Yibing 已提交
14

15
- **Decoder**: Accepts the input of the encoder, decoding the target language sequence $ u = {u_1, u_2, ..., u_ {T '}} $ one by one. For each time step, the RNN unit outputs a hidden vector. Then the conditional probability of the next target word is calculated by `Softmax` normalization, i.e. $ P (u_i | w, u_1, u_2, ..., u_ {t- 1}) $. Thus, given the input $ w $, the corresponding translation result is $ u $
Y
Yibing 已提交
16

17
$$ P(u_1,u_2,...,u_{T'} | w) = \prod_{t=1}^{t={T'}}p(u_t|w, u_1, u_2, u_{t-1})$$
Y
Yibing 已提交
18

19
In Chinese to English translation, for example, the source language is Chinese, and the target language is English. The following is a sentence after the source language word segmentation.
Y
Yibing 已提交
20

21 22 23
```
祝愿 祖国 繁荣 昌盛
```
Y
Yibing 已提交
24

25
Corresponding target language English translation results for:
Y
Yibing 已提交
26

27
```
Y
Yibing 已提交
28
Wish motherland rich and powerful
29
```
Y
Yibing 已提交
30

31
In the preprocessing step, we prepare the parallel corpus data of the source language and the target language. Then we construct the dictionaries of the source language and the target language respectively. In the training stage, we use the pairwise parallel corpus training model. In the model test stage, the model automatically generates the corresponding English translation, and then it evaluates the resulting results with standard translations. For the evaluation metric, BLEU is most commonly used.
Y
Yibing 已提交
32

33
### RNN unit
34
The original structure of the RNN uses a vector to store the hidden state. However, the RNN of this structure is prone to have gradient vanishing problem, which is difficult to model for a long time. This issue can be addressed by using LSTM \[[1](#References)] and  GRU (Gated Recurrent Unit) \[[2](#References)]. This solves the problem of long-term dependency by carefully forgetting the previous information. In this example, we demonstrate the GRU based model.
Y
Yibing 已提交
35

36 37 38 39
<p align="center">
<img src="images/gru.png" width = "90%" align="center"/><br/>
图 2. GRU 单元
 </p>
Y
Yibing 已提交
40

41
We can see that, in addition to the implicit state, the GRU also contains two gates: the Update Gate and the Reset Gate. At each time step, the update of the threshold and the hidden state is determined by the formula on the right side of Figure 2. These two thresholds determine how the state is updated.
Y
Yibing 已提交
42

43 44
### Bi-directional Encoder
In the above basic model, when the encoder sequentially processes the input sentence sequence, the state of the current time contains only the history input information without the sequence information of the future time. For sequence modeling, the context of the future also contains important information. With the bi-directional encoder (Figure 3), we can get both information at the same time:
45 46 47

<p align="center">
<img src="images/bidirectional-encoder.png" width = "90%" align="center"/><br/>
48
Figure 3. Bi-directional encoder structure diagram
49 50
 </p>

Y
Yibing 已提交
51

52
The bi-directional encoder \[[3](#References)\] shown in Figure 3 consists of two independent RNNs that encode the input sequence from the forward and backward, respectively. Then it combines the outputs of the two RNNs together, as the final encoding output.
Y
Yibing 已提交
53

54 55
In PaddlePaddle, bi-directional encoders can easily call using APIs:

56 57 58 59
```python
src_word_id = paddle.layer.data(
    name='source_language_word',
    type=paddle.data_type.integer_value_sequence(source_dict_dim))
60

Y
Yibing 已提交
61
# source embedding
62 63
src_embedding = paddle.layer.embedding(
    input=src_word_id, size=word_vector_dim)
64

C
caoying03 已提交
65
# bidirectional GRU as encoder
66 67 68 69 70 71 72 73 74
encoded_vector = paddle.networks.bidirectional_gru(
    input=src_embedding,
    size=encoder_size,
    fwd_act=paddle.activation.Tanh(),
    fwd_gate_act=paddle.activation.Sigmoid(),
    bwd_act=paddle.activation.Tanh(),
    bwd_gate_act=paddle.activation.Sigmoid(),
    return_seq=True)
```
75 76 77

### Beam Search Algorithm
After the training is completed, the model will input and decode the corresponding target language translation result according to the source language. Decoding, a direct way is to take each step conditional probability of the largest word, as the current moment of output. But the local optimal does not necessarily guarantee the global optimal. If the search for the full space is large, the cost is too large. In order to solve this problem, beam search algorithm is commonly used. Beam search is a heuristic graph search algorithm that controls the search width with a parameter $ k $] as follows:
Y
Yibing 已提交
78

79
**1**. During decoding, always maintain $ k $ decoded sub-sequences;
Y
Yibing 已提交
80

81
**2**. At the middle of time $ t $, for each sequence in the $ k $ sub-sequence, calculate the probability of the next word and take the maximum of $ k $ words with the largest probability, combining $ k ^ 2 $ New child sequence;
Y
Yibing 已提交
82

83
**3**. Take the maximum probability of $ k $ in these combination sequences to update the original subsequence;
Y
Yibing 已提交
84

85
**4**. iterate through it until you get $ k $ complete sentences as candidates for translation results.
Y
Yibing 已提交
86

87
For more information on beam search, refer to the [beam search] chapter in PaddleBook [machine translation] (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md) (https://github.com/PaddlePaddle/book/blob/develop.org.machine_translation/README.cn.md# beam search algorithm) section.
Y
Yibing 已提交
88 89


90 91
### Decoder without Attention mechanism
- In the relevant section of PaddleBook (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md), the Attention Mechanism has been introduced. This example demonstrates Encoder-Decoder structure without attention mechanism. With regard to the attention mechanism, please refer to PaddleBook and references \[[3](#References)].
Y
Yibing 已提交
92

93
In PaddlePaddle, commonly used RNN units can be conveniently called using APIs. For example, `recurrent_layer_group` can be used to implement custom actions at each point in the RNN. First, customize the single-step logic function, and then use the function `recurrent_group ()` to cycle through the single-step logic function to process the entire sequence. In this case, the unattended mechanism of the decoder uses `recurrent_layer_group` to implement the function` gru_decoder_without_attention () `. Corresponding code is as follows:
Y
Yibing 已提交
94

Y
Yibing 已提交
95

96
```python
97
# the initialization state for decoder GRU
98 99 100
encoder_last = paddle.layer.last_seq(input=encoded_vector)
encoder_last_projected = paddle.layer.fc(
    size=decoder_size, act=paddle.activation.Tanh(), input=encoder_last)
C
caoying03 已提交
101

102
# the step function for decoder GRU
103 104
def gru_decoder_without_attention(enc_vec, current_word):
    '''
Y
Yibing 已提交
105
    Step function for gru decoder
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
    :param enc_vec: encoded vector of source language
    :type enc_vec: layer object
    :param current_word: current input of decoder
    :type current_word: layer object
    '''
    decoder_mem = paddle.layer.memory(
            name="gru_decoder",
            size=decoder_size,
            boot_layer=encoder_last_projected)

    context = paddle.layer.last_seq(input=enc_vec)

    decoder_inputs = paddle.layer.fc(
        size=decoder_size * 3, input=[context, current_word])

    gru_step = paddle.layer.gru_step(
        name="gru_decoder",
        act=paddle.activation.Tanh(),
        gate_act=paddle.activation.Sigmoid(),
        input=decoder_inputs,
        output_mem=decoder_mem,
        size=decoder_size)

     out = paddle.layer.fc(
        size=target_dict_dim,
        bias_attr=True,
        act=paddle.activation.Softmax(),
        input=gru_step)
134
    return out  
135
```
Y
Yibing 已提交
136

137
In the model training and testing phase, the behavior of the decoder is different:
Y
Yibing 已提交
138

139 140
- **Training phase**: The word vector of the target translation `trg_embedding` is passed as a parameter to the single step logic` gru_decoder_without_attention () `. The function` recurrent_group () `loop calls the single step logic execution, and finally calculates the target translation with the actual decoding;
- **Testing phase**: The decoder predicts the next word based on the last generated word, `GeneratedInput ()`. The automatic fetch model predicts the highest probability of the $ k $ word vector passed to the single step logic. Then the beam_search () function calls the function `gru_decoder_without_attention ()` to complete the beam search and returns as a result.
Y
Yibing 已提交
141

142
The training and generated returns are implemented in the following `if-else` conditional branches:
Y
Yibing 已提交
143

144 145
```python
group_input1 = paddle.layer.StaticInput(input=encoded_vector)
Y
Yibing 已提交
146 147
group_inputs = [group_input1]

148
decoder_group_name = "decoder_group"
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
if is_generating:
    trg_embedding = paddle.layer.GeneratedInput(
        size=target_dict_dim,
        embedding_name="_target_language_embedding",
        embedding_size=word_vector_dim)
    group_inputs.append(trg_embedding)

    beam_gen = paddle.layer.beam_search(
        name=decoder_group_name,
        step=gru_decoder_without_attention,
        input=group_inputs,
        bos_id=0,
        eos_id=1,
        beam_size=beam_size,
        max_length=max_length)
Y
Yibing 已提交
164 165

    return beam_gen
166
else:
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
    trg_embedding = paddle.layer.embedding(
        input=paddle.layer.data(
            name="target_language_word",
            type=paddle.data_type.integer_value_sequence(target_dict_dim)),
        size=word_vector_dim,
        param_attr=paddle.attr.ParamAttr(name="_target_language_embedding"))
    group_inputs.append(trg_embedding)

    decoder = paddle.layer.recurrent_group(
        name=decoder_group_name,
        step=gru_decoder_without_attention,
        input=group_inputs)

    lbl = paddle.layer.data(
        name="target_language_next_word",
        type=paddle.data_type.integer_value_sequence(target_dict_dim))
    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
184 185

    return cost
186
```
Y
Yibing 已提交
187

188
## Data Preparation
189
The data used in this example is from [WMT14] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), which is a parallel corpus of French-to-English translation. Use [bitexts] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) as training data, [dev + test data] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) as validation and test data. PaddlePaddle has been packaged in the data set of the read interface, in the first run, the program will automatically complete the download. Users do not need to manually complete the relevant data preparation!
Y
Yibing 已提交
190

191
## Model Training and Testing
Y
Yibing 已提交
192

193
### Model Training
C
caoying03 已提交
194

195
Starting the model training is very simple, just in the command line window to execute `python train.py`. The `train ()` function in the `train.py` script of the model training phase completes the following logic:
Y
Yibing 已提交
196

197
**a) Define the network, parse the network structure, initialize the model parameters. **
Y
Yibing 已提交
198

199
```python
C
caoying03 已提交
200
# define the network topolgy.
201 202 203
cost = seq2seq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
```
Y
Yibing 已提交
204

205
**b) Set the training process optimization strategy. Define the training data to read `reader`**
Y
Yibing 已提交
206

207
```python
208
# define optimization method
209 210 211 212
optimizer = paddle.optimizer.RMSProp(
    learning_rate=1e-3,
    gradient_clipping_threshold=10.0,
    regularization=paddle.optimizer.L2Regularization(rate=8e-4))
213 214

# define the trainer instance
215 216
trainer = paddle.trainer.SGD(
    cost=cost, parameters=parameters, update_equation=optimizer)
217

Y
Yibing 已提交
218
# define data reader
219 220 221 222 223
wmt14_reader = paddle.batch(
    paddle.reader.shuffle(
        paddle.dataset.wmt14.train(source_dict_dim), buf_size=8192),
    batch_size=55)
```
Y
Yibing 已提交
224

225
**c) Define the event handle, print the training intermediate results, save the model snapshot**
Y
Yibing 已提交
226

227
```python
228
# define the event_handler callback
229 230 231 232 233 234
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if not event.batch_id % 100 and event.batch_id:
            with gzip.open(
                    os.path.join(save_path,
                                 "nmt_without_att_%05d_batch_%05d.tar.gz" %
235
                                 event.pass_id, event.batch_id), "w") as f:
236
                parameters.to_tar(f)
Y
Yibing 已提交
237

238 239
        if event.batch_id and not event.batch_id % 10:
            logger.info("Pass %d, Batch %d, Cost %f, %s" % (
240
                event.pass_id, event.batch_id, event.cost, event.metrics))
241 242
```

243
**d) Start training**
Y
Yibing 已提交
244

245
```python
C
caoying03 已提交
246
# start training
247 248 249
trainer.train(
    reader=wmt14_reader, event_handler=event_handler, num_passes=2)
```
Y
Yibing 已提交
250

251
The output sample is
Y
Yibing 已提交
252

253
```text
Y
Yibing 已提交
254 255 256 257 258 259 260 261 262
Pass 0, Batch 0, Cost 267.674663, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 172.892294, {'classification_error_evaluator': 0.953895092010498}
.........
Pass 0, Batch 20, Cost 177.989329, {'classification_error_evaluator': 0.9052488207817078}
.........
Pass 0, Batch 30, Cost 153.633665, {'classification_error_evaluator': 0.8643803596496582}
.........
Pass 0, Batch 40, Cost 168.170543, {'classification_error_evaluator': 0.8348183631896973}
263
```
Y
Yibing 已提交
264

265 266
### Generate Translation Results
In PaddlePaddle, it is also easy to use translated models to generate translated text.
C
caoying03 已提交
267

268
1. First of all, please modify the `generate.py` script` main` passed to the `generate` function parameters to choose which saved model to use. The default parameters are as follows:
C
caoying03 已提交
269

270 271 272 273 274 275 276 277
    ```python
    generate(
        source_dict_dim=30000,
        target_dict_dim=30000,
        batch_size=20,
        beam_size=3,
        model_path="models/nmt_without_att_params_batch_00100.tar.gz")
    ```
C
caoying03 已提交
278

279
2. In the terminal phase, execute the `python generate.py` command. The` generate () `in the script executes the following code:
C
caoying03 已提交
280

281
    **a) Load the test sample**
C
caoying03 已提交
282

283 284 285
    ```python
    # load data  samples for generation
    gen_creator = paddle.dataset.wmt14.gen(source_dict_dim)
C
caoying03 已提交
286
    gen_data = []
287 288 289
    for item in gen_creator():
        gen_data.append((item[0], ))
    ```
C
caoying03 已提交
290

291
    **b) Initialize the model, execute `infer ()` for each input sample to generate `beam search` translation results**
C
caoying03 已提交
292

293 294 295 296
    ```python
    beam_gen = seq2seq_net(source_dict_dim, target_dict_dim, True)
    with gzip.open(init_models_path) as f:
        parameters = paddle.parameters.Parameters.from_tar(f)
C
caoying03 已提交
297
    # prob is the prediction probabilities, and id is the prediction word.
298 299 300 301 302 303
    beam_result = paddle.infer(
        output_layer=beam_gen,
        parameters=parameters,
        input=gen_data,
        field=['prob', 'id'])
    ```
C
caoying03 已提交
304

305
    **c) Next, load the source and target language dictionaries, convert the sentences represented by the `id` sequence into the original language and output the results.**
C
caoying03 已提交
306

307 308
    ```python
    beam_result = inferer.infer(input=test_batch, field=["prob", "id"])
C
caoying03 已提交
309

310 311
    gen_sen_idx = np.where(beam_result[1] == -1)[0]
    assert len(gen_sen_idx) == len(test_batch) * beam_size
C
caoying03 已提交
312 313

    start_pos, end_pos = 1, 0
314 315 316 317 318 319 320 321
    for i, sample in enumerate(test_batch):
        print(" ".join([
            src_dict[w] for w in sample[0][1:-1]
        ]))  # skip the start and ending mark when print the source sentence
        for j in xrange(beam_size):
            end_pos = gen_sen_idx[i * beam_size + j]
            print("%.4f\t%s" % (beam_result[0][i][j], " ".join(
                trg_dict[w] for w in beam_result[1][start_pos:end_pos])))
C
caoying03 已提交
322
            start_pos = end_pos + 2
323 324
        print("\n")
    ```
Y
Yibing 已提交
325

326
Set the width of the beam search to 3, enter a French sentence. Then it automatically generate the corresponding test data for the translation results, the output format is as follows:
Y
Yibing 已提交
327

328 329 330 331 332
```text
Elles connaissent leur entreprise mieux que personne .
-3.754819        They know their business better than anyone . <e>
-4.445528        They know their businesses better than anyone . <e>
-5.026885        They know their business better than anybody . <e>
Y
Yibing 已提交
333

334
```
335 336
- The first line of input for the source language.
- Second ~ beam_size + 1 line is the result of the `beam_size` translation generated by the column search
337
    - the output of the same row is separated into two columns by "\ t", the first column is the log probability of the sentence, and the second column is the text of the translation result.
338
    - the symbol `<s>` represents the beginning of the sentence, the symbol `<e>` indicates the end of a sentence, and if there is a word that is not included in the dictionary, it is replaced with the symbol `<unk>`.
Y
Yibing 已提交
339

340
So far, we have implemented a basic machine translation model using PaddlePaddle. We can see, PaddlePaddle provides a flexible and rich API. This enables users to easily choose and use a various complex network configuration. NMT itself is also a rapidly developing field, and many new ideas continue to emerge. This example is a basic implementation of NMT. Users can also implement more complex NMT models using PaddlePaddle.
Y
Yibing 已提交
341 342


343
## References
344
[1] Sutskever I, Vinyals O, Le Q V. [Sequence to Sequence Learning with Neural Networks] (https://arxiv.org/abs/1409.3215) [J]. 2014, 4: 3104-3112.
Y
Yibing 已提交
345

346
[2] Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation (http://www.aclweb.org/anthology/D/D14/D14-1179 .pdf) [C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
Y
Yibing 已提交
347

348
[3] Bahdanau D, Cho K, Bengio Y. [Neural machine translation by exclusive learning to align and translate] (https://arxiv.org/abs/1409.0473) [C]. Proceedings of ICLR 2015, 2015