提交 9b786b57 编写于 作者: Y Yi Wang 提交者: GitHub

Merge pull request #189 from helinwang/machine_translation_en

English translation for code part of chapter machine translation
...@@ -185,77 +185,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder ...@@ -185,77 +185,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
## Data Preparation ## Data Preparation
### Download and Uncompression
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
Run the following command in Linux to obtain the data:
```bash
cd data
./wmt14_data.sh
```
There are three folders in the downloaded dataset `data/wmt14`:
<p align = "center">
<table>
<tr>
<td>Folder Name</td>
<td>French-English Parallel Corpus</td>
<td>Number of Files</td>
<td>Size of Files</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
### User Defined Dataset (Optional) ### Data Preprocessing
To use your own dataset, just put it under the `data` folder and organize it as follows
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train``test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
### Data Pre-processing
There are two steps for pre-processing: There are two steps for pre-processing:
- Merge the source and target parallel corpus files into one file - Merge the source and target parallel corpus files into one file
...@@ -264,248 +197,103 @@ There are two steps for pre-processing: ...@@ -264,248 +197,103 @@ There are two steps for pre-processing:
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary).
`preprocess.py` is used for pre-processing: ### A Subset of Dataset
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
The specific command to run the script is as follows:
```python
python preprocess.py -i data/wmt14 -d 30000
```
You will see the following messages after a few minutes:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
The pre-processed data is located at `data/pre-wmt14`:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
### Providing Data to PaddlePaddle
We use `dataprovider.py` to provide data to PaddlePaddle as follows:
1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens: Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
```python This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
from paddle.trainer.PyDataProvider2 import *
UNK_IDX = 2 #out of vocabulary word
START = "<s>" #begin of sequence
END = "<e>" #end of sequence
```
2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
- Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
- Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
`src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: training 0: generation
settings.job_mode = not is_generating
def fun(dict_path): # load dictionary according to the path
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #training
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #target language sequence
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #target language next word sequence
integer_value_sequence(len(settings.trg_dict))
}
else: #generation
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'sent_id': #source language sequence id
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
- add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word".
- add `<s>` to the beginning of each target language senquence, producing "target_language_word".
- add `<e>` to the end of each target language senquence, producing "target_language_next_word".
```python
def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
## Model Config
### Data Definition ## Training Instructions
1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results). ### Initialize PaddlePaddle
```python ```python
import os import paddle.v2 as paddle
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # data path # train with a single CPU
src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary paddle.init(use_gpu=False, trainer_count=1)
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary ```
is_generating = get_config_arg("is_generating", bool, False) # config mode
```
2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
```python ### Define DataSet
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # source language dictionary path
"trg_dict_path": trg_lang_dict, # target language dictionary path
"is_generating": is_generating # config mode
})
```
### Algorithm Configuration We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure ### Model Configuration
1. Define some global variables 1. Define some global variables
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = dict_size # source language dictionary size
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # hidden layer size of GRU in decoder
if is_generating:
beam_size=3 # beam size for the beam search algorithm
max_length=250 # maximum length for the generated sentence
gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
``` ```
2. Implement Encoder as follows: 1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. Implement Attention-based Decoder as follows: 1. Implement Attention-based Decoder as follows:
3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ 1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. 1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -514,184 +302,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -514,184 +302,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned. - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. Decoder differences between the training and generation 1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = StaticInput(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 In training mode: 1. Training mode:
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
4.3 In generation mode: Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details. ### Create Parameters
- `beam_search` will call `gru_decoder_with_attention` to generate id
- `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict`
```python Create every parameter that `cost` layer needs.
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
```python
parameters = paddle.parameters.create(cost)
```
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
for param in parameters.keys():
print param
```
## Model Training ## Model Training
Training can be started with the following command: 1. Create trainer
```bash We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
./train.sh
```
where `train.sh` contains
```bash ```python
paddle train \ optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--config='seqToseq_net.py' \ trainer = paddle.trainer.SGD(cost=cost,
--save_dir='model' \ parameters=parameters,
--use_gpu=false \ update_equation=optimizer)
--num_passes=16 \ ```
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
2>&1 | tee 'train.log'
```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
The training loss will the printed every 10 batches, and you will see messages like those below:
```text
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
.....
```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
The model training is successful when the classification\_error\_evaluator is lower than 0.35. 1. Define event handler
## Model Usage The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
### Download Pre-trained Model ```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model: 1. Start training
```bash
cd pretrained
./wmt14_model.sh
```
### Usage and Results ```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
Run the following command to perform translation from French to English: ```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
```bash The model training is successful when the `classification_error_evaluator` is lower than 0.35.
./gen.sh
``` ## Model Usage
where `gen.sh` contains:
### Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
```bash ```bash
paddle train \ cd pretrained
--job=test \ ./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
``` ```
Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
For translation results please refer to [Illustrative Results](#Illustrative Results).
### BLEU Evaluation ### BLEU Evaluation
...@@ -717,7 +467,7 @@ BLEU = 26.92 ...@@ -717,7 +467,7 @@ BLEU = 26.92
## Summary ## Summary
End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter. End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
## References ## References
......
...@@ -152,54 +152,8 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -152,54 +152,8 @@ e_{ij}&=align(z_i,h_j)\\\\
## 数据介绍 ## 数据介绍
### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令:
```bash
cd data
./wmt14_data.sh
```
得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center">
<table>
<tr>
<td>文件夹名</td>
<td>法英平行语料文件</td>
<td>文件数</td>
<td>文件大小</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src``XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -256,17 +210,16 @@ wmt14_reader = paddle.batch( ...@@ -256,17 +210,16 @@ wmt14_reader = paddle.batch(
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
``` ```
2. 其次实现编码器框架分为三步 1. 其次实现编码器框架分为三步
2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列 1 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)`
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -274,7 +227,7 @@ wmt14_reader = paddle.batch( ...@@ -274,7 +227,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -284,16 +237,17 @@ wmt14_reader = paddle.batch( ...@@ -284,16 +237,17 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 1. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection( encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector) input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -302,7 +256,8 @@ wmt14_reader = paddle.batch( ...@@ -302,7 +256,8 @@ wmt14_reader = paddle.batch(
decoder_boot += paddle.layer.full_matrix_projection( decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
...@@ -339,24 +294,23 @@ wmt14_reader = paddle.batch( ...@@ -339,24 +294,23 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。 1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 1. 训练模式下的解码器调用:
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -379,7 +333,8 @@ wmt14_reader = paddle.batch( ...@@ -379,7 +333,8 @@ wmt14_reader = paddle.batch(
name='target_language_next_word', name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)) type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)
### 参数定义 ### 参数定义
...@@ -387,7 +342,6 @@ wmt14_reader = paddle.batch( ...@@ -387,7 +342,6 @@ wmt14_reader = paddle.batch(
首先依据模型配置的`cost`定义模型参数。 首先依据模型配置的`cost`定义模型参数。
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
...@@ -411,7 +365,7 @@ for param in parameters.keys(): ...@@ -411,7 +365,7 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
2. 构造event_handler 1. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
```python ```python
...@@ -422,7 +376,7 @@ for param in parameters.keys(): ...@@ -422,7 +376,7 @@ for param in parameters.keys():
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
3. 启动训练: 1. 启动训练:
```python ```python
trainer.train( trainer.train(
...@@ -438,23 +392,19 @@ for param in parameters.keys(): ...@@ -438,23 +392,19 @@ for param in parameters.keys():
... ...
``` ```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ## 应用模型
### 下载预训练的模型 ### 下载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
```bash ```bash
cd pretrained cd pretrained
./wmt14_model.sh ./wmt14_model.sh
``` ```
### 应用命令与结果
新版api尚未支持机器翻译的翻译过程,尽请期待。
翻译结果请见[效果展示](#效果展示)
### BLEU评估 ### BLEU评估
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
......
...@@ -227,77 +227,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder ...@@ -227,77 +227,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
## Data Preparation ## Data Preparation
### Download and Uncompression
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
Run the following command in Linux to obtain the data:
```bash
cd data
./wmt14_data.sh
```
There are three folders in the downloaded dataset `data/wmt14`:
<p align = "center">
<table>
<tr>
<td>Folder Name</td>
<td>French-English Parallel Corpus</td>
<td>Number of Files</td>
<td>Size of Files</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
### User Defined Dataset (Optional) ### Data Preprocessing
To use your own dataset, just put it under the `data` folder and organize it as follows
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train`、`test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
### Data Pre-processing
There are two steps for pre-processing: There are two steps for pre-processing:
- Merge the source and target parallel corpus files into one file - Merge the source and target parallel corpus files into one file
...@@ -306,248 +239,103 @@ There are two steps for pre-processing: ...@@ -306,248 +239,103 @@ There are two steps for pre-processing:
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary).
`preprocess.py` is used for pre-processing: ### A Subset of Dataset
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
The specific command to run the script is as follows:
```python
python preprocess.py -i data/wmt14 -d 30000
```
You will see the following messages after a few minutes:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
The pre-processed data is located at `data/pre-wmt14`:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
### Providing Data to PaddlePaddle
We use `dataprovider.py` to provide data to PaddlePaddle as follows: Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens: This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
```python ## Training Instructions
from paddle.trainer.PyDataProvider2 import *
UNK_IDX = 2 #out of vocabulary word
START = "<s>" #begin of sequence
END = "<e>" #end of sequence
```
2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
- Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
- Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
`src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config). ### Initialize PaddlePaddle
```python ```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list, import paddle.v2 as paddle
**kwargs):
# job_mode = 1: training 0: generation
settings.job_mode = not is_generating
def fun(dict_path): # load dictionary according to the path
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #training
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #target language sequence
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #target language next word sequence
integer_value_sequence(len(settings.trg_dict))
}
else: #generation
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'sent_id': #source language sequence id
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
- add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word".
- add `<s>` to the beginning of each target language senquence, producing "target_language_word".
- add `<e>` to the end of each target language senquence, producing "target_language_next_word".
```python
def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
## Model Config
### Data Definition
1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
```python
import os
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # data path # train with a single CPU
src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary paddle.init(use_gpu=False, trainer_count=1)
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary ```
is_generating = get_config_arg("is_generating", bool, False) # config mode
```
2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
```python ### Define DataSet
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # source language dictionary path
"trg_dict_path": trg_lang_dict, # target language dictionary path
"is_generating": is_generating # config mode
})
```
### Algorithm Configuration We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure ### Model Configuration
1. Define some global variables 1. Define some global variables
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = dict_size # source language dictionary size
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # hidden layer size of GRU in decoder
if is_generating:
beam_size=3 # beam size for the beam search algorithm
max_length=250 # maximum length for the generated sentence
gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
``` ```
2. Implement Encoder as follows: 1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. Implement Attention-based Decoder as follows: 1. Implement Attention-based Decoder as follows:
3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$ 1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word. 1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -556,184 +344,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -556,184 +344,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned. - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. Decoder differences between the training and generation 1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. ```python
decoder_group_name = "decoder_group"
```python group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
decoder_group_name = "decoder_group" group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_inputs = [group_input1, group_input2]
group_input2 = StaticInput(input=encoded_proj, is_seq=True) ```
group_inputs = [group_input1, group_input2]
```
4.2 In training mode: 1. Training mode:
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
4.3 In generation mode: Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details. ### Create Parameters
- `beam_search` will call `gru_decoder_with_attention` to generate id
- `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict`
```python Create every parameter that `cost` layer needs.
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
```python
parameters = paddle.parameters.create(cost)
```
We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
```python
for param in parameters.keys():
print param
```
## Model Training ## Model Training
Training can be started with the following command: 1. Create trainer
```bash We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
./train.sh
```
where `train.sh` contains
```bash ```python
paddle train \ optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--config='seqToseq_net.py' \ trainer = paddle.trainer.SGD(cost=cost,
--save_dir='model' \ parameters=parameters,
--use_gpu=false \ update_equation=optimizer)
--num_passes=16 \ ```
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
2>&1 | tee 'train.log'
```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
The training loss will the printed every 10 batches, and you will see messages like those below:
```text
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
.....
```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
The model training is successful when the classification\_error\_evaluator is lower than 0.35. 1. Define event handler
## Model Usage The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
### Download Pre-trained Model ```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model: 1. Start training
```bash
cd pretrained
./wmt14_model.sh
```
### Usage and Results ```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
Run the following command to perform translation from French to English: ```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
```bash The model training is successful when the `classification_error_evaluator` is lower than 0.35.
./gen.sh
``` ## Model Usage
where `gen.sh` contains:
### Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
```bash ```bash
paddle train \ cd pretrained
--job=test \ ./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
``` ```
Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
For translation results please refer to [Illustrative Results](#Illustrative Results).
### BLEU Evaluation ### BLEU Evaluation
...@@ -759,7 +509,7 @@ BLEU = 26.92 ...@@ -759,7 +509,7 @@ BLEU = 26.92
## Summary ## Summary
End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter. End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
## References ## References
......
...@@ -194,54 +194,8 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -194,54 +194,8 @@ e_{ij}&=align(z_i,h_j)\\\\
## 数据介绍 ## 数据介绍
### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令:
```bash
cd data
./wmt14_data.sh
```
得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center">
<table>
<tr>
<td>文件夹名</td>
<td>法英平行语料文件</td>
<td>文件数</td>
<td>文件大小</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src`和`XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -298,17 +252,16 @@ wmt14_reader = paddle.batch( ...@@ -298,17 +252,16 @@ wmt14_reader = paddle.batch(
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
``` ```
2. 其次,实现编码器框架。分为三步: 1. 其次,实现编码器框架。分为三步:
2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列 1 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence。
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -316,7 +269,7 @@ wmt14_reader = paddle.batch( ...@@ -316,7 +269,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -326,16 +279,17 @@ wmt14_reader = paddle.batch( ...@@ -326,16 +279,17 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 1. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection( encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector) input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -344,7 +298,8 @@ wmt14_reader = paddle.batch( ...@@ -344,7 +298,8 @@ wmt14_reader = paddle.batch(
decoder_boot += paddle.layer.full_matrix_projection( decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
...@@ -381,24 +336,23 @@ wmt14_reader = paddle.batch( ...@@ -381,24 +336,23 @@ wmt14_reader = paddle.batch(
return out return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。 1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 1. 训练模式下的解码器调用:
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -421,7 +375,8 @@ wmt14_reader = paddle.batch( ...@@ -421,7 +375,8 @@ wmt14_reader = paddle.batch(
name='target_language_next_word', name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)) type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 参数定义 ### 参数定义
...@@ -429,7 +384,6 @@ wmt14_reader = paddle.batch( ...@@ -429,7 +384,6 @@ wmt14_reader = paddle.batch(
首先依据模型配置的`cost`定义模型参数。 首先依据模型配置的`cost`定义模型参数。
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
...@@ -453,7 +407,7 @@ for param in parameters.keys(): ...@@ -453,7 +407,7 @@ for param in parameters.keys():
update_equation=optimizer) update_equation=optimizer)
``` ```
2. 构造event_handler 1. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
```python ```python
...@@ -464,7 +418,7 @@ for param in parameters.keys(): ...@@ -464,7 +418,7 @@ for param in parameters.keys():
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
``` ```
3. 启动训练: 1. 启动训练:
```python ```python
trainer.train( trainer.train(
...@@ -480,23 +434,19 @@ for param in parameters.keys(): ...@@ -480,23 +434,19 @@ for param in parameters.keys():
... ...
``` ```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ## 应用模型
### 下载预训练的模型 ### 下载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
```bash ```bash
cd pretrained cd pretrained
./wmt14_model.sh ./wmt14_model.sh
``` ```
### 应用命令与结果
新版api尚未支持机器翻译的翻译过程,尽请期待。
翻译结果请见[效果展示](#效果展示)。
### BLEU评估 ### BLEU评估
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册