English translation for code part of chapter machine translation

74b3c372 · Helin Wang · 725e8c19 · 74b3c372 · 74b3c372 · 74b3c372
4 changed file
--- a/machine_translation/README.en.md
+++ b/machine_translation/README.en.md
@@ -185,77 +185,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
 ## Data Preparation
-### Download and Uncompression
 This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
-Run the following command in Linux to obtain the data:
-```bash
-cd data
-./wmt14_data.sh
-```
-There are three folders in the downloaded dataset `data/wmt14`:
-<p align = "center">
-<table>
-<tr>
-<td>Folder Name</td>
-<td>French-English Parallel Corpus</td>
-<td>Number of Files</td>
-<td>Size of Files</td>
-</tr>
-<tr>
-<td>train</td>
-<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
-<td>12</td>
-<td>3.55G</td>
-</tr>
-<tr>
-<td>test</td>
-<td>ntst1213.src, ntst1213.trg</td>
-<td>2</td>
-<td>1636k</td>
-</tr>
-</tr>
-<tr>
-<td>gen</td>
-<td>ntst14.src, ntst14.trg</td>
-<td>2</td>
-<td>864k</td>
-</tr>
-</table>
-</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
-### User Defined Dataset (Optional)
+### Data Preprocessing
-To use your own dataset, just put it under the `data` folder and organize it as follows
-```text
-user_dataset
-├── train
-│   ├── train_file1.src
-│   ├── train_file1.trg
-│   └── ...
-├── test
-│   ├── test_file1.src
-│   ├── test_file1.trg
-│   └── ...
-├── gen
-│   ├── gen_file1.src
-│   ├── gen_file1.trg
-│   └── ...
-```
-Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train`、`test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
-### Data Pre-processing
 There are two steps for pre-processing:
 - Merge the source and target parallel corpus files into one file
@@ -264,248 +197,103 @@ There are two steps for pre-processing:
 - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence)  and `<unk>` (unknown words that are not in the vocabulary).
-`preprocess.py` is used for pre-processing:
+### A Subset of Dataset
-```python
-python preprocess.py -i INPUT [-d DICTSIZE] [-m]
-```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
-The specific command to run the script is as follows:
-```python
-python preprocess.py -i data/wmt14 -d 30000
-```
-You will see the following messages after a few minutes:
-```text
-concat parallel corpora for dataset
-build source dictionary for train data
-build target dictionary for train data
-dictionary size is 30000
-```
-The pre-processed data is located at `data/pre-wmt14`:
-```text
-pre-wmt14
-├── train
-│   └── train
-├── test
-│   └── test
-├── gen
-│   └── gen
-├── train.list
-├── test.list
-├── gen.list
-├── src.dict
-└── trg.dict
-```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
-### Providing Data to PaddlePaddle
-We use `dataprovider.py` to provide data to PaddlePaddle as follows:
-1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens:
+Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
-   ```python
+This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
-   from paddle.trainer.PyDataProvider2 import *
-   UNK_IDX = 2    #out of vocabulary word
-   START = "<s>"  #begin of sequence
-   END = "<e>"    #end of sequence
-   ```
-2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
-   - Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
-   - Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
-  `src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
-   ```python
-   def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
-            **kwargs):
-       # job_mode = 1: training 0: generation
-       settings.job_mode = not is_generating
-       def fun(dict_path): # load dictionary according to the path
-           out_dict = dict()
-           with open(dict_path, "r") as fin:
-               out_dict = {
-                   line.strip(): line_count
-                   for line_count, line in enumerate(fin)
-               }
-           return out_dict
-       settings.src_dict = fun(src_dict_path)
-       settings.trg_dict = fun(trg_dict_path)
-       if settings.job_mode:                                  #training
-           settings.input_types = {
-               'source_language_word':                        #source language sequence
-               integer_value_sequence(len(settings.src_dict)),
-               'target_language_word':                        #target language sequence
-               integer_value_sequence(len(settings.trg_dict)),
-               'target_language_next_word':                   #target language next word sequence
-               integer_value_sequence(len(settings.trg_dict))
-           }
-       else:                                                  #generation
-           settings.input_types = {
-               'source_language_word':                        #source language sequence
-               integer_value_sequence(len(settings.src_dict)),
-               'sent_id':                                     #source language sequence id
-               integer_value_sequence(len(open(file_list[0], "r").readlines()))
-           }
-   ```
-3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
-   - add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word".
-   - add `<s>` to the beginning of each target language senquence, producing "target_language_word".
-   - add `<e>` to the end of each target language senquence, producing "target_language_next_word".
-   ```python
-   def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
-       words = s.strip().split()
-       return [dictionary[START]] + \
-              [dictionary.get(w, UNK_IDX) for w in words] + \
-              [dictionary[END]]
-   @provider(init_hook=hook, pool_size=50000)
-   def process(settings, file_name):
-       with open(file_name, 'r') as f:
-           for line_count, line in enumerate(f):
-               line_split = line.strip().split('\t')
-               if settings.job_mode and len(line_split) != 2:
-                   continue
-               src_seq = line_split[0]
-               src_ids = _get_ids(src_seq, settings.src_dict)
-               if settings.job_mode:
-                   trg_seq = line_split[1]
-                   trg_words = trg_seq.split()
-                   trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
-                   # sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
-                   if len(src_ids) > 80 or len(trg_ids) > 80:
-                       continue
-                   trg_ids_next = trg_ids + [settings.trg_dict[END]]
-                   trg_ids = [settings.trg_dict[START]] + trg_ids
-                   yield {
-                       'source_language_word': src_ids,
-                       'target_language_word': trg_ids,
-                       'target_language_next_word': trg_ids_next
-                   }
-               else:
-                   yield {'source_language_word': src_ids, 'sent_id': [line_count]}
-   ```
-Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
-## Model Config
-### Data Definition
+## Training Instructions
-1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
+### Initialize PaddlePaddle
-   ```python
+```python
-   import os
+import paddle.v2 as paddle
-   from paddle.trainer_config_helpers import *
-   data_dir = "./data/pre-wmt14" # data path
+# train with a single CPU
-   src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary
+paddle.init(use_gpu=False, trainer_count=1)
-   trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary
+```
-   is_generating = get_config_arg("is_generating", bool, False) # config mode
-   ```
-2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
-   ```python
+### Define DataSet
-   if not is_generating:
-       train_list = os.path.join(data_dir, 'train.list')
-       test_list = os.path.join(data_dir, 'test.list')
-   else:
-       train_list = None
-       test_list = os.path.join(data_dir, 'gen.list')
-   define_py_data_sources2(
-       train_list,
-       test_list,
-       module="dataprovider",
-       obj="process",
-       args={
-           "src_dict_path": src_lang_dict, # source language dictionary path
-           "trg_dict_path": trg_lang_dict, # target language dictionary path
-           "is_generating": is_generating  # config mode
-       })
-   ```
-### Algorithm Configuration
+We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
 ```python
-settings(
+# source and target dict dim.
-    learning_method = AdamOptimizer(),
+dict_size = 30000
-    batch_size = 50,
-    learning_rate = 5e-4)
+feeding = {
+    'source_language_word': 0,
+    'target_language_word': 1,
+    'target_language_next_word': 2
+}
+wmt14_reader = paddle.batch(
+    paddle.reader.shuffle(
+        paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
+    batch_size=5)
 ```
-This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
-### Model Structure
+### Model Configuration
 1. Define some global variables
   ```python
-   source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
+   source_dict_dim = dict_size # source language dictionary size
-   target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
+   target_dict_dim = dict_size # destination language dictionary size
-   word_vector_dim = 512 # dimensionality of word vector
+   word_vector_dim = 512 # word embedding dimension
-   encoder_size = 512      # dimensionality of the hidden state of encoder GRU
+   encoder_size = 512 # hidden layer size of GRU in encoder
-   decoder_size = 512    # dimentionality of the hidden state of decoder GRU
+   decoder_size = 512 # hidden layer size of GRU in decoder
-   if is_generating:
-       beam_size=3    # beam size for the beam search algorithm
-       max_length=250 # maximum length for the generated sentence
-       gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
  ```
-2. Implement Encoder as follows:
+1. Implement Encoder as follows:
+   1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
-   2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
   ```python
-   src_word_id = data_layer(name='source_language_word', size=source_dict_dim)
+    src_word_id = paddle.layer.data(
+        name='source_language_word',
+        type=paddle.data_type.integer_value_sequence(source_dict_dim))
   ```
-   2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
+   1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
   ```python
-   src_embedding = embedding_layer(
+    src_embedding = paddle.layer.embedding(
-       input=src_word_id,
+        input=src_word_id,
-       size=word_vector_dim,
+        size=word_vector_dim,
-       param_attr=ParamAttr(name='_source_language_embedding'))
+        param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
   ```
-   2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
+   1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
   ```python
-   src_forward = simple_gru(input=src_embedding, size=encoder_size)
+    src_forward = paddle.networks.simple_gru(
-   src_backward = simple_gru(
+        input=src_embedding, size=encoder_size)
-       input=src_embedding, size=encoder_size, reverse=True)
+    src_backward = paddle.networks.simple_gru(
-   encoded_vector = concat_layer(input=[src_forward, src_backward])
+        input=src_embedding, size=encoder_size, reverse=True)
+    encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
   ```
-3. Implement Attention-based Decoder as follows:
+1. Implement Attention-based Decoder as follows:
-   3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
+   1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
   ```python
-   with mixed_layer(size=decoder_size) as encoded_proj:
+    with paddle.layer.mixed(size=decoder_size) as encoded_proj:
-       encoded_proj += full_matrix_projection(input=encoded_vector)
+        encoded_proj += paddle.layer.full_matrix_projection(
+            input=encoded_vector)
   ```
-   3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
+   1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
   ```python
-   backward_first = first_seq(input=src_backward)
+    backward_first = paddle.layer.first_seq(input=src_backward)
-   with mixed_layer(
+    with paddle.layer.mixed(
-           size=decoder_size,
+            size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
-           act=TanhActivation(), ) as decoder_boot:
+        decoder_boot += paddle.layer.full_matrix_projection(
-       decoder_boot += full_matrix_projection(input=backward_first)
+            input=backward_first)
   ```
-   3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
+   1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
      - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
      - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
@@ -514,184 +302,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
      - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
   ```python
-   def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
+    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
-       decoder_mem = memory(
-           name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
+        decoder_mem = paddle.layer.memory(
+            name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
-       context = simple_attention(
-           encoded_sequence=enc_vec,
+        context = paddle.networks.simple_attention(
-           encoded_proj=enc_proj,
+            encoded_sequence=enc_vec,
-           decoder_state=decoder_mem, )
+            encoded_proj=enc_proj,
+            decoder_state=decoder_mem)
-       with mixed_layer(size=decoder_size * 3) as decoder_inputs:
-           decoder_inputs += full_matrix_projection(input=context)
+        with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
-           decoder_inputs += full_matrix_projection(input=current_word)
+            decoder_inputs += paddle.layer.full_matrix_projection(input=context)
+            decoder_inputs += paddle.layer.full_matrix_projection(
-       gru_step = gru_step_layer(
+                input=current_word)
-           name='gru_decoder',
-           input=decoder_inputs,
+        gru_step = paddle.layer.gru_step(
-           output_mem=decoder_mem,
+            name='gru_decoder',
-           size=decoder_size)
+            input=decoder_inputs,
+            output_mem=decoder_mem,
-       with mixed_layer(
+            size=decoder_size)
-               size=target_dict_dim, bias_attr=True,
-               act=SoftmaxActivation()) as out:
+        with paddle.layer.mixed(
-           out += full_matrix_projection(input=gru_step)
+                size=target_dict_dim,
-       return out
+                bias_attr=True,
+                act=paddle.activation.Softmax()) as out:
+            out += paddle.layer.full_matrix_projection(input=gru_step)
+        return out
    ```
-4. Decoder differences between the training and generation
+1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
-   4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
-   ```python
+    ```python
-   decoder_group_name = "decoder_group"
+    decoder_group_name = "decoder_group"
-   group_input1 = StaticInput(input=encoded_vector, is_seq=True)
+    group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
-   group_input2 =  StaticInput(input=encoded_proj, is_seq=True)
+    group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
-   group_inputs = [group_input1, group_input2]
+    group_inputs = [group_input1, group_input2]
-   ```
+    ```
-   4.2 In training mode:
+1. Training mode:
-      - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+      - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
      - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
      - the sequence of next words from the target language is used as label (lbl)
      - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
-   ```python
+    ```python
-   if not is_generating:
+    trg_embedding = paddle.layer.embedding(
-       trg_embedding = embedding_layer(
+        input=paddle.layer.data(
-           input=data_layer(
+            name='target_language_word',
-               name='target_language_word', size=target_dict_dim),
+            type=paddle.data_type.integer_value_sequence(target_dict_dim)),
-           size=word_vector_dim,
+        size=word_vector_dim,
-           param_attr=ParamAttr(name='_target_language_embedding'))
+        param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
-       group_inputs.append(trg_embedding)
+    group_inputs.append(trg_embedding)
-       decoder = recurrent_group(
+    # For decoder equipped with attention mechanism, in training,
-           name=decoder_group_name,
+    # target embeding (the groudtruth) is the data input,
-           step=gru_decoder_with_attention,
+    # while encoded source sequence is accessed to as an unbounded memory.
-           input=group_inputs)
+    # Here, the StaticInput defines a read-only memory
+    # for the recurrent_group.
-       lbl = data_layer(name='target_language_next_word', size=target_dict_dim)
+    decoder = paddle.layer.recurrent_group(
-       cost = classification_cost(input=decoder, label=lbl)
+        name=decoder_group_name,
-       outputs(cost)
+        step=gru_decoder_with_attention,
-   ```
+        input=group_inputs)
+    lbl = paddle.layer.data(
+        name='target_language_next_word',
+        type=paddle.data_type.integer_value_sequence(target_dict_dim))
+    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
+    ```
-   4.3 In generation mode:
+Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
-      - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
+### Create Parameters
-      - `beam_search` will call `gru_decoder_with_attention` to generate id
-      - `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict`
-   ```python
+Create every parameter that `cost` layer needs.
-   else:
-       trg_embedding = GeneratedInput(
-           size=target_dict_dim,
-           embedding_name='_target_language_embedding',
-           embedding_size=word_vector_dim)
-       group_inputs.append(trg_embedding)
-       beam_gen = beam_search(
-           name=decoder_group_name,
-           step=gru_decoder_with_attention,
-           input=group_inputs,
-           bos_id=0,
-           eos_id=1,
-           beam_size=beam_size,
-           max_length=max_length)
-       seqtext_printer_evaluator(
-           input=beam_gen,
-           id_input=data_layer(
-               name="sent_id", size=1),
-           dict_file=trg_lang_dict,
-           result_file=gen_trans_file)
-       outputs(beam_gen)
-   ```
-Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
+```python
+parameters = paddle.parameters.create(cost)
+```
+We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
+```python
+for param in parameters.keys():
+    print param
+```
 ## Model Training
-Training can be started with the following command:
+1. Create trainer
-```bash
+    We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
-./train.sh
-```
-where `train.sh` contains
-```bash
+    ```python
-paddle train \
+    optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--config='seqToseq_net.py' \
+    trainer = paddle.trainer.SGD(cost=cost,
--save_dir='model' \
+                                 parameters=parameters,
--use_gpu=false \
+                                 update_equation=optimizer)
--num_passes=16 \
+    ```
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
-2>&1 | tee 'train.log'
-```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
-The training loss will the printed every 10 batches, and you will see messages like those below:
-```text
-I0719 19:16:45.952062 15563 TrainerInternal.cpp:160]  Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155  CurrentEval: classification_error_evaluator=0.737155
-I0719 19:17:56.707319 15563 TrainerInternal.cpp:160]  Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392  CurrentEval: classification_error_evaluator=0.659065
-.....
-```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
-The model training is successful when the classification\_error\_evaluator is lower than 0.35.
+1. Define event handler
-## Model Usage
+    The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
-### Download Pre-trained Model
+    ```python
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 10 == 0:
+                print "Pass %d, Batch %d, Cost %f, %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics)
+    ```
-As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model:
+1. Start training
-```bash
-cd pretrained
-./wmt14_model.sh
-```
-### Usage and Results
+    ```python
+    trainer.train(
+        reader=wmt14_reader,
+        event_handler=event_handler,
+        num_passes=10000,
+        feeding=feeding)
+    ```
-Run the following command to perform translation from French to English:
+    ```text
+    Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
+    Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
+    ...
+    ```
-```bash
+The model training is successful when the `classification_error_evaluator` is lower than 0.35.
-./gen.sh
-```
+## Model Usage
-where `gen.sh` contains:
+### Download Pre-trained Model
+As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
 ```bash
-paddle train \
+cd pretrained
--job=test \
+./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
-2>&1 | tee 'translation/gen.log'
 ```
-Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
-For translation results please refer to [Illustrative Results](#Illustrative Results).
 ### BLEU Evaluation
@@ -717,7 +467,7 @@ BLEU = 26.92
 ## Summary
-End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter.
+End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
 ## References

--- a/machine_translation/README.md
+++ b/machine_translation/README.md
@@ -152,54 +152,8 @@ e_{ij}&=align(z_i,h_j)\\\\
 ## 数据介绍
-### 下载与解压缩
 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集，[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
-在Linux下，只需简单地运行以下命令：
-```bash
-cd data
-./wmt14_data.sh
-```
-得到的数据集`data/wmt14`包含如下三个文件夹：
-<p align = "center">
-<table>
-<tr>
-<td>文件夹名</td>
-<td>法英平行语料文件</td>
-<td>文件数</td>
-<td>文件大小</td>
-</tr>
-<tr>
-<td>train</td>
-<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
-<td>12</td>
-<td>3.55G</td>
-</tr>
-<tr>
-<td>test</td>
-<td>ntst1213.src, ntst1213.trg</td>
-<td>2</td>
-<td>1636k</td>
-</tr>
-</tr>
-<tr>
-<td>gen</td>
-<td>ntst14.src, ntst14.trg</td>
-<td>2</td>
-<td>864k</td>
-</tr>
-</table>
-</p>
- `XXX.src`是源法语文件，`XXX.trg`是目标英语文件，文件中的每行存放一个句子
- `XXX.src`和`XXX.trg`的行数一致，且两者任意第$i$行的句子之间都有着一一对应的关系。
 ### 数据预处理
 我们的预处理流程包括两步：
@@ -256,17 +210,16 @@ wmt14_reader = paddle.batch(
   decoder_size = 512 # 解码器中的GRU隐层大小
  ```
-2. 其次，实现编码器框架。分为三步：
+1. 其次，实现编码器框架。分为三步：
-   2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列
+   1 输入是一个文字序列，被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以，我们定义数据层的数据类型为`integer_value_sequence`（整型序列），序列中每个元素的范围是`[0, source_dict_dim)`。
-   转换成one-hot vector表示的源语言序列$\mathbf{w}$，其类型为integer_value_sequence。
   ```python
    src_word_id = paddle.layer.data(
        name='source_language_word',
        type=paddle.data_type.integer_value_sequence(source_dict_dim))
   ```
-   2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
+   1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
   ```python
    src_embedding = paddle.layer.embedding(
@@ -274,7 +227,7 @@ wmt14_reader = paddle.batch(
        size=word_vector_dim,
        param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
   ```
-   2.3 用双向GRU编码源语言序列，拼接两个GRU的编码结果得到$\mathbf{h}$。
+   1. 用双向GRU编码源语言序列，拼接两个GRU的编码结果得到$\mathbf{h}$。
   ```python
    src_forward = paddle.networks.simple_gru(
@@ -284,16 +237,17 @@ wmt14_reader = paddle.batch(
    encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
   ```
-3. 接着，定义基于注意力机制的解码器框架。分为三步：
+1. 接着，定义基于注意力机制的解码器框架。分为三步：
-   3.1 对源语言序列编码后的结果（见2.3），过一个前馈神经网络（Feed Forward Neural Network），得到其映射。
+   1. 对源语言序列编码后的结果（见2.3），过一个前馈神经网络（Feed Forward Neural Network），得到其映射。
   ```python
    with paddle.layer.mixed(size=decoder_size) as encoded_proj:
        encoded_proj += paddle.layer.full_matrix_projection(
            input=encoded_vector)
   ```
-   3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列，但在0时刻并没有初始值，所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射，作为该初始值，即$c_0=h_T$。
+   1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列，但在0时刻并没有初始值，所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射，作为该初始值，即$c_0=h_T$。
   ```python
    backward_first = paddle.layer.first_seq(input=src_backward)
@@ -302,7 +256,8 @@ wmt14_reader = paddle.batch(
        decoder_boot += paddle.layer.full_matrix_projection(
            input=backward_first)
   ```
-   3.3 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
+   1. 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
      - decoder_mem记录了前一个时间步的隐层状态$z_i$，其初始状态是decoder_boot。
      - context通过调用`simple_attention`函数，实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中，enc_vec是$h_j$，enc_proj是$h_j$的映射（见3.1），权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
      - decoder_inputs融合了$c_i$和当前目标词current_word（即$u_i$）的表示。
@@ -339,24 +294,23 @@ wmt14_reader = paddle.batch(
        return out
    ```
-4. 训练模式与生成模式下的解码器调用区别。
+1. 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
-   4.1 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
-   ```python
+	```python
    decoder_group_name = "decoder_group"
    group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
    group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
    group_inputs = [group_input1, group_input2]
-   ```
+	```
-   4.2 训练模式下的解码器调用：
-      - 首先，将目标语言序列的词向量trg_embedding，直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
+1. 训练模式下的解码器调用：
-      - 其次，使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
-      - 接着，使用目标语言的下一个词序列作为标签层lbl，即预测目标词。
-      - 最后，用多类交叉熵损失函数`classification_cost`来计算损失值。
-   ```python
+   - 首先，将目标语言序列的词向量trg_embedding，直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
+   - 其次，使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
+   - 接着，使用目标语言的下一个词序列作为标签层lbl，即预测目标词。
+   - 最后，用多类交叉熵损失函数`classification_cost`来计算损失值。
+	```python
    trg_embedding = paddle.layer.embedding(
        input=paddle.layer.data(
            name='target_language_word',
@@ -379,7 +333,8 @@ wmt14_reader = paddle.batch(
        name='target_language_next_word',
        type=paddle.data_type.integer_value_sequence(target_dict_dim))
    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
-   ```
+	```
 注意：我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化，可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
 ### 参数定义
@@ -387,7 +342,6 @@ wmt14_reader = paddle.batch(
 首先依据模型配置的`cost`定义模型参数。
 ```python
-# create parameters
 parameters = paddle.parameters.create(cost)
 ```
@@ -411,7 +365,7 @@ for param in parameters.keys():
                                 update_equation=optimizer)
    ```
-2. 构造event_handler
+1. 构造event_handler
    可以通过自定义回调函数来评估训练过程中的各种状态，比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志，包含cost等信息。
    ```python
@@ -422,7 +376,7 @@ for param in parameters.keys():
                    event.pass_id, event.batch_id, event.cost, event.metrics)
    ```
-3. 启动训练：
+1. 启动训练：
    ```python
    trainer.train(
@@ -438,23 +392,19 @@ for param in parameters.keys():
    ...
    ```
+	当`classification_error_evaluator`的值低于0.35的时候，表示训练成功。
 ## 应用模型
 ### 下载预训练的模型
 由于NMT模型的训练非常耗时，我们在50个物理节点（每节点含有2颗6核CPU）的集群中，花了5天时间训练了16个pass，其中每个pass耗时7个小时。因此，我们提供了一个预先训练好的模型（pass-00012）供大家直接下载使用。该模型大小为205MB，在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下：
 ```bash
 cd pretrained
 ./wmt14_model.sh
 ```
-### 应用命令与结果
-新版api尚未支持机器翻译的翻译过程，尽请期待。
-翻译结果请见[效果展示](#效果展示)。
 ### BLEU评估
 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标，由IBM的watson研究中心于2002年提出\[[5](#参考文献)\]，基本出发点是：机器译文越接近专业翻译人员的翻译结果，翻译系统的性能越好。其中，机器译文与人工参考译文之间的接近程度，采用句子精确度（precision）的计算方法，即比较两者的n元词组相匹配的个数，匹配的个数越多，BLEU得分越好。

--- a/machine_translation/index.en.html
+++ b/machine_translation/index.en.html
@@ -227,77 +227,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
 ## Data Preparation
-### Download and Uncompression
 This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
-Run the following command in Linux to obtain the data:
-```bash
-cd data
-./wmt14_data.sh
-```
-There are three folders in the downloaded dataset `data/wmt14`:
-<p align = "center">
-<table>
-<tr>
-<td>Folder Name</td>
-<td>French-English Parallel Corpus</td>
-<td>Number of Files</td>
-<td>Size of Files</td>
-</tr>
-<tr>
-<td>train</td>
-<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
-<td>12</td>
-<td>3.55G</td>
-</tr>
-<tr>
-<td>test</td>
-<td>ntst1213.src, ntst1213.trg</td>
-<td>2</td>
-<td>1636k</td>
-</tr>
-</tr>
-<tr>
-<td>gen</td>
-<td>ntst14.src, ntst14.trg</td>
-<td>2</td>
-<td>864k</td>
-</tr>
-</table>
-</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
-### User Defined Dataset (Optional)
+### Data Preprocessing
-To use your own dataset, just put it under the `data` folder and organize it as follows
-```text
-user_dataset
-├── train
-│   ├── train_file1.src
-│   ├── train_file1.trg
-│   └── ...
-├── test
-│   ├── test_file1.src
-│   ├── test_file1.trg
-│   └── ...
-├── gen
-│   ├── gen_file1.src
-│   ├── gen_file1.trg
-│   └── ...
-```
-Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train`、`test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
-### Data Pre-processing
 There are two steps for pre-processing:
 - Merge the source and target parallel corpus files into one file
@@ -306,248 +239,103 @@ There are two steps for pre-processing:
 - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence)  and `<unk>` (unknown words that are not in the vocabulary).
-`preprocess.py` is used for pre-processing:
+### A Subset of Dataset
-```python
-python preprocess.py -i INPUT [-d DICTSIZE] [-m]
-```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
-The specific command to run the script is as follows:
-```python
-python preprocess.py -i data/wmt14 -d 30000
-```
-You will see the following messages after a few minutes:
-```text
-concat parallel corpora for dataset
-build source dictionary for train data
-build target dictionary for train data
-dictionary size is 30000
-```
-The pre-processed data is located at `data/pre-wmt14`:
-```text
-pre-wmt14
-├── train
-│   └── train
-├── test
-│   └── test
-├── gen
-│   └── gen
-├── train.list
-├── test.list
-├── gen.list
-├── src.dict
-└── trg.dict
-```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
-### Providing Data to PaddlePaddle
-We use `dataprovider.py` to provide data to PaddlePaddle as follows:
+Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
-1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens:
+This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
-   ```python
+## Training Instructions
-   from paddle.trainer.PyDataProvider2 import *
-   UNK_IDX = 2    #out of vocabulary word
-   START = "<s>"  #begin of sequence
-   END = "<e>"    #end of sequence
-   ```
-2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
-   - Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
-   - Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
-  `src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
+### Initialize PaddlePaddle
-   ```python
+```python
-   def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
+import paddle.v2 as paddle
-            **kwargs):
-       # job_mode = 1: training 0: generation
-       settings.job_mode = not is_generating
-       def fun(dict_path): # load dictionary according to the path
-           out_dict = dict()
-           with open(dict_path, "r") as fin:
-               out_dict = {
-                   line.strip(): line_count
-                   for line_count, line in enumerate(fin)
-               }
-           return out_dict
-       settings.src_dict = fun(src_dict_path)
-       settings.trg_dict = fun(trg_dict_path)
-       if settings.job_mode:                                  #training
-           settings.input_types = {
-               'source_language_word':                        #source language sequence
-               integer_value_sequence(len(settings.src_dict)),
-               'target_language_word':                        #target language sequence
-               integer_value_sequence(len(settings.trg_dict)),
-               'target_language_next_word':                   #target language next word sequence
-               integer_value_sequence(len(settings.trg_dict))
-           }
-       else:                                                  #generation
-           settings.input_types = {
-               'source_language_word':                        #source language sequence
-               integer_value_sequence(len(settings.src_dict)),
-               'sent_id':                                     #source language sequence id
-               integer_value_sequence(len(open(file_list[0], "r").readlines()))
-           }
-   ```
-3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
-   - add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word".
-   - add `<s>` to the beginning of each target language senquence, producing "target_language_word".
-   - add `<e>` to the end of each target language senquence, producing "target_language_next_word".
-   ```python
-   def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
-       words = s.strip().split()
-       return [dictionary[START]] + \
-              [dictionary.get(w, UNK_IDX) for w in words] + \
-              [dictionary[END]]
-   @provider(init_hook=hook, pool_size=50000)
-   def process(settings, file_name):
-       with open(file_name, 'r') as f:
-           for line_count, line in enumerate(f):
-               line_split = line.strip().split('\t')
-               if settings.job_mode and len(line_split) != 2:
-                   continue
-               src_seq = line_split[0]
-               src_ids = _get_ids(src_seq, settings.src_dict)
-               if settings.job_mode:
-                   trg_seq = line_split[1]
-                   trg_words = trg_seq.split()
-                   trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
-                   # sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
-                   if len(src_ids) > 80 or len(trg_ids) > 80:
-                       continue
-                   trg_ids_next = trg_ids + [settings.trg_dict[END]]
-                   trg_ids = [settings.trg_dict[START]] + trg_ids
-                   yield {
-                       'source_language_word': src_ids,
-                       'target_language_word': trg_ids,
-                       'target_language_next_word': trg_ids_next
-                   }
-               else:
-                   yield {'source_language_word': src_ids, 'sent_id': [line_count]}
-   ```
-Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
-## Model Config
-### Data Definition
-1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
-   ```python
-   import os
-   from paddle.trainer_config_helpers import *
-   data_dir = "./data/pre-wmt14" # data path
+# train with a single CPU
-   src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary
+paddle.init(use_gpu=False, trainer_count=1)
-   trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary
+```
-   is_generating = get_config_arg("is_generating", bool, False) # config mode
-   ```
-2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
-   ```python
+### Define DataSet
-   if not is_generating:
-       train_list = os.path.join(data_dir, 'train.list')
-       test_list = os.path.join(data_dir, 'test.list')
-   else:
-       train_list = None
-       test_list = os.path.join(data_dir, 'gen.list')
-   define_py_data_sources2(
-       train_list,
-       test_list,
-       module="dataprovider",
-       obj="process",
-       args={
-           "src_dict_path": src_lang_dict, # source language dictionary path
-           "trg_dict_path": trg_lang_dict, # target language dictionary path
-           "is_generating": is_generating  # config mode
-       })
-   ```
-### Algorithm Configuration
+We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
 ```python
-settings(
+# source and target dict dim.
-    learning_method = AdamOptimizer(),
+dict_size = 30000
-    batch_size = 50,
-    learning_rate = 5e-4)
+feeding = {
+    'source_language_word': 0,
+    'target_language_word': 1,
+    'target_language_next_word': 2
+}
+wmt14_reader = paddle.batch(
+    paddle.reader.shuffle(
+        paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
+    batch_size=5)
 ```
-This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
-### Model Structure
+### Model Configuration
 1. Define some global variables
   ```python
-   source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
+   source_dict_dim = dict_size # source language dictionary size
-   target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
+   target_dict_dim = dict_size # destination language dictionary size
-   word_vector_dim = 512 # dimensionality of word vector
+   word_vector_dim = 512 # word embedding dimension
-   encoder_size = 512      # dimensionality of the hidden state of encoder GRU
+   encoder_size = 512 # hidden layer size of GRU in encoder
-   decoder_size = 512    # dimentionality of the hidden state of decoder GRU
+   decoder_size = 512 # hidden layer size of GRU in decoder
-   if is_generating:
-       beam_size=3    # beam size for the beam search algorithm
-       max_length=250 # maximum length for the generated sentence
-       gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
  ```
-2. Implement Encoder as follows:
+1. Implement Encoder as follows:
+   1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
-   2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
   ```python
-   src_word_id = data_layer(name='source_language_word', size=source_dict_dim)
+    src_word_id = paddle.layer.data(
+        name='source_language_word',
+        type=paddle.data_type.integer_value_sequence(source_dict_dim))
   ```
-   2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
+   1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
   ```python
-   src_embedding = embedding_layer(
+    src_embedding = paddle.layer.embedding(
-       input=src_word_id,
+        input=src_word_id,
-       size=word_vector_dim,
+        size=word_vector_dim,
-       param_attr=ParamAttr(name='_source_language_embedding'))
+        param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
   ```
-   2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
+   1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
   ```python
-   src_forward = simple_gru(input=src_embedding, size=encoder_size)
+    src_forward = paddle.networks.simple_gru(
-   src_backward = simple_gru(
+        input=src_embedding, size=encoder_size)
-       input=src_embedding, size=encoder_size, reverse=True)
+    src_backward = paddle.networks.simple_gru(
-   encoded_vector = concat_layer(input=[src_forward, src_backward])
+        input=src_embedding, size=encoder_size, reverse=True)
+    encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
   ```
-3. Implement Attention-based Decoder as follows:
+1. Implement Attention-based Decoder as follows:
-   3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
+   1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
   ```python
-   with mixed_layer(size=decoder_size) as encoded_proj:
+    with paddle.layer.mixed(size=decoder_size) as encoded_proj:
-       encoded_proj += full_matrix_projection(input=encoded_vector)
+        encoded_proj += paddle.layer.full_matrix_projection(
+            input=encoded_vector)
   ```
-   3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
+   1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
   ```python
-   backward_first = first_seq(input=src_backward)
+    backward_first = paddle.layer.first_seq(input=src_backward)
-   with mixed_layer(
+    with paddle.layer.mixed(
-           size=decoder_size,
+            size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
-           act=TanhActivation(), ) as decoder_boot:
+        decoder_boot += paddle.layer.full_matrix_projection(
-       decoder_boot += full_matrix_projection(input=backward_first)
+            input=backward_first)
   ```
-   3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
+   1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
      - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
      - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
@@ -556,184 +344,146 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
      - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
   ```python
-   def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
+    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
-       decoder_mem = memory(
-           name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
+        decoder_mem = paddle.layer.memory(
+            name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
-       context = simple_attention(
-           encoded_sequence=enc_vec,
+        context = paddle.networks.simple_attention(
-           encoded_proj=enc_proj,
+            encoded_sequence=enc_vec,
-           decoder_state=decoder_mem, )
+            encoded_proj=enc_proj,
+            decoder_state=decoder_mem)
-       with mixed_layer(size=decoder_size * 3) as decoder_inputs:
-           decoder_inputs += full_matrix_projection(input=context)
+        with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
-           decoder_inputs += full_matrix_projection(input=current_word)
+            decoder_inputs += paddle.layer.full_matrix_projection(input=context)
+            decoder_inputs += paddle.layer.full_matrix_projection(
-       gru_step = gru_step_layer(
+                input=current_word)
-           name='gru_decoder',
-           input=decoder_inputs,
+        gru_step = paddle.layer.gru_step(
-           output_mem=decoder_mem,
+            name='gru_decoder',
-           size=decoder_size)
+            input=decoder_inputs,
+            output_mem=decoder_mem,
-       with mixed_layer(
+            size=decoder_size)
-               size=target_dict_dim, bias_attr=True,
-               act=SoftmaxActivation()) as out:
+        with paddle.layer.mixed(
-           out += full_matrix_projection(input=gru_step)
+                size=target_dict_dim,
-       return out
+                bias_attr=True,
+                act=paddle.activation.Softmax()) as out:
+            out += paddle.layer.full_matrix_projection(input=gru_step)
+        return out
    ```
-4. Decoder differences between the training and generation
+1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
-   4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
+    ```python
+    decoder_group_name = "decoder_group"
-   ```python
+    group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
-   decoder_group_name = "decoder_group"
+    group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
-   group_input1 = StaticInput(input=encoded_vector, is_seq=True)
+    group_inputs = [group_input1, group_input2]
-   group_input2 =  StaticInput(input=encoded_proj, is_seq=True)
+    ```
-   group_inputs = [group_input1, group_input2]
-   ```
-   4.2 In training mode:
+1. Training mode:
-      - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+      - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
      - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
      - the sequence of next words from the target language is used as label (lbl)
      - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
-   ```python
+    ```python
-   if not is_generating:
+    trg_embedding = paddle.layer.embedding(
-       trg_embedding = embedding_layer(
+        input=paddle.layer.data(
-           input=data_layer(
+            name='target_language_word',
-               name='target_language_word', size=target_dict_dim),
+            type=paddle.data_type.integer_value_sequence(target_dict_dim)),
-           size=word_vector_dim,
+        size=word_vector_dim,
-           param_attr=ParamAttr(name='_target_language_embedding'))
+        param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
-       group_inputs.append(trg_embedding)
+    group_inputs.append(trg_embedding)
-       decoder = recurrent_group(
+    # For decoder equipped with attention mechanism, in training,
-           name=decoder_group_name,
+    # target embeding (the groudtruth) is the data input,
-           step=gru_decoder_with_attention,
+    # while encoded source sequence is accessed to as an unbounded memory.
-           input=group_inputs)
+    # Here, the StaticInput defines a read-only memory
+    # for the recurrent_group.
-       lbl = data_layer(name='target_language_next_word', size=target_dict_dim)
+    decoder = paddle.layer.recurrent_group(
-       cost = classification_cost(input=decoder, label=lbl)
+        name=decoder_group_name,
-       outputs(cost)
+        step=gru_decoder_with_attention,
-   ```
+        input=group_inputs)
+    lbl = paddle.layer.data(
+        name='target_language_next_word',
+        type=paddle.data_type.integer_value_sequence(target_dict_dim))
+    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
+    ```
-   4.3 In generation mode:
+Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
-      - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
+### Create Parameters
-      - `beam_search` will call `gru_decoder_with_attention` to generate id
-      - `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict`
-   ```python
+Create every parameter that `cost` layer needs.
-   else:
-       trg_embedding = GeneratedInput(
-           size=target_dict_dim,
-           embedding_name='_target_language_embedding',
-           embedding_size=word_vector_dim)
-       group_inputs.append(trg_embedding)
-       beam_gen = beam_search(
-           name=decoder_group_name,
-           step=gru_decoder_with_attention,
-           input=group_inputs,
-           bos_id=0,
-           eos_id=1,
-           beam_size=beam_size,
-           max_length=max_length)
-       seqtext_printer_evaluator(
-           input=beam_gen,
-           id_input=data_layer(
-               name="sent_id", size=1),
-           dict_file=trg_lang_dict,
-           result_file=gen_trans_file)
-       outputs(beam_gen)
-   ```
-Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
+```python
+parameters = paddle.parameters.create(cost)
+```
+We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
+```python
+for param in parameters.keys():
+    print param
+```
 ## Model Training
-Training can be started with the following command:
+1. Create trainer
-```bash
+    We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
-./train.sh
-```
-where `train.sh` contains
-```bash
+    ```python
-paddle train \
+    optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--config='seqToseq_net.py' \
+    trainer = paddle.trainer.SGD(cost=cost,
--save_dir='model' \
+                                 parameters=parameters,
--use_gpu=false \
+                                 update_equation=optimizer)
--num_passes=16 \
+    ```
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
-2>&1 | tee 'train.log'
-```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
-The training loss will the printed every 10 batches, and you will see messages like those below:
-```text
-I0719 19:16:45.952062 15563 TrainerInternal.cpp:160]  Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155  CurrentEval: classification_error_evaluator=0.737155
-I0719 19:17:56.707319 15563 TrainerInternal.cpp:160]  Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392  CurrentEval: classification_error_evaluator=0.659065
-.....
-```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
-The model training is successful when the classification\_error\_evaluator is lower than 0.35.
+1. Define event handler
-## Model Usage
+    The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
-### Download Pre-trained Model
+    ```python
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 10 == 0:
+                print "Pass %d, Batch %d, Cost %f, %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics)
+    ```
-As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model:
+1. Start training
-```bash
-cd pretrained
-./wmt14_model.sh
-```
-### Usage and Results
+    ```python
+    trainer.train(
+        reader=wmt14_reader,
+        event_handler=event_handler,
+        num_passes=10000,
+        feeding=feeding)
+    ```
-Run the following command to perform translation from French to English:
+    ```text
+    Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
+    Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
+    ...
+    ```
-```bash
+The model training is successful when the `classification_error_evaluator` is lower than 0.35.
-./gen.sh
-```
+## Model Usage
-where `gen.sh` contains:
+### Download Pre-trained Model
+As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
 ```bash
-paddle train \
+cd pretrained
--job=test \
+./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
-2>&1 | tee 'translation/gen.log'
 ```
-Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
-For translation results please refer to [Illustrative Results](#Illustrative Results).
 ### BLEU Evaluation
@@ -759,7 +509,7 @@ BLEU = 26.92
 ## Summary
-End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter.
+End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
 ## References

--- a/machine_translation/index.html
+++ b/machine_translation/index.html
@@ -194,54 +194,8 @@ e_{ij}&=align(z_i,h_j)\\\\
 ## 数据介绍
-### 下载与解压缩
 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集，[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
-在Linux下，只需简单地运行以下命令：
-```bash
-cd data
-./wmt14_data.sh
-```
-得到的数据集`data/wmt14`包含如下三个文件夹：
-<p align = "center">
-<table>
-<tr>
-<td>文件夹名</td>
-<td>法英平行语料文件</td>
-<td>文件数</td>
-<td>文件大小</td>
-</tr>
-<tr>
-<td>train</td>
-<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
-<td>12</td>
-<td>3.55G</td>
-</tr>
-<tr>
-<td>test</td>
-<td>ntst1213.src, ntst1213.trg</td>
-<td>2</td>
-<td>1636k</td>
-</tr>
-</tr>
-<tr>
-<td>gen</td>
-<td>ntst14.src, ntst14.trg</td>
-<td>2</td>
-<td>864k</td>
-</tr>
-</table>
-</p>
- `XXX.src`是源法语文件，`XXX.trg`是目标英语文件，文件中的每行存放一个句子
- `XXX.src`和`XXX.trg`的行数一致，且两者任意第$i$行的句子之间都有着一一对应的关系。
 ### 数据预处理
 我们的预处理流程包括两步：
@@ -298,17 +252,16 @@ wmt14_reader = paddle.batch(
   decoder_size = 512 # 解码器中的GRU隐层大小
  ```
-2. 其次，实现编码器框架。分为三步：
+1. 其次，实现编码器框架。分为三步：
-   2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列
+   1 输入是一个文字序列，被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以，我们定义数据层的数据类型为`integer_value_sequence`（整型序列），序列中每个元素的范围是`[0, source_dict_dim)`。
-   转换成one-hot vector表示的源语言序列$\mathbf{w}$，其类型为integer_value_sequence。
   ```python
    src_word_id = paddle.layer.data(
        name='source_language_word',
        type=paddle.data_type.integer_value_sequence(source_dict_dim))
   ```
-   2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
+   1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
   ```python
    src_embedding = paddle.layer.embedding(
@@ -316,7 +269,7 @@ wmt14_reader = paddle.batch(
        size=word_vector_dim,
        param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
   ```
-   2.3 用双向GRU编码源语言序列，拼接两个GRU的编码结果得到$\mathbf{h}$。
+   1. 用双向GRU编码源语言序列，拼接两个GRU的编码结果得到$\mathbf{h}$。
   ```python
    src_forward = paddle.networks.simple_gru(
@@ -326,16 +279,17 @@ wmt14_reader = paddle.batch(
    encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
   ```
-3. 接着，定义基于注意力机制的解码器框架。分为三步：
+1. 接着，定义基于注意力机制的解码器框架。分为三步：
-   3.1 对源语言序列编码后的结果（见2.3），过一个前馈神经网络（Feed Forward Neural Network），得到其映射。
+   1. 对源语言序列编码后的结果（见2.3），过一个前馈神经网络（Feed Forward Neural Network），得到其映射。
   ```python
    with paddle.layer.mixed(size=decoder_size) as encoded_proj:
        encoded_proj += paddle.layer.full_matrix_projection(
            input=encoded_vector)
   ```
-   3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列，但在0时刻并没有初始值，所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射，作为该初始值，即$c_0=h_T$。
+   1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列，但在0时刻并没有初始值，所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射，作为该初始值，即$c_0=h_T$。
   ```python
    backward_first = paddle.layer.first_seq(input=src_backward)
@@ -344,7 +298,8 @@ wmt14_reader = paddle.batch(
        decoder_boot += paddle.layer.full_matrix_projection(
            input=backward_first)
   ```
-   3.3 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
+   1. 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
      - decoder_mem记录了前一个时间步的隐层状态$z_i$，其初始状态是decoder_boot。
      - context通过调用`simple_attention`函数，实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中，enc_vec是$h_j$，enc_proj是$h_j$的映射（见3.1），权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
      - decoder_inputs融合了$c_i$和当前目标词current_word（即$u_i$）的表示。
@@ -381,24 +336,23 @@ wmt14_reader = paddle.batch(
        return out
    ```
-4. 训练模式与生成模式下的解码器调用区别。
+1. 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
-   4.1 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
-   ```python
+	```python
    decoder_group_name = "decoder_group"
    group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
    group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
    group_inputs = [group_input1, group_input2]
-   ```
+	```
-   4.2 训练模式下的解码器调用：
-      - 首先，将目标语言序列的词向量trg_embedding，直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
+1. 训练模式下的解码器调用：
-      - 其次，使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
-      - 接着，使用目标语言的下一个词序列作为标签层lbl，即预测目标词。
-      - 最后，用多类交叉熵损失函数`classification_cost`来计算损失值。
-   ```python
+   - 首先，将目标语言序列的词向量trg_embedding，直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
+   - 其次，使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
+   - 接着，使用目标语言的下一个词序列作为标签层lbl，即预测目标词。
+   - 最后，用多类交叉熵损失函数`classification_cost`来计算损失值。
+	```python
    trg_embedding = paddle.layer.embedding(
        input=paddle.layer.data(
            name='target_language_word',
@@ -421,7 +375,8 @@ wmt14_reader = paddle.batch(
        name='target_language_next_word',
        type=paddle.data_type.integer_value_sequence(target_dict_dim))
    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
-   ```
+	```
 注意：我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化，可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
 ### 参数定义
@@ -429,7 +384,6 @@ wmt14_reader = paddle.batch(
 首先依据模型配置的`cost`定义模型参数。
 ```python
-# create parameters
 parameters = paddle.parameters.create(cost)
 ```
@@ -453,7 +407,7 @@ for param in parameters.keys():
                                 update_equation=optimizer)
    ```
-2. 构造event_handler
+1. 构造event_handler
    可以通过自定义回调函数来评估训练过程中的各种状态，比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志，包含cost等信息。
    ```python
@@ -464,7 +418,7 @@ for param in parameters.keys():
                    event.pass_id, event.batch_id, event.cost, event.metrics)
    ```
-3. 启动训练：
+1. 启动训练：
    ```python
    trainer.train(
@@ -480,23 +434,19 @@ for param in parameters.keys():
    ...
    ```
+	当`classification_error_evaluator`的值低于0.35的时候，表示训练成功。
 ## 应用模型
 ### 下载预训练的模型
 由于NMT模型的训练非常耗时，我们在50个物理节点（每节点含有2颗6核CPU）的集群中，花了5天时间训练了16个pass，其中每个pass耗时7个小时。因此，我们提供了一个预先训练好的模型（pass-00012）供大家直接下载使用。该模型大小为205MB，在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下：
 ```bash
 cd pretrained
 ./wmt14_model.sh
 ```
-### 应用命令与结果
-新版api尚未支持机器翻译的翻译过程，尽请期待。
-翻译结果请见[效果展示](#效果展示)。
 ### BLEU评估
 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标，由IBM的watson研究中心于2002年提出\[[5](#参考文献)\]，基本出发点是：机器译文越接近专业翻译人员的翻译结果，翻译系统的性能越好。其中，机器译文与人工参考译文之间的接近程度，采用句子精确度（precision）的计算方法，即比较两者的n元词组相匹配的个数，匹配的个数越多，BLEU得分越好。