Merge branch 'develop' of https://github.com/PaddlePaddle/book into feature/refine_frontpage

745c657f · Yu Yang · b0349e4b · 8e798608 · 745c657f · 745c657f
20 changed file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -13,7 +13,7 @@
        files: \.md$
    -   id: trailing-whitespace
        files: \.md$
-   repo: git://github.com/Lucas-C/pre-commit-hooks
+-   repo: https://github.com/Lucas-C/pre-commit-hooks
    sha: v1.0.1
    hooks:
    -   id: forbid-crlf

--- a/README.md
+++ b/README.md
 # 深度学习入门

-1. 新手入门 [[fit_a_line](fit_a_line/)] [[html](http://book.paddlepaddle.org/fit_a_line)]
-1. 识别数字 [[recognize_digits](recognize_digits/)] [[html](http://book.paddlepaddle.org/recognize_digits)]
-1. 图像分类 [[image_classification](image_classification/)] [[html](http://book.paddlepaddle.org/image_classification)]
-1. 词向量 [[word2vec](word2vec/)] [[html](http://book.paddlepaddle.org/word2vec)]
-1. 情感分析 [[understand_sentiment](understand_sentiment/)] [[html](http://book.paddlepaddle.org/understand_sentiment)]
-1. 语义角色标注 [[label_semantic_roles](label_semantic_roles/)] [[html](http://book.paddlepaddle.org/label_semantic_roles)]
-1. 机器翻译 [[machine_translation](machine_translation/)] [[html](http://book.paddlepaddle.org/machine_translation)]
-1. 个性化推荐 [[recommender_system](recommender_system/)] [[html](http://book.paddlepaddle.org/recommender_system)]
+1. [新手入门](fit_a_line/) [[html](http://book.paddlepaddle.org/fit_a_line)]
+1. [识别数字](recognize_digits/) [[html](http://book.paddlepaddle.org/recognize_digits)]
+1. [图像分类](image_classification/) [[html](http://book.paddlepaddle.org/image_classification)]
+1. [词向量](word2vec/) [[html](http://book.paddlepaddle.org/word2vec)]
+1. [情感分析](understand_sentiment/) [[html](http://book.paddlepaddle.org/understand_sentiment)]
+1. [语义角色标注](label_semantic_roles/) [[html](http://book.paddlepaddle.org/label_semantic_roles)]
+1. [机器翻译](machine_translation/) [[html](http://book.paddlepaddle.org/machine_translation)]
+1. [个性化推荐](recommender_system/) [[html](http://book.paddlepaddle.org/recommender_system)]
+
+
+
+# Deep Learning Introduction
+
+1. [Fit a Line](fit_a_line/) [[html](http://book.paddlepaddle.org/fit_a_line/index.en.html)]
+1. [Recognize Digits](recognize_digits/) [[html](http://book.paddlepaddle.org/recognize_digits/index.en.html)]
+1. [Image Classification](image_classification/) [[html](http://book.paddlepaddle.org/image_classification/index.en.html)]
+1. [Word to Vector](word2vec/) [[html](http://book.paddlepaddle.org/word2vec/index.en.html)]
+1. [Understand Sentiment](understand_sentiment/) [[html](http://book.paddlepaddle.org/understand_sentiment/index.en.html)]
+1. [Label Semantic Roles](label_semantic_roles/) [[html](http://book.paddlepaddle.org/label_semantic_roles/index.en.html)]
+1. [Machine Translation](machine_translation/) [[html](http://book.paddlepaddle.org/machine_translation/index.en.html)]
+1. [Recommender System](recommender_system/) [[html](http://book.paddlepaddle.org/recommender_system/index.en.html)]

 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
+This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
--- a/label_semantic_roles/README.en.md
+++ b/label_semantic_roles/README.en.md
@@ -134,7 +134,7 @@ After modification, the model is as follows:


 <div  align="center">  
-<img src="image/db_lstm_en.png" width = "60%"  align=center /><br>
+<img src="image/db_lstm_network_en.png" width = "60%"  align=center /><br>
 Fig 6. DB-LSTM for SRL tasks
 </div>

@@ -212,7 +212,7 @@ print pred_len

 ## Model configuration

- 1. Define input data dimensions and model hyperparameters.
+- Define input data dimensions and model hyperparameters.

 ```python
 mark_dict_len = 2    # Value range of region mark. Region mark is either 0 or 1, so range is 2
@@ -247,7 +247,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))

 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。

- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
+- The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.

 ```python  

@@ -276,7 +276,7 @@ emb_layers.append(predicate_embedding)
 emb_layers.append(mark_embedding)
 ```

- 3. 8 LSTM units will be trained in "forward / backward" order.
+- 8 LSTM units will be trained in "forward / backward" order.

 ```python  
 hidden_0 = paddle.layer.mixed(
@@ -326,7 +326,7 @@ for i in range(1, depth):
    input_tmp = [mix_hidden, lstm]
 ```

- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
+- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.

 ```python
 feature_out = paddle.layer.mixed(
@@ -340,7 +340,7 @@ for i in range(1, depth):
 ], )
 ```

- 5.  We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
+- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.

 ```python
 crf_cost = paddle.layer.crf(
@@ -353,7 +353,7 @@ crf_cost = paddle.layer.crf(
        learning_rate=mix_hidden_lr))
 ```

- 6.  CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer.  The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
+- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer.  The sharing of parameters among multiple layers is specified by the same parameter name in these layers.

 ```python
 crf_dec = paddle.layer.crf_decoding(

--- a/label_semantic_roles/README.md
+++ b/label_semantic_roles/README.md
@@ -206,7 +206,7 @@ print pred_len

 ## 模型配置说明

- 1. 定义输入数据维度及模型超参数。
+- 定义输入数据维度及模型超参数。

 ```python
 mark_dict_len = 2    # 谓上下文区域标志的维度，是一个0-1 2值特征，因此维度为2
@@ -240,7 +240,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))

 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维，关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。

- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。
+- 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。

 ```python  

@@ -269,7 +269,7 @@ emb_layers.append(predicate_embedding)
 emb_layers.append(mark_embedding)
 ```

- 3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
+- 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。

 ```python  
 hidden_0 = paddle.layer.mixed(
@@ -319,7 +319,7 @@ for i in range(1, depth):
    input_tmp = [mix_hidden, lstm]
 ```

- 4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，经过一个全连接层映射到标记字典的维度，得到最终的特征向量表示。
+- 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，经过一个全连接层映射到标记字典的维度，得到最终的特征向量表示。

 ```python
 feature_out = paddle.layer.mixed(
@@ -333,7 +333,7 @@ input=[
 ], )
 ```

- 5. 网络的末端定义CRF层计算损失(cost)，指定参数名字为 `crfw`，该层需要输入正确的数据标签(target)。
+- 网络的末端定义CRF层计算损失(cost)，指定参数名字为 `crfw`，该层需要输入正确的数据标签(target)。

 ```python
 crf_cost = paddle.layer.crf(
@@ -346,7 +346,7 @@ crf_cost = paddle.layer.crf(
        learning_rate=mix_hidden_lr))
 ```

- 6. CRF译码层和CRF层参数名字相同，即共享权重。如果输入了正确的数据标签(target)，会统计错误标签的个数，可以用来评估模型。如果没有输入正确的数据标签，该层可以推到出最优解，可以用来预测模型。
+- CRF译码层和CRF层参数名字相同，即共享权重。如果输入了正确的数据标签(target)，会统计错误标签的个数，可以用来评估模型。如果没有输入正确的数据标签，该层可以推到出最优解，可以用来预测模型。

 ```python
 crf_dec = paddle.layer.crf_decoding(

--- a/label_semantic_roles/image/bd_lstm_en.png
+++ b/label_semantic_roles/image/bd_lstm_en.png
--- a/label_semantic_roles/index.en.html
+++ b/label_semantic_roles/index.en.html
@@ -176,7 +176,7 @@ After modification, the model is as follows:


 <div  align="center">  
-<img src="image/db_lstm_en.png" width = "60%"  align=center /><br>
+<img src="image/db_lstm_network_en.png" width = "60%"  align=center /><br>
 Fig 6. DB-LSTM for SRL tasks
 </div>

@@ -254,7 +254,7 @@ print pred_len

 ## Model configuration

- 1. Define input data dimensions and model hyperparameters.
+- Define input data dimensions and model hyperparameters.

 ```python
 mark_dict_len = 2    # Value range of region mark. Region mark is either 0 or 1, so range is 2
@@ -289,7 +289,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))

 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。

- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
+- The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.

 ```python  

@@ -318,7 +318,7 @@ emb_layers.append(predicate_embedding)
 emb_layers.append(mark_embedding)
 ```

- 3. 8 LSTM units will be trained in "forward / backward" order.
+- 8 LSTM units will be trained in "forward / backward" order.

 ```python  
 hidden_0 = paddle.layer.mixed(
@@ -368,7 +368,7 @@ for i in range(1, depth):
    input_tmp = [mix_hidden, lstm]
 ```

- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
+- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.

 ```python
 feature_out = paddle.layer.mixed(
@@ -382,7 +382,7 @@ for i in range(1, depth):
 ], )
 ```

- 5.  We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
+- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.

 ```python
 crf_cost = paddle.layer.crf(
@@ -395,7 +395,7 @@ crf_cost = paddle.layer.crf(
        learning_rate=mix_hidden_lr))
 ```

- 6.  CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer.  The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
+- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer.  The sharing of parameters among multiple layers is specified by the same parameter name in these layers.

 ```python
 crf_dec = paddle.layer.crf_decoding(

--- a/label_semantic_roles/index.html
+++ b/label_semantic_roles/index.html
@@ -248,7 +248,7 @@ print pred_len

 ## 模型配置说明

- 1. 定义输入数据维度及模型超参数。
+- 定义输入数据维度及模型超参数。

 ```python
 mark_dict_len = 2    # 谓上下文区域标志的维度，是一个0-1 2值特征，因此维度为2
@@ -282,7 +282,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))

 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维，关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。

- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。
+- 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。

 ```python  

@@ -311,7 +311,7 @@ emb_layers.append(predicate_embedding)
 emb_layers.append(mark_embedding)
 ```

- 3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
+- 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。

 ```python  
 hidden_0 = paddle.layer.mixed(
@@ -361,7 +361,7 @@ for i in range(1, depth):
    input_tmp = [mix_hidden, lstm]
 ```

- 4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，经过一个全连接层映射到标记字典的维度，得到最终的特征向量表示。
+- 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，经过一个全连接层映射到标记字典的维度，得到最终的特征向量表示。

 ```python
 feature_out = paddle.layer.mixed(
@@ -375,7 +375,7 @@ input=[
 ], )
 ```

- 5. 网络的末端定义CRF层计算损失(cost)，指定参数名字为 `crfw`，该层需要输入正确的数据标签(target)。
+- 网络的末端定义CRF层计算损失(cost)，指定参数名字为 `crfw`，该层需要输入正确的数据标签(target)。

 ```python
 crf_cost = paddle.layer.crf(
@@ -388,7 +388,7 @@ crf_cost = paddle.layer.crf(
        learning_rate=mix_hidden_lr))
 ```

- 6. CRF译码层和CRF层参数名字相同，即共享权重。如果输入了正确的数据标签(target)，会统计错误标签的个数，可以用来评估模型。如果没有输入正确的数据标签，该层可以推到出最优解，可以用来预测模型。
+- CRF译码层和CRF层参数名字相同，即共享权重。如果输入了正确的数据标签(target)，会统计错误标签的个数，可以用来评估模型。如果没有输入正确的数据标签，该层可以推到出最优解，可以用来预测模型。

 ```python
 crf_dec = paddle.layer.crf_decoding(

--- a/machine_translation/README.en.md
+++ b/machine_translation/README.en.md
@@ -446,6 +446,7 @@ settings(
 This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.

 ### Model Structure
+
 1. Define some global variables

   ```python
@@ -493,6 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   with mixed_layer(size=decoder_size) as encoded_proj:
       encoded_proj += full_matrix_projection(input=encoded_vector)
   ```
+
   3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$

   ```python
@@ -502,6 +504,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
           act=TanhActivation(), ) as decoder_boot:
       decoder_boot += full_matrix_projection(input=backward_first)
   ```
+
   3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.

      - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
@@ -536,6 +539,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
           out += full_matrix_projection(input=gru_step)
       return out
    ```
+
 4. Decoder differences between the training and generation

   4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
@@ -546,6 +550,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   group_input2 =  StaticInput(input=encoded_proj, is_seq=True)
   group_inputs = [group_input1, group_input2]
   ```
+
   4.2 In training mode:

      - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
@@ -571,6 +576,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
       cost = classification_cost(input=decoder, label=lbl)
       outputs(cost)
   ```
+
   4.3 In generation mode:

      - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.

--- a/machine_translation/README.md
+++ b/machine_translation/README.md
@@ -303,14 +303,12 @@ wmt14_reader = paddle.batch(
            input=backward_first)
   ```
   3.3 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
-
      - decoder_mem记录了前一个时间步的隐层状态$z_i$，其初始状态是decoder_boot。
      - context通过调用`simple_attention`函数，实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中，enc_vec是$h_j$，enc_proj是$h_j$的映射（见3.1），权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
      - decoder_inputs融合了$c_i$和当前目标词current_word（即$u_i$）的表示。
      - gru_step通过调用`gru_step_layer`函数，在decoder_inputs和decoder_mem上做了激活操作，即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
      - 最后，使用softmax归一化计算单词的概率，将out结果返回，即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。

-
   ```python
    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):

@@ -340,6 +338,7 @@ wmt14_reader = paddle.batch(
            out += paddle.layer.full_matrix_projection(input=gru_step)
        return out
    ```
+
 4. 训练模式与生成模式下的解码器调用区别。

   4.1 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
@@ -400,6 +399,7 @@ for param in parameters.keys():
 ```

 ### 训练模型
+
 1. 构造trainer

    根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练，在构造时还需指定优化方法，这里使用最基本的SGD方法。
@@ -409,7 +409,7 @@ for param in parameters.keys():
    trainer = paddle.trainer.SGD(cost=cost,
                                 parameters=parameters,
                                 update_equation=optimizer)
-```
+    ```

 2. 构造event_handler

@@ -421,6 +421,7 @@ for param in parameters.keys():
                print "Pass %d, Batch %d, Cost %f, %s" % (
                    event.pass_id, event.batch_id, event.cost, event.metrics)
    ```
+
 3. 启动训练：

    ```python
@@ -435,7 +436,7 @@ for param in parameters.keys():
    Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
    Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
    ...
-```
+    ```


 ## 应用模型

--- a/machine_translation/index.en.html
+++ b/machine_translation/index.en.html
@@ -488,6 +488,7 @@ settings(
 This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.

 ### Model Structure
+
 1. Define some global variables

   ```python
@@ -535,6 +536,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   with mixed_layer(size=decoder_size) as encoded_proj:
       encoded_proj += full_matrix_projection(input=encoded_vector)
   ```
+
   3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$

   ```python
@@ -544,6 +546,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
           act=TanhActivation(), ) as decoder_boot:
       decoder_boot += full_matrix_projection(input=backward_first)
   ```
+
   3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.

      - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
@@ -578,6 +581,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
           out += full_matrix_projection(input=gru_step)
       return out
    ```
+
 4. Decoder differences between the training and generation

   4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
@@ -588,6 +592,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   group_input2 =  StaticInput(input=encoded_proj, is_seq=True)
   group_inputs = [group_input1, group_input2]
   ```
+
   4.2 In training mode:

      - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
@@ -613,6 +618,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
       cost = classification_cost(input=decoder, label=lbl)
       outputs(cost)
   ```
+
   4.3 In generation mode:

      - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.

--- a/machine_translation/index.html
+++ b/machine_translation/index.html
@@ -345,14 +345,12 @@ wmt14_reader = paddle.batch(
            input=backward_first)
   ```
   3.3 定义解码阶段每一个时间步的RNN行为，即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$，来预测第$i+1$个词的概率$p_{i+1}$。
-
      - decoder_mem记录了前一个时间步的隐层状态$z_i$，其初始状态是decoder_boot。
      - context通过调用`simple_attention`函数，实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中，enc_vec是$h_j$，enc_proj是$h_j$的映射（见3.1），权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
      - decoder_inputs融合了$c_i$和当前目标词current_word（即$u_i$）的表示。
      - gru_step通过调用`gru_step_layer`函数，在decoder_inputs和decoder_mem上做了激活操作，即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
      - 最后，使用softmax归一化计算单词的概率，将out结果返回，即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。

-
   ```python
    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):

@@ -382,6 +380,7 @@ wmt14_reader = paddle.batch(
            out += paddle.layer.full_matrix_projection(input=gru_step)
        return out
    ```
+
 4. 训练模式与生成模式下的解码器调用区别。

   4.1 定义解码器框架名字，和`gru_decoder_with_attention`函数的前两个输入。注意：这两个输入使用`StaticInput`，具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
@@ -442,6 +441,7 @@ for param in parameters.keys():
 ```

 ### 训练模型
+
 1. 构造trainer

    根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练，在构造时还需指定优化方法，这里使用最基本的SGD方法。
@@ -451,7 +451,7 @@ for param in parameters.keys():
    trainer = paddle.trainer.SGD(cost=cost,
                                 parameters=parameters,
                                 update_equation=optimizer)
-```
+    ```

 2. 构造event_handler

@@ -463,6 +463,7 @@ for param in parameters.keys():
                print "Pass %d, Batch %d, Cost %f, %s" % (
                    event.pass_id, event.batch_id, event.cost, event.metrics)
    ```
+
 3. 启动训练：

    ```python
@@ -477,7 +478,7 @@ for param in parameters.keys():
    Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
    Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
    ...
-```
+    ```


 ## 应用模型

--- a/recognize_digits/README.en.md
+++ b/recognize_digits/README.en.md
@@ -32,15 +32,15 @@ In a simple softmax regression model, the input is fed to fully connected layers

 Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.

-$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
+$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

-where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
+where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $

 For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.

 In such a classification problem, we usually use the cross entropy loss function:

-$$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
+$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$

 Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.

@@ -55,7 +55,7 @@ The Softmax regression model described above uses the simplest two-layer neural

 1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
 2.  After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
-3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
+3.  Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.

 Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.

@@ -70,7 +70,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/>
 #### Convolutional Layer

 <p align="center">
-<img src="image/conv_layer_en.png" width=500><br/>
+<img src="image/conv_layer.png" width='750'><br/>
 Fig. 4. Convolutional layer<br/>
 </p>


--- a/recognize_digits/README.md
+++ b/recognize_digits/README.md
@@ -32,15 +32,15 @@ Yann LeCun早先在手写字符识别上做了很多研究，并在研究过程

 输入层的数据$X$传到输出层，在激活操作之前，会乘以相应的权重 $W$ ，并加上偏置变量 $b$ ，具体如下：

-$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
+$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

-其中 $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
+其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $

 对于有 $N$ 个类别的多分类问题，指定 $N$ 个输出节点，$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值，分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。

 在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy），公式如下：

-$$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
+$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$

 图2为softmax回归的网络图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。

@@ -55,7 +55,7 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
 2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
-3.  最后，再经过输出层，得到的$Y=softmax(W_3H_2 + b_3)$，即为最后的分类结果向量。
+3.  最后，再经过输出层，得到的$Y=\text{softmax}(W_3H_2 + b_3)$，即为最后的分类结果向量。


 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
@@ -67,11 +67,11 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 ### 卷积神经网络(Convolutional Neural Network, CNN)

-在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
+在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。

 <p align="center">
 <img src="image/cnn.png"><br/>
-图6. LeNet-5卷积神经网络结构<br/>
+图4. LeNet-5卷积神经网络结构<br/>
 </p>

 #### 卷积层
@@ -79,17 +79,11 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层
 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积，即离散二维滤波器（也称作卷积核）与二维图像做卷积操作，简单的讲是二维滤波器滑动到二维图像上所有位置，并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域，不同卷积核可以提取不同的特征，例如边沿、线性、角等特征。在深层卷积神经网络中，通过卷积操作可以提取出图像低级到复杂的特征。

 <p align="center">
-<img src="image/conv_layer.png"><br/>
-图4. 卷积层图片<br/>
+<img src="image/conv_layer.png" width='750'><br/>
+图5. 卷积层图片<br/>
 </p>

-图4给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。例如图中输出特征图$o[:,:,0]$中的第一个$2$计算如下：
-
-$$ o[0,0,0] = \sum x[0:3,0:3,0] * w_{0}[:,:,0]]  + \sum x[0:3,0:3,1] * w_{0}[:,:,1]]  +  \sum x[0:3,0:3,2] * w_{0}[:,:,2]] + b_0 = 2 $$
-$$ \sum x[0:3,0:3,0] * w_{0}[:,:,0]] = 0*1 + 0*1 + 0*1 + 0*1 + 1*1 + 2*(-1) + 0*(-1) + 0*1 + 0*(-1) = -1 $$
-$$ \sum x[0:3,0:3,1] * w_{0}[:,:,1]] = 0*0 + 0*1 + 0*1 + 0*(-1) + 0*0 + 1*1 + 0*1 + 2*0 + 1*1 = 2 $$
-$$ \sum x[0:3,0:3,2] * w_{0}[:,:,2]] = 0*(-1) + 0*1 + 0*(-1) + 0*0 + 1*1 + 1*0 + 0*(-1) + 1*0 + 1*(-1) = 0 $$
-$$ b_0 = 1 $$
+图5给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。

 在卷积操作中卷积核是可学习的参数，经过上面示例介绍，每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中，神经元通常是全部连接，参数较多。而卷积层的参数较少，这也是由卷积层的主要特性即局部连接和共享权重所决定。

@@ -103,10 +97,10 @@ $$ b_0 = 1 $$

 <p align="center">
 <img src="image/max_pooling.png" width="400px"><br/>
-图5. 池化层图片<br/>
+图6. 池化层图片<br/>
 </p>

-池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图5所示。
+池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图6所示。

 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。


--- a/recognize_digits/image/conv_layer.png
+++ b/recognize_digits/image/conv_layer.png
--- a/recognize_digits/image/conv_layer_en.png
+++ b/recognize_digits/image/conv_layer_en.png
--- a/recognize_digits/index.en.html
+++ b/recognize_digits/index.en.html
@@ -74,15 +74,15 @@ In a simple softmax regression model, the input is fed to fully connected layers

 Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.

-$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
+$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

-where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
+where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $

 For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.

 In such a classification problem, we usually use the cross entropy loss function:

-$$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
+$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$

 Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.

@@ -97,7 +97,7 @@ The Softmax regression model described above uses the simplest two-layer neural

 1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
 2.  After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
-3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
+3.  Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.

 Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.

@@ -112,7 +112,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/>
 #### Convolutional Layer

 <p align="center">
-<img src="image/conv_layer_en.png" width=500><br/>
+<img src="image/conv_layer.png" width='750'><br/>
 Fig. 4. Convolutional layer<br/>
 </p>


--- a/recognize_digits/index.html
+++ b/recognize_digits/index.html
@@ -74,15 +74,15 @@ Yann LeCun早先在手写字符识别上做了很多研究，并在研究过程

 输入层的数据$X$传到输出层，在激活操作之前，会乘以相应的权重 $W$ ，并加上偏置变量 $b$ ，具体如下：

-$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
+$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

-其中 $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
+其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $

 对于有 $N$ 个类别的多分类问题，指定 $N$ 个输出节点，$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值，分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。

 在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy），公式如下：

-$$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
+$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$

 图2为softmax回归的网络图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。

@@ -97,7 +97,7 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
 2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
-3.  最后，再经过输出层，得到的$Y=softmax(W_3H_2 + b_3)$，即为最后的分类结果向量。
+3.  最后，再经过输出层，得到的$Y=\text{softmax}(W_3H_2 + b_3)$，即为最后的分类结果向量。


 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
@@ -109,11 +109,11 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 ### 卷积神经网络(Convolutional Neural Network, CNN)

-在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
+在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。

 <p align="center">
 <img src="image/cnn.png"><br/>
-图6. LeNet-5卷积神经网络结构<br/>
+图4. LeNet-5卷积神经网络结构<br/>
 </p>

 #### 卷积层
@@ -121,17 +121,11 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层
 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积，即离散二维滤波器（也称作卷积核）与二维图像做卷积操作，简单的讲是二维滤波器滑动到二维图像上所有位置，并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域，不同卷积核可以提取不同的特征，例如边沿、线性、角等特征。在深层卷积神经网络中，通过卷积操作可以提取出图像低级到复杂的特征。

 <p align="center">
-<img src="image/conv_layer.png"><br/>
-图4. 卷积层图片<br/>
+<img src="image/conv_layer.png" width='750'><br/>
+图5. 卷积层图片<br/>
 </p>

-图4给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。例如图中输出特征图$o[:,:,0]$中的第一个$2$计算如下：
-
-$$ o[0,0,0] = \sum x[0:3,0:3,0] * w_{0}[:,:,0]]  + \sum x[0:3,0:3,1] * w_{0}[:,:,1]]  +  \sum x[0:3,0:3,2] * w_{0}[:,:,2]] + b_0 = 2 $$
-$$ \sum x[0:3,0:3,0] * w_{0}[:,:,0]] = 0*1 + 0*1 + 0*1 + 0*1 + 1*1 + 2*(-1) + 0*(-1) + 0*1 + 0*(-1) = -1 $$
-$$ \sum x[0:3,0:3,1] * w_{0}[:,:,1]] = 0*0 + 0*1 + 0*1 + 0*(-1) + 0*0 + 1*1 + 0*1 + 2*0 + 1*1 = 2 $$
-$$ \sum x[0:3,0:3,2] * w_{0}[:,:,2]] = 0*(-1) + 0*1 + 0*(-1) + 0*0 + 1*1 + 1*0 + 0*(-1) + 1*0 + 1*(-1) = 0 $$
-$$ b_0 = 1 $$
+图5给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。

 在卷积操作中卷积核是可学习的参数，经过上面示例介绍，每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中，神经元通常是全部连接，参数较多。而卷积层的参数较少，这也是由卷积层的主要特性即局部连接和共享权重所决定。

@@ -145,10 +139,10 @@ $$ b_0 = 1 $$

 <p align="center">
 <img src="image/max_pooling.png" width="400px"><br/>
-图5. 池化层图片<br/>
+图6. 池化层图片<br/>
 </p>

-池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图5所示。
+池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图6所示。

 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。


--- a/understand_sentiment/README.md
+++ b/understand_sentiment/README.md
@@ -110,8 +110,6 @@ Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取

 ```
 import sys
-import paddle.trainer_config_helpers.attrs as attrs
-from paddle.trainer_config_helpers.poolings import MaxPooling
 import paddle.v2 as paddle
 ```
 ## 配置模型
@@ -159,11 +157,11 @@ def stacked_lstm_net(input_dim,
    """
    assert stacked_num % 2 == 1

-    layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5)
-    fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3)
-    lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.)
+    layer_attr = paddle.attr.Extra(drop_rate=0.5)
+    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
+    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
    para_attr = [fc_para_attr, lstm_para_attr]
-    bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.)
+    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
    relu = paddle.activation.Relu()
    linear = paddle.activation.Linear()

@@ -193,8 +191,8 @@ def stacked_lstm_net(input_dim,
            layer_attr=layer_attr)
        inputs = [fc, lstm]

-    fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling())
-    lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling())
+    fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=paddle.pooling.Max())
+    lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=paddle.pooling.Max())
    output = paddle.layer.fc(input=[fc_last, lstm_last],
                             size=class_dim,
                             act=paddle.activation.Softmax(),
@@ -233,9 +231,9 @@ if __name__ == '__main__':
 ```
 这里，`dataset.imdb.train()`和`dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用，意义是将读取的训练数据进行shuffle后，组成一个batch数据。同理，`test_reader`是在测试的时候使用，将读取的测试数据组成一个batch。
 ```
-    reader_dict={'word': 0, 'label': 1}
+    feeding={'word': 0, 'label': 1}
 ```
-`reader_dict`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层，第1列数据对应`label`层。
+`feeding`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层，第1列数据对应`label`层。
 ### 构造模型
 ```
    # Please choose the way to build the network
@@ -272,7 +270,7 @@ Paddle中提供了一系列优化算法的API，这里使用Adam优化算法。
                sys.stdout.write('.')
                sys.stdout.flush()
        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(reader=test_reader, reader_dict=reader_dict)
+            result = trainer.test(reader=test_reader, feeding=feeding)
            print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
 ```
 可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error；在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
@@ -285,7 +283,7 @@ Paddle中提供了一系列优化算法的API，这里使用Adam优化算法。
    trainer.train(
        reader=train_reader,
        event_handler=event_handler,
-        reader_dict=reader_dict,
+        feeding=feeding,
        num_passes=2)
 ```
 程序运行之后的输出如下。

--- a/understand_sentiment/index.html
+++ b/understand_sentiment/index.html
@@ -152,8 +152,6 @@ Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取

 ```
 import sys
-import paddle.trainer_config_helpers.attrs as attrs
-from paddle.trainer_config_helpers.poolings import MaxPooling
 import paddle.v2 as paddle
 ```
 ## 配置模型
@@ -201,11 +199,11 @@ def stacked_lstm_net(input_dim,
    """
    assert stacked_num % 2 == 1

-    layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5)
-    fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3)
-    lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.)
+    layer_attr = paddle.attr.Extra(drop_rate=0.5)
+    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
+    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
    para_attr = [fc_para_attr, lstm_para_attr]
-    bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.)
+    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
    relu = paddle.activation.Relu()
    linear = paddle.activation.Linear()

@@ -235,8 +233,8 @@ def stacked_lstm_net(input_dim,
            layer_attr=layer_attr)
        inputs = [fc, lstm]

-    fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling())
-    lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling())
+    fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=paddle.pooling.Max())
+    lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=paddle.pooling.Max())
    output = paddle.layer.fc(input=[fc_last, lstm_last],
                             size=class_dim,
                             act=paddle.activation.Softmax(),
@@ -275,9 +273,9 @@ if __name__ == '__main__':
 ```
 这里，`dataset.imdb.train()`和`dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用，意义是将读取的训练数据进行shuffle后，组成一个batch数据。同理，`test_reader`是在测试的时候使用，将读取的测试数据组成一个batch。
 ```
-    reader_dict={'word': 0, 'label': 1}
+    feeding={'word': 0, 'label': 1}
 ```
-`reader_dict`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层，第1列数据对应`label`层。
+`feeding`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层，第1列数据对应`label`层。
 ### 构造模型
 ```
    # Please choose the way to build the network
@@ -314,7 +312,7 @@ Paddle中提供了一系列优化算法的API，这里使用Adam优化算法。
                sys.stdout.write('.')
                sys.stdout.flush()
        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(reader=test_reader, reader_dict=reader_dict)
+            result = trainer.test(reader=test_reader, feeding=feeding)
            print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
 ```
 可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error；在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
@@ -327,7 +325,7 @@ Paddle中提供了一系列优化算法的API，这里使用Adam优化算法。
    trainer.train(
        reader=train_reader,
        event_handler=event_handler,
-        reader_dict=reader_dict,
+        feeding=feeding,
        num_passes=2)
 ```
 程序运行之后的输出如下。

--- a/understand_sentiment/train.py
+++ b/understand_sentiment/train.py
@@ -13,8 +13,6 @@
 # limitations under the License.

 import sys
-import paddle.trainer_config_helpers.attrs as attrs
-from paddle.trainer_config_helpers.poolings import MaxPooling
 import paddle.v2 as paddle


@@ -54,11 +52,11 @@ def stacked_lstm_net(input_dim,
    """
    assert stacked_num % 2 == 1

-    layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5)
-    fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3)
-    lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.)
+    layer_attr = paddle.attr.Extra(drop_rate=0.5)
+    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
+    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
    para_attr = [fc_para_attr, lstm_para_attr]
-    bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.)
+    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
    relu = paddle.activation.Relu()
    linear = paddle.activation.Linear()

@@ -88,8 +86,10 @@ def stacked_lstm_net(input_dim,
            layer_attr=layer_attr)
        inputs = [fc, lstm]

-    fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling())
-    lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling())
+    fc_last = paddle.layer.pooling(
+        input=inputs[0], pooling_type=paddle.pooling.Max())
+    lstm_last = paddle.layer.pooling(
+        input=inputs[1], pooling_type=paddle.pooling.Max())
    output = paddle.layer.fc(input=[fc_last, lstm_last],
                             size=class_dim,
                             act=paddle.activation.Softmax(),
@@ -117,7 +117,7 @@ if __name__ == '__main__':
    test_reader = paddle.batch(
        lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)

-    reader_dict = {'word': 0, 'label': 1}
+    feeding = {'word': 0, 'label': 1}

    # network config
    # Please choose the way to build the network
@@ -144,7 +144,7 @@ if __name__ == '__main__':
                sys.stdout.write('.')
                sys.stdout.flush()
        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(reader=test_reader, reader_dict=reader_dict)
+            result = trainer.test(reader=test_reader, feeding=feeding)
            print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)

    # create trainer
@@ -155,5 +155,5 @@ if __name__ == '__main__':
    trainer.train(
        reader=train_reader,
        event_handler=event_handler,
-        reader_dict=reader_dict,
+        feeding=feeding,
        num_passes=2)