merge from paddlepaddle/book/develop

3743b3b2 · gongweibao · 03a01fa1 · fe3bc159 · 3743b3b2 · 3743b3b2
68 changed file
--- a/.gitignore
+++ b/.gitignore
+deprecated
+*~
 pandoc.template
 .DS_Store
\ No newline at end of file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
-   repo: https://github.com/Lucas-C/pre-commit-hooks.git
-    sha: c25201a00e6b0514370501050cf2a8538ac12270
-    hooks:
-    -   id: remove-crlf
 -   repo: https://github.com/reyoung/mirrors-yapf.git
    sha: v0.13.2
    hooks:
    - id: yapf
      files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$  # Bazel BUILD files follow Python syntax.
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469
+    sha: v0.7.1
    hooks:
    -   id: check-merge-conflict
    -   id: check-symlinks
    -   id: detect-private-key
    -   id: end-of-file-fixer
+        files: \.md$
+    -   id: trailing-whitespace
+        files: \.md$
+-   repo: https://github.com/Lucas-C/pre-commit-hooks
+    sha: v1.0.1
+    hooks:
+    -   id: forbid-crlf
+        files: \.md$
+    -   id: remove-crlf
+        files: \.md$
+    -   id: forbid-tabs
+        files: \.md$
+    -   id: remove-tabs
+        files: \.md$
+- repo: local
+  hooks:
+    -  id: convert-markdown-into-html
+       name: convert-markdown-into-html
+       description: "Convert README.md into index.html and README.en.md into index.en.html"
+       entry: python pre-commit-hooks/convert_markdown_into_html.py
+       language: system
+       files: \.md$
--- a/.tmpl/marked.js
+++ b/.tmpl/marked.js
@@ -1093,7 +1093,7 @@ function escape(html, encode) {
 }
 function unescape(html) {
-	// explicitly match decimal, hex, and named HTML entities 
+    // explicitly match decimal, hex, and named HTML entities
  return html.replace(/&(#(?:\d+)|(?:#x[0-9A-Fa-f]+)|(?:\w+));?/g, function(_, n) {
    n = n.toLowerCase();
    if (n === 'colon') return ':';

--- a/build.sh
+++ b/build.sh
-#!/bin/bash
-for i in $(du -a | grep '\.\/.\+\/README.md' | cut -f 2); do
-    .tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.html
-done
-for i in $(du -a | grep '\.\/.\+\/README.en.md' | cut -f 2); do
-    .tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.en.html
-done
-.tmpl/convert_2_ipynb.sh
--- a/fit_a_line/README.en.md
+++ b/fit_a_line/README.en.md
--- a/fit_a_line/README.md
+++ b/fit_a_line/README.md
@@ -39,16 +39,16 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 ### 训练过程
-定义好模型结构之后，我们要通过以下几个步骤进行模型训练  
+定义好模型结构之后，我们要通过以下几个步骤进行模型训练
- 1. 初始化参数，其中包括权重$\omega_i$和偏置$b$，对其进行初始化（如0均值，1方差）。  
+ 1. 初始化参数，其中包括权重$\omega_i$和偏置$b$，对其进行初始化（如0均值，1方差）。
- 2. 网络正向传播计算网络输出和损失函数。  
+ 2. 网络正向传播计算网络输出和损失函数。
- 3. 根据损失函数进行反向误差传播 （[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)），将网络误差从输出层依次向前传递, 并更新网络中的参数。  
+ 3. 根据损失函数进行反向误差传播 （[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)），将网络误差从输出层依次向前传递, 并更新网络中的参数。
- 4. 重复2~3步骤，直至网络训练误差达到规定的程度或训练轮次达到设定值。  
+ 4. 重复2~3步骤，直至网络训练误差达到规定的程度或训练轮次达到设定值。
 ## 数据集
 ### 数据集接口的封装
-首先加载需要的包   
+首先加载需要的包
 ```python
 import paddle.v2 as paddle
@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing
 其中，在uci_housing模块中封装了：
-1.   数据下载的过程<br>
+1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
-      下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data<br>
+2. [数据预处理](#数据预处理)的过程。
-2.   [数据预处理](#数据预处理)的过程<br>
 ### 数据集介绍
@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing
 我们将数据集分割为两份：一份用于调整模型的参数，即进行模型的训练，模型在这份数据集上的误差被称为**训练误差**；另外一份被用来测试，模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据，所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素：更多的训练数据会降低参数估计的方差，从而得到更可信的模型；而更多的测试数据会降低测试误差的方差，从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
 在更复杂的模型训练过程中，我们往往还会多使用一种数据集：验证集。因为复杂的模型中常常还有一些超参数（[Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)）需要调节，所以我们会尝试多种超参数的组合来分别训练多个模型，然后对比它们在验证集上的表现选择相对最好的一组超参数，最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单，我们暂且忽略掉这个过程。
 ## 训练
-fit_a_line下trainer.py演示了训练的整体过程  
-### 初始化paddlepaddle  
+`fit_a_line/trainer.py`演示了训练的整体过程。
+### 初始化PaddlePaddle
 ```python
-# init
 paddle.init(use_gpu=False, trainer_count=1)
 ```
-### 模型配置  
+### 模型配置
-使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。  
+线性回归的模型其实就是一个采用线性激活函数（linear activation，`LinearActivation`）的全连接层（fully-connected layer，`fc_layer`）：
 ```python
-#输入数据，13维的房屋信息
 x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
 y_predict = paddle.layer.fc(input=x,
                                size=1,
@@ -131,17 +128,15 @@ y_predict = paddle.layer.fc(input=x,
 y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
 cost = paddle.layer.regression_cost(input=y_predict, label=y)
 ```
-### 创建参数 
+### 创建参数
 ```python
-# create parameters
 parameters = paddle.parameters.create(cost)
 ```
-### 创建trainer  
+### 创建Trainer
 ```python
-# create optimizer
 optimizer = paddle.optimizer.Momentum(momentum=0)
 trainer = paddle.trainer.SGD(cost=cost,
@@ -149,14 +144,20 @@ trainer = paddle.trainer.SGD(cost=cost,
                             update_equation=optimizer)
 ```
-### 读取数据且打印训练的中间信息  
+### 读取数据且打印训练的中间信息
-在程序中，我们通过reader接口来获取训练或者测试的数据,通过eventhandler来打印训练的中间信息  
-feeding中设置了训练数据和测试数据的下标,reader通过下标区分训练和测试数据。
+PaddlePaddle提供一个
+[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
+来读取数据。 Reader返回的数据可以包括多列，我们需要一个Python dict把列
+序号映射到网络里的数据层。
 ```python
-feeding={'x': 0,
+feeding={'x': 0, 'y': 1}
-             'y': 1}
+```
+此外，我们还可以提供一个 event handler，来打印训练的进度：
+```python
 # event_handler to print training and testing info
 def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
@@ -171,10 +172,10 @@ def event_handler(event):
            feeding=feeding)
        print "Test %d, Cost %f" % (event.pass_id, result.cost)
 ```
-### 开始训练  
+### 开始训练
 ```python
-# training
 trainer.train(
    reader=paddle.batch(
        paddle.reader.shuffle(
@@ -185,13 +186,6 @@ trainer.train(
    num_passes=30)
 ```
-## bash中执行训练程序  
-**注意设置好paddle的安装包路径**
-```bash
-python train.py
-```
 ## 总结
 在这章里，我们借助波士顿房价这一数据集，介绍了线性回归模型的基本概念，以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来，因此弄清楚线性模型的原理和局限非常重要。

--- a/fit_a_line/index.en.html
+++ b/fit_a_line/index.en.html
--- a/fit_a_line/index.html
+++ b/fit_a_line/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -86,14 +87,25 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 3. 根据损失函数进行反向误差传播 （[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)），将网络误差从输出层依次向前传递, 并更新网络中的参数。
 4. 重复2~3步骤，直至网络训练误差达到规定的程度或训练轮次达到设定值。
+## 数据集
+### 数据集接口的封装
+首先加载需要的包
-## 数据准备
+```python
-执行以下命令来准备数据:
+import paddle.v2 as paddle
-```bash
+import paddle.v2.dataset.uci_housing as uci_housing
-cd data && python prepare_data.py
 ```
-这段代码将从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)下载数据并进行[预处理](#数据预处理)，最后数据将被分为训练集和测试集。
+我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)
+其中，在uci_housing模块中封装了：
+1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
+2. [数据预处理](#数据预处理)的过程。
+### 数据集介绍
 这份数据集共506行，每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下：
 | 属性名 | 解释 | 类型 |
@@ -131,89 +143,89 @@ cd data && python prepare_data.py
 </p>
 #### 整理训练集与测试集
-我们将数据集分割为两份：一份用于调整模型的参数，即进行模型的训练，模型在这份数据集上的误差被称为**训练误差**；另外一份被用来测试，模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据，所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素：更多的训练数据会降低参数估计的方差，从而得到更可信的模型；而更多的测试数据会降低测试误差的方差，从而得到更可信的测试误差。一种常见的分割比例为$8:2$，感兴趣的读者朋友们也可以尝试不同的设置来观察这两种误差的变化。
+我们将数据集分割为两份：一份用于调整模型的参数，即进行模型的训练，模型在这份数据集上的误差被称为**训练误差**；另外一份被用来测试，模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据，所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素：更多的训练数据会降低参数估计的方差，从而得到更可信的模型；而更多的测试数据会降低测试误差的方差，从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
+在更复杂的模型训练过程中，我们往往还会多使用一种数据集：验证集。因为复杂的模型中常常还有一些超参数（[Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)）需要调节，所以我们会尝试多种超参数的组合来分别训练多个模型，然后对比它们在验证集上的表现选择相对最好的一组超参数，最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单，我们暂且忽略掉这个过程。
+## 训练
+`fit_a_line/trainer.py`演示了训练的整体过程。
+### 初始化PaddlePaddle
-执行如下命令可以分割数据集，并将训练集和测试集的地址分别写入train.list 和 test.list两个文件中，供PaddlePaddle读取。
 ```python
-python prepare_data.py -r 0.8 #默认使用8:2的比例进行分割
+paddle.init(use_gpu=False, trainer_count=1)
 ```
-在更复杂的模型训练过程中，我们往往还会多使用一种数据集：验证集。因为复杂的模型中常常还有一些超参数（[Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)）需要调节，所以我们会尝试多种超参数的组合来分别训练多个模型，然后对比它们在验证集上的表现选择相对最好的一组超参数，最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单，我们暂且忽略掉这个过程。
+### 模型配置
-### 提供数据给PaddlePaddle
+线性回归的模型其实就是一个采用线性激活函数（linear activation，`LinearActivation`）的全连接层（fully-connected layer，`fc_layer`）：
-准备好数据之后，我们使用一个Python data provider来为PaddlePaddle的训练过程提供数据。一个 data provider 就是一个Python函数，它会被PaddlePaddle的训练过程调用。在这个例子里，只需要读取已经保存好的数据，然后一行一行地返回给PaddlePaddle的训练进程即可。
 ```python
-from paddle.trainer.PyDataProvider2 import *
+x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
-import numpy as np
+y_predict = paddle.layer.fc(input=x,
-#定义数据的类型和维度
+                                size=1,
-@provider(input_types=[dense_vector(13), dense_vector(1)])
+                                act=paddle.activation.Linear())
-def process(settings, input_file):
+y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
-    data = np.load(input_file.strip())
+cost = paddle.layer.regression_cost(input=y_predict, label=y)
-    for row in data:
+```
-	    yield row[:-1].tolist(), row[-1:].tolist()
+### 创建参数
+```python
+parameters = paddle.parameters.create(cost)
 ```
-## 模型配置说明
+### 创建Trainer
-### 数据定义
-首先，通过 `define_py_data_sources2` 来配置PaddlePaddle从上面的`dataprovider.py`里读入训练数据和测试数据。 PaddlePaddle接受从命令行读入的配置信息，例如这里我们传入一个名为`is_predict`的变量来控制模型在训练和测试时的不同结构。
 ```python
-from paddle.trainer_config_helpers import *
+optimizer = paddle.optimizer.Momentum(momentum=0)
-is_predict = get_config_arg('is_predict', bool, False)
+trainer = paddle.trainer.SGD(cost=cost,
+                             parameters=parameters,
+                             update_equation=optimizer)
+```
-define_py_data_sources2(
+### 读取数据且打印训练的中间信息
-    train_list='data/train.list',
-    test_list='data/test.list',
-    module='dataprovider',
-    obj='process')
-```
+PaddlePaddle提供一个
+[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
+来读取数据。 Reader返回的数据可以包括多列，我们需要一个Python dict把列
+序号映射到网络里的数据层。
-### 算法配置
-接着，指定模型优化算法的细节。由于线性回归模型比较简单，我们只要设置基本的`batch_size`即可，它指定每次更新参数的时候使用多少条数据计算梯度信息。
 ```python
-settings(batch_size=2)
+feeding={'x': 0, 'y': 1}
 ```
-### 网络结构
+此外，我们还可以提供一个 event handler，来打印训练的进度：
-最后，使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。
 ```python
-#输入数据，13维的房屋信息
+# event_handler to print training and testing info
-x = data_layer(name='x', size=13)
+def event_handler(event):
+    if isinstance(event, paddle.event.EndIteration):
-y_predict = fc_layer(
+        if event.batch_id % 100 == 0:
-    input=x,
+            print "Pass %d, Batch %d, Cost %f" % (
-    param_attr=ParamAttr(name='w'),
+                event.pass_id, event.batch_id, event.cost)
-    size=1,
-    act=LinearActivation(),
+    if isinstance(event, paddle.event.EndPass):
-    bias_attr=ParamAttr(name='b'))
+        result = trainer.test(
+            reader=paddle.batch(
-if not is_predict: #训练时，我们使用MSE，即regression_cost作为损失函数
+                uci_housing.test(), batch_size=2),
-    y = data_layer(name='y', size=1)
+            feeding=feeding)
-    cost = regression_cost(input=y_predict, label=y)
+        print "Test %d, Cost %f" % (event.pass_id, result.cost)
-    outputs(cost) #训练时输出MSE来监控损失的变化
-else: #测试时，输出预测值
-    outputs(y_predict)
 ```
-## 训练模型
+### 开始训练
-在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`，训练30轮，结果保存在`output`路径下。
-```bash
-./train.sh
-```
-## 应用模型
+```python
-现在来看下如何使用已经训练好的模型进行预测。
+trainer.train(
-```bash
+    reader=paddle.batch(
-python predict.py
+        paddle.reader.shuffle(
-```
+            uci_housing.train(), buf_size=500),
-这里默认使用`output/pass-00029`中保存的模型进行预测，并将数据中的房价与预测结果进行对比，结果保存在 `predictions.png`中。
+        batch_size=2),
-如果你想使用别的模型或者其它的数据进行预测，只要传入新的路径即可：
+    feeding=feeding,
-```bash
+    event_handler=event_handler,
-python predict.py -m output/pass-00020 -t data/housing.test.npy
+    num_passes=30)
 ```
 ## 总结
@@ -228,6 +240,7 @@ python predict.py -m output/pass-00020 -t data/housing.test.npy
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -246,6 +259,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/gan/index.html
+++ b/gan/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -40,6 +41,7 @@
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 TODO: Write about https://github.com/PaddlePaddle/Paddle/tree/develop/demo/gan
 </div>
 <!-- You can change the lines below now. -->
@@ -58,6 +60,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/image_caption/index.html
+++ b/image_caption/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -39,6 +40,7 @@
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 </div>
 <!-- You can change the lines below now. -->
@@ -57,6 +59,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/image_classification/README.en.md
+++ b/image_classification/README.en.md
@@ -248,48 +248,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
        The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
-	```python
+    ```python
-	datadim = 3 * 32 * 32
+    datadim = 3 * 32 * 32
-	classdim = 10
+    classdim = 10
-	data = data_layer(name='image', size=datadim)
+    data = data_layer(name='image', size=datadim)
-	```
+    ```
 2. Define VGG main module
-	```python
+    ```python
-	net = vgg_bn_drop(data)
+    net = vgg_bn_drop(data)
-	```
+    ```
        The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
-	```python
+    ```python
-	def vgg_bn_drop(input, num_channels):
+    def vgg_bn_drop(input, num_channels):
-	    def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
+        def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
-	        return img_conv_group(
+            return img_conv_group(
-	            input=ipt,
+                input=ipt,
-	            num_channels=num_channels_,
+                num_channels=num_channels_,
-	            pool_size=2,
+                pool_size=2,
-	            pool_stride=2,
+                pool_stride=2,
-	            conv_num_filter=[num_filter] * groups,
+                conv_num_filter=[num_filter] * groups,
-	            conv_filter_size=3,
+                conv_filter_size=3,
-	            conv_act=ReluActivation(),
+                conv_act=ReluActivation(),
-	            conv_with_batchnorm=True,
+                conv_with_batchnorm=True,
-	            conv_batchnorm_drop_rate=dropouts,
+                conv_batchnorm_drop_rate=dropouts,
-	            pool_type=MaxPooling())
+                pool_type=MaxPooling())
-	    conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
+        conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
-	    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+        conv2 = conv_block(conv1, 128, 2, [0.4, 0])
-	    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+        conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
-	    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+        conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
-	    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+        conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
-	    drop = dropout_layer(input=conv5, dropout_rate=0.5)
+        drop = dropout_layer(input=conv5, dropout_rate=0.5)
-	    fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
+        fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
-	    bn = batch_norm_layer(
+        bn = batch_norm_layer(
-	        input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
+            input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
-	    fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
+        fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
-	    return fc2
+        return fc2
-	```
+    ```
        2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
@@ -303,22 +303,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
        The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
-	```python
+    ```python
-	out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
+    out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
-	```
+    ```
 4. Define Loss Function and Outputs
        In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
-	```python
+    ```python
-	if not is_predict:
+    if not is_predict:
-	    lbl = data_layer(name="label", size=class_num)
+        lbl = data_layer(name="label", size=class_num)
-	    cost = classification_cost(input=out, label=lbl)
+        cost = classification_cost(input=out, label=lbl)
-	    outputs(cost)
+        outputs(cost)
-	else:
+    else:
-	    outputs(out)
+        outputs(out)
-	```
+    ```
 ### ResNet

--- a/image_classification/README.md
+++ b/image_classification/README.md
@@ -3,7 +3,7 @@
 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification)， 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。
-## 背景介绍 
+## 背景介绍
 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息，是人们转递与交换信息的重要来源。在本教程中，我们专注于图像识别领域的一个重要问题，即图像分类。
@@ -51,7 +51,7 @@
  2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
  3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
  4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征，两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。
 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破，效果大幅度超越传统方法，获得了ILSVRC2012冠军，该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后，涌现了一系列CNN模型，不断地在ImageNet上刷新成绩，如图4展示。随着模型变得越来越深以及精妙的结构设计，Top-5的错误率也越来越低，降到了3.5%附近。而在同样的ImageNet数据集上，人眼的辨识错误率大概在5.1%，也就是目前的深度学习模型的识别能力已经超过了人眼。
@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 <p align="center">
 <img src="image/lenet.png"><br/>
-图5. CNN网络示例[20] 
+图5. CNN网络示例[20]
-</p> 
+</p>
 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征，发掘出图片局部关联性质和空间不变性质。
 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作，可以过滤掉一些不重要的高频信息。
@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示，总共22层网络：开始由3层普
 <p align="center">
 <img src="image/googlenet.jpeg" ><br/>
-图8. GoogleNet[12] 
+图8. GoogleNet[12]
 </p>
@@ -174,7 +174,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 1. 定义数据输入及其维度
 	网络输入定义为 `data_layer` (数据层)，在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图，因此输入数据大小为3072(3x32x32)，类别大小为10，即10分类。
 	```python
    datadim = 3 * 32 * 32
    classdim = 10
@@ -189,7 +189,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 	net = vgg_bn_drop(image)
 	```
 	VGG核心模块的输入是数据层，`vgg_bn_drop` 定义了16层VGG结构，每层卷积后面引入BN层和Dropout层，详细的定义如下：
 	```python
    def vgg_bn_drop(input):
        def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
@@ -220,11 +220,11 @@ paddle.init(use_gpu=False, trainer_count=1)
        fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
        return fc2
 	```
 	2.1. 首先定义了一组卷积网络，即conv_block。卷积核大小为3x3，池化窗口大小为2x2，窗口滑动大小为2，groups决定每组VGG模块是几次连续的卷积操作，dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块，由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成，
 	2.2. 五组卷积操作，即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0，即不使用Dropout操作。
 	2.3. 最后接两层512维的全连接。
 3. 定义分类器
@@ -240,7 +240,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 4. 定义损失函数和网络输出
 	在有监督训练中需要输入图像对应的类别信息，同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数，并作为网络的输出，预测阶段定义网络的输出为分类器得到的概率信息。
 	```python
    lbl = paddle.layer.data(
        name="label", type=paddle.data_type.integer_value(classdim))
@@ -305,9 +305,9 @@ def layer_warp(block_func, ipt, features, count, stride):
 `resnet_cifar10` 的连接结构主要有以下几个过程。
-1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。 
+1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。
 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ，每组采用图 10 左边残差模块组成。
-3. 最后对网络做均值池化并返回该层。 
+3. 最后对网络做均值池化并返回该层。
 注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
@@ -452,7 +452,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
 [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
-[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. 
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
 [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.

--- a/image_classification/deprecated/README.md
+++ b/image_classification/deprecated/README.md
@@ -3,7 +3,7 @@
 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification)， 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。
-## 背景介绍 
+## 背景介绍
 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息，是人们转递与交换信息的重要来源。在本教程中，我们专注于图像识别领域的一个重要问题，即图像分类。
@@ -51,7 +51,7 @@
  2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
  3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
  4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征，两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。
 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破，效果大幅度超越传统方法，获得了ILSVRC2012冠军，该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后，涌现了一系列CNN模型，不断地在ImageNet上刷新成绩，如图4展示。随着模型变得越来越深以及精妙的结构设计，Top-5的错误率也越来越低，降到了3.5%附近。而在同样的ImageNet数据集上，人眼的辨识错误率大概在5.1%，也就是目前的深度学习模型的识别能力已经超过了人眼。
@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 <p align="center">
 <img src="image/lenet.png"><br/>
-图5. CNN网络示例[20] 
+图5. CNN网络示例[20]
-</p> 
+</p>
 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征，发掘出图片局部关联性质和空间不变性质。
 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作，可以过滤掉一些不重要的高频信息。
@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示，总共22层网络：开始由3层普
 <p align="center">
 <img src="image/googlenet.jpeg" ><br/>
-图8. GoogleNet[12] 
+图8. GoogleNet[12]
 </p>
@@ -245,7 +245,7 @@ $$  lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
 1. 定义数据输入及其维度
 	网络输入定义为 `data_layer` (数据层)，在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图，因此输入数据大小为3072(3x32x32)，类别大小为10，即10分类。
 	```python
 	datadim = 3 * 32 * 32
 	classdim = 10
@@ -258,7 +258,7 @@ $$  lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
 	net = vgg_bn_drop(data)
 	```
 	VGG核心模块的输入是数据层，`vgg_bn_drop` 定义了16层VGG结构，每层卷积后面引入BN层和Dropout层，详细的定义如下：
 	```python
 	def vgg_bn_drop(input, num_channels):
 	    def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
@@ -273,26 +273,26 @@ $$  lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
 	            conv_with_batchnorm=True,
 	            conv_batchnorm_drop_rate=dropouts,
 	            pool_type=MaxPooling())
 	    conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
 	    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
 	    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
 	    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
 	    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
 	    drop = dropout_layer(input=conv5, dropout_rate=0.5)
 	    fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
 	    bn = batch_norm_layer(
 	        input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
 	    fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
 	    return fc2
 	```
 	2.1. 首先定义了一组卷积网络，即conv_block。卷积核大小为3x3，池化窗口大小为2x2，窗口滑动大小为2，groups决定每组VGG模块是几次连续的卷积操作，dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.trainer_config_helpers`中预定义的模块，由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成，
 	2.2. 五组卷积操作，即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0，即不使用Dropout操作。
 	2.3. 最后接两层512维的全连接。
 3. 定义分类器
@@ -306,7 +306,7 @@ $$  lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
 4. 定义损失函数和网络输出
 	在有监督训练中需要输入图像对应的类别信息，同样通过`data_layer`来定义。训练中采用多类交叉熵作为损失函数，并作为网络的输出，预测阶段定义网络的输出为分类器得到的概率信息。
 	```python
 	if not is_predict:
 	    lbl = data_layer(name="label", size=class_num)
@@ -383,9 +383,9 @@ def layer_warp(block_func, ipt, features, count, stride):
 `resnet_cifar10` 的连接结构主要有以下几个过程。
-1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。 
+1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。
 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ，每组采用图 10 左边残差模块组成。
-3. 最后对网络做均值池化并返回该层。 
+3. 最后对网络做均值池化并返回该层。
 注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
@@ -487,7 +487,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
 <p align="center">
 <img src="image/fea_conv0.png" width="500"><br/>
-图13. 卷积特征可视化图 
+图13. 卷积特征可视化图
 </p>
 ## 总结
@@ -501,7 +501,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
 [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
-[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. 
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
 [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.

--- a/image_classification/index.en.html
+++ b/image_classification/index.en.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -289,48 +290,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
        The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
-	```python
+    ```python
-	datadim = 3 * 32 * 32
+    datadim = 3 * 32 * 32
-	classdim = 10
+    classdim = 10
-	data = data_layer(name='image', size=datadim)
+    data = data_layer(name='image', size=datadim)
-	```
+    ```
 2. Define VGG main module
-	```python
+    ```python
-	net = vgg_bn_drop(data)
+    net = vgg_bn_drop(data)
-	```
+    ```
        The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
-	```python
+    ```python
-	def vgg_bn_drop(input, num_channels):
+    def vgg_bn_drop(input, num_channels):
-	    def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
+        def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
-	        return img_conv_group(
+            return img_conv_group(
-	            input=ipt,
+                input=ipt,
-	            num_channels=num_channels_,
+                num_channels=num_channels_,
-	            pool_size=2,
+                pool_size=2,
-	            pool_stride=2,
+                pool_stride=2,
-	            conv_num_filter=[num_filter] * groups,
+                conv_num_filter=[num_filter] * groups,
-	            conv_filter_size=3,
+                conv_filter_size=3,
-	            conv_act=ReluActivation(),
+                conv_act=ReluActivation(),
-	            conv_with_batchnorm=True,
+                conv_with_batchnorm=True,
-	            conv_batchnorm_drop_rate=dropouts,
+                conv_batchnorm_drop_rate=dropouts,
-	            pool_type=MaxPooling())
+                pool_type=MaxPooling())
-	    conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
+        conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
-	    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+        conv2 = conv_block(conv1, 128, 2, [0.4, 0])
-	    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+        conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
-	    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+        conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
-	    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+        conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
-	    drop = dropout_layer(input=conv5, dropout_rate=0.5)
+        drop = dropout_layer(input=conv5, dropout_rate=0.5)
-	    fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
+        fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
-	    bn = batch_norm_layer(
+        bn = batch_norm_layer(
-	        input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
+            input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
-	    fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
+        fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
-	    return fc2
+        return fc2
-	```
+    ```
        2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
@@ -344,22 +345,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
        The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
-	```python
+    ```python
-	out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
+    out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
-	```
+    ```
 4. Define Loss Function and Outputs
        In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
-	```python
+    ```python
-	if not is_predict:
+    if not is_predict:
-	    lbl = data_layer(name="label", size=class_num)
+        lbl = data_layer(name="label", size=class_num)
-	    cost = classification_cost(input=out, label=lbl)
+        cost = classification_cost(input=out, label=lbl)
-	    outputs(cost)
+        outputs(cost)
-	else:
+    else:
-	    outputs(out)
+        outputs(out)
-	```
+    ```
 ### ResNet
@@ -589,6 +590,7 @@ Traditional image classification methods involve multiple stages of processing a
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -607,6 +609,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/image_classification/index.html
+++ b/image_classification/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -44,7 +45,7 @@
 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification)， 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。
-## 背景介绍 
+## 背景介绍
 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息，是人们转递与交换信息的重要来源。在本教程中，我们专注于图像识别领域的一个重要问题，即图像分类。
@@ -92,7 +93,7 @@
  2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
  3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
  4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征，两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。
 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破，效果大幅度超越传统方法，获得了ILSVRC2012冠军，该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后，涌现了一系列CNN模型，不断地在ImageNet上刷新成绩，如图4展示。随着模型变得越来越深以及精妙的结构设计，Top-5的错误率也越来越低，降到了3.5%附近。而在同样的ImageNet数据集上，人眼的辨识错误率大概在5.1%，也就是目前的深度学习模型的识别能力已经超过了人眼。
@@ -108,8 +109,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 <p align="center">
 <img src="image/lenet.png"><br/>
-图5. CNN网络示例[20] 
+图5. CNN网络示例[20]
-</p> 
+</p>
 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征，发掘出图片局部关联性质和空间不变性质。
 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作，可以过滤掉一些不重要的高频信息。
@@ -149,7 +150,7 @@ GoogleNet整体网络结构如图8所示，总共22层网络：开始由3层普
 <p align="center">
 <img src="image/googlenet.jpeg" ><br/>
-图8. GoogleNet[12] 
+图8. GoogleNet[12]
 </p>
@@ -215,7 +216,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 1. 定义数据输入及其维度
 	网络输入定义为 `data_layer` (数据层)，在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图，因此输入数据大小为3072(3x32x32)，类别大小为10，即10分类。
 	```python
    datadim = 3 * 32 * 32
    classdim = 10
@@ -230,7 +231,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 	net = vgg_bn_drop(image)
 	```
 	VGG核心模块的输入是数据层，`vgg_bn_drop` 定义了16层VGG结构，每层卷积后面引入BN层和Dropout层，详细的定义如下：
 	```python
    def vgg_bn_drop(input):
        def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
@@ -261,11 +262,11 @@ paddle.init(use_gpu=False, trainer_count=1)
        fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
        return fc2
 	```
 	2.1. 首先定义了一组卷积网络，即conv_block。卷积核大小为3x3，池化窗口大小为2x2，窗口滑动大小为2，groups决定每组VGG模块是几次连续的卷积操作，dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块，由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成，
 	2.2. 五组卷积操作，即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0，即不使用Dropout操作。
 	2.3. 最后接两层512维的全连接。
 3. 定义分类器
@@ -281,7 +282,7 @@ paddle.init(use_gpu=False, trainer_count=1)
 4. 定义损失函数和网络输出
 	在有监督训练中需要输入图像对应的类别信息，同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数，并作为网络的输出，预测阶段定义网络的输出为分类器得到的概率信息。
 	```python
    lbl = paddle.layer.data(
        name="label", type=paddle.data_type.integer_value(classdim))
@@ -346,9 +347,9 @@ def layer_warp(block_func, ipt, features, count, stride):
 `resnet_cifar10` 的连接结构主要有以下几个过程。
-1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。 
+1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。
 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ，每组采用图 10 左边残差模块组成。
-3. 最后对网络做均值池化并返回该层。 
+3. 最后对网络做均值池化并返回该层。
 注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
@@ -493,7 +494,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
 [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
-[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. 
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
 [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
@@ -535,6 +536,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -553,6 +555,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/image_detection/index.html
+++ b/image_detection/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -39,6 +40,7 @@
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 </div>
 <!-- You can change the lines below now. -->
@@ -57,6 +59,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/image_qa/index.html
+++ b/image_qa/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -39,6 +40,7 @@
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 </div>
 <!-- You can change the lines below now. -->
@@ -57,6 +59,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/index.html
+++ b/index.html
 <html>
 <head>
-	<meta http-equiv="refresh" content="0; url=https://github.com/paddlepaddle/book" />
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
+    jax: ["input/TeX", "output/HTML-CSS"],
+    tex2jax: {
+      inlineMath: [ ['$','$'] ],
+      displayMath: [ ['$$','$$'] ],
+      processEscapes: true
+    },
+    "HTML-CSS": { availableFonts: ["TeX"] }
+  });
+  </script>
+  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
+  <script type="text/javascript" src="../.tmpl/marked.js">
+  </script>
+  <link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
+  <script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
+  <link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
+  <link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
+  <link href="../.tmpl/github-markdown.css" rel='stylesheet'>
 </head>
+<style type="text/css" >
+.markdown-body {
+    box-sizing: border-box;
+    min-width: 200px;
+    max-width: 980px;
+    margin: 0 auto;
+    padding: 45px;
+}
+</style>
 <body>
-	<a href="https://github.com/paddlepaddle/book">Please access github home page</a>
+<div id="context" class="container markdown-body">
+</div>
+<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
+<div id="markdown" style='display:none'>
+# 深度学习入门
+1. 新手入门 [[fit_a_line](fit_a_line/)] [[html](http://book.paddlepaddle.org/fit_a_line)]
+1. 识别数字 [[recognize_digits](recognize_digits/)] [[html](http://book.paddlepaddle.org/recognize_digits)]
+1. 图像分类 [[image_classification](image_classification/)] [[html](http://book.paddlepaddle.org/image_classification)]
+1. 词向量 [[word2vec](word2vec/)] [[html](http://book.paddlepaddle.org/word2vec)]
+1. 情感分析 [[understand_sentiment](understand_sentiment/)] [[html](http://book.paddlepaddle.org/understand_sentiment)]
+1. 语义角色标注 [[label_semantic_roles](label_semantic_roles/)] [[html](http://book.paddlepaddle.org/label_semantic_roles)]
+1. 机器翻译 [[machine_translation](machine_translation/)] [[html](http://book.paddlepaddle.org/machine_translation)]
+1. 个性化推荐 [[recommender_system](recommender_system/)] [[html](http://book.paddlepaddle.org/recommender_system)]
+<br/>
+<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
+</div>
+<!-- You can change the lines below now. -->
+<script type="text/javascript">
+marked.setOptions({
+  renderer: new marked.Renderer(),
+  gfm: true,
+  breaks: false,
+  smartypants: true,
+  highlight: function(code, lang) {
+    code = code.replace(/&amp;/g, "&")
+    code = code.replace(/&gt;/g, ">")
+    code = code.replace(/&lt;/g, "<")
+    code = code.replace(/&nbsp;/g, " ")
+    return hljs.highlightAuto(code, [lang]).value;
+  }
+});
+document.getElementById("context").innerHTML = marked(
+        document.getElementById("markdown").innerHTML)
+</script>
 </body>
--- a/label_semantic_roles/README.en.md
+++ b/label_semantic_roles/README.en.md
@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
 <div  align="center">
-<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
+<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
 Fig 1. Syntactic parse tree
 </div>
-核心关系-> HED
-定中关系-> ATT
-主谓关系-> SBV
-状中结构-> ADV
-介宾关系-> POB
-右附加关系-> RAD
-动宾关系-> VOB
-标点-> WP
 However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
 The BIO representation of above example is shown in Fig.1.
 <div  align="center">
-<img src="image/bio_example.png" width = "90%"  align=center /><br>
+<img src="image/bio_example_en.png" width = "90%"  align=center /><br>
 Fig 2. BIO represention
 </div>
-输入序列-> input sequence
-语块-> chunk
-标注序列-> label sequence
-角色-> role
 This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
 In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
@@ -70,14 +56,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
 Fig.3 illustrate the final stacked recurrent neural networks.
-<p align="center">    
+<p align="center">  
-<img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
+<img src="./image/stacked_lstm_en.png" width = "40%"  align=center><br>
 Fig 3. Stacked Recurrent Neural Networks
 </p>
-线性变换-> linear transformation
-输入层到隐层-> input-to-hidden
 ### Bidirectional Recurrent Neural Network
 LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
@@ -85,16 +68,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s
 To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
-<p align="center">    
+<p align="center">  
-<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
+<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
 Fig 4. Bidirectional LSTMs
 </p>
-线性变换-> linear transformation
-输入层到隐层-> input-to-hidden
-正向处理输出序列->process sequence in the forward direction
-反向处理上一层序列-> process sequence from the previous layer in backward direction
 Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
 ### Conditional Random Field
@@ -106,12 +84,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia
 Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5.
-<p align="center">    
+<p align="center">  
 <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 Fig 5. Linear Chain Conditional Random Field used in SRL tasks
 </p>
-By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: 
+By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
 $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
@@ -155,19 +133,11 @@ After modification, the model is as follows:
 4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
-<div  align="center">    
+<div  align="center">  
-<img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
+<img src="image/db_lstm_en.png" width = "60%"  align=center /><br>
 Fig 6. DB-LSTM for SRL tasks
 </div>
-论元-> argu
-谓词-> pred
-谓词上下文-> ctx-p
-谓词上下文区域标记-> $m_r$
-输入-> input
-原句-> sentence
-反向LSTM-> LSTM Reverse
 ## Data Preparation
 In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
@@ -259,10 +229,10 @@ def d_type(value_range):
 # word sequence
 word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
 # predicate
-predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) 
+predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
 # 5 features for predicate context
-ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) 
+ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
 ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
 ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
 ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
@@ -274,12 +244,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
 # label sequence
 target = paddle.layer.data(name='target', type=d_type(label_dict_len))
 ```
 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
 - 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
-```python   
+```python  
 # Since word vectorlookup table is pre-trained, we won't update it this time.
 # is_static being True prevents updating the lookup table during training.
@@ -405,7 +375,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
 ```
 We can print out parameter name. It will be generated if not specified.
 ```python
 print parameters.keys()
 ```

--- a/label_semantic_roles/README.md
+++ b/label_semantic_roles/README.md
@@ -52,7 +52,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
 图3是最终得到的栈式循环神经网络结构示意图。
-<p align="center">    
+<p align="center">  
 <img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
 图3. 基于LSTM的栈式循环神经网络结构示意图
 </p>
@@ -63,7 +63,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
 为了克服这一缺陷，我们可以设计一种双向循环网络单元，它的思想简单且直接：对上一节的栈式循环神经网络进行一个小小的修改，堆叠多个LSTM单元，让每一层LSTM单元分别以：正向、反向、正向 …… 的顺序学习上一层的输出序列。于是，从第2层开始，$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
-<p align="center">    
+<p align="center">  
 <img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
 图4. 基于LSTM的双向循环神经网络结构示意图
 </p>
@@ -78,7 +78,7 @@ CRF是一种概率化结构模型，可以看作是一个概率无向图模型
 序列标注任务只需要考虑输入和输出都是一个线性序列，并且由于我们只是将输入序列作为条件，不做任何条件独立假设，因此输入序列的元素之间并不存在图结构。综上，在序列标注任务中使用的是如图5所示的定义在链式图上的CRF，称之为线性链条件随机场（Linear Chain Conditional Random Field）。
-<p align="center">    
+<p align="center">  
 <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 图5. 序列标注任务中使用的线性链条件随机场
 </p>
@@ -122,7 +122,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra
 3. 第2步的4个词向量序列作为双向LSTM模型的输入；LSTM模型学习输入序列的特征表示，得到新的特性表示序列；
 4. CRF以第3步中LSTM学习到的特征为输入，以标记序列为监督信号，完成序列标注；
-<div  align="center">    
+<div  align="center">  
 <img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
 图6. SRL任务上的深层双向LSTM模型
 </div>
@@ -161,7 +161,7 @@ conll05st-release/
 预处理完成之后一条训练样本包含9个特征，分别是：句子序列、谓词、谓词上下文（占 5 列）、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
-| 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 | 
+| 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 |
 |---|---|---|---|---|
 | A | set | n't been set . × | 0 | B-A1 |
 | record | set | n't been set . × | 0 | I-A1 |
@@ -214,7 +214,7 @@ word_dim = 32        # 词向量维度
 mark_dim = 5         # 谓词上下文区域通过词表被映射为一个实向量，这个是相邻的维度
 hidden_dim = 512     # LSTM隐层向量的维度 ： 512 / 4
 depth = 8            # 栈式LSTM的深度
 # 一条样本总共9个特征，下面定义了9个data层，每个层类型为integer_value_sequence，表示整数ID的序列类型.
 def d_type(size):
    return paddle.data_type.integer_value_sequence(size)
@@ -222,10 +222,10 @@ def d_type(size):
 # 句子序列
 word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
 # 谓词
-predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) 
+predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
 # 谓词上下文5个特征
-ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) 
+ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
 ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
 ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
 ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
@@ -237,12 +237,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
 # 标注序列
 target = paddle.layer.data(name='target', type=d_type(label_dict_len))
 ```
 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维，关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
 - 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。
-```python   
+```python  
 # 在本教程中，我们加载了预训练的词向量，这里设置了：is_static=True
 # is_static 为 True 时保证了在训练 SRL 模型过程中，词表不再更新
@@ -369,7 +369,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
 ```
 可以打印参数名字，如果在网络配置中没有指定名字，则默认生成。
 ```python
 print parameters.keys()
 ```

--- a/label_semantic_roles/data/extract_pairs.py
+++ b/label_semantic_roles/data/extract_pairs.py
@@ -20,7 +20,7 @@ from optparse import OptionParser
 def read_labels(props_file):
    '''
    a sentence maybe has more than one verb, each verb has its label sequence
-    label[],  is a 3-dimension list. 
+    label[],  is a 3-dimension list.
    the first dim is to store all sentence's label seqs, len is the sentence number
    the second dim is to store all label sequences for one sentences
    the third dim is to store each label for one word

--- a/label_semantic_roles/image/bd_lstm_en.png
+++ b/label_semantic_roles/image/bd_lstm_en.png
--- a/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
+++ b/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
--- a/label_semantic_roles/image/bio_example.png
+++ b/label_semantic_roles/image/bio_example.png
--- a/label_semantic_roles/image/bio_example_en.png
+++ b/label_semantic_roles/image/bio_example_en.png
--- a/label_semantic_roles/image/dependency_parsing.png
+++ b/label_semantic_roles/image/dependency_parsing.png
--- a/label_semantic_roles/image/dependency_parsing_en.png
+++ b/label_semantic_roles/image/dependency_parsing_en.png
--- a/label_semantic_roles/image/stacked_lstm_en.png
+++ b/label_semantic_roles/image/stacked_lstm_en.png
--- a/label_semantic_roles/index.en.html
+++ b/label_semantic_roles/index.en.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -63,34 +64,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
 <div  align="center">
-<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
+<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
 Fig 1. Syntactic parse tree
 </div>
-核心关系-> HED
-定中关系-> ATT
-主谓关系-> SBV
-状中结构-> ADV
-介宾关系-> POB
-右附加关系-> RAD
-动宾关系-> VOB
-标点-> WP
 However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
 The BIO representation of above example is shown in Fig.1.
 <div  align="center">
-<img src="image/bio_example.png" width = "90%"  align=center /><br>
+<img src="image/bio_example_en.png" width = "90%"  align=center /><br>
 Fig 2. BIO represention
 </div>
-输入序列-> input sequence
-语块-> chunk
-标注序列-> label sequence
-角色-> role
 This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
 In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
@@ -111,14 +98,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
 Fig.3 illustrate the final stacked recurrent neural networks.
-<p align="center">    
+<p align="center">  
-<img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
+<img src="./image/stacked_lstm_en.png" width = "40%"  align=center><br>
 Fig 3. Stacked Recurrent Neural Networks
 </p>
-线性变换-> linear transformation
-输入层到隐层-> input-to-hidden
 ### Bidirectional Recurrent Neural Network
 LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
@@ -126,16 +110,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s
 To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
-<p align="center">    
+<p align="center">  
-<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
+<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
 Fig 4. Bidirectional LSTMs
 </p>
-线性变换-> linear transformation
-输入层到隐层-> input-to-hidden
-正向处理输出序列->process sequence in the forward direction
-反向处理上一层序列-> process sequence from the previous layer in backward direction
 Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
 ### Conditional Random Field
@@ -147,12 +126,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia
 Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5.
-<p align="center">    
+<p align="center">  
 <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 Fig 5. Linear Chain Conditional Random Field used in SRL tasks
 </p>
-By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: 
+By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
 $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
@@ -196,19 +175,11 @@ After modification, the model is as follows:
 4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
-<div  align="center">    
+<div  align="center">  
-<img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
+<img src="image/db_lstm_en.png" width = "60%"  align=center /><br>
 Fig 6. DB-LSTM for SRL tasks
 </div>
-论元-> argu
-谓词-> pred
-谓词上下文-> ctx-p
-谓词上下文区域标记-> $m_r$
-输入-> input
-原句-> sentence
-反向LSTM-> LSTM Reverse
 ## Data Preparation
 In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
@@ -300,10 +271,10 @@ def d_type(value_range):
 # word sequence
 word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
 # predicate
-predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) 
+predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
 # 5 features for predicate context
-ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) 
+ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
 ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
 ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
 ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
@@ -315,12 +286,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
 # label sequence
 target = paddle.layer.data(name='target', type=d_type(label_dict_len))
 ```
 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
 - 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
-```python   
+```python  
 # Since word vectorlookup table is pre-trained, we won't update it this time.
 # is_static being True prevents updating the lookup table during training.
@@ -446,7 +417,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
 ```
 We can print out parameter name. It will be generated if not specified.
 ```python
 print parameters.keys()
 ```
@@ -542,6 +513,7 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -560,6 +532,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/label_semantic_roles/index.html
+++ b/label_semantic_roles/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -93,7 +94,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
 图3是最终得到的栈式循环神经网络结构示意图。
-<p align="center">    
+<p align="center">  
 <img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
 图3. 基于LSTM的栈式循环神经网络结构示意图
 </p>
@@ -104,7 +105,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
 为了克服这一缺陷，我们可以设计一种双向循环网络单元，它的思想简单且直接：对上一节的栈式循环神经网络进行一个小小的修改，堆叠多个LSTM单元，让每一层LSTM单元分别以：正向、反向、正向 …… 的顺序学习上一层的输出序列。于是，从第2层开始，$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
-<p align="center">    
+<p align="center">  
 <img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
 图4. 基于LSTM的双向循环神经网络结构示意图
 </p>
@@ -119,7 +120,7 @@ CRF是一种概率化结构模型，可以看作是一个概率无向图模型
 序列标注任务只需要考虑输入和输出都是一个线性序列，并且由于我们只是将输入序列作为条件，不做任何条件独立假设，因此输入序列的元素之间并不存在图结构。综上，在序列标注任务中使用的是如图5所示的定义在链式图上的CRF，称之为线性链条件随机场（Linear Chain Conditional Random Field）。
-<p align="center">    
+<p align="center">  
 <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 图5. 序列标注任务中使用的线性链条件随机场
 </p>
@@ -163,7 +164,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra
 3. 第2步的4个词向量序列作为双向LSTM模型的输入；LSTM模型学习输入序列的特征表示，得到新的特性表示序列；
 4. CRF以第3步中LSTM学习到的特征为输入，以标记序列为监督信号，完成序列标注；
-<div  align="center">    
+<div  align="center">  
 <img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
 图6. SRL任务上的深层双向LSTM模型
 </div>
@@ -202,7 +203,7 @@ conll05st-release/
 预处理完成之后一条训练样本包含9个特征，分别是：句子序列、谓词、谓词上下文（占 5 列）、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
-| 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 | 
+| 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 |
 |---|---|---|---|---|
 | A | set | n't been set . × | 0 | B-A1 |
 | record | set | n't been set . × | 0 | I-A1 |
@@ -255,7 +256,7 @@ word_dim = 32        # 词向量维度
 mark_dim = 5         # 谓词上下文区域通过词表被映射为一个实向量，这个是相邻的维度
 hidden_dim = 512     # LSTM隐层向量的维度 ： 512 / 4
 depth = 8            # 栈式LSTM的深度
 # 一条样本总共9个特征，下面定义了9个data层，每个层类型为integer_value_sequence，表示整数ID的序列类型.
 def d_type(size):
    return paddle.data_type.integer_value_sequence(size)
@@ -263,10 +264,10 @@ def d_type(size):
 # 句子序列
 word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
 # 谓词
-predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) 
+predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
 # 谓词上下文5个特征
-ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) 
+ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
 ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
 ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
 ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
@@ -278,12 +279,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
 # 标注序列
 target = paddle.layer.data(name='target', type=d_type(label_dict_len))
 ```
 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维，关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
 - 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表，转换为实向量表示的词向量序列。
-```python   
+```python  
 # 在本教程中，我们加载了预训练的词向量，这里设置了：is_static=True
 # is_static 为 True 时保证了在训练 SRL 模型过程中，词表不再更新
@@ -410,7 +411,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
 ```
 可以打印参数名字，如果在网络配置中没有指定名字，则默认生成。
 ```python
 print parameters.keys()
 ```
@@ -509,6 +510,7 @@ trainer.train(
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -527,6 +529,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/label_semantic_roles/predict.sh
+++ b/label_semantic_roles/predict.sh
@@ -19,7 +19,7 @@ function get_best_pass() {
  cat $1  | grep -Pzo 'Test .*\n.*pass-.*' | \
  sed  -r 'N;s/Test.* cost=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
  sort -n | head -n 1
-}   
+}
 log=train.log
 LOG=`get_best_pass $log`
@@ -28,11 +28,11 @@ best_model_path="output/pass-${LOG[1]}"
 config_file=db_lstm.py
 dict_file=./data/wordDict.txt
-label_file=./data/targetDict.txt 
+label_file=./data/targetDict.txt
 predicate_dict_file=./data/verbDict.txt
 input_file=./data/feature
 output_file=predict.res
 python predict.py \
     -c $config_file \
     -w $best_model_path \

--- a/machine_translation/README.en.md
+++ b/machine_translation/README.en.md
@@ -9,19 +9,19 @@ Machine translation (MT) leverages computers to translate from one language to a
 Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
-To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: 
+To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
-1. human designed features cannot cover all possible linguistic variations; 
+1. human designed features cannot cover all possible linguistic variations;
-2. it is difficult to use global features; 
+2. it is difficult to use global features;
 3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
-The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are: 
+The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
-1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 
+1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
 2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
@@ -57,7 +57,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
 We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
 Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
-GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. 
+GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
 A GRU unit has only two gates:
 - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
 - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
@@ -96,20 +96,20 @@ There are three steps for encoding a sentence:
 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$   where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
-2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 
+2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
-  * the dimensionality of the vector is typically large, leading to the curse of dimensionality; 
+  * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
  * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
 3. Encoding of the source sequence via RNN: This can be described mathematically as:
    $$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
-    where 
+    where
-    $h_0$ is a zero vector, 
+    $h_0$ is a zero vector,
-    $\varnothing _\theta$ is a non-linear activation function, and 
+    $\varnothing _\theta$ is a non-linear activation function, and
-    $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ 
+    $\mathbf{h}=\left \{ h_1,..., h_T \right \}$
    is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
@@ -142,8 +142,8 @@ The generation process of machine translation is to translate the source sentenc
 ### Attention Mechanism
-There are a few problems with the fixed dimensional vector representation from the encoding stage: 
+There are a few problems with the fixed dimensional vector representation from the encoding stage:
-  * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. 
+  * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
  * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
 Different from the simple decoder, $z_i$ is computed as:
@@ -172,7 +172,7 @@ Figure 6. Decoder with Attention Mechanism
 [Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge  (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
-Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. 
+Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
 The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
@@ -452,7 +452,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
   target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
   word_vector_dim = 512 # dimensionality of word vector
-   encoder_size = 512 	 # dimensionality of the hidden state of encoder GRU
+   encoder_size = 512      # dimensionality of the hidden state of encoder GRU
   decoder_size = 512    # dimentionality of the hidden state of decoder GRU
   if is_generating:

--- a/machine_translation/README.md
+++ b/machine_translation/README.md
--- a/machine_translation/api_train.py
+++ b/machine_translation/api_train.py
+import paddle.v2 as paddle
+def seqToseq_net(source_dict_dim, target_dict_dim):
+    ### Network Architecture
+    word_vector_dim = 512  # dimension of word vector
+    decoder_size = 512  # dimension of hidden unit in GRU Decoder network
+    encoder_size = 512  # dimension of hidden unit in GRU Encoder network
+    #### Encoder
+    src_word_id = paddle.layer.data(
+        name='source_language_word',
+        type=paddle.data_type.integer_value_sequence(source_dict_dim))
+    src_embedding = paddle.layer.embedding(
+        input=src_word_id,
+        size=word_vector_dim,
+        param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
+    src_forward = paddle.networks.simple_gru(
+        input=src_embedding, size=encoder_size)
+    src_backward = paddle.networks.simple_gru(
+        input=src_embedding, size=encoder_size, reverse=True)
+    encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
+    #### Decoder
+    with paddle.layer.mixed(size=decoder_size) as encoded_proj:
+        encoded_proj += paddle.layer.full_matrix_projection(
+            input=encoded_vector)
+    backward_first = paddle.layer.first_seq(input=src_backward)
+    with paddle.layer.mixed(
+            size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
+        decoder_boot += paddle.layer.full_matrix_projection(
+            input=backward_first)
+    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
+        decoder_mem = paddle.layer.memory(
+            name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
+        context = paddle.networks.simple_attention(
+            encoded_sequence=enc_vec,
+            encoded_proj=enc_proj,
+            decoder_state=decoder_mem)
+        with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
+            decoder_inputs += paddle.layer.full_matrix_projection(input=context)
+            decoder_inputs += paddle.layer.full_matrix_projection(
+                input=current_word)
+        gru_step = paddle.layer.gru_step(
+            name='gru_decoder',
+            input=decoder_inputs,
+            output_mem=decoder_mem,
+            size=decoder_size)
+        with paddle.layer.mixed(
+                size=target_dict_dim,
+                bias_attr=True,
+                act=paddle.activation.Softmax()) as out:
+            out += paddle.layer.full_matrix_projection(input=gru_step)
+        return out
+    decoder_group_name = "decoder_group"
+    group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
+    group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
+    group_inputs = [group_input1, group_input2]
+    trg_embedding = paddle.layer.embedding(
+        input=paddle.layer.data(
+            name='target_language_word',
+            type=paddle.data_type.integer_value_sequence(target_dict_dim)),
+        size=word_vector_dim,
+        param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
+    group_inputs.append(trg_embedding)
+    # For decoder equipped with attention mechanism, in training,
+    # target embeding (the groudtruth) is the data input,
+    # while encoded source sequence is accessed to as an unbounded memory.
+    # Here, the StaticInput defines a read-only memory
+    # for the recurrent_group.
+    decoder = paddle.layer.recurrent_group(
+        name=decoder_group_name,
+        step=gru_decoder_with_attention,
+        input=group_inputs)
+    lbl = paddle.layer.data(
+        name='target_language_next_word',
+        type=paddle.data_type.integer_value_sequence(target_dict_dim))
+    cost = paddle.layer.classification_cost(input=decoder, label=lbl)
+    return cost
+def main():
+    paddle.init(use_gpu=False, trainer_count=1)
+    # source and target dict dim.
+    dict_size = 30000
+    source_dict_dim = target_dict_dim = dict_size
+    # define network topology
+    cost = seqToseq_net(source_dict_dim, target_dict_dim)
+    parameters = paddle.parameters.create(cost)
+    # define optimize method and trainer
+    optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
+    trainer = paddle.trainer.SGD(cost=cost,
+                                 parameters=parameters,
+                                 update_equation=optimizer)
+    # define data reader
+    feeding = {
+        'source_language_word': 0,
+        'target_language_word': 1,
+        'target_language_next_word': 2
+    }
+    wmt14_reader = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
+        batch_size=5)
+    # define event_handler callback
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 10 == 0:
+                print "Pass %d, Batch %d, Cost %f, %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics)
+    # start to train
+    trainer.train(
+        reader=wmt14_reader,
+        event_handler=event_handler,
+        num_passes=10000,
+        feeding=feeding)
+if __name__ == '__main__':
+    main()
--- a/machine_translation/data/wmt14_data.sh
+++ b/machine_translation/data/wmt14_data.sh
@@ -32,17 +32,17 @@ rm dev+test.tgz
 # separate the dev and test dataset
 mkdir test gen
 mv dev/ntst1213.* test
-mv dev/ntst14.* gen 
+mv dev/ntst14.* gen
 rm -rf dev
 set +x
 # rename the suffix, .fr->.src, .en->.trg
 for dir in train test gen
-do 
+do
  filelist=`ls $dir`
  cd $dir
  for file in $filelist
-  do 
+  do
    if [ ${file##*.} = "fr" ]; then
      mv $file ${file/%fr/src}
    elif [ ${file##*.} = 'en' ]; then

--- a/machine_translation/eval_bleu.sh
+++ b/machine_translation/eval_bleu.sh
@@ -31,7 +31,7 @@ else
            print $3;
            read_pos += (2 + res_num);
      }}' res_num=$beam_size $gen_file >$top1
-fi 
+fi
 # evalute bleu value
 bleu_script=multi-bleu.perl

--- a/machine_translation/index.en.html
+++ b/machine_translation/index.en.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -50,19 +51,19 @@ Machine translation (MT) leverages computers to translate from one language to a
 Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
-To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: 
+To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
-1. human designed features cannot cover all possible linguistic variations; 
+1. human designed features cannot cover all possible linguistic variations;
-2. it is difficult to use global features; 
+2. it is difficult to use global features;
 3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
-The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are: 
+The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
-1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 
+1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
 2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
@@ -98,7 +99,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
 We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
 Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
-GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. 
+GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
 A GRU unit has only two gates:
 - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
 - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
@@ -137,20 +138,20 @@ There are three steps for encoding a sentence:
 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$   where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
-2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 
+2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
-  * the dimensionality of the vector is typically large, leading to the curse of dimensionality; 
+  * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
  * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
 3. Encoding of the source sequence via RNN: This can be described mathematically as:
    $$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
-    where 
+    where
-    $h_0$ is a zero vector, 
+    $h_0$ is a zero vector,
-    $\varnothing _\theta$ is a non-linear activation function, and 
+    $\varnothing _\theta$ is a non-linear activation function, and
-    $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ 
+    $\mathbf{h}=\left \{ h_1,..., h_T \right \}$
    is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
@@ -183,8 +184,8 @@ The generation process of machine translation is to translate the source sentenc
 ### Attention Mechanism
-There are a few problems with the fixed dimensional vector representation from the encoding stage: 
+There are a few problems with the fixed dimensional vector representation from the encoding stage:
-  * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. 
+  * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
  * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
 Different from the simple decoder, $z_i$ is computed as:
@@ -213,7 +214,7 @@ Figure 6. Decoder with Attention Mechanism
 [Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge  (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
-Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. 
+Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
 The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
@@ -493,7 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
   source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
   target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
   word_vector_dim = 512 # dimensionality of word vector
-   encoder_size = 512 	 # dimensionality of the hidden state of encoder GRU
+   encoder_size = 512      # dimensionality of the hidden state of encoder GRU
   decoder_size = 512    # dimentionality of the hidden state of decoder GRU
   if is_generating:
@@ -764,6 +765,7 @@ End-to-end neural machine translation is a recently developed way to perform mac
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -782,6 +784,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/machine_translation/index.html
+++ b/machine_translation/index.html
--- a/machine_translation/pretrained/wmt14_model.sh
+++ b/machine_translation/pretrained/wmt14_model.sh
@@ -20,4 +20,4 @@ wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz
 # untar the model
 tar -zxvf wmt14_model.tar.gz
-rm wmt14_model.tar.gz 
+rm wmt14_model.tar.gz
--- a/.tmpl/convert-markdown-into-html.sh
+++ b/.tmpl/convert-markdown-into-html.sh
-markdown_file=$1
+import argparse
+import re
+import sys
-# Notice: the single-quotes around EOF below make outputs
+HEAD = """
-# verbatium. c.f. http://stackoverflow.com/a/9870274/724872
-cat <<'EOF'
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -10,8 +10,8 @@ cat <<'EOF'
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -44,11 +44,9 @@ cat <<'EOF'
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
-EOF
+"""
-cat $markdown_file
+TAIL = """
-cat <<'EOF'
 </div>
 <!-- You can change the lines below now. -->
@@ -67,7 +65,31 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
-EOF
+"""
+def convert_markdown_into_html(argv=None):
+    parser = argparse.ArgumentParser()
+    parser.add_argument('filenames', nargs='*', help='Filenames to fix')
+    args = parser.parse_args(argv)
+    retv = 0
+    for filename in args.filenames:
+        with open(
+                re.sub(r"README", "index", re.sub(r"\.md$", ".html", filename)),
+                "w") as output:
+            output.write(HEAD)
+            with open(filename) as input:
+                for line in input:
+                    output.write(line)
+            output.write(TAIL)
+    return retv
+if __name__ == '__main__':
+    sys.exit(convert_markdown_into_html())
--- a/query_relationship/index.html
+++ b/query_relationship/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -39,6 +40,7 @@
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 </div>
 <!-- You can change the lines below now. -->
@@ -57,6 +59,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/recognize_digits/README.en.md
+++ b/recognize_digits/README.en.md
@@ -87,7 +87,7 @@ Fig. 5 Pooling layer<br/>
 A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
-#### LeNet-5 Network 
+#### LeNet-5 Network
 <p align="center">
 <img src="image/cnn_en.png"><br/>
@@ -227,7 +227,7 @@ trainer = paddle.trainer.SGD(cost=cost,
 Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`.  These two functions are *reader creators*, once called, returns a *reader*.  A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.  
-Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer.  If you want very shuffled data, try use a larger buffer size. 
+Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer.  If you want very shuffled data, try use a larger buffer size.
 `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.

--- a/recognize_digits/README.md
+++ b/recognize_digits/README.md
@@ -56,7 +56,7 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层
 1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
 2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
 3.  最后，再经过输出层，得到的$Y=softmax(W_3H_2 + b_3)$，即为最后的分类结果向量。
 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。

--- a/recognize_digits/index.en.html
+++ b/recognize_digits/index.en.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -128,7 +129,7 @@ Fig. 5 Pooling layer<br/>
 A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
-#### LeNet-5 Network 
+#### LeNet-5 Network
 <p align="center">
 <img src="image/cnn_en.png"><br/>
@@ -268,7 +269,7 @@ trainer = paddle.trainer.SGD(cost=cost,
 Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`.  These two functions are *reader creators*, once called, returns a *reader*.  A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.  
-Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer.  If you want very shuffled data, try use a larger buffer size. 
+Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer.  If you want very shuffled data, try use a larger buffer size.
 `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
@@ -338,6 +339,7 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>.
 </div>
 <!-- You can change the lines below now. -->
@@ -356,6 +358,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/recognize_digits/index.html
+++ b/recognize_digits/index.html
 <html>
 <head>
  <script type="text/x-mathjax-config">
@@ -5,8 +6,8 @@
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
-      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
+      inlineMath: [ ['$','$'] ],
-      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
+      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
@@ -97,7 +98,7 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层
 1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
 2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
 3.  最后，再经过输出层，得到的$Y=softmax(W_3H_2 + b_3)$，即为最后的分类结果向量。
 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
@@ -340,6 +341,7 @@ trainer.train(
 <br/>
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span> 由 <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作，采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
 </div>
 <!-- You can change the lines below now. -->
@@ -358,6 +360,6 @@ marked.setOptions({
  }
 });
 document.getElementById("context").innerHTML = marked(
-		document.getElementById("markdown").innerHTML)
+        document.getElementById("markdown").innerHTML)
 </script>
 </body>
--- a/recommender_system/.gitignore
+++ b/recommender_system/.gitignore
+.idea
+.ipynb_checkpoints
--- a/recommender_system/README.en.md
+++ b/recommender_system/README.en.md
@@ -72,7 +72,7 @@ Given the feature vectors of users and movies, we compute the relevance using co
 <img src="image/rec_regression_network_en.png" width="90%" ><br/>
 Figure 3. A hybrid recommendation model.
-</p> 
+</p>
 ## Dataset

--- a/recommender_system/README.ipynb
+++ b/recommender_system/README.ipynb
--- a/recommender_system/README.md
+++ b/recommender_system/README.md
--- a/recommender_system/image/output_32_0.png
+++ b/recommender_system/image/output_32_0.png
--- a/recommender_system/image/rec_regression_network_en.png
+++ b/recommender_system/image/rec_regression_network_en.png
--- a/recommender_system/index.en.html
+++ b/recommender_system/index.en.html
--- a/recommender_system/index.html
+++ b/recommender_system/index.html
--- a/recommender_system/preprocess.sh
+++ b/recommender_system/preprocess.sh
@@ -17,9 +17,9 @@ set -e
 UNAME_STR=`uname`
 if [[ ${UNAME_STR} == 'Linux' ]]; then
-	SHUF_PROG='shuf'
+    SHUF_PROG='shuf'
 else
-	SHUF_PROG='gshuf'
+    SHUF_PROG='gshuf'
 fi

--- a/skip_thought/index.html
+++ b/skip_thought/index.html
--- a/speech_recognition/index.html
+++ b/speech_recognition/index.html
--- a/understand_sentiment/README.en.md
+++ b/understand_sentiment/README.en.md
--- a/understand_sentiment/README.md
+++ b/understand_sentiment/README.md
--- a/understand_sentiment/data/get_imdb.sh
+++ b/understand_sentiment/data/get_imdb.sh
@@ -33,7 +33,7 @@ echo "Unzipping..."
 tar -zxvf aclImdb_v1.tar.gz
 unzip master.zip
-#move train and test set to imdb_data directory 
+#move train and test set to imdb_data directory
 #in order to process when traing
 mkdir -p imdb/train
 mkdir -p imdb/test

--- a/understand_sentiment/index.en.html
+++ b/understand_sentiment/index.en.html
--- a/understand_sentiment/index.html
+++ b/understand_sentiment/index.html
--- a/understand_sentiment/preprocess.py
+++ b/understand_sentiment/preprocess.py
--- a/understand_sentiment/train.py
+++ b/understand_sentiment/train.py
--- a/word2vec/README.en.md
+++ b/word2vec/README.en.md
--- a/word2vec/README.md
+++ b/word2vec/README.md
--- a/word2vec/format_convert.py
+++ b/word2vec/format_convert.py
--- a/word2vec/index.en.html
+++ b/word2vec/index.en.html
--- a/word2vec/index.html
+++ b/word2vec/index.html