Merge pull request #13095 from shanyi15/0.14.0

fix_bookdoc_0.14.0

Merge pull request #13095 from shanyi15/0.14.0
fix_bookdoc_0.14.0
0dbe8eb0 · Shan Yi · GitHub · 3165ddf3 · 837b7e17 · 3165ddf3
53 changed file
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/.gitignore
-*.pyc
-train.log
-output
-data/cifar-10-batches-py/
-data/cifar-10-python.tar.gz
-data/*.txt
-data/*.list
-data/mean.meta
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/index.md

 # 图像分类

-本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/168.html)。

 ## 背景介绍

@@ -20,24 +20,25 @@

 图像分类包括通用图像分类、细粒度图像分类等。图1展示了通用图像分类效果，即模型可以正确识别图像上的主要物体。

-![dogCatClassification](./image/dog_cat.png)
 <p align="center">
+<img src="image/dog_cat.png "  width="350" ><br/>
 图1. 通用图像分类展示
 </p>


 图2展示了细粒度图像分类-花卉识别的效果，要求模型可以正确识别花的类别。

-![flowersClassification](./image/flowers.png)
+
 <p align="center">
+<img src="image/flowers.png" width="400" ><br/>
 图2. 细粒度图像分类展示
 </p>


 一个好的模型既要对不同类别识别正确，同时也应该能够对不同视角、光照、背景、变形或部分遮挡的图像正确识别(这里我们统一称作图像扰动)。图3展示了一些图像的扰动，较好的模型会像聪明的人类一样能够正确识别。

-![imageVariations](https://raw.githubusercontent.com/PaddlePaddle/book/develop/03.image_classification/image/variations.png)
 <p align="center">
+<img src="image/variations.png" width="550" ><br/>
 图3. 扰动图片展示[22]
 </p>

@@ -46,17 +47,21 @@
 图像识别领域大量的研究成果都是建立在[PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/)、[ImageNet](http://image-net.org/)等公开的数据集上，很多图像识别算法通常在这些数据集上进行测试和比较。PASCAL VOC是2005年发起的一个视觉挑战赛，ImageNet是2010年发起的大规模视觉识别竞赛(ILSVRC)的数据集，在本章中我们基于这些竞赛的一些论文介绍图像分类模型。

 在2012年之前的传统图像分类方法可以用背景描述中提到的三步完成，但通常完整建立图像识别模型一般包括底层特征学习、特征编码、空间约束、分类器设计、模型融合等几个阶段。
-1). **底层特征提取**: 通常从图像中按照固定步长、尺度提取大量局部特征描述。常用的局部特征包括SIFT(Scale-Invariant Feature Transform, 尺度不变特征转换) \[[1](#参考文献)\]、HOG(Histogram of Oriented Gradient, 方向梯度直方图) \[[2](#参考文献)\]、LBP(Local Bianray Pattern, 局部二值模式) \[[3](#参考文献)\] 等，一般也采用多种特征描述子，防止丢失过多的有用信息。
-2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
-3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
-4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
+
+  1). **底层特征提取**: 通常从图像中按照固定步长、尺度提取大量局部特征描述。常用的局部特征包括SIFT(Scale-Invariant Feature Transform, 尺度不变特征转换) \[[1](#参考文献)\]、HOG(Histogram of Oriented Gradient, 方向梯度直方图) \[[2](#参考文献)\]、LBP(Local Bianray Pattern, 局部二值模式) \[[3](#参考文献)\] 等，一般也采用多种特征描述子，防止丢失过多的有用信息。
+
+  2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
+
+  3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
+
+  4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。

 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征，两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。

 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破，效果大幅度超越传统方法，获得了ILSVRC2012冠军，该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后，涌现了一系列CNN模型，不断地在ImageNet上刷新成绩，如图4展示。随着模型变得越来越深以及精妙的结构设计，Top-5的错误率也越来越低，降到了3.5%附近。而在同样的ImageNet数据集上，人眼的辨识错误率大概在5.1%，也就是目前的深度学习模型的识别能力已经超过了人眼。

-![ilsvrc](./image/ilsvrc.png)
 <p align="center">
+<img src="image/ilsvrc.png" width="500" ><br/>
 图4. ILSVRC图像分类Top-5错误率
 </p>

@@ -64,8 +69,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得

 传统CNN包含卷积层、全连接层等组件，并采用softmax多类别分类器和多类交叉熵损失函数，一个典型的卷积神经网络如图5所示，我们先介绍用来构造CNN的常见组件。

-![cnnStructure](./image/lenet.png)
 <p align="center">
+<img src="image/lenet.png"><br/>
 图5. CNN网络示例[20]
 </p>

@@ -83,8 +88,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得

 牛津大学VGG(Visual Geometry Group)组在2014年ILSVRC提出的模型被称作VGG模型 \[[11](#参考文献)\] 。该模型相比以往模型进一步加宽和加深了网络结构，它的核心是五组卷积操作，每两组之间做Max-Pooling空间降维。同一组内采用多次连续的3X3卷积，卷积核的数目由较浅组的64增多到最深组的512，同一组内的卷积核数目是一样的。卷积之后接两层全连接层，之后是分类层。由于每组内卷积层的不同，有11、13、16、19层这几种模型，下图展示一个16层的网络结构。VGG模型结构相对简洁，提出之后也有很多文章基于此模型进行研究，如在ImageNet上首次公开超过人眼识别的模型\[[19](#参考文献)\]就是借鉴VGG模型的结构。

-![vgg16](./image/vgg16.png)
 <p align="center">
+<img src="image/vgg16.png" width="750" ><br/>
 图6. 基于ImageNet的VGG16模型
 </p>

@@ -92,12 +97,16 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得

 GoogleNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军，在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块，因为GoogleNet模型由多组Inception模块组成，模型设计借鉴了NIN的一些思想。

-NIN模型主要有两个特点：1) 引入了多层感知卷积网络(Multi-Layer Perceptron Convolution, MLPconv)代替一层线性卷积网络。MLPconv是一个微小的多层卷积网络，即在线性卷积后面增加若干层1x1的卷积，这样可以提取出高度非线性特征。2) 传统的CNN最后几层一般都是全连接层，参数较多。而NIN模型设计最后一层卷积层包含类别维度大小的特征图，然后采用全局均值池化(Avg-Pooling)替代全连接层，得到类别维度大小的向量，再进行分类。这种替代全连接层的方式有利于减少参数。
+NIN模型主要有两个特点：
+
+1) 引入了多层感知卷积网络(Multi-Layer Perceptron Convolution, MLPconv)代替一层线性卷积网络。MLPconv是一个微小的多层卷积网络，即在线性卷积后面增加若干层1x1的卷积，这样可以提取出高度非线性特征。
+
+2) 传统的CNN最后几层一般都是全连接层，参数较多。而NIN模型设计最后一层卷积层包含类别维度大小的特征图，然后采用全局均值池化(Avg-Pooling)替代全连接层，得到类别维度大小的向量，再进行分类。这种替代全连接层的方式有利于减少参数。

 Inception模块如下图7所示，图(a)是最简单的设计，输出是3个卷积层和一个池化层的特征拼接。这种设计的缺点是池化层不会改变特征通道数，拼接后会导致特征的通道数较大，经过几层这样的模块堆积后，通道数会越来越大，导致参数和计算量也随之增大。为了改善这个缺点，图(b)引入3个1x1卷积层进行降维，所谓的降维就是减少通道数，同时如NIN模型中提到的1x1卷积也可以修正线性特征。

-![inception](./image/inception.png)
 <p align="center">
+<img src="image/inception.png" width="800" ><br/>
 图7. Inception模块
 </p>

@@ -105,8 +114,8 @@ GoogleNet由多组Inception模块堆积而成。另外，在网络最后也没

 GoogleNet整体网络结构如图8所示，总共22层网络：开始由3层普通的卷积组成；接下来由三组子网络组成，第一组子网络包含2个Inception模块，第二组包含5个Inception模块，第三组包含2个Inception模块；然后接均值池化层、全连接层。

-![googleNet](./image/googlenet.jpeg)
 <p align="center">
+<img src="image/googlenet.jpeg" ><br/>
 图8. GoogleNet[12]
 </p>

@@ -120,15 +129,15 @@ ResNet(Residual Network) \[[15](#参考文献)\] 是2015年ImageNet图像分类

 残差模块如图9所示，左边是基本模块连接方式，由两个输出通道数相同的3x3卷积组成。右边是瓶颈模块(Bottleneck)连接方式，之所以称为瓶颈，是因为上面的1x1卷积用来降维(图示例即256->64)，下面的1x1卷积用来升维(图示例即64->256)，这样中间3x3卷积的输入和输出通道数都较小(图示例即64->64)。

-![ResNetBlock](./image/resnet_block.jpg)
 <p align="center">
+<img src="image/resnet_block.jpg" width="400"><br/>
 图9. 残差模块
 </p>

 图10展示了50、101、152层网络连接示意图，使用的是瓶颈模块。这三个模型的区别在于每组中残差模块的重复次数不同(见图右上角)。ResNet训练收敛较快，成功的训练了上百乃至近千层的卷积神经网络。

-![ResNet](./image/resnet.png)
 <p align="center">
+<img src="image/resnet.png"><br/>
 图10. 基于ImageNet的ResNet模型
 </p>

@@ -139,8 +148,8 @@ ResNet(Residual Network) \[[15](#参考文献)\] 是2015年ImageNet图像分类

 由于ImageNet数据集较大，下载和训练较慢，为了方便大家学习，我们使用[CIFAR10](<https://www.cs.toronto.edu/~kriz/cifar.html>)数据集。CIFAR10数据集包含60,000张32x32的彩色图片，10个类别，每个类包含6,000张。其中50,000张图片作为训练集，10000张作为测试集。图11从每个类别中随机抽取了10张图片，展示了所有的类别。

-![CIFAR](https://raw.githubusercontent.com/PaddlePaddle/book/develop/03.image_classification/image/cifar.png)
 <p align="center">
+<img src="image/cifar.png" width="350"><br/>
 图11. CIFAR10数据集[21]
 </p>

@@ -159,6 +168,7 @@ import paddle
 import paddle.fluid as fluid
 import numpy
 import sys
+from __future__ import print_function
 ```

 本教程中我们提供了VGG和ResNet两个模型的配置。
@@ -170,33 +180,34 @@ VGG核心模块的输入是数据层，`vgg_bn_drop` 定义了16层VGG结构，

 ```python
 def vgg_bn_drop(input):
-def conv_block(ipt, num_filter, groups, dropouts):
-return fluid.nets.img_conv_group(
-input=ipt,
-pool_size=2,
-pool_stride=2,
-conv_num_filter=[num_filter] * groups,
-conv_filter_size=3,
-conv_act='relu',
-conv_with_batchnorm=True,
-conv_batchnorm_drop_rate=dropouts,
-pool_type='max')
-
-conv1 = conv_block(input, 64, 2, [0.3, 0])
-conv2 = conv_block(conv1, 128, 2, [0.4, 0])
-conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
-conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
-conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
-
-drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
-fc1 = fluid.layers.fc(input=drop, size=512, act=None)
-bn = fluid.layers.batch_norm(input=fc1, act='relu')
-drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
-fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
-predict = fluid.layers.fc(input=fc2, size=10, act='softmax')
-return predict
+    def conv_block(ipt, num_filter, groups, dropouts):
+        return fluid.nets.img_conv_group(
+            input=ipt,
+            pool_size=2,
+            pool_stride=2,
+            conv_num_filter=[num_filter] * groups,
+            conv_filter_size=3,
+            conv_act='relu',
+            conv_with_batchnorm=True,
+            conv_batchnorm_drop_rate=dropouts,
+            pool_type='max')
+
+    conv1 = conv_block(input, 64, 2, [0.3, 0])
+    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+
+    drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
+    fc1 = fluid.layers.fc(input=drop, size=512, act=None)
+    bn = fluid.layers.batch_norm(input=fc1, act='relu')
+    drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
+    fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
+    predict = fluid.layers.fc(input=fc2, size=10, act='softmax')
+    return predict
 ```

+
 1. 首先定义了一组卷积网络，即conv_block。卷积核大小为3x3，池化窗口大小为2x2，窗口滑动大小为2，groups决定每组VGG模块是几次连续的卷积操作，dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块，由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。

 2. 五组卷积操作，即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0，即不使用Dropout操作。
@@ -211,74 +222,76 @@ ResNet模型的第1、3、4步和VGG模型相同，这里不再介绍。主要

 先介绍`resnet_cifar10`中的一些基本函数，再介绍网络连接过程。

- `conv_bn_layer` : 带BN的卷积层。
- `shortcut` : 残差模块的"直连"路径，"直连"实际分两种形式：残差模块输入和输出特征通道数不等时，采用1x1卷积的升维操作；残差模块输入和输出通道相等时，采用直连操作。
- `basicblock` : 一个基础残差模块，即图9左边所示，由两组3x3卷积组成的路径和一条"直连"路径组成。
- `bottleneck` : 一个瓶颈残差模块，即图9右边所示，由上下1x1卷积和中间3x3卷积组成的路径和一条"直连"路径组成。
- `layer_warp` : 一组残差模块，由若干个残差模块堆积而成。每组中第一个残差模块滑动窗口大小与其他可以不同，以用来减少特征图在垂直和水平方向的大小。
+  - `conv_bn_layer` : 带BN的卷积层。
+  - `shortcut` : 残差模块的"直连"路径，"直连"实际分两种形式：残差模块输入和输出特征通道数不等时，采用1x1卷积的升维操作；残差模块输入和输出通道相等时，采用直连操作。
+  - `basicblock` : 一个基础残差模块，即图9左边所示，由两组3x3卷积组成的路径和一条"直连"路径组成。
+  - `bottleneck` : 一个瓶颈残差模块，即图9右边所示，由上下1x1卷积和中间3x3卷积组成的路径和一条"直连"路径组成。
+  - `layer_warp` : 一组残差模块，由若干个残差模块堆积而成。每组中第一个残差模块滑动窗口大小与其他可以不同，以用来减少特征图在垂直和水平方向的大小。

 ```python
 def conv_bn_layer(input,
-ch_out,
-filter_size,
-stride,
-padding,
-act='relu',
-bias_attr=False):
-tmp = fluid.layers.conv2d(
-input=input,
-filter_size=filter_size,
-num_filters=ch_out,
-stride=stride,
-padding=padding,
-act=None,
-bias_attr=bias_attr)
-return fluid.layers.batch_norm(input=tmp, act=act)
+                  ch_out,
+                  filter_size,
+                  stride,
+                  padding,
+                  act='relu',
+                  bias_attr=False):
+    tmp = fluid.layers.conv2d(
+        input=input,
+        filter_size=filter_size,
+        num_filters=ch_out,
+        stride=stride,
+        padding=padding,
+        act=None,
+        bias_attr=bias_attr)
+    return fluid.layers.batch_norm(input=tmp, act=act)


 def shortcut(input, ch_in, ch_out, stride):
-if ch_in != ch_out:
-return conv_bn_layer(input, ch_out, 1, stride, 0, None)
-else:
-return input
+    if ch_in != ch_out:
+        return conv_bn_layer(input, ch_out, 1, stride, 0, None)
+    else:
+        return input


 def basicblock(input, ch_in, ch_out, stride):
-tmp = conv_bn_layer(input, ch_out, 3, stride, 1)
-tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, act=None, bias_attr=True)
-short = shortcut(input, ch_in, ch_out, stride)
-return fluid.layers.elementwise_add(x=tmp, y=short, act='relu')
+    tmp = conv_bn_layer(input, ch_out, 3, stride, 1)
+    tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, act=None, bias_attr=True)
+    short = shortcut(input, ch_in, ch_out, stride)
+    return fluid.layers.elementwise_add(x=tmp, y=short, act='relu')


 def layer_warp(block_func, input, ch_in, ch_out, count, stride):
-tmp = block_func(input, ch_in, ch_out, stride)
-for i in range(1, count):
-tmp = block_func(tmp, ch_out, ch_out, 1)
-return tmp
+    tmp = block_func(input, ch_in, ch_out, stride)
+    for i in range(1, count):
+        tmp = block_func(tmp, ch_out, ch_out, 1)
+    return tmp
 ```

 `resnet_cifar10` 的连接结构主要有以下几个过程。

 1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。
+
 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ，每组采用图 10 左边残差模块组成。
+
 3. 最后对网络做均值池化并返回该层。

 注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。

 ```python
 def resnet_cifar10(ipt, depth=32):
-# depth should be one of 20, 32, 44, 56, 110, 1202
-assert (depth - 2) % 6 == 0
-n = (depth - 2) / 6
-nStages = {16, 64, 128}
-conv1 = conv_bn_layer(ipt, ch_out=16, filter_size=3, stride=1, padding=1)
-res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
-res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
-res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
-pool = fluid.layers.pool2d(
-input=res3, pool_size=8, pool_type='avg', pool_stride=1)
-predict = fluid.layers.fc(input=pool, size=10, act='softmax')
-return predict
+    # depth should be one of 20, 32, 44, 56, 110, 1202
+    assert (depth - 2) % 6 == 0
+    n = (depth - 2) / 6
+    nStages = {16, 64, 128}
+    conv1 = conv_bn_layer(ipt, ch_out=16, filter_size=3, stride=1, padding=1)
+    res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
+    res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
+    res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
+    pool = fluid.layers.pool2d(
+        input=res3, pool_size=8, pool_type='avg', pool_stride=1)
+    predict = fluid.layers.fc(input=pool, size=10, act='softmax')
+    return predict
 ```

 ## Infererence Program 配置
@@ -287,13 +300,13 @@ return predict

 ```python
 def inference_program():
-# The image is 32 * 32 with RGB representation.
-data_shape = [3, 32, 32]
-images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
+    # The image is 32 * 32 with RGB representation.
+    data_shape = [3, 32, 32]
+    images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')

-predict = resnet_cifar10(images, 32)
-# predict = vgg_bn_drop(images) # un-comment to use vgg net
-return predict
+    predict = resnet_cifar10(images, 32)
+    # predict = vgg_bn_drop(images) # un-comment to use vgg net
+    return predict
 ```

 ## Train Program 配置
@@ -306,13 +319,13 @@ return predict

 ```python
 def train_program():
-predict = inference_program()
+    predict = inference_program()

-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-cost = fluid.layers.cross_entropy(input=predict, label=label)
-avg_cost = fluid.layers.mean(cost)
-accuracy = fluid.layers.accuracy(input=predict, label=label)
-return [avg_cost, accuracy]
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    accuracy = fluid.layers.accuracy(input=predict, label=label)
+    return [avg_cost, accuracy]
 ```

 ## Optimizer Function 配置
@@ -321,7 +334,7 @@ return [avg_cost, accuracy]

 ```python
 def optimizer_program():
-return fluid.optimizer.Adam(learning_rate=0.001)
+    return fluid.optimizer.Adam(learning_rate=0.001)
 ```

 ## 训练模型
@@ -334,9 +347,9 @@ return fluid.optimizer.Adam(learning_rate=0.001)
 use_cuda = False
 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 trainer = fluid.Trainer(
-train_func=train_program,
-optimizer_func=optimizer_program,
-place=place)
+    train_func=train_program,
+    optimizer_func=optimizer_program,
+    place=place)
 ```

 ### Data Feeders 配置
@@ -349,12 +362,12 @@ BATCH_SIZE = 128

 # Reader for training
 train_reader = paddle.batch(
-paddle.reader.shuffle(paddle.dataset.cifar.train10(), buf_size=50000),
-batch_size=BATCH_SIZE)
+    paddle.reader.shuffle(paddle.dataset.cifar.train10(), buf_size=50000),
+    batch_size=BATCH_SIZE)

 # Reader for testing. A separated data set for testing.
 test_reader = paddle.batch(
-paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
+    paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
 ```

 ### Event Handler
@@ -363,7 +376,11 @@ paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)

 `event_handler_plot`可以用来利用回调数据来打点画图:

-![png](./image/train_and_test.png)
+<p align="center">
+<img src="image/train_and_test.png" width="350"><br/>
+图12. 训练结果
+</p>
+

 ```python
 params_dirname = "image_classification_resnet.inference.model"
@@ -376,21 +393,21 @@ cost_ploter = Ploter(train_title, test_title)

 step = 0
 def event_handler_plot(event):
-global step
-if isinstance(event, fluid.EndStepEvent):
-if step % 1 == 0:
-cost_ploter.append(train_title, step, event.metrics[0])
-cost_ploter.plot()
-step += 1
-if isinstance(event, fluid.EndEpochEvent):
-avg_cost, accuracy = trainer.test(
-reader=test_reader,
-feed_order=['pixel', 'label'])
-cost_ploter.append(test_title, step, avg_cost)
-
-# save parameters
-if params_dirname is not None:
-trainer.save_params(params_dirname)
+    global step
+    if isinstance(event, fluid.EndStepEvent):
+        if step % 1 == 0:
+            cost_ploter.append(train_title, step, event.metrics[0])
+            cost_ploter.plot()
+        step += 1
+    if isinstance(event, fluid.EndEpochEvent):
+        avg_cost, accuracy = trainer.test(
+            reader=test_reader,
+            feed_order=['pixel', 'label'])
+        cost_ploter.append(test_title, step, avg_cost)
+
+        # save parameters
+        if params_dirname is not None:
+            trainer.save_params(params_dirname)
 ```

 `event_handler` 用来在训练过程中输出文本日志
@@ -400,39 +417,39 @@ params_dirname = "image_classification_resnet.inference.model"

 # event handler to track training and testing process
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-if event.step % 100 == 0:
-print("\nPass %d, Batch %d, Cost %f, Acc %f" %
-(event.step, event.epoch, event.metrics[0],
-event.metrics[1]))
-else:
-sys.stdout.write('.')
-sys.stdout.flush()
-
-if isinstance(event, fluid.EndEpochEvent):
-# Test against with the test dataset to get accuracy.
-avg_cost, accuracy = trainer.test(
-reader=test_reader, feed_order=['pixel', 'label'])
-
-print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy))
-
-# save parameters
-if params_dirname is not None:
-trainer.save_params(params_dirname)
+    if isinstance(event, fluid.EndStepEvent):
+        if event.step % 100 == 0:
+            print("\nPass %d, Batch %d, Cost %f, Acc %f" %
+                  (event.step, event.epoch, event.metrics[0],
+                   event.metrics[1]))
+        else:
+            sys.stdout.write('.')
+            sys.stdout.flush()
+
+    if isinstance(event, fluid.EndEpochEvent):
+        # Test against with the test dataset to get accuracy.
+        avg_cost, accuracy = trainer.test(
+            reader=test_reader, feed_order=['pixel', 'label'])
+
+        print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy))
+
+        # save parameters
+        if params_dirname is not None:
+            trainer.save_params(params_dirname)
 ```

 ### 训练

 通过`trainer.train`函数训练:

-**注意:** CPU，每个 Epoch 将花费大约15～20分钟。这部分可能需要一段时间。请随意修改代码，在GPU上运行测试，以提高培训速度。
+**注意:** CPU，每个 Epoch 将花费大约15～20分钟。这部分可能需要一段时间。请随意修改代码，在GPU上运行测试，以提高训练速度。

 ```python
 trainer.train(
-reader=train_reader,
-num_epochs=2,
-event_handler=event_handler,
-feed_order=['pixel', 'label'])
+    reader=train_reader,
+    num_epochs=2,
+    event_handler=event_handler,
+    feed_order=['pixel', 'label'])
 ```

 一轮训练log示例如下所示，经过1个pass， 训练集上平均 Accuracy 为0.59 ，测试集上平均  Accuracy 为0.6 。
@@ -449,11 +466,11 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750
 Test with Pass 0, Loss 1.1, Acc 0.6
 ```

-图12是训练的分类错误率曲线图，运行到第200个pass后基本收敛，最终得到测试集上分类错误率为8.54%。
+图13是训练的分类错误率曲线图，运行到第200个pass后基本收敛，最终得到测试集上分类错误率为8.54%。

-![CIFARErrorRate](./image/plot.png)
 <p align="center">
-图12. CIFAR10数据集上VGG模型的分类错误率
+<img src="image/plot.png" width="400" ><br/>
+图13. CIFAR10数据集上VGG模型的分类错误率
 </p>

 ## 应用模型
@@ -471,19 +488,19 @@ import numpy as np
 import os

 def load_image(file):
-im = Image.open(file)
-im = im.resize((32, 32), Image.ANTIALIAS)
+    im = Image.open(file)
+    im = im.resize((32, 32), Image.ANTIALIAS)

-im = np.array(im).astype(np.float32)
-# The storage order of the loaded image is W(width),
-# H(height), C(channel). PaddlePaddle requires
-# the CHW order, so transpose them.
-im = im.transpose((2, 0, 1))  # CHW
-im = im / 255.0
+    im = np.array(im).astype(np.float32)
+    # The storage order of the loaded image is W(width),
+    # H(height), C(channel). PaddlePaddle requires
+    # the CHW order, so transpose them.
+    im = im.transpose((2, 0, 1))  # CHW
+    im = im / 255.0

-# Add one dimension to mimic the list format.
-im = numpy.expand_dims(im, axis=0)
-return im
+    # Add one dimension to mimic the list format.
+    im = numpy.expand_dims(im, axis=0)
+    return im

 cur_dir = os.getcwd()
 img = load_image(cur_dir + '/image/dog.png')
@@ -497,11 +514,11 @@ img = load_image(cur_dir + '/image/dog.png')

 ```python
 inferencer = fluid.Inferencer(
-infer_func=inference_program, param_path=params_dirname, place=place)
-
+    infer_func=inference_program, param_path=params_dirname, place=place)
+label_list = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
 # inference
 results = inferencer.infer({'pixel': img})
-print("infer results: ", results)
+print("infer results: %s" % label_list[np.argmax(results[0])])
 ```

 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/cifar.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/cifar.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/inception_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/inception_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/lenet_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/lenet_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/plot_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/plot_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/index.rst
+++ b/doc/fluid/new_docs/beginners_guide/basics/index.rst
@@ -10,9 +10,9 @@
 ..  toctree::
    :maxdepth: 2

-    image_classification/index.md
-    word2vec/index.md
-    recommender_system/index.md
-    understand_sentiment/index.md
-    label_semantic_roles/index.md
-    machine_translation/index.md
+    image_classification/README.cn.md
+    word2vec/README.cn.md
+    recommender_system/README.cn.md
+    understand_sentiment/README.cn.md
+    label_semantic_roles/README.cn.md
+    machine_translation/README.cn.md
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/.gitignore
-data/train.list
-data/test.*
-data/conll05st-release.tar.gz
-data/conll05st-release
-data/predicate_dict
-data/label_dict
-data/word_dict
-data/emb
-data/feature
-output
-predict.res
-train.log
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/index.md
 # 语义角色标注

-本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/178.html)。

 ## 背景介绍

@@ -8,7 +8,7 @@

 请看下面的例子，“遇到” 是谓词（Predicate，通常简写为“Pred”），“小明”是施事者（Agent），“小红”是受事者（Patient），“昨天” 是事件发生的时间（Time），“公园”是事情发生的地点（Location）。

-$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\mbox{Time}}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
+$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$

 语义角色标注（Semantic Role Labeling，SRL）以句子的谓词为中心，不对句子所包含的语义信息进行深入分析，只分析句子中各成分与谓词之间的关系，即句子的谓词（Predicate）- 论元（Argument）结构，并用语义角色来描述这些结构关系，是许多自然语言理解任务（如信息抽取，篇章分析，深度问答等）的一个重要中间步骤。在研究中一般都假定谓词是给定的，所要做的就是找出给定谓词的各个论元和它们的语义角色。

@@ -20,17 +20,17 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m
 4. 论元识别：这个过程是从上一步剪除之后的候选中判断哪些是真正的论元，通常当做一个二分类问题来解决。
 5. 对第4步的结果，通过多分类得到论元的语义角色标签。可以看到，句法分析是基础，并且后续步骤常常会构造的一些人工特征，这些特征往往也来自句法分析。

-![dependencyParsing](./image/dependency_parsing.png)
 <div  align="center">
+<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
 图1. 依存句法分析句法树示例
 </div>

-然而，完全句法分析需要确定句子所包含的全部句法信息，并确定句子各成分之间的关系，是一个非常困难的任务，目前技术下的句法分析准确率并不高，句法分析的细微错误都会导致SRL的错误。为了降低问题的复杂度，同时获得一定的句法结构信息，“浅层句法分析”的思想应运而生。浅层句法分析也称为部分句法分析（partial parsing）或语块划分（chunking）。和完全句法分析得到一颗完整的句法树不同，浅层句法分析只需要识别句子中某些结构相对简单的独立成分，例如：动词短语，这些被识别出来的结构称为语块。为了回避 “无法获得准确率较高的句法树” 所带来的困难，一些研究\[[1](#参考文献)\]也提出了基于语块（chunk）的SRL方法。基于语块的SRL方法将SRL作为一个序列标注问题来解决。序列标注任务一般都会采用BIO表示方式来定义序列标注的标签集，我们先来介绍这种表示方法。在BIO表示法中，B代表语块的开始，I代表语块的中间，O代表语块结束。通过B、I、O 三种标记将不同的语块赋予不同的标签，例如：对于一个角色为A的论元，将它所包含的第一个语块赋予标签B-A，将它所包含的其它语块赋予标签I-A，不属于任何论元的语块赋予标签O。
+然而，完全句法分析需要确定句子所包含的全部句法信息，并确定句子各成分之间的关系，是一个非常困难的任务，目前技术下的句法分析准确率并不高，句法分析的细微错误都会导致SRL的错误。为了降低问题的复杂度，同时获得一定的句法结构信息，“浅层句法分析”的思想应运而生。浅层句法分析也称为部分句法分析（partial parsing）或语块划分（chunking）。和完全句法分析得到一颗完整的句法树不同，浅层句法分析只需要识别句子中某些结构相对简单的独立成分，例如：动词短语，这些被识别出来的结构称为语块。为了回避 “无法获得准确率较高的句法树” 所带来的困难，一些研究\[[1](#参考文献)\]也提出了基于语块（chunk）的SRL方法。基于语块的SRL方法将SRL作为一个序列标注问题来解决。序列标注任务一般都会采用BIO表示方式来定义序列标注的标签集，我们先来介绍这种表示方法。在BIO表示法中，B代表语块的开始，I代表语块的中间，O代表语块结束。通过B、I、O 三种标记将不同的语块赋予不同的标签，例如：对于一个由角色A拓展得到的语块组，将它所包含的第一个语块赋予标签B-A，将它所包含的其它语块赋予标签I-A，不属于任何论元的语块赋予标签O。

 我们继续以上面的这句话为例，图1展示了BIO表示方法。

-![bioExample](./image/bio_example.png)
 <div  align="center">
+<img src="image/bio_example.png" width = "90%"  align=center /><br>
 图2. BIO标注方法示例
 </div>

@@ -52,8 +52,8 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m

 图3是最终得到的栈式循环神经网络结构示意图。

-![lstmStructure](./image/stacked_lstm.png)
 <p align="center">  
+<img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
 图3. 基于LSTM的栈式循环神经网络结构示意图
 </p>

@@ -63,8 +63,8 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m

 为了克服这一缺陷，我们可以设计一种双向循环网络单元，它的思想简单且直接：对上一节的栈式循环神经网络进行一个小小的修改，堆叠多个LSTM单元，让每一层LSTM单元分别以：正向、反向、正向 …… 的顺序学习上一层的输出序列。于是，从第2层开始，$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。

-![lstmStructure](./image/bidirectional_stacked_lstm.png)
 <p align="center">  
+<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
 图4. 基于LSTM的双向循环神经网络结构示意图
 </p>

@@ -78,8 +78,8 @@ CRF是一种概率化结构模型，可以看作是一个概率无向图模型

 序列标注任务只需要考虑输入和输出都是一个线性序列，并且由于我们只是将输入序列作为条件，不做任何条件独立假设，因此输入序列的元素之间并不存在图结构。综上，在序列标注任务中使用的是如图5所示的定义在链式图上的CRF，称之为线性链条件随机场（Linear Chain Conditional Random Field）。

-![linear_chain_crf](./image/linear_chain_crf.png)
 <p align="center">  
+<img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 图5. 序列标注任务中使用的线性链条件随机场
 </p>

@@ -102,8 +102,8 @@ $$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\pr
 在SRL任务中，输入是 “谓词” 和 “一句话”，目标是从这句话中找到谓词的论元，并标注论元的语义角色。如果一个句子含有$n$个谓词，这个句子会被处理$n$次。一个最为直接的模型是下面这样：

 1. 构造输入；
- 输入1是谓词，输入2是句子
- 将输入1扩展成和输入2一样长的序列，用one-hot方式表示；
+ - 输入1是谓词，输入2是句子
+ - 将输入1扩展成和输入2一样长的序列，用one-hot方式表示；
 2. one-hot方式的谓词序列和句子序列通过词表，转换为实向量表示的词向量序列；
 3. 将步骤2中的2个词向量序列作为双向LSTM的输入，学习输入序列的特征表示；
 4. CRF以步骤3中模型学习到的特征为输入，以标记序列为监督信号，实现序列标注；
@@ -116,14 +116,14 @@ $$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\pr
 修改后的模型如下（图6是一个深度为4的模型结构示意图）：

 1. 构造输入
- 输入1是句子序列，输入2是谓词序列，输入3是谓词上下文，从句子中抽取这个谓词前后各$n$个词，构成谓词上下文，用one-hot方式表示，输入4是谓词上下文区域标记，标记了句子中每一个词是否在谓词上下文中；
- 将输入2~3均扩展为和输入1一样长的序列；
+ - 输入1是句子序列，输入2是谓词序列，输入3是谓词上下文，从句子中抽取这个谓词前后各$n$个词，构成谓词上下文，用one-hot方式表示，输入4是谓词上下文区域标记，标记了句子中每一个词是否在谓词上下文中；
+ - 将输入2~3均扩展为和输入1一样长的序列；
 2. 输入1~4均通过词表取词向量转换为实向量表示的词向量序列；其中输入1、3共享同一个词表，输入2和4各自独有词表；
 3. 第2步的4个词向量序列作为双向LSTM模型的输入；LSTM模型学习输入序列的特征表示，得到新的特性表示序列；
 4. CRF以第3步中LSTM学习到的特征为输入，以标记序列为监督信号，完成序列标注；

-![db_lstm_network](./image/db_lstm_network.png)
 <div  align="center">  
+<img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
 图6. SRL任务上的深层双向LSTM模型
 </div>

@@ -137,8 +137,8 @@ $$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\pr
 ```text
 conll05st-release/
 └── test.wsj
-├── props  # 标注结果
-└── words  # 输入文本序列
+    ├── props  # 标注结果
+    └── words  # 输入文本序列
 ```

 标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同，但原理是相同的，关于标注结果标签含义的说明，请参考论文\[[9](#参考文献)\]。
@@ -151,14 +151,6 @@ conll05st-release/
 4. 构造以BIO法表示的标记；
 5. 依据词典获取词对应的整数索引。

-
-```python
-# import paddle.v2.dataset.conll05 as conll05
-# conll05.corpus_reader函数完成上面第1步和第2步.
-# conll05.reader_creator函数完成上面第3步到第5步.
-# conll05.test函数可以获取处理之后的每条样本来供PaddlePaddle训练.
-```
-
 预处理完成之后一条训练样本包含9个特征，分别是：句子序列、谓词、谓词上下文（占 5 列）、谓词上下区域标志、标注序列。下表是一条训练样本的示例。

 | 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 |
@@ -187,6 +179,8 @@ conll05st-release/
 获取词典，打印词典大小：

 ```python
+from __future__ import print_function
+
 import math, os
 import numpy as np
 import paddle
@@ -201,9 +195,9 @@ word_dict_len = len(word_dict)
 label_dict_len = len(label_dict)
 pred_dict_len = len(verb_dict)

-print word_dict_len
-print label_dict_len
-print pred_dict_len
+print('word_dict_len: ', word_dict_len)
+print('label_dict_len: ', label_dict_len)
+print('pred_dict_len: ', pred_dict_len)
 ```

 ## 模型配置说明
@@ -232,96 +226,96 @@ embedding_name = 'emb'
 ```python
 # 这里加载PaddlePaddle上版保存的二进制模型
 def load_parameter(file_name, h, w):
-with open(file_name, 'rb') as f:
-f.read(16)  # skip header.
-return np.fromfile(f, dtype=np.float32).reshape(h, w)
+    with open(file_name, 'rb') as f:
+        f.read(16)  # skip header.
+        return np.fromfile(f, dtype=np.float32).reshape(h, w)
 ```

 - 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。

 ```python  
 def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark,
-**ignored):
-# 8 features
-predicate_embedding = fluid.layers.embedding(
-input=predicate,
-size=[pred_dict_len, word_dim],
-dtype='float32',
-is_sparse=IS_SPARSE,
-param_attr='vemb')
-
-mark_embedding = fluid.layers.embedding(
-input=mark,
-size=[mark_dict_len, mark_dim],
-dtype='float32',
-is_sparse=IS_SPARSE)
-
-word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
-# Since word vector lookup table is pre-trained, we won't update it this time.
-# trainable being False prevents updating the lookup table during training.
-emb_layers = [
-fluid.layers.embedding(
-size=[word_dict_len, word_dim],
-input=x,
-param_attr=fluid.ParamAttr(
-name=embedding_name, trainable=False)) for x in word_input
-]
-emb_layers.append(predicate_embedding)
-emb_layers.append(mark_embedding)
-
-# 8 LSTM units are trained through alternating left-to-right / right-to-left order
-# denoted by the variable `reverse`.
-hidden_0_layers = [
-fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
-for emb in emb_layers
-]
-
-hidden_0 = fluid.layers.sums(input=hidden_0_layers)
-
-lstm_0 = fluid.layers.dynamic_lstm(
-input=hidden_0,
-size=hidden_dim,
-candidate_activation='relu',
-gate_activation='sigmoid',
-cell_activation='sigmoid')
-
-# stack L-LSTM and R-LSTM with direct edges
-input_tmp = [hidden_0, lstm_0]
-
-# In PaddlePaddle, state features and transition features of a CRF are implemented
-# by a fully connected layer and a CRF layer seperately. The fully connected layer
-# with linear activation learns the state features, here we use fluid.layers.sums
-# (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
-# fluid.layers.linear_chain_crf only
-# learns the transition features, which is a cost layer and is the last layer of the network.
-# fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
-# as the cost by given the input sequence and it requires the true tag sequence
-# as target in the learning process.
-
-for i in range(1, depth):
-mix_hidden = fluid.layers.sums(input=[
-fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
-fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
-])
-
-lstm = fluid.layers.dynamic_lstm(
-input=mix_hidden,
-size=hidden_dim,
-candidate_activation='relu',
-gate_activation='sigmoid',
-cell_activation='sigmoid',
-is_reverse=((i % 2) == 1))
-
-input_tmp = [mix_hidden, lstm]
-
-# 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，
-# 经过一个全连接层映射到标记字典的维度，来学习 CRF 的状态特征
-feature_out = fluid.layers.sums(input=[
-fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
-fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
-])
-
-return feature_out
+            **ignored):
+    # 8 features
+    predicate_embedding = fluid.layers.embedding(
+        input=predicate,
+        size=[pred_dict_len, word_dim],
+        dtype='float32',
+        is_sparse=IS_SPARSE,
+        param_attr='vemb')
+
+    mark_embedding = fluid.layers.embedding(
+        input=mark,
+        size=[mark_dict_len, mark_dim],
+        dtype='float32',
+        is_sparse=IS_SPARSE)
+
+    word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+    # Since word vector lookup table is pre-trained, we won't update it this time.
+    # trainable being False prevents updating the lookup table during training.
+    emb_layers = [
+        fluid.layers.embedding(
+            size=[word_dict_len, word_dim],
+            input=x,
+            param_attr=fluid.ParamAttr(
+                name=embedding_name, trainable=False)) for x in word_input
+    ]
+    emb_layers.append(predicate_embedding)
+    emb_layers.append(mark_embedding)
+
+    # 8 LSTM units are trained through alternating left-to-right / right-to-left order
+    # denoted by the variable `reverse`.
+    hidden_0_layers = [
+        fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
+        for emb in emb_layers
+    ]
+
+    hidden_0 = fluid.layers.sums(input=hidden_0_layers)
+
+    lstm_0 = fluid.layers.dynamic_lstm(
+        input=hidden_0,
+        size=hidden_dim,
+        candidate_activation='relu',
+        gate_activation='sigmoid',
+        cell_activation='sigmoid')
+
+    # stack L-LSTM and R-LSTM with direct edges
+    input_tmp = [hidden_0, lstm_0]
+
+    # In PaddlePaddle, state features and transition features of a CRF are implemented
+    # by a fully connected layer and a CRF layer seperately. The fully connected layer
+    # with linear activation learns the state features, here we use fluid.layers.sums
+    # (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
+    # fluid.layers.linear_chain_crf only
+    # learns the transition features, which is a cost layer and is the last layer of the network.
+    # fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
+    # as the cost by given the input sequence and it requires the true tag sequence
+    # as target in the learning process.
+
+    for i in range(1, depth):
+        mix_hidden = fluid.layers.sums(input=[
+            fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
+            fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
+        ])
+
+        lstm = fluid.layers.dynamic_lstm(
+            input=mix_hidden,
+            size=hidden_dim,
+            candidate_activation='relu',
+            gate_activation='sigmoid',
+            cell_activation='sigmoid',
+            is_reverse=((i % 2) == 1))
+
+        input_tmp = [mix_hidden, lstm]
+
+    # 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，
+    # 经过一个全连接层映射到标记字典的维度，来学习 CRF 的状态特征
+    feature_out = fluid.layers.sums(input=[
+        fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
+        fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
+    ])
+
+    return feature_out
 ```

 ## 训练模型
@@ -338,116 +332,116 @@ return feature_out

 ```python
 def train(use_cuda, save_dirname=None, is_local=True):
-# define network topology
-
-# 句子序列
-word = fluid.layers.data(
-name='word_data', shape=[1], dtype='int64', lod_level=1)
-
-# 谓词
-predicate = fluid.layers.data(
-name='verb_data', shape=[1], dtype='int64', lod_level=1)
-
-# 谓词上下文5个特征
-ctx_n2 = fluid.layers.data(
-name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
-ctx_n1 = fluid.layers.data(
-name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
-ctx_0 = fluid.layers.data(
-name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
-ctx_p1 = fluid.layers.data(
-name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
-ctx_p2 = fluid.layers.data(
-name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
-
-# 谓词上下区域标志
-mark = fluid.layers.data(
-name='mark_data', shape=[1], dtype='int64', lod_level=1)
-
-# define network topology
-feature_out = db_lstm(**locals())
-
-# 标注序列
-target = fluid.layers.data(
-name='target', shape=[1], dtype='int64', lod_level=1)
-
-# 学习 CRF 的转移特征
-crf_cost = fluid.layers.linear_chain_crf(
-input=feature_out,
-label=target,
-param_attr=fluid.ParamAttr(
-name='crfw', learning_rate=mix_hidden_lr))
-
-avg_cost = fluid.layers.mean(crf_cost)
-
-sgd_optimizer = fluid.optimizer.SGD(
-learning_rate=fluid.layers.exponential_decay(
-learning_rate=0.01,
-decay_steps=100000,
-decay_rate=0.5,
-staircase=True))
-
-sgd_optimizer.minimize(avg_cost)
-
-# The CRF decoding layer is used for evaluation and inference.
-# It shares weights with CRF layer.  The sharing of parameters among multiple layers
-# is specified by using the same parameter name in these layers. If true tag sequence
-# is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
-# for each input token and sums the error over the entire sequence.
-# Otherwise, `fluid.layers.crf_decoding`  generates the labelling tags.
-crf_decode = fluid.layers.crf_decoding(
-input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
-
-train_data = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.conll05.test(), buf_size=8192),
-batch_size=BATCH_SIZE)
-
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-
-
-feeder = fluid.DataFeeder(
-feed_list=[
-word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
-],
-place=place)
-exe = fluid.Executor(place)
-
-def train_loop(main_program):
-exe.run(fluid.default_startup_program())
-embedding_param = fluid.global_scope().find_var(
-embedding_name).get_tensor()
-embedding_param.set(
-load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
-place)
-
-start_time = time.time()
-batch_id = 0
-for pass_id in xrange(PASS_NUM):
-for data in train_data():
-cost = exe.run(main_program,
-feed=feeder.feed(data),
-fetch_list=[avg_cost])
-cost = cost[0]
-
-if batch_id % 10 == 0:
-print("avg_cost:" + str(cost))
-if batch_id != 0:
-print("second per batch: " + str((time.time(
-) - start_time) / batch_id))
-# Set the threshold low to speed up the CI test
-if float(cost) < 60.0:
-if save_dirname is not None:
-fluid.io.save_inference_model(save_dirname, [
-'word_data', 'verb_data', 'ctx_n2_data',
-'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
-'ctx_p2_data', 'mark_data'
-], [feature_out], exe)
-return
-
-batch_id = batch_id + 1
-
-train_loop(fluid.default_main_program())
+    # define network topology
+
+    # 句子序列
+    word = fluid.layers.data(
+        name='word_data', shape=[1], dtype='int64', lod_level=1)
+
+    # 谓词
+    predicate = fluid.layers.data(
+        name='verb_data', shape=[1], dtype='int64', lod_level=1)
+
+    # 谓词上下文5个特征
+    ctx_n2 = fluid.layers.data(
+        name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_n1 = fluid.layers.data(
+        name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_0 = fluid.layers.data(
+        name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_p1 = fluid.layers.data(
+        name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_p2 = fluid.layers.data(
+        name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
+
+    # 谓词上下区域标志
+    mark = fluid.layers.data(
+        name='mark_data', shape=[1], dtype='int64', lod_level=1)
+
+    # define network topology
+    feature_out = db_lstm(**locals())
+
+    # 标注序列
+    target = fluid.layers.data(
+        name='target', shape=[1], dtype='int64', lod_level=1)
+
+    # 学习 CRF 的转移特征
+    crf_cost = fluid.layers.linear_chain_crf(
+        input=feature_out,
+        label=target,
+        param_attr=fluid.ParamAttr(
+            name='crfw', learning_rate=mix_hidden_lr))
+
+    avg_cost = fluid.layers.mean(crf_cost)
+
+    sgd_optimizer = fluid.optimizer.SGD(
+        learning_rate=fluid.layers.exponential_decay(
+            learning_rate=0.01,
+            decay_steps=100000,
+            decay_rate=0.5,
+            staircase=True))
+
+    sgd_optimizer.minimize(avg_cost)
+
+    # The CRF decoding layer is used for evaluation and inference.
+    # It shares weights with CRF layer.  The sharing of parameters among multiple layers
+    # is specified by using the same parameter name in these layers. If true tag sequence
+    # is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
+    # for each input token and sums the error over the entire sequence.
+    # Otherwise, `fluid.layers.crf_decoding`  generates the labelling tags.
+    crf_decode = fluid.layers.crf_decoding(
+        input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
+
+    train_data = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.conll05.test(), buf_size=8192),
+        batch_size=BATCH_SIZE)
+
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+
+
+    feeder = fluid.DataFeeder(
+        feed_list=[
+            word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
+        ],
+        place=place)
+    exe = fluid.Executor(place)
+
+    def train_loop(main_program):
+        exe.run(fluid.default_startup_program())
+        embedding_param = fluid.global_scope().find_var(
+            embedding_name).get_tensor()
+        embedding_param.set(
+            load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
+            place)
+
+        start_time = time.time()
+        batch_id = 0
+        for pass_id in xrange(PASS_NUM):
+            for data in train_data():
+                cost = exe.run(main_program,
+                               feed=feeder.feed(data),
+                               fetch_list=[avg_cost])
+                cost = cost[0]
+
+                if batch_id % 10 == 0:
+                    print("avg_cost: " + str(cost))
+                    if batch_id != 0:
+                        print("second per batch: " + str((time.time(
+                        ) - start_time) / batch_id))
+                    # Set the threshold low to speed up the CI test
+                    if float(cost) < 60.0:
+                        if save_dirname is not None:
+                            fluid.io.save_inference_model(save_dirname, [
+                                'word_data', 'verb_data', 'ctx_n2_data',
+                                'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
+                                'ctx_p2_data', 'mark_data'
+                            ], [feature_out], exe)
+                        return
+
+                batch_id = batch_id + 1
+
+    train_loop(fluid.default_main_program())
 ```


@@ -457,92 +451,92 @@ train_loop(fluid.default_main_program())

 ```python
 def infer(use_cuda, save_dirname=None):
-if save_dirname is None:
-return
-
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-exe = fluid.Executor(place)
-
-inference_scope = fluid.core.Scope()
-with fluid.scope_guard(inference_scope):
-# Use fluid.io.load_inference_model to obtain the inference program desc,
-# the feed_target_names (the names of variables that will be fed
-# data using feed operators), and the fetch_targets (variables that
-# we want to obtain data from using fetch operators).
-[inference_program, feed_target_names,
-fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
-
-# Setup inputs by creating LoDTensors to represent sequences of words.
-# Here each word is the basic element of these LoDTensors and the shape of
-# each word (base_shape) should be [1] since it is simply an index to
-# look up for the corresponding word vector.
-# Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]],
-# which has only one lod level. Then the created LoDTensors will have only
-# one higher level structure (sequence of words, or sentence) than the basic
-# element (word). Hence the LoDTensor will hold data for three sentences of
-# length 3, 4 and 2, respectively.
-# Note that lod info should be a list of lists.
-lod = [[3, 4, 2]]
-base_shape = [1]
-# The range of random integers is [low, high]
-word = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-pred = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=pred_dict_len - 1)
-ctx_n2 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_n1 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_0 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_p1 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_p2 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
-mark = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=mark_dict_len - 1)
-
-# Construct feed as a dictionary of {feed_target_name: feed_target_data}
-# and results will contain a list of data corresponding to fetch_targets.
-assert feed_target_names[0] == 'word_data'
-assert feed_target_names[1] == 'verb_data'
-assert feed_target_names[2] == 'ctx_n2_data'
-assert feed_target_names[3] == 'ctx_n1_data'
-assert feed_target_names[4] == 'ctx_0_data'
-assert feed_target_names[5] == 'ctx_p1_data'
-assert feed_target_names[6] == 'ctx_p2_data'
-assert feed_target_names[7] == 'mark_data'
-
-results = exe.run(inference_program,
-feed={
-feed_target_names[0]: word,
-feed_target_names[1]: pred,
-feed_target_names[2]: ctx_n2,
-feed_target_names[3]: ctx_n1,
-feed_target_names[4]: ctx_0,
-feed_target_names[5]: ctx_p1,
-feed_target_names[6]: ctx_p2,
-feed_target_names[7]: mark
-},
-fetch_list=fetch_targets,
-return_numpy=False)
-print(results[0].lod())
-np_data = np.array(results[0])
-print("Inference Shape: ", np_data.shape)
+    if save_dirname is None:
+        return
+
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    inference_scope = fluid.core.Scope()
+    with fluid.scope_guard(inference_scope):
+        # Use fluid.io.load_inference_model to obtain the inference program desc,
+        # the feed_target_names (the names of variables that will be fed
+        # data using feed operators), and the fetch_targets (variables that
+        # we want to obtain data from using fetch operators).
+        [inference_program, feed_target_names,
+         fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
+
+        # Setup inputs by creating LoDTensors to represent sequences of words.
+        # Here each word is the basic element of these LoDTensors and the shape of
+        # each word (base_shape) should be [1] since it is simply an index to
+        # look up for the corresponding word vector.
+        # Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]],
+        # which has only one lod level. Then the created LoDTensors will have only
+        # one higher level structure (sequence of words, or sentence) than the basic
+        # element (word). Hence the LoDTensor will hold data for three sentences of
+        # length 3, 4 and 2, respectively.
+        # Note that lod info should be a list of lists.
+        lod = [[3, 4, 2]]
+        base_shape = [1]
+        # The range of random integers is [low, high]
+        word = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        pred = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=pred_dict_len - 1)
+        ctx_n2 = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        ctx_n1 = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        ctx_0 = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        ctx_p1 = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        ctx_p2 = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
+        mark = fluid.create_random_int_lodtensor(
+            lod, base_shape, place, low=0, high=mark_dict_len - 1)
+
+        # Construct feed as a dictionary of {feed_target_name: feed_target_data}
+        # and results will contain a list of data corresponding to fetch_targets.
+        assert feed_target_names[0] == 'word_data'
+        assert feed_target_names[1] == 'verb_data'
+        assert feed_target_names[2] == 'ctx_n2_data'
+        assert feed_target_names[3] == 'ctx_n1_data'
+        assert feed_target_names[4] == 'ctx_0_data'
+        assert feed_target_names[5] == 'ctx_p1_data'
+        assert feed_target_names[6] == 'ctx_p2_data'
+        assert feed_target_names[7] == 'mark_data'
+
+        results = exe.run(inference_program,
+                          feed={
+                              feed_target_names[0]: word,
+                              feed_target_names[1]: pred,
+                              feed_target_names[2]: ctx_n2,
+                              feed_target_names[3]: ctx_n1,
+                              feed_target_names[4]: ctx_0,
+                              feed_target_names[5]: ctx_p1,
+                              feed_target_names[6]: ctx_p2,
+                              feed_target_names[7]: mark
+                          },
+                          fetch_list=fetch_targets,
+                          return_numpy=False)
+        print(results[0].lod())
+        np_data = np.array(results[0])
+        print("Inference Shape: ", np_data.shape)
 ```

 整个程序的入口如下：

 ```python
 def main(use_cuda, is_local=True):
-if use_cuda and not fluid.core.is_compiled_with_cuda():
-return
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return

-# Directory for saving the trained model
-save_dirname = "label_semantic_roles.inference.model"
+    # Directory for saving the trained model
+    save_dirname = "label_semantic_roles.inference.model"

-train(use_cuda, save_dirname, is_local)
-infer(use_cuda, save_dirname)
+    train(use_cuda, save_dirname, is_local)
+    infer(use_cuda, save_dirname)


 main(use_cuda=False)

--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/db_lstm_network_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/db_lstm_network_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/.gitignore
-data/wmt14
-data/pre-wmt14
-pretrained/wmt14_model
-gen.log
-gen_result
-train.log
-dataprovider_copy_1.py
-*.pyc
-multi-bleu.perl
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/index.md
@@ -30,7 +30,9 @@
 1 -6.23177   These are the light of hope and relief . <e>
 2 -7.7914  These are the light of hope and the relief of hope . <e>
 ```
+
 - 左起第一列是生成句子的序号；左起第二列是该条句子的得分（从大到小），分值越高越好；左起第三列是生成的英语句子。
+
 - 另外有两个特殊标志：`<e>`表示句子的结尾，`<unk>`表示未登录词（unknown word），即未在训练字典中出现的词。

 ## 模型概览
@@ -41,7 +43,7 @@

 我们已经在[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一章中介绍了一种双向循环神经网络，这里介绍Bengio团队在论文\[[2](#参考文献),[4](#参考文献)\]中提出的另一种结构。该结构的目的是输入一个序列，得到其在每个时刻的特征表示，即输出的每个时刻都用定长向量表示到该时刻的上下文语义信息。

-具体来说，该双向循环神经网络分别在时间维以顺序和逆序——即前向（forward）和后向（backward）——依次处理输入序列，并将每个时间步RNN的输出拼接成为最终的输出层。这样每个时间步的输出节点，都包含了输入序列中当前时刻完整的过去和未来的上下文信息。下图展示的是一个按时间步展开的双向循环神经网络。该网络包含一个前向和一个后向RNN，其中有六个权重矩阵：输入到前向隐层和后向隐层的权重矩阵（$W_1, W_3$），隐层到隐层自己的权重矩阵（$W_2,W_5$），前向隐层和后向隐层到输出层的权重矩阵（$W_4, W_6$）。注意，该网络的前向隐层和后向隐层之间没有连接。
+具体来说，该双向循环神经网络分别在时间维以顺序和逆序——即前向（forward）和后向（backward）——依次处理输入序列，并将每个时间步RNN的输出拼接成为最终的输出层。这样每个时间步的输出节点，都包含了输入序列中当前时刻完整的过去和未来的上下文信息。下图展示的是一个按时间步展开的双向循环神经网络。该网络包含一个前向和一个后向RNN，其中有六个权重矩阵：输入到前向隐层和后向隐层的权重矩阵（`$W_1, W_3$`），隐层到隐层自己的权重矩阵（`$W_2,W_5$`），前向隐层和后向隐层到输出层的权重矩阵（`$W_4, W_6$`）。注意，该网络的前向隐层和后向隐层之间没有连接。

 ![bi_rnn](./image/bi_rnn.png)
 <p align="center">
@@ -60,13 +62,13 @@

 编码阶段分为三步：

-1. one-hot vector表示：将源语言句子$x=\left \{ x_1,x_2,...,x_T \right \}$的每个词$x_i$表示成一个列向量$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$。这个向量$w_i$的维度与词汇表大小$\left | V \right |$ 相同，并且只有一个维度上有值1（该位置对应该词在词汇表中的位置），其余全是0。
+1. one-hot vector表示：将源语言句子`$x=\left \{ x_1,x_2,...,x_T \right \}$`的每个词`$x_i$`表示成一个列向量`$w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$`。这个向量`$w_i$`的维度与词汇表大小`$\left | V \right |$` 相同，并且只有一个维度上有值1（该位置对应该词在词汇表中的位置），其余全是0。

-2. 映射到低维语义空间的词向量：one-hot vector表示存在两个问题，1）生成的向量维度往往很大，容易造成维数灾难；2）难以刻画词与词之间的关系（如语义相似性，也就是无法很好地表达语义）。因此，需再one-hot vector映射到低维的语义空间，由一个固定维度的稠密向量（称为词向量）表示。记映射矩阵为$C\epsilon R^{K\times \left | V \right |}$，用$s_i=Cw_i$表示第$i$个词的词向量，$K$为向量维度。
+2. 映射到低维语义空间的词向量：one-hot vector表示存在两个问题，1）生成的向量维度往往很大，容易造成维数灾难；2）难以刻画词与词之间的关系（如语义相似性，也就是无法很好地表达语义）。因此，需再one-hot vector映射到低维的语义空间，由一个固定维度的稠密向量（称为词向量）表示。记映射矩阵为`$C\epsilon R^{K\times \left | V \right |}$`，用`$s_i=Cw_i$`表示第`$i$`个词的词向量，`$K$`为向量维度。

-3. 用RNN编码源语言词序列：这一过程的计算公式为$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$，其中$h_0$是一个全零的向量，$\varnothing _\theta$是一个非线性激活函数，最后得到的$\mathbf{h}=\left \{ h_1,..., h_T \right \}$就是RNN依次读入源语言$T$个词的状态编码序列。整句话的向量表示可以采用$\mathbf{h}$在最后一个时间步$T$的状态编码，或使用时间维上的池化（pooling）结果。
+3. 用RNN编码源语言词序列：这一过程的计算公式为`$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$`，其中`$h_0$`是一个全零的向量，`$\varnothing _\theta$`是一个非线性激活函数，最后得到的`$\mathbf{h}=\left \{ h_1,..., h_T \right \}$`就是RNN依次读入源语言`$T$`个词的状态编码序列。整句话的向量表示可以采用`$\mathbf{h}$`在最后一个时间步`$T$`的状态编码，或使用时间维上的池化（pooling）结果。

-第3步也可以使用双向循环神经网络实现更复杂的句编码表示，具体可以用双向GRU实现。前向GRU按照词序列$(x_1,x_2,...,x_T)$的顺序依次编码源语言端词，并得到一系列隐层状态$(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$。类似的，后向GRU按照$(x_T,x_{T-1},...,x_1)$的顺序依次编码源语言端词，得到$(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$。最后对于词$x_i$，通过拼接两个GRU的结果得到它的隐层状态，即$h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$。
+第3步也可以使用双向循环神经网络实现更复杂的句编码表示，具体可以用双向GRU实现。前向GRU按照词序列`$(x_1,x_2,...,x_T)$`的顺序依次编码源语言端词，并得到一系列隐层状态`$(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$`。类似的，后向GRU按照`$(x_T,x_{T-1},...,x_1)$`的顺序依次编码源语言端词，得到`$(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$`。最后对于词`$x_i$`，通过拼接两个GRU的结果得到它的隐层状态，即`$h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$`。

 ![encoder_attention](./image/encoder_attention.png)
 <p align="center">
@@ -77,19 +79,16 @@

 机器翻译任务的训练过程中，解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是：

-1. 每一个时刻，根据源语言句子的编码信息（又叫上下文向量，context vector）$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$，计算出下一个隐层状态$z_{i+1}$。计算公式如下：
-
-$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
-
-其中$\phi _{\theta '}$是一个非线性激活函数；$c=q\mathbf{h}$是源语言句子的上下文向量，在不使用[注意力机制](#注意力机制)时，如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素，则可以定义$c=h_T$；$u_i$是目标语言序列的第$i$个单词，$u_0$是目标语言序列的开始标记`<s>`，表示解码开始；$z_i$是$i$时刻解码RNN的隐层状态，$z_0$是一个全零的向量。
-
-2. 将$z_{i+1}$通过`softmax`归一化，得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。概率分布公式如下：
+1. 每一个时刻，根据源语言句子的编码信息（又叫上下文向量，context vector）`$c$`、真实目标语言序列的第`$i$`个词`$u_i$`和`$i$`时刻RNN的隐层状态`$z_i$`，计算出下一个隐层状态`$z_{i+1}$`。计算公式如下：
+$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
+其中`$\phi _{\theta '}$`是一个非线性激活函数；`$c=q\mathbf{h}$`是源语言句子的上下文向量，在不使用[注意力机制](#注意力机制)时，如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素，则可以定义`$c=h_T$`；`$u_i$`是目标语言序列的第`$i$`个单词，`$u_0$`是目标语言序列的开始标记`<s>`，表示解码开始；`$z_i$`是`$i$`时刻解码RNN的隐层状态，`$z_0$`是一个全零的向量。

+2. 将`$z_{i+1}$`通过`softmax`归一化，得到目标语言序列的第`$i+1$`个单词的概率分布`$p_{i+1}$`。概率分布公式如下：
 $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
+其中`$W_sz_{i+1}+b_z$`是对每个可能的输出单词进行打分，再用softmax归一化就可以得到第`$i+1$`个词的概率`$p_{i+1}$`。

-其中$W_sz_{i+1}+b_z$是对每个可能的输出单词进行打分，再用softmax归一化就可以得到第$i+1$个词的概率$p_{i+1}$。
+3. 根据`$p_{i+1}$`和`$u_{i+1}$`计算代价。

-3. 根据$p_{i+1}$和$u_{i+1}$计算代价。
 4. 重复步骤1~3，直到目标语言序列中的所有词处理完毕。

 机器翻译任务的生成过程，通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异，具体介绍请见[柱搜索算法](#柱搜索算法)。
@@ -102,12 +101,15 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$

 使用柱搜索算法的解码阶段，目标是最大化生成序列的概率。思路是：

-1. 每一个时刻，根据源语言句子的编码信息$c$、生成的第$i$个目标语言序列单词$u_i$和$i$时刻RNN的隐层状态$z_i$，计算出下一个隐层状态$z_{i+1}$。
-2. 将$z_{i+1}$通过`softmax`归一化，得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。
-3. 根据$p_{i+1}$采样出单词$u_{i+1}$。
+1. 每一个时刻，根据源语言句子的编码信息`$c$`、生成的第`$i$`个目标语言序列单词`$u_i$`和`$i$`时刻RNN的隐层状态`$z_i$`，计算出下一个隐层状态`$z_{i+1}$`。
+
+2. 将`$z_{i+1}$`通过`softmax`归一化，得到目标语言序列的第`$i+1$`个单词的概率分布`$p_{i+1}$`。
+
+3. 根据`$p_{i+1}$`采样出单词`$u_{i+1}$`。
+
 4. 重复步骤1~3，直到获得句子结束标记`<e>`或超过句子的最大生成长度为止。

-注意：$z_{i+1}$和$p_{i+1}$的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的，因此并不能保证得到全局最优解。
+注意：`$z_{i+1}$`和`$p_{i+1}$`的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的，因此并不能保证得到全局最优解。

 ## 数据介绍

@@ -116,9 +118,13 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 ### 数据预处理

 我们的预处理流程包括两步：
+
 - 将每个源语言到目标语言的平行语料库文件合并为一个文件：
+
 - 合并每个`XXX.src`和`XXX.trg`文件为`XXX`。
- `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接，用'\t'分隔。
+
+- `XXX`中的第`$i$`行内容为`XXX.src`中的第`$i$`行和`XXX.trg`中的第`$i$`行连接，用'\t'分隔。
+
 - 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词，包括：语料中词频最高的（DICTSIZE - 3）个单词，和3个特殊符号`<s>`（序列的开始）、`<e>`（序列的结束）和`<unk>`（未登录词）。

 ### 示例数据
@@ -132,6 +138,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量。

 ```python
+from __future__ import print_function
 import contextlib

 import numpy as np
@@ -157,139 +164,152 @@ decoder_size = hidden_dim

 然后如下实现编码器框架：

-```python
-def encoder(is_sparse):
-src_word_id = pd.data(
-name="src_word_id", shape=[1], dtype='int64', lod_level=1)
-src_embedding = pd.embedding(
-input=src_word_id,
-size=[dict_size, word_dim],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr=fluid.ParamAttr(name='vemb'))
-
-fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
-lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
-encoder_out = pd.sequence_last_step(input=lstm_hidden0)
-return encoder_out
-```
+   ```python
+   def encoder(is_sparse):
+    src_word_id = pd.data(
+        name="src_word_id", shape=[1], dtype='int64', lod_level=1)
+    src_embedding = pd.embedding(
+        input=src_word_id,
+        size=[dict_size, word_dim],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr=fluid.ParamAttr(name='vemb'))
+
+    fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
+    lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
+    encoder_out = pd.sequence_last_step(input=lstm_hidden0)
+    return encoder_out
+   ```

 再实现训练模式下的解码器：

 ```python
-def train_decoder(context, is_sparse):
-trg_language_word = pd.data(
-name="target_language_word", shape=[1], dtype='int64', lod_level=1)
-trg_embedding = pd.embedding(
-input=trg_language_word,
-size=[dict_size, word_dim],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr=fluid.ParamAttr(name='vemb'))
-
-rnn = pd.DynamicRNN()
-with rnn.block():
-current_word = rnn.step_input(trg_embedding)
-pre_state = rnn.memory(init=context)
-current_state = pd.fc(input=[current_word, pre_state],
-size=decoder_size,
-act='tanh')
-
-current_score = pd.fc(input=current_state,
-size=target_dict_dim,
-act='softmax')
-rnn.update_memory(pre_state, current_state)
-rnn.output(current_score)
-
-return rnn()
+   def train_decoder(context, is_sparse):
+    trg_language_word = pd.data(
+        name="target_language_word", shape=[1], dtype='int64', lod_level=1)
+    trg_embedding = pd.embedding(
+        input=trg_language_word,
+        size=[dict_size, word_dim],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr=fluid.ParamAttr(name='vemb'))
+
+    rnn = pd.DynamicRNN()
+    with rnn.block():
+        current_word = rnn.step_input(trg_embedding)
+        pre_state = rnn.memory(init=context)
+        current_state = pd.fc(input=[current_word, pre_state],
+                              size=decoder_size,
+                              act='tanh')
+
+        current_score = pd.fc(input=current_state,
+                              size=target_dict_dim,
+                              act='softmax')
+        rnn.update_memory(pre_state, current_state)
+        rnn.output(current_score)
+
+    return rnn()
 ```

 实现推测模式下的解码器：

 ```python
 def decode(context, is_sparse):
-init_state = context
-array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
-counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
-
-# fill the first element with init_state
-state_array = pd.create_array('float32')
-pd.array_write(init_state, array=state_array, i=counter)
-
-# ids, scores as memory
-ids_array = pd.create_array('int64')
-scores_array = pd.create_array('float32')
-
-init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2)
-init_scores = pd.data(
-name="init_scores", shape=[1], dtype="float32", lod_level=2)
-
-pd.array_write(init_ids, array=ids_array, i=counter)
-pd.array_write(init_scores, array=scores_array, i=counter)
-
-cond = pd.less_than(x=counter, y=array_len)
-
-while_op = pd.While(cond=cond)
-with while_op.block():
-pre_ids = pd.array_read(array=ids_array, i=counter)
-pre_state = pd.array_read(array=state_array, i=counter)
-pre_score = pd.array_read(array=scores_array, i=counter)
-
-# expand the lod of pre_state to be the same with pre_score
-pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
-
-pre_ids_emb = pd.embedding(
-input=pre_ids,
-size=[dict_size, word_dim],
-dtype='float32',
-is_sparse=is_sparse)
-
-# use rnn unit to update rnn
-current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
-size=decoder_size,
-act='tanh')
-current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
-# use score to do beam search
-current_score = pd.fc(input=current_state_with_lod,
-size=target_dict_dim,
-act='softmax')
-topk_scores, topk_indices = pd.topk(current_score, k=topk_size)
-selected_ids, selected_scores = pd.beam_search(
-pre_ids, topk_indices, topk_scores, beam_size, end_id=10, level=0)
-
-pd.increment(x=counter, value=1, in_place=True)
-
-# update the memories
-pd.array_write(current_state, array=state_array, i=counter)
-pd.array_write(selected_ids, array=ids_array, i=counter)
-pd.array_write(selected_scores, array=scores_array, i=counter)
-
-pd.less_than(x=counter, y=array_len, cond=cond)
-
-translation_ids, translation_scores = pd.beam_search_decode(
-ids=ids_array, scores=scores_array)
-
-return translation_ids, translation_scores
+    init_state = context
+    array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
+    counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
+
+    # fill the first element with init_state
+    state_array = pd.create_array('float32')
+    pd.array_write(init_state, array=state_array, i=counter)
+
+    # ids, scores as memory
+    ids_array = pd.create_array('int64')
+    scores_array = pd.create_array('float32')
+
+    init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2)
+    init_scores = pd.data(
+        name="init_scores", shape=[1], dtype="float32", lod_level=2)
+
+    pd.array_write(init_ids, array=ids_array, i=counter)
+    pd.array_write(init_scores, array=scores_array, i=counter)
+
+    cond = pd.less_than(x=counter, y=array_len)
+
+    while_op = pd.While(cond=cond)
+    with while_op.block():
+        pre_ids = pd.array_read(array=ids_array, i=counter)
+        pre_state = pd.array_read(array=state_array, i=counter)
+        pre_score = pd.array_read(array=scores_array, i=counter)
+
+        # expand the lod of pre_state to be the same with pre_score
+        pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
+
+        pre_ids_emb = pd.embedding(
+            input=pre_ids,
+            size=[dict_size, word_dim],
+            dtype='float32',
+            is_sparse=is_sparse)
+
+        # use rnn unit to update rnn
+        current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
+                              size=decoder_size,
+                              act='tanh')
+        current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
+        # use score to do beam search
+        current_score = pd.fc(input=current_state_with_lod,
+                              size=target_dict_dim,
+                              act='softmax')
+        topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
+        # calculate accumulated scores after topk to reduce computation cost
+        accu_scores = pd.elementwise_add(
+            x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0)
+        selected_ids, selected_scores = pd.beam_search(
+            pre_ids,
+            pre_score,
+            topk_indices,
+            accu_scores,
+            beam_size,
+            end_id=10,
+            level=0)
+
+        pd.increment(x=counter, value=1, in_place=True)
+
+        # update the memories
+        pd.array_write(current_state, array=state_array, i=counter)
+        pd.array_write(selected_ids, array=ids_array, i=counter)
+        pd.array_write(selected_scores, array=scores_array, i=counter)
+
+        # update the break condition: up to the max length or all candidates of
+        # source sentences have ended.
+        length_cond = pd.less_than(x=counter, y=array_len)
+        finish_cond = pd.logical_not(pd.is_empty(x=selected_ids))
+        pd.logical_and(x=length_cond, y=finish_cond, out=cond)
+
+    translation_ids, translation_scores = pd.beam_search_decode(
+        ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=10)
+
+    return translation_ids, translation_scores
 ```

 进而，我们定义一个`train_program`来使用`inference_program`计算出的结果，在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。

 ```python
 def train_program(is_sparse):
-context = encoder(is_sparse)
-rnn_out = train_decoder(context, is_sparse)
-label = pd.data(
-name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
-cost = pd.cross_entropy(input=rnn_out, label=label)
-avg_cost = pd.mean(cost)
-return avg_cost
+    context = encoder(is_sparse)
+    rnn_out = train_decoder(context, is_sparse)
+    label = pd.data(
+        name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
+    cost = pd.cross_entropy(input=rnn_out, label=label)
+    avg_cost = pd.mean(cost)
+    return avg_cost


 def optimizer_func():
-return fluid.optimizer.Adagrad(
-learning_rate=1e-4,
-regularization=fluid.regularizer.L2DecayRegularizer(
-regularization_coeff=0.1))
+    return fluid.optimizer.Adagrad(
+        learning_rate=1e-4,
+        regularization=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.1))
 ```

 ## 训练模型
@@ -307,9 +327,9 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.wmt14.train(dict_size), buf_size=1000),
-batch_size=batch_size)
+        paddle.reader.shuffle(
+            paddle.dataset.wmt14.train(dict_size), buf_size=1000),
+        batch_size=batch_size)
 ```

 ### 构造训练器(trainer)
@@ -318,9 +338,9 @@ batch_size=batch_size)
 ```python
 is_sparse = False
 trainer = fluid.Trainer(
-train_func=partial(train_program, is_sparse),
-place=place,
-optimizer_func=optimizer_func)
+        train_func=partial(train_program, is_sparse),
+        place=place,
+        optimizer_func=optimizer_func)
 ```

 ### 提供数据
@@ -329,8 +349,8 @@ optimizer_func=optimizer_func)

 ```python
 feed_order = [
-'src_word_id', 'target_language_word', 'target_language_next_word'
-]
+        'src_word_id', 'target_language_word', 'target_language_next_word'
+    ]
 ```

 ### 事件处理器
@@ -338,12 +358,12 @@ feed_order = [

 ```python
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-if event.step % 10 == 0:
-print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))
+    if isinstance(event, fluid.EndStepEvent):
+        if event.step % 10 == 0:
+            print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))

-if event.step == 20:
-trainer.stop()
+        if event.step == 20:
+            trainer.stop()
 ```

 ### 开始训练
@@ -353,10 +373,10 @@ trainer.stop()
 EPOCH_NUM = 1

 trainer.train(
-reader=train_reader,
-num_epochs=EPOCH_NUM,
-event_handler=event_handler,
-feed_order=feed_order)
+        reader=train_reader,
+        num_epochs=EPOCH_NUM,
+        event_handler=event_handler,
+        feed_order=feed_order)
 ```

 ## 应用模型
@@ -377,7 +397,7 @@ translation_ids, translation_scores = decode(context, is_sparse)
 ```python
 init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
 init_scores_data = np.array(
-[1. for _ in range(batch_size)], dtype='float32')
+    [1. for _ in range(batch_size)], dtype='float32')
 init_ids_data = init_ids_data.reshape((batch_size, 1))
 init_scores_data = init_scores_data.reshape((batch_size, 1))
 init_lod = [1] * batch_size
@@ -387,14 +407,14 @@ init_ids = fluid.create_lod_tensor(init_ids_data, init_lod, place)
 init_scores = fluid.create_lod_tensor(init_scores_data, init_lod, place)

 test_data = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.wmt14.test(dict_size), buf_size=1000),
-batch_size=batch_size)
+    paddle.reader.shuffle(
+        paddle.dataset.wmt14.test(dict_size), buf_size=1000),
+    batch_size=batch_size)

 feed_order = ['src_word_id']
 feed_list = [
-framework.default_main_program().global_block().var(var_name)
-for var_name in feed_order
+    framework.default_main_program().global_block().var(var_name)
+    for var_name in feed_order
 ]
 feeder = fluid.DataFeeder(feed_list, place)

@@ -409,27 +429,30 @@ exe = Executor(place)
 exe.run(framework.default_startup_program())

 for data in test_data():
-feed_data = map(lambda x: [x[0]], data)
-feed_dict = feeder.feed(feed_data)
-feed_dict['init_ids'] = init_ids
-feed_dict['init_scores'] = init_scores
-
-results = exe.run(
-framework.default_main_program(),
-feed=feed_dict,
-fetch_list=[translation_ids, translation_scores],
-return_numpy=False)
-
-result_ids = np.array(results[0])
-result_scores = np.array(results[1])
-
-print("Original sentence:")
-print(" ".join([src_dict[w] for w in feed_data[0][0]]))
-print("Translated sentence:")
-print(" ".join([trg_dict[w] for w in result_ids]))
-print("Corresponding score: ", result_scores)
-
-break
+    feed_data = map(lambda x: [x[0]], data)
+    feed_dict = feeder.feed(feed_data)
+    feed_dict['init_ids'] = init_ids
+    feed_dict['init_scores'] = init_scores
+
+    results = exe.run(
+        framework.default_main_program(),
+        feed=feed_dict,
+        fetch_list=[translation_ids, translation_scores],
+        return_numpy=False)
+
+    result_ids = np.array(results[0])
+    result_scores = np.array(results[1])
+
+    print("Original sentence:")
+    print(" ".join([src_dict[w] for w in feed_data[0][0][1:-1]]))
+    print("Translated score and sentence:")
+    for i in xrange(beam_size):
+        start_pos = result_ids_lod[1][i] + 1
+        end_pos = result_ids_lod[1][i+1]
+        print("%d\t%.4f\t%s\n" % (i+1, result_scores[end_pos-1],
+                " ".join([trg_dict[w] for w in result_ids[start_pos:end_pos]])))
+
+    break
 ```

 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/bi_rnn_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/bi_rnn_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/decoder_attention_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/decoder_attention_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_attention_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_attention_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/gru_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/gru_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/nmt_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/nmt_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/.gitignore
-.idea
-.ipynb_checkpoints
--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/index.md
 # 个性化推荐

-本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/176.html)。

 ## 背景介绍

@@ -36,8 +36,8 @@ Prediction Score is 4.25

 YouTube是世界上最大的视频上传、分享和发现网站，YouTube推荐系统为超过10亿用户从不断增长的视频库中推荐个性化的内容。整个系统由两个神经网络组成：候选生成网络和排序网络。候选生成网络从百万量级的视频库中生成上百个候选，排序网络对候选进行打分排序，输出排名最高的数十个结果。系统结构如图1所示：

-![YouTube_Overview](./image/YouTube_Overview.png)
 <p align="center">
+<img src="image/YouTube_Overview.png" width="70%" ><br/>
 图1. YouTube 推荐系统结构
 </p>

@@ -47,8 +47,8 @@ YouTube是世界上最大的视频上传、分享和发现网站，YouTube推荐

 首先，将观看历史及搜索词记录这类历史信息，映射为向量后取平均值得到定长表示；同时，输入人口学特征以优化新用户的推荐效果，并将二值特征和连续特征归一化处理到[0, 1]范围。接下来，将所有特征表示拼接为一个向量，并输入给非线形多层感知器（MLP，详见[识别数字](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.cn.md)教程）处理。最后，训练时将MLP的输出给softmax做分类，预测时计算用户的综合特征（MLP的输出）与所有视频的相似度，取得分最高的$k$个作为候选生成网络的筛选结果。图2显示了候选生成网络结构。

-![Deep_candidate_generation_model_architecture](./image/Deep_candidate_generation_model_architecture.png)
 <p align="center">
+<img src="image/Deep_candidate_generation_model_architecture.png" width="70%" ><br/>
 图2. 候选生成网络结构
 </p>

@@ -72,8 +72,8 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$

 卷积神经网络主要由卷积（convolution）和池化（pooling）操作构成，其应用及组合方式灵活多变，种类繁多。本小结我们以如图3所示的网络进行讲解：

-![text_cnn](./image/text_cnn.png)
 <p align="center">
+<img src="image/text_cnn.png" width = "80%" align="center"/><br/>
 图3. 卷积神经网络文本分类模型
 </p>

@@ -95,9 +95,9 @@ $$\hat c=max(c)$$

 1. 首先，使用用户特征和电影特征作为神经网络的输入，其中：

- 用户特征融合了四个属性信息，分别是用户ID、性别、职业和年龄。
+   - 用户特征融合了四个属性信息，分别是用户ID、性别、职业和年龄。

- 电影特征融合了三个属性信息，分别是电影ID、电影类型ID和电影名称。
+   - 电影特征融合了三个属性信息，分别是电影ID、电影类型ID和电影名称。

 2. 对用户特征，将用户ID映射为维度大小为256的向量表示，输入全连接层，并对其他三个属性也做类似的处理。然后将四个属性的特征表示分别全连接并相加。

@@ -105,8 +105,9 @@ $$\hat c=max(c)$$

 4. 得到用户和电影的向量表示后，计算二者的余弦相似度作为推荐系统的打分。最后，用该相似度打分和用户真实打分的差异的平方作为该回归模型的损失函数。

-![rec_regression_network](./image/rec_regression_network.png)
 <p align="center">
+
+<img src="image/rec_regression_network.png" width="90%" ><br/>
 图4. 融合推荐模型
 </p>

@@ -141,7 +142,7 @@ movie_info = paddle.dataset.movielens.movie_info()
 print movie_info.values()[0]
 ```

-<MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
+    <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>


 这表示，电影的id是1，标题是《Toy Story》，该电影被分为到三个类别中。这三个类别是动画，儿童，喜剧。
@@ -152,13 +153,14 @@ user_info = paddle.dataset.movielens.user_info()
 print user_info.values()[0]
 ```

-<UserInfo id(1), gender(F), age(1), job(10)>
+    <UserInfo id(1), gender(F), age(1), job(10)>


 这表示，该用户ID是1，女性，年龄比18岁还年轻。职业ID是10。


 其中，年龄使用下列分布
+
 *  1:  "Under 18"
 * 18:  "18-24"
 * 25:  "25-34"
@@ -168,6 +170,7 @@ print user_info.values()[0]
 * 56:  "56+"

 职业是从下面几种选项里面选则得出:
+
 *  0:  "other" or not specified
 *  1:  "academic/educator"
 *  2:  "artist"
@@ -203,7 +206,7 @@ mov_id = train_sample[len(user_info[uid].value())]
 print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
 ```

-User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
+    User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]


 即用户1对电影1193的评价为5分。
@@ -214,6 +217,7 @@ User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193


 ```python
+from __future__ import print_function
 import math
 import sys
 import numpy as np
@@ -232,59 +236,59 @@ BATCH_SIZE = 256
 ```python
 def get_usr_combined_features():

-USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
+    USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1

-uid = layers.data(name='user_id', shape=[1], dtype='int64')
+    uid = layers.data(name='user_id', shape=[1], dtype='int64')

-usr_emb = layers.embedding(
-input=uid,
-dtype='float32',
-size=[USR_DICT_SIZE, 32],
-param_attr='user_table',
-is_sparse=IS_SPARSE)
+    usr_emb = layers.embedding(
+        input=uid,
+        dtype='float32',
+        size=[USR_DICT_SIZE, 32],
+        param_attr='user_table',
+        is_sparse=IS_SPARSE)

-usr_fc = layers.fc(input=usr_emb, size=32)
+    usr_fc = layers.fc(input=usr_emb, size=32)

-USR_GENDER_DICT_SIZE = 2
+    USR_GENDER_DICT_SIZE = 2

-usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
+    usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')

-usr_gender_emb = layers.embedding(
-input=usr_gender_id,
-size=[USR_GENDER_DICT_SIZE, 16],
-param_attr='gender_table',
-is_sparse=IS_SPARSE)
+    usr_gender_emb = layers.embedding(
+        input=usr_gender_id,
+        size=[USR_GENDER_DICT_SIZE, 16],
+        param_attr='gender_table',
+        is_sparse=IS_SPARSE)

-usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
+    usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)

-USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
-usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
+    USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
+    usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")

-usr_age_emb = layers.embedding(
-input=usr_age_id,
-size=[USR_AGE_DICT_SIZE, 16],
-is_sparse=IS_SPARSE,
-param_attr='age_table')
+    usr_age_emb = layers.embedding(
+        input=usr_age_id,
+        size=[USR_AGE_DICT_SIZE, 16],
+        is_sparse=IS_SPARSE,
+        param_attr='age_table')

-usr_age_fc = layers.fc(input=usr_age_emb, size=16)
+    usr_age_fc = layers.fc(input=usr_age_emb, size=16)

-USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
-usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
+    USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
+    usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")

-usr_job_emb = layers.embedding(
-input=usr_job_id,
-size=[USR_JOB_DICT_SIZE, 16],
-param_attr='job_table',
-is_sparse=IS_SPARSE)
+    usr_job_emb = layers.embedding(
+        input=usr_job_id,
+        size=[USR_JOB_DICT_SIZE, 16],
+        param_attr='job_table',
+        is_sparse=IS_SPARSE)

-usr_job_fc = layers.fc(input=usr_job_emb, size=16)
+    usr_job_fc = layers.fc(input=usr_job_emb, size=16)

-concat_embed = layers.concat(
-input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
+    concat_embed = layers.concat(
+        input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)

-usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
+    usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")

-return usr_combined_features
+    return usr_combined_features
 ```

 如上述代码所示，对于每个用户，我们输入4维特征。其中包括user_id,gender_id,age_id,job_id。这几维特征均是简单的整数值。为了后续神经网络处理这些特征方便，我们借鉴NLP中的语言模型，将这几维离散的整数值，变换成embedding取出。分别形成usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb。
@@ -297,51 +301,51 @@ return usr_combined_features
 ```python
 def get_mov_combined_features():

-MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
+    MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1

-mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
+    mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')

-mov_emb = layers.embedding(
-input=mov_id,
-dtype='float32',
-size=[MOV_DICT_SIZE, 32],
-param_attr='movie_table',
-is_sparse=IS_SPARSE)
+    mov_emb = layers.embedding(
+        input=mov_id,
+        dtype='float32',
+        size=[MOV_DICT_SIZE, 32],
+        param_attr='movie_table',
+        is_sparse=IS_SPARSE)

-mov_fc = layers.fc(input=mov_emb, size=32)
+    mov_fc = layers.fc(input=mov_emb, size=32)

-CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
+    CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())

-category_id = layers.data(
-name='category_id', shape=[1], dtype='int64', lod_level=1)
+    category_id = layers.data(
+        name='category_id', shape=[1], dtype='int64', lod_level=1)

-mov_categories_emb = layers.embedding(
-input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
+    mov_categories_emb = layers.embedding(
+        input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)

-mov_categories_hidden = layers.sequence_pool(
-input=mov_categories_emb, pool_type="sum")
+    mov_categories_hidden = layers.sequence_pool(
+        input=mov_categories_emb, pool_type="sum")

-MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
+    MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())

-mov_title_id = layers.data(
-name='movie_title', shape=[1], dtype='int64', lod_level=1)
+    mov_title_id = layers.data(
+        name='movie_title', shape=[1], dtype='int64', lod_level=1)

-mov_title_emb = layers.embedding(
-input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
+    mov_title_emb = layers.embedding(
+        input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)

-mov_title_conv = nets.sequence_conv_pool(
-input=mov_title_emb,
-num_filters=32,
-filter_size=3,
-act="tanh",
-pool_type="sum")
+    mov_title_conv = nets.sequence_conv_pool(
+        input=mov_title_emb,
+        num_filters=32,
+        filter_size=3,
+        act="tanh",
+        pool_type="sum")

-concat_embed = layers.concat(
-input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
+    concat_embed = layers.concat(
+        input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)

-mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
+    mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")

-return mov_combined_features
+    return mov_combined_features
 ```

 电影标题名称(title)是一个序列的整数，整数代表的是这个词在索引序列中的下标。这个序列会被送入 `sequence_conv_pool` 层，这个层会在时间维度上使用卷积和池化。因为如此，所以输出会是固定长度，尽管输入的序列长度各不相同。
@@ -350,13 +354,13 @@ return mov_combined_features

 ```python
 def inference_program():
-usr_combined_features = get_usr_combined_features()
-mov_combined_features = get_mov_combined_features()
+    usr_combined_features = get_usr_combined_features()
+    mov_combined_features = get_mov_combined_features()

-inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
-scale_infer = layers.scale(x=inference, scale=5.0)
+    inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
+    scale_infer = layers.scale(x=inference, scale=5.0)

-return scale_infer
+    return scale_infer
 ```

 进而，我们定义一个`train_program`来使用`inference_program`计算出的结果，在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。
@@ -364,17 +368,17 @@ return scale_infer
 ```python
 def train_program():

-scale_infer = inference_program()
+    scale_infer = inference_program()

-label = layers.data(name='score', shape=[1], dtype='float32')
-square_cost = layers.square_error_cost(input=scale_infer, label=label)
-avg_cost = layers.mean(square_cost)
+    label = layers.data(name='score', shape=[1], dtype='float32')
+    square_cost = layers.square_error_cost(input=scale_infer, label=label)
+    avg_cost = layers.mean(square_cost)

-return [avg_cost, scale_infer]
+    return [avg_cost, scale_infer]


 def optimizer_func():
-return fluid.optimizer.SGD(learning_rate=0.2)
+    return fluid.optimizer.SGD(learning_rate=0.2)
 ```


@@ -393,12 +397,12 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.movielens.train(), buf_size=8192),
-batch_size=BATCH_SIZE)
+    paddle.reader.shuffle(
+        paddle.dataset.movielens.train(), buf_size=8192),
+    batch_size=BATCH_SIZE)

 test_reader = paddle.batch(
-paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
+    paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
 ```

 ### 构造训练器(trainer)
@@ -406,7 +410,7 @@ paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)

 ```python
 trainer = fluid.Trainer(
-train_func=train_program, place=place, optimizer_func=optimizer_func)
+    train_func=train_program, place=place, optimizer_func=optimizer_func)
 ```

 ### 提供数据
@@ -415,8 +419,8 @@ train_func=train_program, place=place, optimizer_func=optimizer_func)

 ```python
 feed_order = [
-'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
-'movie_title', 'score'
+    'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
+    'movie_title', 'score'
 ]
 ```

@@ -433,23 +437,23 @@ plot_cost = Ploter(test_title)


 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-avg_cost_set = trainer.test(
-reader=test_reader, feed_order=feed_order)
+    if isinstance(event, fluid.EndStepEvent):
+        avg_cost_set = trainer.test(
+            reader=test_reader, feed_order=feed_order)

-# get avg cost
-avg_cost = np.array(avg_cost_set).mean()
+        # get avg cost
+        avg_cost = np.array(avg_cost_set).mean()

-plot_cost.append(test_title, event.step, avg_cost_set[0])
-plot_cost.plot()
+        plot_cost.append(test_title, event.step, avg_cost_set[0])
+        plot_cost.plot()

-print("avg_cost: %s" % avg_cost)
-print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
-float(avg_cost)))
+        print("avg_cost: %s" % avg_cost)
+        print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
+                                                          float(avg_cost)))

-if event.step == 20: # Adjust this number for accuracy
-trainer.save_params(params_dirname)
-trainer.stop()
+        if event.step == 20: # Adjust this number for accuracy
+            trainer.save_params(params_dirname)
+            trainer.stop()
 ```

 ### 开始训练
@@ -457,10 +461,10 @@ trainer.stop()

 ```python
 trainer.train(
-num_epochs=1,
-event_handler=event_handler,
-reader=train_reader,
-feed_order=feed_order)
+    num_epochs=1,
+    event_handler=event_handler,
+    reader=train_reader,
+    feed_order=feed_order)
 ```

 ## 应用模型
@@ -470,7 +474,7 @@ feed_order=feed_order)

 ```python
 inferencer = fluid.Inferencer(
-inference_program, param_path=params_dirname, place=place)
+        inference_program, param_path=params_dirname, place=place)
 ```

 ### 生成测试用输入数据
@@ -488,7 +492,7 @@ job_id = fluid.create_lod_tensor([[10]], [[1]], place)
 movie_id = fluid.create_lod_tensor([[783]], [[1]], place) # Hunchback of Notre Dame
 category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place) # Animation, Children's, Musical
 movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
-place) # 'hunchback','of','notre','dame','the'
+                                      place) # 'hunchback','of','notre','dame','the'
 ```

 ### 测试
@@ -497,16 +501,21 @@ place) # 'hunchback','of','notre','dame','the'

 ```python
 results = inferencer.infer(
-{
-'user_id': user_id,
-'gender_id': gender_id,
-'age_id': age_id,
-'job_id': job_id,
-'movie_id': movie_id,
-'category_id': category_id,
-'movie_title': movie_title
-},
-return_numpy=False)
+    {
+        'user_id': user_id,
+        'gender_id': gender_id,
+        'age_id': age_id,
+        'job_id': job_id,
+        'movie_id': movie_id,
+        'category_id': category_id,
+        'movie_title': movie_title
+    },
+    return_numpy=False)
+
+predict_rating = np.array(results[0])
+print("Predict Rating of user id 1 on movie \"" + infer_movie_name + "\" is " + str(predict_rating[0][0]))
+print("Actual Rating of user id 1 on movie \"" + infer_movie_name + "\" is 4.")
+
 ```

 ## 总结
@@ -515,12 +524,12 @@ return_numpy=False)

 ## 参考文献

-1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
-2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
+1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
+2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
 3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
-4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
+4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
 5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
-6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
+6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.



--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/image/rec_regression_network_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/image/rec_regression_network_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/.gitignore
-data/aclImdb
-data/imdb
-data/pre-imdb
-data/mosesdecoder-master
-*.log
-model_output
-dataprovider_copy_1.py
-model.list
-*.pyc
-.DS_Store
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/index.md
 # 情感分析

-本教程源代码目录在[book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/177.html)。

 ## 背景介绍

@@ -36,8 +36,8 @@

 循环神经网络是一种能对序列数据进行精确建模的有力工具。实际上，循环神经网络的理论计算能力是图灵完备的\[[4](#参考文献)\]。自然语言是一种典型的序列数据（词序列），近年来，循环神经网络及其变体（如long short term memory\[[5](#参考文献)\]等）在自然语言处理的多个领域，如语言模型、句法解析、语义角色标注（或一般的序列标注）、语义表示、图文生成、对话、机器翻译等任务上均表现优异甚至成为目前效果最好的方法。

-![rnn](./image/rnn.png)
 <p align="center">
+<img src="image/rnn.png" width = "60%" align="center"/><br/>
 图1. 循环神经网络按时间展开的示意图
 </p>

@@ -65,8 +65,8 @@ $$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$
 $$ h_t = o_t\odot tanh(c_t) $$
 其中，$i_t, f_t, c_t, o_t$分别表示输入门，遗忘门，记忆单元及输出门的向量值，带角标的$W$及$b$为模型参数，$tanh$为双曲正切函数，$\odot$表示逐元素（elementwise）的乘法操作。输入门控制着新输入进入记忆单元$c$的强度，遗忘门控制着记忆单元维持上一时刻值的强度，输出门控制着输出记忆单元的强度。三种门的计算方式类似，但有着完全不同的参数，它们各自以不同的方式控制着记忆单元$c$，如图2所示：

-![lstm](./image/lstm.png)
 <p align="center">
+<img src="image/lstm.png" width = "65%" align="center"/><br/>
 图2. 时刻$t$的LSTM [7]
 </p>

@@ -82,8 +82,8 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$

 如图3所示（以三层为例），奇数层LSTM正向，偶数层LSTM反向，高一层的LSTM使用低一层LSTM及之前所有层的信息作为输入，对最高层LSTM序列使用时间维度上的最大池化即可得到文本的定长向量表示（这一表示充分融合了文本的上下文信息，并且对文本进行了深层次抽象），最后我们将文本表示连接至softmax构建分类模型。

-![stacked_lstm](./image/stacked_lstm.jpg)
 <p align="center">
+<img src="image/stacked_lstm.jpg" width=450><br/>
 图3. 栈式双向LSTM用于文本分类
 </p>

@@ -94,11 +94,11 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$
 ```text
 aclImdb
 |- test
-|-- neg
-|-- pos
+   |-- neg
+   |-- pos
 |- train
-|-- neg
-|-- pos
+   |-- neg
+   |-- pos
 ```
 Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取，并提供了读取字典、训练数据、测试数据等API。

@@ -107,6 +107,7 @@ Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取
 在该示例中，我们实现了两种文本分类算法，分别基于[推荐系统](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)一节介绍过的文本卷积神经网络，以及[栈式双向LSTM](#栈式双向LSTM（Stacked Bidirectional LSTM）)。我们首先引入要用到的库和定义全局变量：

 ```python
+from __future__ import print_function
 import paddle
 import paddle.fluid as fluid
 from functools import partial
@@ -115,6 +116,7 @@ import numpy as np
 CLASS_DIM = 2
 EMB_DIM = 128
 HID_DIM = 512
+STACKED_NUM = 3
 BATCH_SIZE = 128
 USE_GPU = False
 ```
@@ -126,23 +128,23 @@ USE_GPU = False

 ```python
 def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
-emb = fluid.layers.embedding(
-input=data, size=[input_dim, emb_dim], is_sparse=True)
-conv_3 = fluid.nets.sequence_conv_pool(
-input=emb,
-num_filters=hid_dim,
-filter_size=3,
-act="tanh",
-pool_type="sqrt")
-conv_4 = fluid.nets.sequence_conv_pool(
-input=emb,
-num_filters=hid_dim,
-filter_size=4,
-act="tanh",
-pool_type="sqrt")
-prediction = fluid.layers.fc(
-input=[conv_3, conv_4], size=class_dim, act="softmax")
-return prediction
+    emb = fluid.layers.embedding(
+        input=data, size=[input_dim, emb_dim], is_sparse=True)
+    conv_3 = fluid.nets.sequence_conv_pool(
+        input=emb,
+        num_filters=hid_dim,
+        filter_size=3,
+        act="tanh",
+        pool_type="sqrt")
+    conv_4 = fluid.nets.sequence_conv_pool(
+        input=emb,
+        num_filters=hid_dim,
+        filter_size=4,
+        act="tanh",
+        pool_type="sqrt")
+    prediction = fluid.layers.fc(
+        input=[conv_3, conv_4], size=class_dim, act="softmax")
+    return prediction
 ```

 网络的输入`input_dim`表示的是词典的大小，`class_dim`表示类别数。这里，我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
@@ -154,27 +156,26 @@ return prediction
 ```python
 def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):

-emb = fluid.layers.embedding(
-input=data, size=[input_dim, emb_dim], is_sparse=True)
+    emb = fluid.layers.embedding(
+        input=data, size=[input_dim, emb_dim], is_sparse=True)

-fc1 = fluid.layers.fc(input=emb, size=hid_dim)
-lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
+    fc1 = fluid.layers.fc(input=emb, size=hid_dim)
+    lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)

-inputs = [fc1, lstm1]
+    inputs = [fc1, lstm1]

-for i in range(2, stacked_num + 1):
-fc = fluid.layers.fc(input=inputs, size=hid_dim)
-lstm, cell = fluid.layers.dynamic_lstm(
-input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
-inputs = [fc, lstm]
+    for i in range(2, stacked_num + 1):
+        fc = fluid.layers.fc(input=inputs, size=hid_dim)
+        lstm, cell = fluid.layers.dynamic_lstm(
+            input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
+        inputs = [fc, lstm]

-fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
-lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')
+    fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
+    lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')

-prediction = fluid.layers.fc(input=[fc_last, lstm_last],
-size=class_dim,
-act='softmax')
-return prediction
+    prediction = fluid.layers.fc(
+        input=[fc_last, lstm_last], size=class_dim, act='softmax')
+    return prediction
 ```
 以上的栈式双向LSTM抽象出了高级特征并把其映射到和分类类别数同样大小的向量上。`paddle.activation.Softmax`函数用来计算分类属于某个类别的概率。

@@ -184,12 +185,13 @@ return prediction

 ```python
 def inference_program(word_dict):
-data = fluid.layers.data(
-name="words", shape=[1], dtype="int64", lod_level=1)
+    data = fluid.layers.data(
+        name="words", shape=[1], dtype="int64", lod_level=1)

-dict_dim = len(word_dict)
-net = convolution_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM)
-return net
+    dict_dim = len(word_dict)
+    net = convolution_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM)
+    # net = stacked_lstm_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM, STACKED_NUM)
+    return net
 ```

 我们这里定义了`training_program`。它使用了从`inference_program`返回的结果来计算误差。我们同时定义了优化函数`optimizer_func`。
@@ -200,16 +202,16 @@ return net

 ```python
 def train_program(word_dict):
-prediction = inference_program(word_dict)
-label = fluid.layers.data(name="label", shape=[1], dtype="int64")
-cost = fluid.layers.cross_entropy(input=prediction, label=label)
-avg_cost = fluid.layers.mean(cost)
-accuracy = fluid.layers.accuracy(input=prediction, label=label)
-return [avg_cost, accuracy]
+    prediction = inference_program(word_dict)
+    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+    cost = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    accuracy = fluid.layers.accuracy(input=prediction, label=label)
+    return [avg_cost, accuracy]


 def optimizer_func():
-return fluid.optimizer.Adagrad(learning_rate=0.002)
+    return fluid.optimizer.Adagrad(learning_rate=0.002)
 ```

 ## 训练模型
@@ -236,9 +238,9 @@ word_dict = paddle.dataset.imdb.word_dict()

 print ("Reading training data....")
 train_reader = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.imdb.train(word_dict), buf_size=25000),
-batch_size=BATCH_SIZE)
+    paddle.reader.shuffle(
+        paddle.dataset.imdb.train(word_dict), buf_size=25000),
+    batch_size=BATCH_SIZE)
 ```

 ### 构造训练器(trainer)
@@ -246,9 +248,9 @@ batch_size=BATCH_SIZE)

 ```python
 trainer = fluid.Trainer(
-train_func=partial(train_program, word_dict),
-place=place,
-optimizer_func=optimizer_func)
+    train_func=partial(train_program, word_dict),
+    place=place,
+    optimizer_func=optimizer_func)
 ```

 ### 提供数据
@@ -268,13 +270,13 @@ feed_order = ['words', 'label']
 params_dirname = "understand_sentiment_conv.inference.model"

 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-print("Step {0}, Epoch {1} Metrics {2}".format(
-event.step, event.epoch, map(np.array, event.metrics)))
+    if isinstance(event, fluid.EndStepEvent):
+        print("Step {0}, Epoch {1} Metrics {2}".format(
+                event.step, event.epoch, map(np.array, event.metrics)))

-if event.step == 10:
-trainer.save_params(params_dirname)
-trainer.stop()
+        if event.step == 10:
+            trainer.save_params(params_dirname)
+            trainer.stop()
 ```

 ### 开始训练
@@ -283,10 +285,10 @@ trainer.stop()

 ```python
 trainer.train(
-num_epochs=1,
-event_handler=event_handler,
-reader=train_reader,
-feed_order=feed_order)
+    num_epochs=1,
+    event_handler=event_handler,
+    reader=train_reader,
+    feed_order=feed_order)
 ```

 ## 应用模型
@@ -297,7 +299,7 @@ feed_order=feed_order)

 ```python
 inferencer = fluid.Inferencer(
-inference_program, param_path=params_dirname, place=place)
+        infer_func=partial(inference_program, word_dict), param_path=params_dirname, place=place)
 ```

 ### 生成测试用输入数据
@@ -307,14 +309,14 @@ inference_program, param_path=params_dirname, place=place)

 ```python
 reviews_str = [
-'read the book forget the movie', 'this is a great movie', 'this is very bad'
+    'read the book forget the movie', 'this is a great movie', 'this is very bad'
 ]
 reviews = [c.split() for c in reviews_str]

 UNK = word_dict['<unk>']
 lod = []
 for c in reviews:
-lod.append([word_dict.get(words, UNK) for words in c])
+    lod.append([word_dict.get(words, UNK) for words in c])

 base_shape = [[len(c) for c in lod]]

@@ -329,7 +331,7 @@ tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
 results = inferencer.infer({'words': tensor_words})

 for i, r in enumerate(results[0]):
-print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'")
+    print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'")

 ```


--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/rnn.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/rnn.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/.gitignore
-data/train.list
-data/test.list
-data/simple-examples*
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/index.md

 # 词向量

-本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/175.html)。

 ## 背景介绍

@@ -12,17 +12,19 @@

 One-hot vector虽然自然，但是用处有限。比如，在互联网广告系统里，如果用户输入的query是“母亲节”，而有一个广告的关键词是“康乃馨”。虽然按照常理，我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨；但是这两个词对应的one-hot vectors之间的距离度量，无论是欧氏距离还是余弦相似度(cosine similarity)，由于其向量正交，都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是：每个词本身的信息量都太小。所以，仅仅给定两个词，不足以让我们准确判别它们是否相关。要想精确计算相关性，我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。

-在机器学习领域里，各种“知识”被各种模型表示，词向量模型(word embedding model)就是其中的一类。通过词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如$embedding(Mother's\ Day) = [0.3, 4.2, -1.5, ...], embedding(Carnation) = [0.2, 5.6, -2.3, ...]$。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”，这样如“母亲节”和“康乃馨”的对应词向量的余弦相似度就不再为零了。
+在机器学习领域里，各种“知识”被各种模型表示，词向量模型(word embedding model)就是其中的一类。通过词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如$embedding(母亲节) = [0.3, 4.2, -1.5, ...], embedding(康乃馨) = [0.2, 5.6, -2.3, ...]$。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”，这样如“母亲节”和“康乃馨”的对应词向量的余弦相似度就不再为零了。

 词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前，传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵，$X_{ij}$表示在所有语料中，词汇表`V`(vocabulary)中第i个词和第j个词同时出现的词数，$|V|$为词汇表的大小。对$X$做矩阵分解（如奇异值分解，Singular Value Decomposition \[[5](#参考文献)\]），得到的$U$即视为所有词的词向量：

 $$X = USV^T$$

-但这样的传统做法有很多问题：<br/>
-1) 由于很多词没有出现，导致矩阵极其稀疏，因此需要对词频做额外处理来达到好的矩阵分解效果；<br/>
-2) 矩阵非常大，维度太高(通常达到$10^6*10^6$的数量级)；<br/>
-3) 需要手动去掉停用词（如although, a,...），不然这些频繁出现的词也会影响矩阵分解的效果。
+但这样的传统做法有很多问题：
+
+1) 由于很多词没有出现，导致矩阵极其稀疏，因此需要对词频做额外处理来达到好的矩阵分解效果；

+2) 矩阵非常大，维度太高(通常达到$10^6 \times 10^6$的数量级)；
+
+3) 需要手动去掉停用词（如although, a,...），不然这些频繁出现的词也会影响矩阵分解的效果。

 基于神经网络的模型不需要计算存储一个在全语料上统计的大表，而是通过学习语义信息得到词向量，因此能很好地解决以上问题。在本章里，我们将展示基于神经网络训练词向量的细节，以及如何用PaddlePaddle训练一个词向量模型。

@@ -31,19 +33,21 @@ $$X = USV^T$$

 本章中，当词向量训练好后，我们可以用数据可视化算法t-SNE\[[4](#参考文献)\]画出词语特征在二维上的投影（如下图所示）。从图中可以看出，语义相关的词语（如a, the, these; big, huge）在投影上距离很近，语意无关的词（如say, business; decision, japan）在投影上的距离很远。

-![2d_similarity](./image/2d_similarity.png)
 <p align="center">
-图1. 词向量的二维投影
+    <img src = "image/2d_similarity.png" width=400><br/>
+    图1. 词向量的二维投影
 </p>

 另一方面，我们知道两个向量的余弦值在$[-1,1]$的区间内：两个完全相同的向量余弦值为1, 两个相互垂直的向量之间余弦值为0，两个方向完全相反的向量余弦值为-1，即相关性和余弦值大小成正比。因此我们还可以计算两个词向量的余弦相似度:

 ```
-similarity: 0.899180685161
+
 please input two words: big huge
+similarity: 0.899180685161

 please input two words: from company
 similarity: -0.0997506977351
+
 ```

 以上结果可以通过运行`calculate_dis.py`, 加载字典里的单词和对应训练特征结果得到，我们将在[应用模型](#应用模型)中详细描述用法。
@@ -85,31 +89,31 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$

 其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率，$R(\theta)$表示参数正则项。

-![nnlm](./image/nnlm.png)
 <p align="center">
-图2. N-gram神经网络模型
+       <img src="image/nnlm.png" width=500><br/>
+       图2. N-gram神经网络模型
 </p>

 图2展示了N-gram神经网络模型，从下往上看，该模型分为以下几个部分：
- 对于每个样本，模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。
+ - 对于每个样本，模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。

-每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。
+   每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。

- 然后所有词语的词向量连接成一个大向量，并经过一个非线性映射得到历史词语的隐层表示：
+ - 然后所有词语的词向量连接成一个大向量，并经过一个非线性映射得到历史词语的隐层表示：

-$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
+    $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$

-其中，$x$为所有词语的词向量连接成的大向量，表示文本历史特征；$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率，$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
+    其中，$x$为所有词语的词向量连接成的大向量，表示文本历史特征；$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率，$g_i$表示未经归一化的字典中第$i$个单词的输出概率。

- 根据softmax的定义，通过归一化$g_i$, 生成目标词$w_t$的概率为：
+ - 根据softmax的定义，通过归一化$g_i$, 生成目标词$w_t$的概率为：

-$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
+  $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$

- 整个网络的损失值(cost)为多类分类交叉熵，用公式表示为
+ - 整个网络的损失值(cost)为多类分类交叉熵，用公式表示为

-$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
+   $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$

-其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1)，$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。
+   其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1)，$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。



@@ -117,9 +121,9 @@ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$

 CBOW模型通过一个词的上下文（各N个词）预测当前词。当N=2时，模型如下图所示：

-![cbow](./image/cbow.png)
 <p align="center">
-图3. CBOW模型
+    <img src="image/cbow.png" width=250><br/>
+    图3. CBOW模型
 </p>

 具体来说，不考虑上下文的词语输入顺序，CBOW是用上下文词语的词向量的均值来预测当前词。即：
@@ -132,9 +136,9 @@ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$

 CBOW的好处是对上下文词语的分布在词向量上进行了平滑，去掉了噪声，因此在小数据集上很有效。而Skip-gram的方法中，用一个词预测其上下文，得到了当前词上下文的很多样本，因此可用于更大的数据集。

-![skipgram](./image/skipgram.png)
 <p align="center">
-图4. Skip-gram模型
+    <img src="image/skipgram.png" width=250><br/>
+    图4. Skip-gram模型
 </p>

 如上图所示，Skip-gram模型的具体做法是，将一个词的词向量映射到$2n$个词的词向量（$2n$表示当前输入词的前后各$n$个词），然后分别通过softmax得到这$2n$个词的分类损失值之和。
@@ -148,21 +152,21 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑，去

 <p align="center">
 <table>
-<tr>
-<td>训练数据</td>
-<td>验证数据</td>
-<td>测试数据</td>
-</tr>
-<tr>
-<td>ptb.train.txt</td>
-<td>ptb.valid.txt</td>
-<td>ptb.test.txt</td>
-</tr>
-<tr>
-<td>42068句</td>
-<td>3370句</td>
-<td>3761句</td>
-</tr>
+    <tr>
+        <td>训练数据</td>
+        <td>验证数据</td>
+        <td>测试数据</td>
+    </tr>
+    <tr>
+        <td>ptb.train.txt</td>
+        <td>ptb.valid.txt</td>
+        <td>ptb.test.txt</td>
+    </tr>
+    <tr>
+        <td>42068句</td>
+        <td>3370句</td>
+        <td>3761句</td>
+    </tr>
 </table>
 </p>

@@ -189,9 +193,9 @@ dream that one day <e>

 本配置的模型结构如下图所示：

-![ngram](./image/ngram.png)
 <p align="center">
-图5. 模型配置中的N-gram神经网络模型
+    <img src="image/ngram.png" width=400><br/>
+    图5. 模型配置中的N-gram神经网络模型
 </p>

 首先，加载所需要的包：
@@ -204,6 +208,7 @@ from functools import partial
 import math
 import os
 import sys
+from __future__ import print_function
 ```

 然后，定义参数：
@@ -226,57 +231,57 @@ dict_size = len(word_dict)

 ```python
 def inference_program(is_sparse):
-first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
-second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
-third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
-fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
-
-embed_first = fluid.layers.embedding(
-input=first_word,
-size=[dict_size, EMBED_SIZE],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr='shared_w')
-embed_second = fluid.layers.embedding(
-input=second_word,
-size=[dict_size, EMBED_SIZE],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr='shared_w')
-embed_third = fluid.layers.embedding(
-input=third_word,
-size=[dict_size, EMBED_SIZE],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr='shared_w')
-embed_fourth = fluid.layers.embedding(
-input=fourth_word,
-size=[dict_size, EMBED_SIZE],
-dtype='float32',
-is_sparse=is_sparse,
-param_attr='shared_w')
-
-concat_embed = fluid.layers.concat(
-input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
-hidden1 = fluid.layers.fc(input=concat_embed,
-size=HIDDEN_SIZE,
-act='sigmoid')
-predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
-return predict_word
+    first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
+    second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
+    third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
+    fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
+
+    embed_first = fluid.layers.embedding(
+        input=first_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_second = fluid.layers.embedding(
+        input=second_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_third = fluid.layers.embedding(
+        input=third_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_fourth = fluid.layers.embedding(
+        input=fourth_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+
+    concat_embed = fluid.layers.concat(
+        input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
+    hidden1 = fluid.layers.fc(input=concat_embed,
+                              size=HIDDEN_SIZE,
+                              act='sigmoid')
+    predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
+    return predict_word
 ```

 - 基于以上的神经网络结构，我们可以如下定义我们的`训练`方法

 ```python
 def train_program(is_sparse):
-# The declaration of 'next_word' must be after the invoking of inference_program,
-# or the data input order of train program would be [next_word, firstw, secondw,
-# thirdw, fourthw], which is not correct.
-predict_word = inference_program(is_sparse)
-next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
-cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
-avg_cost = fluid.layers.mean(cost)
-return avg_cost
+    # The declaration of 'next_word' must be after the invoking of inference_program,
+    # or the data input order of train program would be [next_word, firstw, secondw,
+    # thirdw, fourthw], which is not correct.
+    predict_word = inference_program(is_sparse)
+    next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
+    cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
+    avg_cost = fluid.layers.mean(cost)
+    return avg_cost
 ```

 - 现在我们可以开始训练啦。如今的版本较之以前就简单了许多。我们有现成的训练和测试集：`paddle.dataset.imikolov.train()`和`paddle.dataset.imikolov.test()`。两者都会返回一个读取器。在PaddlePaddle中，读取器是一个Python的函数，每次调用，会读取下一条数据。它是一个Python的generator。
@@ -285,59 +290,59 @@ return avg_cost

 ```python
 def optimizer_func():
-# Note here we need to choose more sophisticated optimizers
-# such as AdaGrad with a decay rate. The normal SGD converges
-# very slowly.
-# optimizer=fluid.optimizer.SGD(learning_rate=0.001),
-return fluid.optimizer.AdagradOptimizer(
-learning_rate=3e-3,
-regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
+    # Note here we need to choose more sophisticated optimizers
+    # such as AdaGrad with a decay rate. The normal SGD converges
+    # very slowly.
+    # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
+    return fluid.optimizer.AdagradOptimizer(
+        learning_rate=3e-3,
+        regularization=fluid.regularizer.L2DecayRegularizer(8e-4))


 def train(use_cuda, train_program, params_dirname):
-train_reader = paddle.batch(
-paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
-test_reader = paddle.batch(
-paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
-
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-
-def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-# We output cost every 10 steps.
-if event.step % 10 == 0:
-outs = trainer.test(
-reader=test_reader,
-feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
-avg_cost = outs[0]
-
-print "Step %d: Average Cost %f" % (event.step, avg_cost)
-
-# If average cost is lower than 5.8, we consider the model good enough to stop.
-# Note 5.8 is a relatively high value. In order to get a better model, one should
-# aim for avg_cost lower than 3.5. But the training could take longer time.
-if avg_cost < 5.8:
-trainer.save_params(params_dirname)
-trainer.stop()
-
-if math.isnan(avg_cost):
-sys.exit("got NaN loss, training failed.")
-
-trainer = fluid.Trainer(
-train_func=train_program,
-optimizer_func=optimizer_func,
-place=place)
-
-trainer.train(
-reader=train_reader,
-num_epochs=1,
-event_handler=event_handler,
-feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+    train_reader = paddle.batch(
+        paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
+    test_reader = paddle.batch(
+        paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
+
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+
+    def event_handler(event):
+        if isinstance(event, fluid.EndStepEvent):
+            # We output cost every 10 steps.
+            if event.step % 10 == 0:
+                outs = trainer.test(
+                    reader=test_reader,
+                    feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+                avg_cost = outs[0]
+
+                print("Step %d: Average Cost %f" % (event.step, avg_cost))
+
+                # If average cost is lower than 5.8, we consider the model good enough to stop.
+                # Note 5.8 is a relatively high value. In order to get a better model, one should
+                # aim for avg_cost lower than 3.5. But the training could take longer time.
+                if avg_cost < 5.8:
+                    trainer.save_params(params_dirname)
+                    trainer.stop()
+
+                if math.isnan(avg_cost):
+                    sys.exit("got NaN loss, training failed.")
+
+    trainer = fluid.Trainer(
+        train_func=train_program,
+        optimizer_func=optimizer_func,
+        place=place)
+
+    trainer.train(
+        reader=train_reader,
+        num_epochs=1,
+        event_handler=event_handler,
+        feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
 ```

 - `trainer.train`将会开始训练。从`event_handler`返回的监控情况如下：

-```python
+```text
 Step 0: Average Cost 7.337213
 Step 10: Average Cost 6.136128
 Step 20: Average Cost 5.766995
@@ -352,50 +357,49 @@ Step 20: Average Cost 5.766995

 ```python
 def infer(use_cuda, inference_program, params_dirname=None):
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-inferencer = fluid.Inferencer(
-infer_func=inference_program, param_path=params_dirname, place=place)
-
-# Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
-# is simply an index to look up for the corresponding word vector and hence
-# the shape of word (base_shape) should be [1]. The length-based level of
-# detail (lod) info of each LoDtensor should be [[1]] meaning there is only
-# one lod_level and there is only one sequence of one word on this level.
-# Note that lod info should be a list of lists.
-
-data1 = [[211]]  # 'among'
-data2 = [[6]]    # 'a'
-data3 = [[96]]   # 'group'
-data4 = [[4]]    # 'of'
-lod = [[1]]
-
-first_word  = fluid.create_lod_tensor(data1, lod, place)
-second_word = fluid.create_lod_tensor(data2, lod, place)
-third_word  = fluid.create_lod_tensor(data3, lod, place)
-fourth_word = fluid.create_lod_tensor(data4, lod, place)
-
-result = inferencer.infer(
-{
-'firstw': first_word,
-'secondw': second_word,
-'thirdw': third_word,
-'fourthw': fourth_word
-},
-return_numpy=False)
-
-print(numpy.array(result[0]))
-most_possible_word_index = numpy.argmax(result[0])
-print(most_possible_word_index)
-print([
-key for key, value in word_dict.iteritems()
-if value == most_possible_word_index
-][0])
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        infer_func=inference_program, param_path=params_dirname, place=place)
+
+    # Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
+    # is simply an index to look up for the corresponding word vector and hence
+    # the shape of word (base_shape) should be [1]. The length-based level of
+    # detail (lod) info of each LoDtensor should be [[1]] meaning there is only
+    # one lod_level and there is only one sequence of one word on this level.
+    # Note that lod info should be a list of lists.
+
+    data1 = [[211]]  # 'among'
+    data2 = [[6]]    # 'a'
+    data3 = [[96]]   # 'group'
+    data4 = [[4]]    # 'of'
+    lod = [[1]]
+
+    first_word  = fluid.create_lod_tensor(data1, lod, place)
+    second_word = fluid.create_lod_tensor(data2, lod, place)
+    third_word  = fluid.create_lod_tensor(data3, lod, place)
+    fourth_word = fluid.create_lod_tensor(data4, lod, place)
+
+    result = inferencer.infer(
+        {
+            'firstw': first_word,
+            'secondw': second_word,
+            'thirdw': third_word,
+            'fourthw': fourth_word
+        },
+        return_numpy=False)
+
+    print(numpy.array(result[0]))
+    most_possible_word_index = numpy.argmax(result[0])
+    print(most_possible_word_index)
+    print([
+        key for key, value in word_dict.iteritems()
+        if value == most_possible_word_index
+    ][0])
 ```

 在经历3分钟的短暂训练后，我们得到如下的预测。我们的模型预测 `among a group of` 的下一个词是`a`。这比较符合文法规律。如果我们训练时间更长，比如几个小时，那么我们会得到的下一个预测是 `workers`。

-
-```python
+```text
 [[0.00106646 0.0007907  0.00072041 ... 0.00049024 0.00041355 0.00084464]]
 6
 a
@@ -405,20 +409,20 @@ a

 ```python
 def main(use_cuda, is_sparse):
-if use_cuda and not fluid.core.is_compiled_with_cuda():
-return
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return

-params_dirname = "word2vec.inference.model"
+    params_dirname = "word2vec.inference.model"

-train(
-use_cuda=use_cuda,
-train_program=partial(train_program, is_sparse),
-params_dirname=params_dirname)
+    train(
+        use_cuda=use_cuda,
+        train_program=partial(train_program, is_sparse),
+        params_dirname=params_dirname)

-infer(
-use_cuda=use_cuda,
-inference_program=partial(inference_program, is_sparse),
-params_dirname=params_dirname)
+    infer(
+        use_cuda=use_cuda,
+        inference_program=partial(inference_program, is_sparse),
+        params_dirname=params_dirname)


 main(use_cuda=use_cuda, is_sparse=True)

--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/cbow_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/cbow_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/ngram.en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/ngram.en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/nnlm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/nnlm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/skipgram_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/skipgram_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/README.cn.md
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/README.cn.md
@@ -231,11 +231,13 @@ trainer.train(
    event_handler=event_handler_plot,
    feed_order=feed_order)
 ```
+
 <p align="center">
-    <img src = "image/train_and_test.png" width=400><br/>
+    <img src = "image/train_and_test1.png" width=400><br/>
    图3. 训练结果
 </p>

+
 ## 预测
 提供一个`inference_program`和一个`params_dirname`来初始化预测器。`params_dirname`用来存储我们的参数。


--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/train_and_test.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/train_and_test.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/README.cn.md
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/README.cn.md
 # 识别数字

-本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/167.html)。

 ## 背景介绍
 当我们学习编程的时候，编写的第一个程序一般是实现打印"Hello World"。而机器学习（或深度学习）的入门教程，一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题，比较简单，同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集，包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵，标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。

-![MNIST](./image/mnist_example_image.png)
-<p align="center">图1. MNIST图片示例</p>
+<p align="center">
+    <img src="image/mnist_example_image.png" width="400"><br/>
+    图1. MNIST图片示例
+</p>

 MNIST数据集是从 [NIST](https://www.nist.gov/srd/nist-special-database-19) 的Special Database 3（SD-3）和Special Database 1（SD-1）构建而来。由于SD-3是由美国人口调查局的员工进行标注，SD-1是由美国高中生进行标注，因此SD-3比SD-1更干净也更容易识别。Yann LeCun等人从SD-1和SD-3中各取一半作为MNIST的训练集（60000条数据）和测试集（10000条数据），其中训练集来自250位不同的标注员，此外还保证了训练集和测试集的标注员是不完全相同的。

@@ -36,44 +38,54 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

 对于有 $N$ 个类别的多分类问题，指定 $N$ 个输出节点，$N$ 维结果向量经过softmax将归一化为 $N$ 个[0,1]范围内的实数值，分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。

-在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy），公式如下：
+在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy loss），公式如下：

-$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
+$$  L_{cross-entropy} (label, y) = -\sum_i label_ilog(y_i) $$

 图2为softmax回归的网络图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。

-![softmaxRegression](./image/softmax_regression.png)
-<p align="center">图2. softmax回归网络结构图</p>
+<p align="center">
+<img src="image/softmax_regression.png" width=400><br/>
+图2. softmax回归网络结构图<br/>
+</p>

 ### 多层感知器(Multilayer Perceptron, MLP)

 Softmax回归模型采用了最简单的两层神经网络，即只有输入层和输出层，因此其拟合能力有限。为了达到更好的识别效果，我们考虑在输入层和输出层中间加上若干个隐藏层\[[10](#参考文献)\]。

 1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
+
 2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
+
 3.  最后，再经过输出层，得到的$Y=\text{softmax}(W_3H_2 + b_3)$，即为最后的分类结果向量。


 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。

-![multilayerPerceptron](./image/mlp.png)
-<p align="center">图3. 多层感知器网络结构图</p>
+<p align="center">
+<img src="image/mlp.png" width=500><br/>
+图3. 多层感知器网络结构图<br/>
+</p>

 ### 卷积神经网络(Convolutional Neural Network, CNN)

 在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。

-![cnnStructure](./image/cnn.png)
-<p align="center">图4. LeNet-5卷积神经网络结构</p>
+<p align="center">
+<img src="image/cnn.png"><br/>
+图4. LeNet-5卷积神经网络结构<br/>
+</p>

 #### 卷积层

 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积，即离散二维滤波器（也称作卷积核）与二维图像做卷积操作，简单的讲是二维滤波器滑动到二维图像上所有位置，并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域，不同卷积核可以提取不同的特征，例如边沿、线性、角等特征。在深层卷积神经网络中，通过卷积操作可以提取出图像低级到复杂的特征。

-![cnn](https://raw.githubusercontent.com/PaddlePaddle/book/develop/02.recognize_digits/image/conv_layer.png)
-<p align="center">图5. 卷积层图片</p>
+<p align="center">
+<img src="image/conv_layer.png" width='750'><br/>
+图5. 卷积层图片<br/>
+</p>

-图5给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3$）个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。
+图5给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。

 在卷积操作中卷积核是可学习的参数，经过上面示例介绍，每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中，神经元通常是全部连接，参数较多。而卷积层的参数较少，这也是由卷积层的主要特性即局部连接和共享权重所决定。

@@ -85,19 +97,22 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 #### 池化层

-![pooling](./image/max_pooling.png)
-<p align="center">图6. 池化层图片</p>
+<p align="center">
+<img src="image/max_pooling.png" width="400px"><br/>
+图6. 池化层图片<br/>
+</p>

 池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图6所示。

 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。

 ### 常见激活函数介绍
+
 - sigmoid激活函数： $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $

 - tanh激活函数： $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $

-实际上，tanh函数只是规模变化的sigmoid函数，将sigmoid函数值放大2倍之后再向下平移1个单位：tanh(x) = 2sigmoid(2x) - 1 。
+  实际上，tanh函数只是规模变化的sigmoid函数，将sigmoid函数值放大2倍之后再向下平移1个单位：tanh(x) = 2sigmoid(2x) - 1 。

 - ReLU激活函数： $ f(x) = max(0, x) $

@@ -107,35 +122,13 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层

 PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mnist/)数据的模块`paddle.dataset.mnist`。加载后的数据位于`/home/username/.cache/paddle/dataset/mnist`下：

-<p align="center">
-<table>
-    <thead>
-    <tr>
-        <th>文件名称</th>
-        <th>说明</th>
-    </tr>
-    </thead>
-
-    <tbody>
-    <tr>
-        <td>train-images-idx3-ubyte</td>
-        <td>训练数据图片，60,000条数据</td>
-    </tr>
-    <tr>
-        <td>train-labels-idx1-ubyte</td>
-        <td>训练数据标签，60,000条数据</td>
-    </tr>
-    <tr>
-        <td>t10k-images-idx3-ubyte</td>
-        <td>测试数据图片，10,000条数据</td>
-    </tr>
-    <tr>
-        <td>t10k-labels-idx1-ubyte</td>
-        <td>测试数据标签，10,000条数据</td>
-    </tr>
-    </tbody>
-</table>
-</p>
+
+|    文件名称          |       说明              |
+|----------------------|-------------------------|
+|train-images-idx3-ubyte|  训练数据图片，60,000条数据 |
+|train-labels-idx1-ubyte|  训练数据标签，60,000条数据 |
+|t10k-images-idx3-ubyte |  测试数据图片，10,000条数据 |
+|t10k-labels-idx1-ubyte |  测试数据标签，10,000条数据 |

 ## Fluid API 概述

@@ -143,18 +136,20 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
 我们建议使用 Fluid API，因为它更容易学起来。

 下面是快速的 Fluid API 概述。
+
 1. `inference_program`：指定如何从数据输入中获得预测的函数。
 这是指定网络流的地方。

-1. `train_program`：指定如何从 `inference_program` 和`标签值`中获取 `loss` 的函数。
+2. `train_program`：指定如何从 `inference_program` 和`标签值`中获取 `loss` 的函数。
 这是指定损失计算的地方。

-1. `optimizer_func`: “指定优化器配置的函数。优化器负责减少损失并驱动培训。Paddle 支持多种不同的优化器。
+3. `optimizer_func`: “指定优化器配置的函数。优化器负责减少损失并驱动培训。Paddle 支持多种不同的优化器。

-1. `Trainer`：PaddlePaddle Trainer 管理由 `train_program` 和 `optimizer` 指定的训练过程。
+4. `Trainer`：PaddlePaddle Trainer 管理由 `train_program` 和 `optimizer` 指定的训练过程。
 通过 `event_handler` 回调函数，用户可以监控培训的进展。

-1. `Inferencer`：Fluid inferencer 加载 `inference_program` 和由 Trainer 训练的参数。
+5. `Inferencer`：Fluid inferencer 加载 `inference_program` 和由 Trainer 训练的参数。
+
 然后，它可以推断数据和返回预测。

 在这个演示中，我们将深入了解它们。
@@ -165,6 +160,7 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
 ```python
 import paddle
 import paddle.fluid as fluid
+from __future__ import print_function
 ```

 ### Program Functions 配置
@@ -177,51 +173,51 @@ import paddle.fluid as fluid

 ```python
 def softmax_regression():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-predict = fluid.layers.fc(
-input=img, size=10, act='softmax')
-return predict
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    predict = fluid.layers.fc(
+        input=img, size=10, act='softmax')
+    return predict
 ```

 - 多层感知器：下面代码实现了一个含有两个隐藏层（即全连接层）的多层感知器。其中两个隐藏层的激活函数均采用ReLU，输出层的激活函数用Softmax。

 ```python
 def multilayer_perceptron():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-# 第一个全连接层，激活函数为ReLU
-hidden = fluid.layers.fc(input=img, size=200, act='relu')
-# 第二个全连接层，激活函数为ReLU
-hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
-# 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
-prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-return prediction
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    # 第一个全连接层，激活函数为ReLU
+    hidden = fluid.layers.fc(input=img, size=200, act='relu')
+    # 第二个全连接层，激活函数为ReLU
+    hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
+    # 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
+    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+    return prediction
 ```

 - 卷积神经网络LeNet-5: 输入的二维图像，首先经过两次卷积层到池化层，再经过全连接层，最后使用以softmax为激活函数的全连接层作为输出层。

 ```python
 def convolutional_neural_network():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-# 第一个卷积-池化层
-conv_pool_1 = fluid.nets.simple_img_conv_pool(
-input=img,
-filter_size=5,
-num_filters=20,
-pool_size=2,
-pool_stride=2,
-act="relu")
-conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-# 第二个卷积-池化层
-conv_pool_2 = fluid.nets.simple_img_conv_pool(
-input=conv_pool_1,
-filter_size=5,
-num_filters=50,
-pool_size=2,
-pool_stride=2,
-act="relu")
-# 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
-prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
-return prediction
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    # 第一个卷积-池化层
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    # 第二个卷积-池化层
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    # 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
+    prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
+    return prediction
 ```

 #### Train Program 配置
@@ -234,18 +230,16 @@ return prediction

 ```python
 def train_program():
-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-
-# predict = softmax_regression() # uncomment for Softmax回归
-# predict = multilayer_perceptron() # uncomment for 多层感知器
-predict = convolutional_neural_network() # uncomment for LeNet5卷积神经网络
-cost = fluid.layers.cross_entropy(input=predict, label=label)
-avg_cost = fluid.layers.mean(cost)
-acc = fluid.layers.accuracy(input=predict, label=label)
-return [avg_cost, acc]
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')

+    # predict = softmax_regression() # uncomment for Softmax回归
+    # predict = multilayer_perceptron() # uncomment for 多层感知器
+    predict = convolutional_neural_network() # uncomment for LeNet5卷积神经网络
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    acc = fluid.layers.accuracy(input=predict, label=label)
+    return [avg_cost, acc]

-# 该模型运行在单个CPU上
 ```

 #### Optimizer Function 配置
@@ -254,25 +248,25 @@ return [avg_cost, acc]

 ```python
 def optimizer_program():
-return fluid.optimizer.Adam(learning_rate=0.001)
+    return fluid.optimizer.Adam(learning_rate=0.001)
 ```

 ### 数据集 Feeders 配置

 下一步，我们开始训练过程。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数，每次调用的时候返回一个Python yield generator。

-下面`shuffle`是一个reader decorator，它接受一个reader A，返回另一个reader B —— reader B 每次读入`buffer_size`条训练数据到一个buffer里，然后随机打乱其顺序，并且逐条输出。
+下面`shuffle`是一个reader decorator，它接受一个reader A，返回另一个reader B 。reader B 每次读入`buffer_size`条训练数据到一个buffer里，然后随机打乱其顺序，并且逐条输出。

-`batch`是一个特殊的decorator，它的输入是一个reader，输出是一个batched reader —— 在PaddlePaddle里，一个reader每次yield一条训练数据，而一个batched reader每次yield一个minibatch。
+`batch`是一个特殊的decorator，它的输入是一个reader，输出是一个batched reader 。在PaddlePaddle里，一个reader每次yield一条训练数据，而一个batched reader每次yield一个minibatch。

 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
-paddle.dataset.mnist.train(), buf_size=500),
-batch_size=64)
+        paddle.reader.shuffle(
+            paddle.dataset.mnist.train(), buf_size=500),
+        batch_size=64)

 test_reader = paddle.batch(
-paddle.dataset.mnist.test(), batch_size=64)
+            paddle.dataset.mnist.test(), batch_size=64)
 ```

 ### Trainer 配置
@@ -285,7 +279,8 @@ use_cuda = False # set to True if training with GPU
 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

 trainer = fluid.Trainer(
-train_func=train_program, place=place, optimizer_func=optimizer_program)
+    train_func=train_program, place=place, optimizer_func=optimizer_program)
+
 ```

 #### Event Handler 配置
@@ -300,27 +295,32 @@ Fluid API 在训练期间为回调函数提供了一个钩子。用户能够通
 params_dirname = "recognize_digits_network.inference.model"
 lists = []
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
-if event.step % 100 == 0:
-# event.metrics maps with train program return arguments.
-# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
-print "Pass %d, Batch %d, Cost %f" % (
-event.step, event.epoch, event.metrics[0])
-
-if isinstance(event, fluid.EndEpochEvent):
-avg_cost, acc = trainer.test(
-reader=test_reader, feed_order=['img', 'label'])
-
-print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc))
-
-# save parameters
-trainer.save_params(params_dirname)
-lists.append((event.epoch, avg_cost, acc))
+    if isinstance(event, fluid.EndStepEvent):
+        if event.step % 100 == 0:
+            # event.metrics maps with train program return arguments.
+            # event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
+            print("Pass %d, Batch %d, Cost %f" % (
+                event.step, event.epoch, event.metrics[0]))
+
+    if isinstance(event, fluid.EndEpochEvent):
+        avg_cost, acc = trainer.test(
+            reader=test_reader, feed_order=['img', 'label'])
+
+        print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc))
+
+        # save parameters
+        trainer.save_params(params_dirname)
+        lists.append((event.epoch, avg_cost, acc))
 ```

 `event_handler_plot` 可以用来在训练过程中画图如下：

-![png](./image/train_and_test.png)
+
+<p align="center">
+<img src="image/train_and_test2.png" width="400"><br/>
+图7. 训练结果
+</p>
+

 ```python
 from paddle.v2.plot import Ploter
@@ -333,22 +333,22 @@ lists = []

 # event_handler to plot a figure
 def event_handler_plot(event):
-global step
-if isinstance(event, fluid.EndStepEvent):
-if step % 100 == 0:
-# event.metrics maps with train program return arguments.
-# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
-cost_ploter.append(train_title, step, event.metrics[0])
-cost_ploter.plot()
-step += 1
-if isinstance(event, fluid.EndEpochEvent):
-# save parameters
-trainer.save_params(params_dirname)
-
-avg_cost, acc = trainer.test(
-reader=test_reader, feed_order=['img', 'label'])
-cost_ploter.append(test_title, step, avg_cost)
-lists.append((event.epoch, avg_cost, acc))
+    global step
+    if isinstance(event, fluid.EndStepEvent):
+        if step % 100 == 0:
+            # event.metrics maps with train program return arguments.
+            # event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
+            cost_ploter.append(train_title, step, event.metrics[0])
+            cost_ploter.plot()
+        step += 1
+    if isinstance(event, fluid.EndEpochEvent):
+        # save parameters
+        trainer.save_params(params_dirname)
+
+        avg_cost, acc = trainer.test(
+            reader=test_reader, feed_order=['img', 'label'])
+        cost_ploter.append(test_title, step, avg_cost)
+        lists.append((event.epoch, avg_cost, acc))
 ```

 #### 开始训练
@@ -359,10 +359,10 @@ lists.append((event.epoch, avg_cost, acc))

 ```python
 trainer.train(
-num_epochs=5,
-event_handler=event_handler,
-reader=train_reader,
-feed_order=['img', 'label'])
+    num_epochs=5,
+    event_handler=event_handler,
+    reader=train_reader,
+    feed_order=['img', 'label'])
 ```

 训练过程是完全自动的，event_handler里打印的日志类似如下所示：
@@ -395,11 +395,11 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338

 ```python
 inferencer = fluid.Inferencer(
-# infer_func=softmax_regression, # uncomment for softmax regression
-# infer_func=multilayer_perceptron, # uncomment for MLP
-infer_func=convolutional_neural_network,  # uncomment for LeNet5
-param_path=params_dirname,
-place=place)
+    # infer_func=softmax_regression, # uncomment for softmax regression
+    # infer_func=multilayer_perceptron, # uncomment for MLP
+    infer_func=convolutional_neural_network,  # uncomment for LeNet5
+    param_path=params_dirname,
+    place=place)
 ```

 ### 生成预测输入数据
@@ -412,11 +412,11 @@ import os
 import numpy as np
 from PIL import Image
 def load_image(file):
-im = Image.open(file).convert('L')
-im = im.resize((28, 28), Image.ANTIALIAS)
-im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
-im = im / 255.0 * 2.0 - 1.0
-return im
+    im = Image.open(file).convert('L')
+    im = im.resize((28, 28), Image.ANTIALIAS)
+    im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
+    im = im / 255.0 * 2.0 - 1.0
+    return im

 cur_dir = cur_dir = os.getcwd()
 img = load_image(cur_dir + '/image/infer_3.png')
@@ -429,7 +429,7 @@ img = load_image(cur_dir + '/image/infer_3.png')
 ```python
 results = inferencer.infer({'img': img})
 lab = np.argsort(results)  # probs and lab are the results of one batch data
-print "Label of image/infer_3.png is: %d" % lab[0][0][-1]
+print ("Inference result of image/infer_3.png is: %d" % lab[0][0][-1])
 ```

 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/conv_layer.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/conv_layer.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/max_pooling_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/max_pooling_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_regression_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_regression_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/train_and_test.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/train_and_test.png