Merge branch 'develop' into optimize/op/fusion_gru

7bdd11d8 · tensor-tang · 59621390 · c709a04a · 59621390 · 7bdd11d8
121 changed file
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/.gitignore
-*.pyc
-train.log
-output
-data/cifar-10-batches-py/
-data/cifar-10-python.tar.gz
-data/*.txt
-data/*.list
-data/mean.meta
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/index.md
 # 图像分类
-本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/168.html)。
 ## 背景介绍
@@ -20,24 +20,25 @@
 图像分类包括通用图像分类、细粒度图像分类等。图1展示了通用图像分类效果，即模型可以正确识别图像上的主要物体。
-![dogCatClassification](./image/dog_cat.png)
 <p align="center">
+<img src="image/dog_cat.png "  width="350" ><br/>
 图1. 通用图像分类展示
 </p>
 图2展示了细粒度图像分类-花卉识别的效果，要求模型可以正确识别花的类别。
-![flowersClassification](./image/flowers.png)
 <p align="center">
+<img src="image/flowers.png" width="400" ><br/>
 图2. 细粒度图像分类展示
 </p>
 一个好的模型既要对不同类别识别正确，同时也应该能够对不同视角、光照、背景、变形或部分遮挡的图像正确识别(这里我们统一称作图像扰动)。图3展示了一些图像的扰动，较好的模型会像聪明的人类一样能够正确识别。
-![imageVariations](https://raw.githubusercontent.com/PaddlePaddle/book/develop/03.image_classification/image/variations.png)
 <p align="center">
+<img src="image/variations.png" width="550" ><br/>
 图3. 扰动图片展示[22]
 </p>
@@ -46,17 +47,21 @@
 图像识别领域大量的研究成果都是建立在[PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/)、[ImageNet](http://image-net.org/)等公开的数据集上，很多图像识别算法通常在这些数据集上进行测试和比较。PASCAL VOC是2005年发起的一个视觉挑战赛，ImageNet是2010年发起的大规模视觉识别竞赛(ILSVRC)的数据集，在本章中我们基于这些竞赛的一些论文介绍图像分类模型。
 在2012年之前的传统图像分类方法可以用背景描述中提到的三步完成，但通常完整建立图像识别模型一般包括底层特征学习、特征编码、空间约束、分类器设计、模型融合等几个阶段。
-1). **底层特征提取**: 通常从图像中按照固定步长、尺度提取大量局部特征描述。常用的局部特征包括SIFT(Scale-Invariant Feature Transform, 尺度不变特征转换) \[[1](#参考文献)\]、HOG(Histogram of Oriented Gradient, 方向梯度直方图) \[[2](#参考文献)\]、LBP(Local Bianray Pattern, 局部二值模式) \[[3](#参考文献)\] 等，一般也采用多种特征描述子，防止丢失过多的有用信息。
-2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
+  1). **底层特征提取**: 通常从图像中按照固定步长、尺度提取大量局部特征描述。常用的局部特征包括SIFT(Scale-Invariant Feature Transform, 尺度不变特征转换) \[[1](#参考文献)\]、HOG(Histogram of Oriented Gradient, 方向梯度直方图) \[[2](#参考文献)\]、LBP(Local Bianray Pattern, 局部二值模式) \[[3](#参考文献)\] 等，一般也采用多种特征描述子，防止丢失过多的有用信息。
-3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
-4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
+  2). **特征编码**: 底层特征中包含了大量冗余与噪声，为了提高特征表达的鲁棒性，需要使用一种特征变换算法对底层特征进行编码，称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
+  3). **空间特征约束**: 特征编码之后一般会经过空间特征约束，也称作**特征汇聚**。特征汇聚是指在一个空间范围内，对每一维特征取最大值或者平均值，可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法，这种方法提出将图像均匀分块，在分块内做特征汇聚。
+  4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述，接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器，在传统图像分类任务上性能很好。
 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征，两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。
 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破，效果大幅度超越传统方法，获得了ILSVRC2012冠军，该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后，涌现了一系列CNN模型，不断地在ImageNet上刷新成绩，如图4展示。随着模型变得越来越深以及精妙的结构设计，Top-5的错误率也越来越低，降到了3.5%附近。而在同样的ImageNet数据集上，人眼的辨识错误率大概在5.1%，也就是目前的深度学习模型的识别能力已经超过了人眼。
-![ilsvrc](./image/ilsvrc.png)
 <p align="center">
+<img src="image/ilsvrc.png" width="500" ><br/>
 图4. ILSVRC图像分类Top-5错误率
 </p>
@@ -64,8 +69,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 传统CNN包含卷积层、全连接层等组件，并采用softmax多类别分类器和多类交叉熵损失函数，一个典型的卷积神经网络如图5所示，我们先介绍用来构造CNN的常见组件。
-![cnnStructure](./image/lenet.png)
 <p align="center">
+<img src="image/lenet.png"><br/>
 图5. CNN网络示例[20]
 </p>
@@ -83,8 +88,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 牛津大学VGG(Visual Geometry Group)组在2014年ILSVRC提出的模型被称作VGG模型 \[[11](#参考文献)\] 。该模型相比以往模型进一步加宽和加深了网络结构，它的核心是五组卷积操作，每两组之间做Max-Pooling空间降维。同一组内采用多次连续的3X3卷积，卷积核的数目由较浅组的64增多到最深组的512，同一组内的卷积核数目是一样的。卷积之后接两层全连接层，之后是分类层。由于每组内卷积层的不同，有11、13、16、19层这几种模型，下图展示一个16层的网络结构。VGG模型结构相对简洁，提出之后也有很多文章基于此模型进行研究，如在ImageNet上首次公开超过人眼识别的模型\[[19](#参考文献)\]就是借鉴VGG模型的结构。
-![vgg16](./image/vgg16.png)
 <p align="center">
+<img src="image/vgg16.png" width="750" ><br/>
 图6. 基于ImageNet的VGG16模型
 </p>
@@ -92,12 +97,16 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
 GoogleNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军，在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块，因为GoogleNet模型由多组Inception模块组成，模型设计借鉴了NIN的一些思想。
-NIN模型主要有两个特点：1) 引入了多层感知卷积网络(Multi-Layer Perceptron Convolution, MLPconv)代替一层线性卷积网络。MLPconv是一个微小的多层卷积网络，即在线性卷积后面增加若干层1x1的卷积，这样可以提取出高度非线性特征。2) 传统的CNN最后几层一般都是全连接层，参数较多。而NIN模型设计最后一层卷积层包含类别维度大小的特征图，然后采用全局均值池化(Avg-Pooling)替代全连接层，得到类别维度大小的向量，再进行分类。这种替代全连接层的方式有利于减少参数。
+NIN模型主要有两个特点：
+1) 引入了多层感知卷积网络(Multi-Layer Perceptron Convolution, MLPconv)代替一层线性卷积网络。MLPconv是一个微小的多层卷积网络，即在线性卷积后面增加若干层1x1的卷积，这样可以提取出高度非线性特征。
+2) 传统的CNN最后几层一般都是全连接层，参数较多。而NIN模型设计最后一层卷积层包含类别维度大小的特征图，然后采用全局均值池化(Avg-Pooling)替代全连接层，得到类别维度大小的向量，再进行分类。这种替代全连接层的方式有利于减少参数。
 Inception模块如下图7所示，图(a)是最简单的设计，输出是3个卷积层和一个池化层的特征拼接。这种设计的缺点是池化层不会改变特征通道数，拼接后会导致特征的通道数较大，经过几层这样的模块堆积后，通道数会越来越大，导致参数和计算量也随之增大。为了改善这个缺点，图(b)引入3个1x1卷积层进行降维，所谓的降维就是减少通道数，同时如NIN模型中提到的1x1卷积也可以修正线性特征。
-![inception](./image/inception.png)
 <p align="center">
+<img src="image/inception.png" width="800" ><br/>
 图7. Inception模块
 </p>
@@ -105,8 +114,8 @@ GoogleNet由多组Inception模块堆积而成。另外，在网络最后也没
 GoogleNet整体网络结构如图8所示，总共22层网络：开始由3层普通的卷积组成；接下来由三组子网络组成，第一组子网络包含2个Inception模块，第二组包含5个Inception模块，第三组包含2个Inception模块；然后接均值池化层、全连接层。
-![googleNet](./image/googlenet.jpeg)
 <p align="center">
+<img src="image/googlenet.jpeg" ><br/>
 图8. GoogleNet[12]
 </p>
@@ -120,15 +129,15 @@ ResNet(Residual Network) \[[15](#参考文献)\] 是2015年ImageNet图像分类
 残差模块如图9所示，左边是基本模块连接方式，由两个输出通道数相同的3x3卷积组成。右边是瓶颈模块(Bottleneck)连接方式，之所以称为瓶颈，是因为上面的1x1卷积用来降维(图示例即256->64)，下面的1x1卷积用来升维(图示例即64->256)，这样中间3x3卷积的输入和输出通道数都较小(图示例即64->64)。
-![ResNetBlock](./image/resnet_block.jpg)
 <p align="center">
+<img src="image/resnet_block.jpg" width="400"><br/>
 图9. 残差模块
 </p>
 图10展示了50、101、152层网络连接示意图，使用的是瓶颈模块。这三个模型的区别在于每组中残差模块的重复次数不同(见图右上角)。ResNet训练收敛较快，成功的训练了上百乃至近千层的卷积神经网络。
-![ResNet](./image/resnet.png)
 <p align="center">
+<img src="image/resnet.png"><br/>
 图10. 基于ImageNet的ResNet模型
 </p>
@@ -139,8 +148,8 @@ ResNet(Residual Network) \[[15](#参考文献)\] 是2015年ImageNet图像分类
 由于ImageNet数据集较大，下载和训练较慢，为了方便大家学习，我们使用[CIFAR10](<https://www.cs.toronto.edu/~kriz/cifar.html>)数据集。CIFAR10数据集包含60,000张32x32的彩色图片，10个类别，每个类包含6,000张。其中50,000张图片作为训练集，10000张作为测试集。图11从每个类别中随机抽取了10张图片，展示了所有的类别。
-![CIFAR](https://raw.githubusercontent.com/PaddlePaddle/book/develop/03.image_classification/image/cifar.png)
 <p align="center">
+<img src="image/cifar.png" width="350"><br/>
 图11. CIFAR10数据集[21]
 </p>
@@ -159,6 +168,7 @@ import paddle
 import paddle.fluid as fluid
 import numpy
 import sys
+from __future__ import print_function
 ```
 本教程中我们提供了VGG和ResNet两个模型的配置。
@@ -170,33 +180,34 @@ VGG核心模块的输入是数据层，`vgg_bn_drop` 定义了16层VGG结构，
 ```python
 def vgg_bn_drop(input):
-def conv_block(ipt, num_filter, groups, dropouts):
+    def conv_block(ipt, num_filter, groups, dropouts):
-return fluid.nets.img_conv_group(
+        return fluid.nets.img_conv_group(
-input=ipt,
+            input=ipt,
-pool_size=2,
+            pool_size=2,
-pool_stride=2,
+            pool_stride=2,
-conv_num_filter=[num_filter] * groups,
+            conv_num_filter=[num_filter] * groups,
-conv_filter_size=3,
+            conv_filter_size=3,
-conv_act='relu',
+            conv_act='relu',
-conv_with_batchnorm=True,
+            conv_with_batchnorm=True,
-conv_batchnorm_drop_rate=dropouts,
+            conv_batchnorm_drop_rate=dropouts,
-pool_type='max')
+            pool_type='max')
-conv1 = conv_block(input, 64, 2, [0.3, 0])
+    conv1 = conv_block(input, 64, 2, [0.3, 0])
-conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
-conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
-conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
-conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
-drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
+    drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
-fc1 = fluid.layers.fc(input=drop, size=512, act=None)
+    fc1 = fluid.layers.fc(input=drop, size=512, act=None)
-bn = fluid.layers.batch_norm(input=fc1, act='relu')
+    bn = fluid.layers.batch_norm(input=fc1, act='relu')
-drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
+    drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
-fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
+    fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
-predict = fluid.layers.fc(input=fc2, size=10, act='softmax')
+    predict = fluid.layers.fc(input=fc2, size=10, act='softmax')
-return predict
+    return predict
 ```
 1. 首先定义了一组卷积网络，即conv_block。卷积核大小为3x3，池化窗口大小为2x2，窗口滑动大小为2，groups决定每组VGG模块是几次连续的卷积操作，dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块，由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。
 2. 五组卷积操作，即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0，即不使用Dropout操作。
@@ -211,74 +222,76 @@ ResNet模型的第1、3、4步和VGG模型相同，这里不再介绍。主要
 先介绍`resnet_cifar10`中的一些基本函数，再介绍网络连接过程。
- `conv_bn_layer` : 带BN的卷积层。
+  - `conv_bn_layer` : 带BN的卷积层。
- `shortcut` : 残差模块的"直连"路径，"直连"实际分两种形式：残差模块输入和输出特征通道数不等时，采用1x1卷积的升维操作；残差模块输入和输出通道相等时，采用直连操作。
+  - `shortcut` : 残差模块的"直连"路径，"直连"实际分两种形式：残差模块输入和输出特征通道数不等时，采用1x1卷积的升维操作；残差模块输入和输出通道相等时，采用直连操作。
- `basicblock` : 一个基础残差模块，即图9左边所示，由两组3x3卷积组成的路径和一条"直连"路径组成。
+  - `basicblock` : 一个基础残差模块，即图9左边所示，由两组3x3卷积组成的路径和一条"直连"路径组成。
- `bottleneck` : 一个瓶颈残差模块，即图9右边所示，由上下1x1卷积和中间3x3卷积组成的路径和一条"直连"路径组成。
+  - `bottleneck` : 一个瓶颈残差模块，即图9右边所示，由上下1x1卷积和中间3x3卷积组成的路径和一条"直连"路径组成。
- `layer_warp` : 一组残差模块，由若干个残差模块堆积而成。每组中第一个残差模块滑动窗口大小与其他可以不同，以用来减少特征图在垂直和水平方向的大小。
+  - `layer_warp` : 一组残差模块，由若干个残差模块堆积而成。每组中第一个残差模块滑动窗口大小与其他可以不同，以用来减少特征图在垂直和水平方向的大小。
 ```python
 def conv_bn_layer(input,
-ch_out,
+                  ch_out,
-filter_size,
+                  filter_size,
-stride,
+                  stride,
-padding,
+                  padding,
-act='relu',
+                  act='relu',
-bias_attr=False):
+                  bias_attr=False):
-tmp = fluid.layers.conv2d(
+    tmp = fluid.layers.conv2d(
-input=input,
+        input=input,
-filter_size=filter_size,
+        filter_size=filter_size,
-num_filters=ch_out,
+        num_filters=ch_out,
-stride=stride,
+        stride=stride,
-padding=padding,
+        padding=padding,
-act=None,
+        act=None,
-bias_attr=bias_attr)
+        bias_attr=bias_attr)
-return fluid.layers.batch_norm(input=tmp, act=act)
+    return fluid.layers.batch_norm(input=tmp, act=act)
 def shortcut(input, ch_in, ch_out, stride):
-if ch_in != ch_out:
+    if ch_in != ch_out:
-return conv_bn_layer(input, ch_out, 1, stride, 0, None)
+        return conv_bn_layer(input, ch_out, 1, stride, 0, None)
-else:
+    else:
-return input
+        return input
 def basicblock(input, ch_in, ch_out, stride):
-tmp = conv_bn_layer(input, ch_out, 3, stride, 1)
+    tmp = conv_bn_layer(input, ch_out, 3, stride, 1)
-tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, act=None, bias_attr=True)
+    tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, act=None, bias_attr=True)
-short = shortcut(input, ch_in, ch_out, stride)
+    short = shortcut(input, ch_in, ch_out, stride)
-return fluid.layers.elementwise_add(x=tmp, y=short, act='relu')
+    return fluid.layers.elementwise_add(x=tmp, y=short, act='relu')
 def layer_warp(block_func, input, ch_in, ch_out, count, stride):
-tmp = block_func(input, ch_in, ch_out, stride)
+    tmp = block_func(input, ch_in, ch_out, stride)
-for i in range(1, count):
+    for i in range(1, count):
-tmp = block_func(tmp, ch_out, ch_out, 1)
+        tmp = block_func(tmp, ch_out, ch_out, 1)
-return tmp
+    return tmp
 ```
 `resnet_cifar10` 的连接结构主要有以下几个过程。
 1. 底层输入连接一层 `conv_bn_layer`，即带BN的卷积层。
 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ，每组采用图 10 左边残差模块组成。
 3. 最后对网络做均值池化并返回该层。
-注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 `$(depth - 2) % 6 == 0$` 。
+注意：除过第一层卷积层和最后一层全连接层之外，要求三组 `layer_warp` 总的含参层数能够被6整除，即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
 ```python
 def resnet_cifar10(ipt, depth=32):
-# depth should be one of 20, 32, 44, 56, 110, 1202
+    # depth should be one of 20, 32, 44, 56, 110, 1202
-assert (depth - 2) % 6 == 0
+    assert (depth - 2) % 6 == 0
-n = (depth - 2) / 6
+    n = (depth - 2) / 6
-nStages = {16, 64, 128}
+    nStages = {16, 64, 128}
-conv1 = conv_bn_layer(ipt, ch_out=16, filter_size=3, stride=1, padding=1)
+    conv1 = conv_bn_layer(ipt, ch_out=16, filter_size=3, stride=1, padding=1)
-res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
+    res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
-res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
+    res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
-res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
+    res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
-pool = fluid.layers.pool2d(
+    pool = fluid.layers.pool2d(
-input=res3, pool_size=8, pool_type='avg', pool_stride=1)
+        input=res3, pool_size=8, pool_type='avg', pool_stride=1)
-predict = fluid.layers.fc(input=pool, size=10, act='softmax')
+    predict = fluid.layers.fc(input=pool, size=10, act='softmax')
-return predict
+    return predict
 ```
 ## Infererence Program 配置
@@ -287,13 +300,13 @@ return predict
 ```python
 def inference_program():
-# The image is 32 * 32 with RGB representation.
+    # The image is 32 * 32 with RGB representation.
-data_shape = [3, 32, 32]
+    data_shape = [3, 32, 32]
-images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
+    images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
-predict = resnet_cifar10(images, 32)
+    predict = resnet_cifar10(images, 32)
-# predict = vgg_bn_drop(images) # un-comment to use vgg net
+    # predict = vgg_bn_drop(images) # un-comment to use vgg net
-return predict
+    return predict
 ```
 ## Train Program 配置
@@ -306,13 +319,13 @@ return predict
 ```python
 def train_program():
-predict = inference_program()
+    predict = inference_program()
-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-cost = fluid.layers.cross_entropy(input=predict, label=label)
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
-avg_cost = fluid.layers.mean(cost)
+    avg_cost = fluid.layers.mean(cost)
-accuracy = fluid.layers.accuracy(input=predict, label=label)
+    accuracy = fluid.layers.accuracy(input=predict, label=label)
-return [avg_cost, accuracy]
+    return [avg_cost, accuracy]
 ```
 ## Optimizer Function 配置
@@ -321,7 +334,7 @@ return [avg_cost, accuracy]
 ```python
 def optimizer_program():
-return fluid.optimizer.Adam(learning_rate=0.001)
+    return fluid.optimizer.Adam(learning_rate=0.001)
 ```
 ## 训练模型
@@ -334,9 +347,9 @@ return fluid.optimizer.Adam(learning_rate=0.001)
 use_cuda = False
 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 trainer = fluid.Trainer(
-train_func=train_program,
+    train_func=train_program,
-optimizer_func=optimizer_program,
+    optimizer_func=optimizer_program,
-place=place)
+    place=place)
 ```
 ### Data Feeders 配置
@@ -349,12 +362,12 @@ BATCH_SIZE = 128
 # Reader for training
 train_reader = paddle.batch(
-paddle.reader.shuffle(paddle.dataset.cifar.train10(), buf_size=50000),
+    paddle.reader.shuffle(paddle.dataset.cifar.train10(), buf_size=50000),
-batch_size=BATCH_SIZE)
+    batch_size=BATCH_SIZE)
 # Reader for testing. A separated data set for testing.
 test_reader = paddle.batch(
-paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
+    paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
 ```
 ### Event Handler
@@ -363,7 +376,11 @@ paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
 `event_handler_plot`可以用来利用回调数据来打点画图:
-![png](./image/train_and_test.png)
+<p align="center">
+<img src="image/train_and_test.png" width="350"><br/>
+图12. 训练结果
+</p>
 ```python
 params_dirname = "image_classification_resnet.inference.model"
@@ -376,21 +393,21 @@ cost_ploter = Ploter(train_title, test_title)
 step = 0
 def event_handler_plot(event):
-global step
+    global step
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if step % 1 == 0:
+        if step % 1 == 0:
-cost_ploter.append(train_title, step, event.metrics[0])
+            cost_ploter.append(train_title, step, event.metrics[0])
-cost_ploter.plot()
+            cost_ploter.plot()
-step += 1
+        step += 1
-if isinstance(event, fluid.EndEpochEvent):
+    if isinstance(event, fluid.EndEpochEvent):
-avg_cost, accuracy = trainer.test(
+        avg_cost, accuracy = trainer.test(
-reader=test_reader,
+            reader=test_reader,
-feed_order=['pixel', 'label'])
+            feed_order=['pixel', 'label'])
-cost_ploter.append(test_title, step, avg_cost)
+        cost_ploter.append(test_title, step, avg_cost)
-# save parameters
+        # save parameters
-if params_dirname is not None:
+        if params_dirname is not None:
-trainer.save_params(params_dirname)
+            trainer.save_params(params_dirname)
 ```
 `event_handler` 用来在训练过程中输出文本日志
@@ -400,39 +417,39 @@ params_dirname = "image_classification_resnet.inference.model"
 # event handler to track training and testing process
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if event.step % 100 == 0:
+        if event.step % 100 == 0:
-print("\nPass %d, Batch %d, Cost %f, Acc %f" %
+            print("\nPass %d, Batch %d, Cost %f, Acc %f" %
-(event.step, event.epoch, event.metrics[0],
+                  (event.step, event.epoch, event.metrics[0],
-event.metrics[1]))
+                   event.metrics[1]))
-else:
+        else:
-sys.stdout.write('.')
+            sys.stdout.write('.')
-sys.stdout.flush()
+            sys.stdout.flush()
-if isinstance(event, fluid.EndEpochEvent):
+    if isinstance(event, fluid.EndEpochEvent):
-# Test against with the test dataset to get accuracy.
+        # Test against with the test dataset to get accuracy.
-avg_cost, accuracy = trainer.test(
+        avg_cost, accuracy = trainer.test(
-reader=test_reader, feed_order=['pixel', 'label'])
+            reader=test_reader, feed_order=['pixel', 'label'])
-print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy))
+        print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy))
-# save parameters
+        # save parameters
-if params_dirname is not None:
+        if params_dirname is not None:
-trainer.save_params(params_dirname)
+            trainer.save_params(params_dirname)
 ```
 ### 训练
 通过`trainer.train`函数训练:
-**注意:** CPU，每个 Epoch 将花费大约15～20分钟。这部分可能需要一段时间。请随意修改代码，在GPU上运行测试，以提高培训速度。
+**注意:** CPU，每个 Epoch 将花费大约15～20分钟。这部分可能需要一段时间。请随意修改代码，在GPU上运行测试，以提高训练速度。
 ```python
 trainer.train(
-reader=train_reader,
+    reader=train_reader,
-num_epochs=2,
+    num_epochs=2,
-event_handler=event_handler,
+    event_handler=event_handler,
-feed_order=['pixel', 'label'])
+    feed_order=['pixel', 'label'])
 ```
 一轮训练log示例如下所示，经过1个pass， 训练集上平均 Accuracy 为0.59 ，测试集上平均  Accuracy 为0.6 。
@@ -449,11 +466,11 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750
 Test with Pass 0, Loss 1.1, Acc 0.6
 ```
-图12是训练的分类错误率曲线图，运行到第200个pass后基本收敛，最终得到测试集上分类错误率为8.54%。
+图13是训练的分类错误率曲线图，运行到第200个pass后基本收敛，最终得到测试集上分类错误率为8.54%。
-![CIFARErrorRate](./image/plot.png)
 <p align="center">
-图12. CIFAR10数据集上VGG模型的分类错误率
+<img src="image/plot.png" width="400" ><br/>
+图13. CIFAR10数据集上VGG模型的分类错误率
 </p>
 ## 应用模型
@@ -471,19 +488,19 @@ import numpy as np
 import os
 def load_image(file):
-im = Image.open(file)
+    im = Image.open(file)
-im = im.resize((32, 32), Image.ANTIALIAS)
+    im = im.resize((32, 32), Image.ANTIALIAS)
-im = np.array(im).astype(np.float32)
+    im = np.array(im).astype(np.float32)
-# The storage order of the loaded image is W(width),
+    # The storage order of the loaded image is W(width),
-# H(height), C(channel). PaddlePaddle requires
+    # H(height), C(channel). PaddlePaddle requires
-# the CHW order, so transpose them.
+    # the CHW order, so transpose them.
-im = im.transpose((2, 0, 1))  # CHW
+    im = im.transpose((2, 0, 1))  # CHW
-im = im / 255.0
+    im = im / 255.0
-# Add one dimension to mimic the list format.
+    # Add one dimension to mimic the list format.
-im = numpy.expand_dims(im, axis=0)
+    im = numpy.expand_dims(im, axis=0)
-return im
+    return im
 cur_dir = os.getcwd()
 img = load_image(cur_dir + '/image/dog.png')
@@ -497,11 +514,11 @@ img = load_image(cur_dir + '/image/dog.png')
 ```python
 inferencer = fluid.Inferencer(
-infer_func=inference_program, param_path=params_dirname, place=place)
+    infer_func=inference_program, param_path=params_dirname, place=place)
+label_list = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
 # inference
 results = inferencer.infer({'pixel': img})
-print("infer results: ", results)
+print("infer results: %s" % label_list[np.argmax(results[0])])
 ```
 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/cifar.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/cifar.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/inception_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/inception_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/lenet_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/lenet_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/plot_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/plot_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations.png
--- a/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/image_classification/image/variations_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/index.rst
+++ b/doc/fluid/new_docs/beginners_guide/basics/index.rst
@@ -10,9 +10,9 @@
 ..  toctree::
    :maxdepth: 2
-    image_classification/index.md
+    image_classification/README.cn.md
-    word2vec/index.md
+    word2vec/README.cn.md
-    recommender_system/index.md
+    recommender_system/README.cn.md
-    understand_sentiment/index.md
+    understand_sentiment/README.cn.md
-    label_semantic_roles/index.md
+    label_semantic_roles/README.cn.md
-    machine_translation/index.md
+    machine_translation/README.cn.md
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/.gitignore
-data/train.list
-data/test.*
-data/conll05st-release.tar.gz
-data/conll05st-release
-data/predicate_dict
-data/label_dict
-data/word_dict
-data/emb
-data/feature
-output
-predict.res
-train.log
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/index.md
 # 语义角色标注
-本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/178.html)。
 ## 背景介绍
@@ -8,7 +8,7 @@
 请看下面的例子，“遇到” 是谓词（Predicate，通常简写为“Pred”），“小明”是施事者（Agent），“小红”是受事者（Patient），“昨天” 是事件发生的时间（Time），“公园”是事情发生的地点（Location）。
-$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\mbox{Time}}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
+$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
 语义角色标注（Semantic Role Labeling，SRL）以句子的谓词为中心，不对句子所包含的语义信息进行深入分析，只分析句子中各成分与谓词之间的关系，即句子的谓词（Predicate）- 论元（Argument）结构，并用语义角色来描述这些结构关系，是许多自然语言理解任务（如信息抽取，篇章分析，深度问答等）的一个重要中间步骤。在研究中一般都假定谓词是给定的，所要做的就是找出给定谓词的各个论元和它们的语义角色。
@@ -20,17 +20,17 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m
 4. 论元识别：这个过程是从上一步剪除之后的候选中判断哪些是真正的论元，通常当做一个二分类问题来解决。
 5. 对第4步的结果，通过多分类得到论元的语义角色标签。可以看到，句法分析是基础，并且后续步骤常常会构造的一些人工特征，这些特征往往也来自句法分析。
-![dependencyParsing](./image/dependency_parsing.png)
 <div  align="center">
+<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
 图1. 依存句法分析句法树示例
 </div>
-然而，完全句法分析需要确定句子所包含的全部句法信息，并确定句子各成分之间的关系，是一个非常困难的任务，目前技术下的句法分析准确率并不高，句法分析的细微错误都会导致SRL的错误。为了降低问题的复杂度，同时获得一定的句法结构信息，“浅层句法分析”的思想应运而生。浅层句法分析也称为部分句法分析（partial parsing）或语块划分（chunking）。和完全句法分析得到一颗完整的句法树不同，浅层句法分析只需要识别句子中某些结构相对简单的独立成分，例如：动词短语，这些被识别出来的结构称为语块。为了回避 “无法获得准确率较高的句法树” 所带来的困难，一些研究\[[1](#参考文献)\]也提出了基于语块（chunk）的SRL方法。基于语块的SRL方法将SRL作为一个序列标注问题来解决。序列标注任务一般都会采用BIO表示方式来定义序列标注的标签集，我们先来介绍这种表示方法。在BIO表示法中，B代表语块的开始，I代表语块的中间，O代表语块结束。通过B、I、O 三种标记将不同的语块赋予不同的标签，例如：对于一个角色为A的论元，将它所包含的第一个语块赋予标签B-A，将它所包含的其它语块赋予标签I-A，不属于任何论元的语块赋予标签O。
+然而，完全句法分析需要确定句子所包含的全部句法信息，并确定句子各成分之间的关系，是一个非常困难的任务，目前技术下的句法分析准确率并不高，句法分析的细微错误都会导致SRL的错误。为了降低问题的复杂度，同时获得一定的句法结构信息，“浅层句法分析”的思想应运而生。浅层句法分析也称为部分句法分析（partial parsing）或语块划分（chunking）。和完全句法分析得到一颗完整的句法树不同，浅层句法分析只需要识别句子中某些结构相对简单的独立成分，例如：动词短语，这些被识别出来的结构称为语块。为了回避 “无法获得准确率较高的句法树” 所带来的困难，一些研究\[[1](#参考文献)\]也提出了基于语块（chunk）的SRL方法。基于语块的SRL方法将SRL作为一个序列标注问题来解决。序列标注任务一般都会采用BIO表示方式来定义序列标注的标签集，我们先来介绍这种表示方法。在BIO表示法中，B代表语块的开始，I代表语块的中间，O代表语块结束。通过B、I、O 三种标记将不同的语块赋予不同的标签，例如：对于一个由角色A拓展得到的语块组，将它所包含的第一个语块赋予标签B-A，将它所包含的其它语块赋予标签I-A，不属于任何论元的语块赋予标签O。
 我们继续以上面的这句话为例，图1展示了BIO表示方法。
-![bioExample](./image/bio_example.png)
 <div  align="center">
+<img src="image/bio_example.png" width = "90%"  align=center /><br>
 图2. BIO标注方法示例
 </div>
@@ -44,27 +44,27 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m
 ### 栈式循环神经网络（Stacked Recurrent Neural Network）
-深层网络有助于形成层次化特征，网络上层在下层已经学习到的初级特征基础上，形成更复杂的高级特征。尽管LSTM沿时间轴展开后等价于一个非常“深”的前馈网络，但由于LSTM各个时间步参数共享，`$t-1$`时刻状态到`$t$`时刻的映射，始终只经过了一次非线性映射，也就是说单层LSTM对状态转移的建模是 “浅” 的。堆叠多个LSTM单元，令前一个LSTM`$t$`时刻的输出，成为下一个LSTM单元`$t$`时刻的输入，帮助我们构建起一个深层网络，我们把它称为第一个版本的栈式循环神经网络。深层网络提高了模型拟合复杂模式的能力，能够更好地建模跨不同时间步的模式\[[2](#参考文献)\]。
+深层网络有助于形成层次化特征，网络上层在下层已经学习到的初级特征基础上，形成更复杂的高级特征。尽管LSTM沿时间轴展开后等价于一个非常“深”的前馈网络，但由于LSTM各个时间步参数共享，$t-1$时刻状态到$t$时刻的映射，始终只经过了一次非线性映射，也就是说单层LSTM对状态转移的建模是 “浅” 的。堆叠多个LSTM单元，令前一个LSTM$t$时刻的输出，成为下一个LSTM单元$t$时刻的输入，帮助我们构建起一个深层网络，我们把它称为第一个版本的栈式循环神经网络。深层网络提高了模型拟合复杂模式的能力，能够更好地建模跨不同时间步的模式\[[2](#参考文献)\]。
 然而，训练一个深层LSTM网络并非易事。纵向堆叠多个LSTM单元可能遇到梯度在纵向深度上传播受阻的问题。通常，堆叠4层LSTM单元可以正常训练，当层数达到4~8层时，会出现性能衰减，这时必须考虑一些新的结构以保证梯度纵向顺畅传播，这是训练深层LSTM网络必须解决的问题。我们可以借鉴LSTM解决 “梯度消失梯度爆炸” 问题的智慧之一：在记忆单元（Memory Cell）这条信息传播的路线上没有非线性映射，当梯度反向传播时既不会衰减、也不会爆炸。因此，深层LSTM模型也可以在纵向上添加一条保证梯度顺畅传播的路径。
-一个LSTM单元完成的运算可以被分为三部分：（1）输入到隐层的映射（input-to-hidden） ：每个时间步输入信息`$x$`会首先经过一个矩阵映射，再作为遗忘门，输入门，记忆单元，输出门的输入，注意，这一次映射没有引入非线性激活；（2）隐层到隐层的映射（hidden-to-hidden）：这一步是LSTM计算的主体，包括遗忘门，输入门，记忆单元更新，输出门的计算；（3）隐层到输出的映射（hidden-to-output）：通常是简单的对隐层向量进行激活。我们在第一个版本的栈式网络的基础上，加入一条新的路径：除上一层LSTM输出之外，将前层LSTM的输入到隐层的映射作为的一个新的输入，同时加入一个线性映射去学习一个新的变换。
+一个LSTM单元完成的运算可以被分为三部分：（1）输入到隐层的映射（input-to-hidden） ：每个时间步输入信息$x$会首先经过一个矩阵映射，再作为遗忘门，输入门，记忆单元，输出门的输入，注意，这一次映射没有引入非线性激活；（2）隐层到隐层的映射（hidden-to-hidden）：这一步是LSTM计算的主体，包括遗忘门，输入门，记忆单元更新，输出门的计算；（3）隐层到输出的映射（hidden-to-output）：通常是简单的对隐层向量进行激活。我们在第一个版本的栈式网络的基础上，加入一条新的路径：除上一层LSTM输出之外，将前层LSTM的输入到隐层的映射作为的一个新的输入，同时加入一个线性映射去学习一个新的变换。
 图3是最终得到的栈式循环神经网络结构示意图。
-![lstmStructure](./image/stacked_lstm.png)
 <p align="center">  
+<img src="./image/stacked_lstm.png" width = "40%"  align=center><br>
 图3. 基于LSTM的栈式循环神经网络结构示意图
 </p>
 ### 双向循环神经网络（Bidirectional Recurrent Neural Network）
-在LSTM中，`$t$`时刻的隐藏层向量编码了到`$t$`时刻为止所有输入的信息，但`$t$`时刻的LSTM可以看到历史，却无法看到未来。在绝大多数自然语言处理任务中，我们几乎总是能拿到整个句子。这种情况下，如果能够像获取历史信息一样，得到未来的信息，对序列学习任务会有很大的帮助。
+在LSTM中，$t$时刻的隐藏层向量编码了到$t$时刻为止所有输入的信息，但$t$时刻的LSTM可以看到历史，却无法看到未来。在绝大多数自然语言处理任务中，我们几乎总是能拿到整个句子。这种情况下，如果能够像获取历史信息一样，得到未来的信息，对序列学习任务会有很大的帮助。
-为了克服这一缺陷，我们可以设计一种双向循环网络单元，它的思想简单且直接：对上一节的栈式循环神经网络进行一个小小的修改，堆叠多个LSTM单元，让每一层LSTM单元分别以：正向、反向、正向 …… 的顺序学习上一层的输出序列。于是，从第2层开始，`$t$`时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
+为了克服这一缺陷，我们可以设计一种双向循环网络单元，它的思想简单且直接：对上一节的栈式循环神经网络进行一个小小的修改，堆叠多个LSTM单元，让每一层LSTM单元分别以：正向、反向、正向 …… 的顺序学习上一层的输出序列。于是，从第2层开始，$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
-![lstmStructure](./image/bidirectional_stacked_lstm.png)
 <p align="center">  
+<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
 图4. 基于LSTM的双向循环神经网络结构示意图
 </p>
@@ -74,56 +74,56 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_{\m
 使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务。在SRL任务中，深层LSTM网络学习输入的特征表示，条件随机场（Conditional Random Filed， CRF）在特征的基础上完成序列标注，处于整个网络的末端。
-CRF是一种概率化结构模型，可以看作是一个概率无向图模型，结点表示随机变量，边表示随机变量之间的概率依赖关系。简单来讲，CRF学习条件概率`$P(X|Y)$`，其中 `$X = (x_1, x_2, ... , x_n)$` 是输入序列，`$Y = (y_1, y_2, ... , y_n)$` 是标记序列；解码过程是给定 `$X$`序列求解令`$P(Y|X)$`最大的`$Y$`序列，即`$Y^* = \mbox{arg max}_{Y} P(Y | X)$`。
+CRF是一种概率化结构模型，可以看作是一个概率无向图模型，结点表示随机变量，边表示随机变量之间的概率依赖关系。简单来讲，CRF学习条件概率$P(X|Y)$，其中 $X = (x_1, x_2, ... , x_n)$ 是输入序列，$Y = (y_1, y_2, ... , y_n)$ 是标记序列；解码过程是给定 $X$序列求解令$P(Y|X)$最大的$Y$序列，即$Y^* = \mbox{arg max}_{Y} P(Y | X)$。
 序列标注任务只需要考虑输入和输出都是一个线性序列，并且由于我们只是将输入序列作为条件，不做任何条件独立假设，因此输入序列的元素之间并不存在图结构。综上，在序列标注任务中使用的是如图5所示的定义在链式图上的CRF，称之为线性链条件随机场（Linear Chain Conditional Random Field）。
-![linear_chain_crf](./image/linear_chain_crf.png)
 <p align="center">  
+<img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
 图5. 序列标注任务中使用的线性链条件随机场
 </p>
-根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\]，在给定观测序列`$X$`时，一个特定标记序列`$Y$`的概率可以定义为：
+根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\]，在给定观测序列$X$时，一个特定标记序列$Y$的概率可以定义为：
 $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
-其中`$Z(X)$`是归一化因子，`$t_j$` 是定义在边上的特征函数，依赖于当前和前一个位置，称为转移特征，表示对于输入序列`$X$`及其标注序列在 `$i$`及`$i - 1$`位置上标记的转移概率。`$s_k$`是定义在结点上的特征函数，称为状态特征，依赖于当前位置，表示对于观察序列`$X$`及其`$i$`位置的标记概率。`$\lambda_j$` 和 `$\mu_k$` 分别是转移特征函数和状态特征函数对应的权值。实际上，`$t$`和`$s$`可以用相同的数学形式表示，再对转移特征和状态特在各个位置`$i$`求和有：`$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$`，把`$f$`统称为特征函数，于是`$P(Y|X)$`可表示为：
+其中$Z(X)$是归一化因子，$t_j$ 是定义在边上的特征函数，依赖于当前和前一个位置，称为转移特征，表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数，称为状态特征，依赖于当前位置，表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上，$t$和$s$可以用相同的数学形式表示，再对转移特征和状态特在各个位置$i$求和有：$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$，把$f$统称为特征函数，于是$P(Y|X)$可表示为：
 $$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
-`$\omega$`是特征函数对应的权值，是CRF模型要学习的参数。训练时，对于给定的输入序列和对应的标记序列集合`$D = \left[(X_1,  Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$` ，通过正则化的极大似然估计，求解如下优化目标：
+$\omega$是特征函数对应的权值，是CRF模型要学习的参数。训练时，对于给定的输入序列和对应的标记序列集合$D = \left[(X_1,  Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ，通过正则化的极大似然估计，求解如下优化目标：
 $$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
-这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时，对于给定的输入序列`$X$`，通过解码算法（通常有：维特比算法、Beam Search）求令出条件概率`$\bar{P}(Y|X)$`最大的输出序列 `$\bar{Y}$`。
+这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时，对于给定的输入序列$X$，通过解码算法（通常有：维特比算法、Beam Search）求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。
 ### 深度双向LSTM（DB-LSTM）SRL模型
-在SRL任务中，输入是 “谓词” 和 “一句话”，目标是从这句话中找到谓词的论元，并标注论元的语义角色。如果一个句子含有`$n$`个谓词，这个句子会被处理`$n$`次。一个最为直接的模型是下面这样：
+在SRL任务中，输入是 “谓词” 和 “一句话”，目标是从这句话中找到谓词的论元，并标注论元的语义角色。如果一个句子含有$n$个谓词，这个句子会被处理$n$次。一个最为直接的模型是下面这样：
 1. 构造输入；
- 输入1是谓词，输入2是句子
+ - 输入1是谓词，输入2是句子
- 将输入1扩展成和输入2一样长的序列，用one-hot方式表示；
+ - 将输入1扩展成和输入2一样长的序列，用one-hot方式表示；
 2. one-hot方式的谓词序列和句子序列通过词表，转换为实向量表示的词向量序列；
 3. 将步骤2中的2个词向量序列作为双向LSTM的输入，学习输入序列的特征表示；
 4. CRF以步骤3中模型学习到的特征为输入，以标记序列为监督信号，实现序列标注；
 大家可以尝试上面这种方法。这里，我们提出一些改进，引入两个简单但对提高系统性能非常有效的特征：
- 谓词上下文：上面的方法中，只用到了谓词的词向量表达谓词相关的所有信息，这种方法始终是非常弱的，特别是如果谓词在句子中出现多次，有可能引起一定的歧义。从经验出发，谓词前后若干个词的一个小片段，能够提供更丰富的信息，帮助消解歧义。于是，我们把这样的经验也添加到模型中，为每个谓词同时抽取一个“谓词上下文” 片段，也就是从这个谓词前后各取`$n$`个词构成的一个窗口片段；
+- 谓词上下文：上面的方法中，只用到了谓词的词向量表达谓词相关的所有信息，这种方法始终是非常弱的，特别是如果谓词在句子中出现多次，有可能引起一定的歧义。从经验出发，谓词前后若干个词的一个小片段，能够提供更丰富的信息，帮助消解歧义。于是，我们把这样的经验也添加到模型中，为每个谓词同时抽取一个“谓词上下文” 片段，也就是从这个谓词前后各取$n$个词构成的一个窗口片段；
 - 谓词上下文区域标记：为句子中的每一个词引入一个0-1二值变量，表示它们是否在“谓词上下文”片段中；
 修改后的模型如下（图6是一个深度为4的模型结构示意图）：
 1. 构造输入
- 输入1是句子序列，输入2是谓词序列，输入3是谓词上下文，从句子中抽取这个谓词前后各`$n$`个词，构成谓词上下文，用one-hot方式表示，输入4是谓词上下文区域标记，标记了句子中每一个词是否在谓词上下文中；
+ - 输入1是句子序列，输入2是谓词序列，输入3是谓词上下文，从句子中抽取这个谓词前后各$n$个词，构成谓词上下文，用one-hot方式表示，输入4是谓词上下文区域标记，标记了句子中每一个词是否在谓词上下文中；
- 将输入2~3均扩展为和输入1一样长的序列；
+ - 将输入2~3均扩展为和输入1一样长的序列；
 2. 输入1~4均通过词表取词向量转换为实向量表示的词向量序列；其中输入1、3共享同一个词表，输入2和4各自独有词表；
 3. 第2步的4个词向量序列作为双向LSTM模型的输入；LSTM模型学习输入序列的特征表示，得到新的特性表示序列；
 4. CRF以第3步中LSTM学习到的特征为输入，以标记序列为监督信号，完成序列标注；
-![db_lstm_network](./image/db_lstm_network.png)
 <div  align="center">  
+<img src="image/db_lstm_network.png" width = "60%"  align=center /><br>
 图6. SRL任务上的深层双向LSTM模型
 </div>
@@ -137,8 +137,8 @@ $$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\pr
 ```text
 conll05st-release/
 └── test.wsj
-├── props  # 标注结果
+    ├── props  # 标注结果
-└── words  # 输入文本序列
+    └── words  # 输入文本序列
 ```
 标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同，但原理是相同的，关于标注结果标签含义的说明，请参考论文\[[9](#参考文献)\]。
@@ -146,19 +146,11 @@ conll05st-release/
 原始数据需要进行数据预处理才能被PaddlePaddle处理，预处理包括下面几个步骤:
 1. 将文本序列和标记序列其合并到一条记录中；
-2. 一个句子如果含有`$n$`个谓词，这个句子会被处理`$n$`次，变成`$n$`条独立的训练样本，每个样本一个不同的谓词；
+2. 一个句子如果含有$n$个谓词，这个句子会被处理$n$次，变成$n$条独立的训练样本，每个样本一个不同的谓词；
 3. 抽取谓词上下文和构造谓词上下文区域标记；
 4. 构造以BIO法表示的标记；
 5. 依据词典获取词对应的整数索引。
-```python
-# import paddle.v2.dataset.conll05 as conll05
-# conll05.corpus_reader函数完成上面第1步和第2步.
-# conll05.reader_creator函数完成上面第3步到第5步.
-# conll05.test函数可以获取处理之后的每条样本来供PaddlePaddle训练.
-```
 预处理完成之后一条训练样本包含9个特征，分别是：句子序列、谓词、谓词上下文（占 5 列）、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
 | 句子序列 | 谓词 | 谓词上下文（窗口 = 5） | 谓词上下文区域标记 | 标注序列 |
@@ -187,6 +179,8 @@ conll05st-release/
 获取词典，打印词典大小：
 ```python
+from __future__ import print_function
 import math, os
 import numpy as np
 import paddle
@@ -201,9 +195,9 @@ word_dict_len = len(word_dict)
 label_dict_len = len(label_dict)
 pred_dict_len = len(verb_dict)
-print word_dict_len
+print('word_dict_len: ', word_dict_len)
-print label_dict_len
+print('label_dict_len: ', label_dict_len)
-print pred_dict_len
+print('pred_dict_len: ', pred_dict_len)
 ```
 ## 模型配置说明
@@ -232,96 +226,96 @@ embedding_name = 'emb'
 ```python
 # 这里加载PaddlePaddle上版保存的二进制模型
 def load_parameter(file_name, h, w):
-with open(file_name, 'rb') as f:
+    with open(file_name, 'rb') as f:
-f.read(16)  # skip header.
+        f.read(16)  # skip header.
-return np.fromfile(f, dtype=np.float32).reshape(h, w)
+        return np.fromfile(f, dtype=np.float32).reshape(h, w)
 ```
 - 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
 ```python  
 def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark,
-**ignored):
+            **ignored):
-# 8 features
+    # 8 features
-predicate_embedding = fluid.layers.embedding(
+    predicate_embedding = fluid.layers.embedding(
-input=predicate,
+        input=predicate,
-size=[pred_dict_len, word_dim],
+        size=[pred_dict_len, word_dim],
-dtype='float32',
+        dtype='float32',
-is_sparse=IS_SPARSE,
+        is_sparse=IS_SPARSE,
-param_attr='vemb')
+        param_attr='vemb')
-mark_embedding = fluid.layers.embedding(
+    mark_embedding = fluid.layers.embedding(
-input=mark,
+        input=mark,
-size=[mark_dict_len, mark_dim],
+        size=[mark_dict_len, mark_dim],
-dtype='float32',
+        dtype='float32',
-is_sparse=IS_SPARSE)
+        is_sparse=IS_SPARSE)
-word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+    word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
-# Since word vector lookup table is pre-trained, we won't update it this time.
+    # Since word vector lookup table is pre-trained, we won't update it this time.
-# trainable being False prevents updating the lookup table during training.
+    # trainable being False prevents updating the lookup table during training.
-emb_layers = [
+    emb_layers = [
-fluid.layers.embedding(
+        fluid.layers.embedding(
-size=[word_dict_len, word_dim],
+            size=[word_dict_len, word_dim],
-input=x,
+            input=x,
-param_attr=fluid.ParamAttr(
+            param_attr=fluid.ParamAttr(
-name=embedding_name, trainable=False)) for x in word_input
+                name=embedding_name, trainable=False)) for x in word_input
-]
+    ]
-emb_layers.append(predicate_embedding)
+    emb_layers.append(predicate_embedding)
-emb_layers.append(mark_embedding)
+    emb_layers.append(mark_embedding)
-# 8 LSTM units are trained through alternating left-to-right / right-to-left order
+    # 8 LSTM units are trained through alternating left-to-right / right-to-left order
-# denoted by the variable `reverse`.
+    # denoted by the variable `reverse`.
-hidden_0_layers = [
+    hidden_0_layers = [
-fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
+        fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
-for emb in emb_layers
+        for emb in emb_layers
-]
+    ]
-hidden_0 = fluid.layers.sums(input=hidden_0_layers)
+    hidden_0 = fluid.layers.sums(input=hidden_0_layers)
-lstm_0 = fluid.layers.dynamic_lstm(
+    lstm_0 = fluid.layers.dynamic_lstm(
-input=hidden_0,
+        input=hidden_0,
-size=hidden_dim,
+        size=hidden_dim,
-candidate_activation='relu',
+        candidate_activation='relu',
-gate_activation='sigmoid',
+        gate_activation='sigmoid',
-cell_activation='sigmoid')
+        cell_activation='sigmoid')
-# stack L-LSTM and R-LSTM with direct edges
+    # stack L-LSTM and R-LSTM with direct edges
-input_tmp = [hidden_0, lstm_0]
+    input_tmp = [hidden_0, lstm_0]
-# In PaddlePaddle, state features and transition features of a CRF are implemented
+    # In PaddlePaddle, state features and transition features of a CRF are implemented
-# by a fully connected layer and a CRF layer seperately. The fully connected layer
+    # by a fully connected layer and a CRF layer seperately. The fully connected layer
-# with linear activation learns the state features, here we use fluid.layers.sums
+    # with linear activation learns the state features, here we use fluid.layers.sums
-# (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
+    # (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
-# fluid.layers.linear_chain_crf only
+    # fluid.layers.linear_chain_crf only
-# learns the transition features, which is a cost layer and is the last layer of the network.
+    # learns the transition features, which is a cost layer and is the last layer of the network.
-# fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
+    # fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
-# as the cost by given the input sequence and it requires the true tag sequence
+    # as the cost by given the input sequence and it requires the true tag sequence
-# as target in the learning process.
+    # as target in the learning process.
-for i in range(1, depth):
+    for i in range(1, depth):
-mix_hidden = fluid.layers.sums(input=[
+        mix_hidden = fluid.layers.sums(input=[
-fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
+            fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
-fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
+            fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
-])
+        ])
-lstm = fluid.layers.dynamic_lstm(
+        lstm = fluid.layers.dynamic_lstm(
-input=mix_hidden,
+            input=mix_hidden,
-size=hidden_dim,
+            size=hidden_dim,
-candidate_activation='relu',
+            candidate_activation='relu',
-gate_activation='sigmoid',
+            gate_activation='sigmoid',
-cell_activation='sigmoid',
+            cell_activation='sigmoid',
-is_reverse=((i % 2) == 1))
+            is_reverse=((i % 2) == 1))
-input_tmp = [mix_hidden, lstm]
+        input_tmp = [mix_hidden, lstm]
-# 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，
+    # 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射，
-# 经过一个全连接层映射到标记字典的维度，来学习 CRF 的状态特征
+    # 经过一个全连接层映射到标记字典的维度，来学习 CRF 的状态特征
-feature_out = fluid.layers.sums(input=[
+    feature_out = fluid.layers.sums(input=[
-fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
+        fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
-fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
+        fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
-])
+    ])
-return feature_out
+    return feature_out
 ```
 ## 训练模型
@@ -338,116 +332,116 @@ return feature_out
 ```python
 def train(use_cuda, save_dirname=None, is_local=True):
-# define network topology
+    # define network topology
-# 句子序列
+    # 句子序列
-word = fluid.layers.data(
+    word = fluid.layers.data(
-name='word_data', shape=[1], dtype='int64', lod_level=1)
+        name='word_data', shape=[1], dtype='int64', lod_level=1)
-# 谓词
+    # 谓词
-predicate = fluid.layers.data(
+    predicate = fluid.layers.data(
-name='verb_data', shape=[1], dtype='int64', lod_level=1)
+        name='verb_data', shape=[1], dtype='int64', lod_level=1)
-# 谓词上下文5个特征
+    # 谓词上下文5个特征
-ctx_n2 = fluid.layers.data(
+    ctx_n2 = fluid.layers.data(
-name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
+        name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
-ctx_n1 = fluid.layers.data(
+    ctx_n1 = fluid.layers.data(
-name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
+        name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
-ctx_0 = fluid.layers.data(
+    ctx_0 = fluid.layers.data(
-name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
+        name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
-ctx_p1 = fluid.layers.data(
+    ctx_p1 = fluid.layers.data(
-name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
+        name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
-ctx_p2 = fluid.layers.data(
+    ctx_p2 = fluid.layers.data(
-name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
+        name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
-# 谓词上下区域标志
+    # 谓词上下区域标志
-mark = fluid.layers.data(
+    mark = fluid.layers.data(
-name='mark_data', shape=[1], dtype='int64', lod_level=1)
+        name='mark_data', shape=[1], dtype='int64', lod_level=1)
-# define network topology
+    # define network topology
-feature_out = db_lstm(**locals())
+    feature_out = db_lstm(**locals())
-# 标注序列
+    # 标注序列
-target = fluid.layers.data(
+    target = fluid.layers.data(
-name='target', shape=[1], dtype='int64', lod_level=1)
+        name='target', shape=[1], dtype='int64', lod_level=1)
-# 学习 CRF 的转移特征
+    # 学习 CRF 的转移特征
-crf_cost = fluid.layers.linear_chain_crf(
+    crf_cost = fluid.layers.linear_chain_crf(
-input=feature_out,
+        input=feature_out,
-label=target,
+        label=target,
-param_attr=fluid.ParamAttr(
+        param_attr=fluid.ParamAttr(
-name='crfw', learning_rate=mix_hidden_lr))
+            name='crfw', learning_rate=mix_hidden_lr))
-avg_cost = fluid.layers.mean(crf_cost)
+    avg_cost = fluid.layers.mean(crf_cost)
-sgd_optimizer = fluid.optimizer.SGD(
+    sgd_optimizer = fluid.optimizer.SGD(
-learning_rate=fluid.layers.exponential_decay(
+        learning_rate=fluid.layers.exponential_decay(
-learning_rate=0.01,
+            learning_rate=0.01,
-decay_steps=100000,
+            decay_steps=100000,
-decay_rate=0.5,
+            decay_rate=0.5,
-staircase=True))
+            staircase=True))
-sgd_optimizer.minimize(avg_cost)
+    sgd_optimizer.minimize(avg_cost)
-# The CRF decoding layer is used for evaluation and inference.
+    # The CRF decoding layer is used for evaluation and inference.
-# It shares weights with CRF layer.  The sharing of parameters among multiple layers
+    # It shares weights with CRF layer.  The sharing of parameters among multiple layers
-# is specified by using the same parameter name in these layers. If true tag sequence
+    # is specified by using the same parameter name in these layers. If true tag sequence
-# is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
+    # is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
-# for each input token and sums the error over the entire sequence.
+    # for each input token and sums the error over the entire sequence.
-# Otherwise, `fluid.layers.crf_decoding`  generates the labelling tags.
+    # Otherwise, `fluid.layers.crf_decoding`  generates the labelling tags.
-crf_decode = fluid.layers.crf_decoding(
+    crf_decode = fluid.layers.crf_decoding(
-input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
+        input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
-train_data = paddle.batch(
+    train_data = paddle.batch(
-paddle.reader.shuffle(
+        paddle.reader.shuffle(
-paddle.dataset.conll05.test(), buf_size=8192),
+            paddle.dataset.conll05.test(), buf_size=8192),
-batch_size=BATCH_SIZE)
+        batch_size=BATCH_SIZE)
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-feeder = fluid.DataFeeder(
+    feeder = fluid.DataFeeder(
-feed_list=[
+        feed_list=[
-word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
+            word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
-],
+        ],
-place=place)
+        place=place)
-exe = fluid.Executor(place)
+    exe = fluid.Executor(place)
-def train_loop(main_program):
+    def train_loop(main_program):
-exe.run(fluid.default_startup_program())
+        exe.run(fluid.default_startup_program())
-embedding_param = fluid.global_scope().find_var(
+        embedding_param = fluid.global_scope().find_var(
-embedding_name).get_tensor()
+            embedding_name).get_tensor()
-embedding_param.set(
+        embedding_param.set(
-load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
+            load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
-place)
+            place)
-start_time = time.time()
+        start_time = time.time()
-batch_id = 0
+        batch_id = 0
-for pass_id in xrange(PASS_NUM):
+        for pass_id in xrange(PASS_NUM):
-for data in train_data():
+            for data in train_data():
-cost = exe.run(main_program,
+                cost = exe.run(main_program,
-feed=feeder.feed(data),
+                               feed=feeder.feed(data),
-fetch_list=[avg_cost])
+                               fetch_list=[avg_cost])
-cost = cost[0]
+                cost = cost[0]
-if batch_id % 10 == 0:
+                if batch_id % 10 == 0:
-print("avg_cost:" + str(cost))
+                    print("avg_cost: " + str(cost))
-if batch_id != 0:
+                    if batch_id != 0:
-print("second per batch: " + str((time.time(
+                        print("second per batch: " + str((time.time(
-) - start_time) / batch_id))
+                        ) - start_time) / batch_id))
-# Set the threshold low to speed up the CI test
+                    # Set the threshold low to speed up the CI test
-if float(cost) < 60.0:
+                    if float(cost) < 60.0:
-if save_dirname is not None:
+                        if save_dirname is not None:
-fluid.io.save_inference_model(save_dirname, [
+                            fluid.io.save_inference_model(save_dirname, [
-'word_data', 'verb_data', 'ctx_n2_data',
+                                'word_data', 'verb_data', 'ctx_n2_data',
-'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
+                                'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
-'ctx_p2_data', 'mark_data'
+                                'ctx_p2_data', 'mark_data'
-], [feature_out], exe)
+                            ], [feature_out], exe)
-return
+                        return
-batch_id = batch_id + 1
+                batch_id = batch_id + 1
-train_loop(fluid.default_main_program())
+    train_loop(fluid.default_main_program())
 ```
@@ -457,92 +451,92 @@ train_loop(fluid.default_main_program())
 ```python
 def infer(use_cuda, save_dirname=None):
-if save_dirname is None:
+    if save_dirname is None:
-return
+        return
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-exe = fluid.Executor(place)
+    exe = fluid.Executor(place)
-inference_scope = fluid.core.Scope()
+    inference_scope = fluid.core.Scope()
-with fluid.scope_guard(inference_scope):
+    with fluid.scope_guard(inference_scope):
-# Use fluid.io.load_inference_model to obtain the inference program desc,
+        # Use fluid.io.load_inference_model to obtain the inference program desc,
-# the feed_target_names (the names of variables that will be fed
+        # the feed_target_names (the names of variables that will be fed
-# data using feed operators), and the fetch_targets (variables that
+        # data using feed operators), and the fetch_targets (variables that
-# we want to obtain data from using fetch operators).
+        # we want to obtain data from using fetch operators).
-[inference_program, feed_target_names,
+        [inference_program, feed_target_names,
-fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
+         fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
-# Setup inputs by creating LoDTensors to represent sequences of words.
+        # Setup inputs by creating LoDTensors to represent sequences of words.
-# Here each word is the basic element of these LoDTensors and the shape of
+        # Here each word is the basic element of these LoDTensors and the shape of
-# each word (base_shape) should be [1] since it is simply an index to
+        # each word (base_shape) should be [1] since it is simply an index to
-# look up for the corresponding word vector.
+        # look up for the corresponding word vector.
-# Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]],
+        # Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]],
-# which has only one lod level. Then the created LoDTensors will have only
+        # which has only one lod level. Then the created LoDTensors will have only
-# one higher level structure (sequence of words, or sentence) than the basic
+        # one higher level structure (sequence of words, or sentence) than the basic
-# element (word). Hence the LoDTensor will hold data for three sentences of
+        # element (word). Hence the LoDTensor will hold data for three sentences of
-# length 3, 4 and 2, respectively.
+        # length 3, 4 and 2, respectively.
-# Note that lod info should be a list of lists.
+        # Note that lod info should be a list of lists.
-lod = [[3, 4, 2]]
+        lod = [[3, 4, 2]]
-base_shape = [1]
+        base_shape = [1]
-# The range of random integers is [low, high]
+        # The range of random integers is [low, high]
-word = fluid.create_random_int_lodtensor(
+        word = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-pred = fluid.create_random_int_lodtensor(
+        pred = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=pred_dict_len - 1)
+            lod, base_shape, place, low=0, high=pred_dict_len - 1)
-ctx_n2 = fluid.create_random_int_lodtensor(
+        ctx_n2 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_n1 = fluid.create_random_int_lodtensor(
+        ctx_n1 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_0 = fluid.create_random_int_lodtensor(
+        ctx_0 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_p1 = fluid.create_random_int_lodtensor(
+        ctx_p1 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-ctx_p2 = fluid.create_random_int_lodtensor(
+        ctx_p2 = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=word_dict_len - 1)
+            lod, base_shape, place, low=0, high=word_dict_len - 1)
-mark = fluid.create_random_int_lodtensor(
+        mark = fluid.create_random_int_lodtensor(
-lod, base_shape, place, low=0, high=mark_dict_len - 1)
+            lod, base_shape, place, low=0, high=mark_dict_len - 1)
-# Construct feed as a dictionary of {feed_target_name: feed_target_data}
+        # Construct feed as a dictionary of {feed_target_name: feed_target_data}
-# and results will contain a list of data corresponding to fetch_targets.
+        # and results will contain a list of data corresponding to fetch_targets.
-assert feed_target_names[0] == 'word_data'
+        assert feed_target_names[0] == 'word_data'
-assert feed_target_names[1] == 'verb_data'
+        assert feed_target_names[1] == 'verb_data'
-assert feed_target_names[2] == 'ctx_n2_data'
+        assert feed_target_names[2] == 'ctx_n2_data'
-assert feed_target_names[3] == 'ctx_n1_data'
+        assert feed_target_names[3] == 'ctx_n1_data'
-assert feed_target_names[4] == 'ctx_0_data'
+        assert feed_target_names[4] == 'ctx_0_data'
-assert feed_target_names[5] == 'ctx_p1_data'
+        assert feed_target_names[5] == 'ctx_p1_data'
-assert feed_target_names[6] == 'ctx_p2_data'
+        assert feed_target_names[6] == 'ctx_p2_data'
-assert feed_target_names[7] == 'mark_data'
+        assert feed_target_names[7] == 'mark_data'
-results = exe.run(inference_program,
+        results = exe.run(inference_program,
-feed={
+                          feed={
-feed_target_names[0]: word,
+                              feed_target_names[0]: word,
-feed_target_names[1]: pred,
+                              feed_target_names[1]: pred,
-feed_target_names[2]: ctx_n2,
+                              feed_target_names[2]: ctx_n2,
-feed_target_names[3]: ctx_n1,
+                              feed_target_names[3]: ctx_n1,
-feed_target_names[4]: ctx_0,
+                              feed_target_names[4]: ctx_0,
-feed_target_names[5]: ctx_p1,
+                              feed_target_names[5]: ctx_p1,
-feed_target_names[6]: ctx_p2,
+                              feed_target_names[6]: ctx_p2,
-feed_target_names[7]: mark
+                              feed_target_names[7]: mark
-},
+                          },
-fetch_list=fetch_targets,
+                          fetch_list=fetch_targets,
-return_numpy=False)
+                          return_numpy=False)
-print(results[0].lod())
+        print(results[0].lod())
-np_data = np.array(results[0])
+        np_data = np.array(results[0])
-print("Inference Shape: ", np_data.shape)
+        print("Inference Shape: ", np_data.shape)
 ```
 整个程序的入口如下：
 ```python
 def main(use_cuda, is_local=True):
-if use_cuda and not fluid.core.is_compiled_with_cuda():
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
-return
+        return
-# Directory for saving the trained model
+    # Directory for saving the trained model
-save_dirname = "label_semantic_roles.inference.model"
+    save_dirname = "label_semantic_roles.inference.model"
-train(use_cuda, save_dirname, is_local)
+    train(use_cuda, save_dirname, is_local)
-infer(use_cuda, save_dirname)
+    infer(use_cuda, save_dirname)
 main(use_cuda=False)

--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bidirectional_stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/bio_example_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/db_lstm_network_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/db_lstm_network_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/dependency_parsing_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/label_semantic_roles/image/stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/.gitignore
-data/wmt14
-data/pre-wmt14
-pretrained/wmt14_model
-gen.log
-gen_result
-train.log
-dataprovider_copy_1.py
-*.pyc
-multi-bleu.perl
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/index.md
@@ -30,7 +30,9 @@
 1 -6.23177   These are the light of hope and relief . <e>
 2 -7.7914  These are the light of hope and the relief of hope . <e>
 ```
 - 左起第一列是生成句子的序号；左起第二列是该条句子的得分（从大到小），分值越高越好；左起第三列是生成的英语句子。
 - 另外有两个特殊标志：`<e>`表示句子的结尾，`<unk>`表示未登录词（unknown word），即未在训练字典中出现的词。
 ## 模型概览
@@ -78,18 +80,15 @@
 机器翻译任务的训练过程中，解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是：
 1. 每一个时刻，根据源语言句子的编码信息（又叫上下文向量，context vector）`$c$`、真实目标语言序列的第`$i$`个词`$u_i$`和`$i$`时刻RNN的隐层状态`$z_i$`，计算出下一个隐层状态`$z_{i+1}$`。计算公式如下：
+$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
-$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
 其中`$\phi _{\theta '}$`是一个非线性激活函数；`$c=q\mathbf{h}$`是源语言句子的上下文向量，在不使用[注意力机制](#注意力机制)时，如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素，则可以定义`$c=h_T$`；`$u_i$`是目标语言序列的第`$i$`个单词，`$u_0$`是目标语言序列的开始标记`<s>`，表示解码开始；`$z_i$`是`$i$`时刻解码RNN的隐层状态，`$z_0$`是一个全零的向量。
 2. 将`$z_{i+1}$`通过`softmax`归一化，得到目标语言序列的第`$i+1$`个单词的概率分布`$p_{i+1}$`。概率分布公式如下：
 $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 其中`$W_sz_{i+1}+b_z$`是对每个可能的输出单词进行打分，再用softmax归一化就可以得到第`$i+1$`个词的概率`$p_{i+1}$`。
 3. 根据`$p_{i+1}$`和`$u_{i+1}$`计算代价。
 4. 重复步骤1~3，直到目标语言序列中的所有词处理完毕。
 机器翻译任务的生成过程，通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异，具体介绍请见[柱搜索算法](#柱搜索算法)。
@@ -103,8 +102,11 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 使用柱搜索算法的解码阶段，目标是最大化生成序列的概率。思路是：
 1. 每一个时刻，根据源语言句子的编码信息`$c$`、生成的第`$i$`个目标语言序列单词`$u_i$`和`$i$`时刻RNN的隐层状态`$z_i$`，计算出下一个隐层状态`$z_{i+1}$`。
 2. 将`$z_{i+1}$`通过`softmax`归一化，得到目标语言序列的第`$i+1$`个单词的概率分布`$p_{i+1}$`。
 3. 根据`$p_{i+1}$`采样出单词`$u_{i+1}$`。
 4. 重复步骤1~3，直到获得句子结束标记`<e>`或超过句子的最大生成长度为止。
 注意：`$z_{i+1}$`和`$p_{i+1}$`的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的，因此并不能保证得到全局最优解。
@@ -116,9 +118,13 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 ### 数据预处理
 我们的预处理流程包括两步：
 - 将每个源语言到目标语言的平行语料库文件合并为一个文件：
 - 合并每个`XXX.src`和`XXX.trg`文件为`XXX`。
 - `XXX`中的第`$i$`行内容为`XXX.src`中的第`$i$`行和`XXX.trg`中的第`$i$`行连接，用'\t'分隔。
 - 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词，包括：语料中词频最高的（DICTSIZE - 3）个单词，和3个特殊符号`<s>`（序列的开始）、`<e>`（序列的结束）和`<unk>`（未登录词）。
 ### 示例数据
@@ -132,6 +138,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
 下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量。
 ```python
+from __future__ import print_function
 import contextlib
 import numpy as np
@@ -157,139 +164,152 @@ decoder_size = hidden_dim
 然后如下实现编码器框架：
-```python
+   ```python
-def encoder(is_sparse):
+   def encoder(is_sparse):
-src_word_id = pd.data(
+    src_word_id = pd.data(
-name="src_word_id", shape=[1], dtype='int64', lod_level=1)
+        name="src_word_id", shape=[1], dtype='int64', lod_level=1)
-src_embedding = pd.embedding(
+    src_embedding = pd.embedding(
-input=src_word_id,
+        input=src_word_id,
-size=[dict_size, word_dim],
+        size=[dict_size, word_dim],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr=fluid.ParamAttr(name='vemb'))
+        param_attr=fluid.ParamAttr(name='vemb'))
-fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
+    fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
-lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
+    lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
-encoder_out = pd.sequence_last_step(input=lstm_hidden0)
+    encoder_out = pd.sequence_last_step(input=lstm_hidden0)
-return encoder_out
+    return encoder_out
-```
+   ```
 再实现训练模式下的解码器：
 ```python
-def train_decoder(context, is_sparse):
+   def train_decoder(context, is_sparse):
-trg_language_word = pd.data(
+    trg_language_word = pd.data(
-name="target_language_word", shape=[1], dtype='int64', lod_level=1)
+        name="target_language_word", shape=[1], dtype='int64', lod_level=1)
-trg_embedding = pd.embedding(
+    trg_embedding = pd.embedding(
-input=trg_language_word,
+        input=trg_language_word,
-size=[dict_size, word_dim],
+        size=[dict_size, word_dim],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr=fluid.ParamAttr(name='vemb'))
+        param_attr=fluid.ParamAttr(name='vemb'))
-rnn = pd.DynamicRNN()
+    rnn = pd.DynamicRNN()
-with rnn.block():
+    with rnn.block():
-current_word = rnn.step_input(trg_embedding)
+        current_word = rnn.step_input(trg_embedding)
-pre_state = rnn.memory(init=context)
+        pre_state = rnn.memory(init=context)
-current_state = pd.fc(input=[current_word, pre_state],
+        current_state = pd.fc(input=[current_word, pre_state],
-size=decoder_size,
+                              size=decoder_size,
-act='tanh')
+                              act='tanh')
-current_score = pd.fc(input=current_state,
+        current_score = pd.fc(input=current_state,
-size=target_dict_dim,
+                              size=target_dict_dim,
-act='softmax')
+                              act='softmax')
-rnn.update_memory(pre_state, current_state)
+        rnn.update_memory(pre_state, current_state)
-rnn.output(current_score)
+        rnn.output(current_score)
-return rnn()
+    return rnn()
 ```
 实现推测模式下的解码器：
 ```python
 def decode(context, is_sparse):
-init_state = context
+    init_state = context
-array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
+    array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
-counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
+    counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
-# fill the first element with init_state
+    # fill the first element with init_state
-state_array = pd.create_array('float32')
+    state_array = pd.create_array('float32')
-pd.array_write(init_state, array=state_array, i=counter)
+    pd.array_write(init_state, array=state_array, i=counter)
-# ids, scores as memory
+    # ids, scores as memory
-ids_array = pd.create_array('int64')
+    ids_array = pd.create_array('int64')
-scores_array = pd.create_array('float32')
+    scores_array = pd.create_array('float32')
-init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2)
+    init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2)
-init_scores = pd.data(
+    init_scores = pd.data(
-name="init_scores", shape=[1], dtype="float32", lod_level=2)
+        name="init_scores", shape=[1], dtype="float32", lod_level=2)
-pd.array_write(init_ids, array=ids_array, i=counter)
+    pd.array_write(init_ids, array=ids_array, i=counter)
-pd.array_write(init_scores, array=scores_array, i=counter)
+    pd.array_write(init_scores, array=scores_array, i=counter)
-cond = pd.less_than(x=counter, y=array_len)
+    cond = pd.less_than(x=counter, y=array_len)
-while_op = pd.While(cond=cond)
+    while_op = pd.While(cond=cond)
-with while_op.block():
+    with while_op.block():
-pre_ids = pd.array_read(array=ids_array, i=counter)
+        pre_ids = pd.array_read(array=ids_array, i=counter)
-pre_state = pd.array_read(array=state_array, i=counter)
+        pre_state = pd.array_read(array=state_array, i=counter)
-pre_score = pd.array_read(array=scores_array, i=counter)
+        pre_score = pd.array_read(array=scores_array, i=counter)
-# expand the lod of pre_state to be the same with pre_score
+        # expand the lod of pre_state to be the same with pre_score
-pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
+        pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
-pre_ids_emb = pd.embedding(
+        pre_ids_emb = pd.embedding(
-input=pre_ids,
+            input=pre_ids,
-size=[dict_size, word_dim],
+            size=[dict_size, word_dim],
-dtype='float32',
+            dtype='float32',
-is_sparse=is_sparse)
+            is_sparse=is_sparse)
-# use rnn unit to update rnn
+        # use rnn unit to update rnn
-current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
+        current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
-size=decoder_size,
+                              size=decoder_size,
-act='tanh')
+                              act='tanh')
-current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
+        current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
-# use score to do beam search
+        # use score to do beam search
-current_score = pd.fc(input=current_state_with_lod,
+        current_score = pd.fc(input=current_state_with_lod,
-size=target_dict_dim,
+                              size=target_dict_dim,
-act='softmax')
+                              act='softmax')
-topk_scores, topk_indices = pd.topk(current_score, k=topk_size)
+        topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
-selected_ids, selected_scores = pd.beam_search(
+        # calculate accumulated scores after topk to reduce computation cost
-pre_ids, topk_indices, topk_scores, beam_size, end_id=10, level=0)
+        accu_scores = pd.elementwise_add(
+            x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0)
-pd.increment(x=counter, value=1, in_place=True)
+        selected_ids, selected_scores = pd.beam_search(
+            pre_ids,
-# update the memories
+            pre_score,
-pd.array_write(current_state, array=state_array, i=counter)
+            topk_indices,
-pd.array_write(selected_ids, array=ids_array, i=counter)
+            accu_scores,
-pd.array_write(selected_scores, array=scores_array, i=counter)
+            beam_size,
+            end_id=10,
-pd.less_than(x=counter, y=array_len, cond=cond)
+            level=0)
-translation_ids, translation_scores = pd.beam_search_decode(
+        pd.increment(x=counter, value=1, in_place=True)
-ids=ids_array, scores=scores_array)
+        # update the memories
-return translation_ids, translation_scores
+        pd.array_write(current_state, array=state_array, i=counter)
+        pd.array_write(selected_ids, array=ids_array, i=counter)
+        pd.array_write(selected_scores, array=scores_array, i=counter)
+        # update the break condition: up to the max length or all candidates of
+        # source sentences have ended.
+        length_cond = pd.less_than(x=counter, y=array_len)
+        finish_cond = pd.logical_not(pd.is_empty(x=selected_ids))
+        pd.logical_and(x=length_cond, y=finish_cond, out=cond)
+    translation_ids, translation_scores = pd.beam_search_decode(
+        ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=10)
+    return translation_ids, translation_scores
 ```
 进而，我们定义一个`train_program`来使用`inference_program`计算出的结果，在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。
 ```python
 def train_program(is_sparse):
-context = encoder(is_sparse)
+    context = encoder(is_sparse)
-rnn_out = train_decoder(context, is_sparse)
+    rnn_out = train_decoder(context, is_sparse)
-label = pd.data(
+    label = pd.data(
-name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
+        name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
-cost = pd.cross_entropy(input=rnn_out, label=label)
+    cost = pd.cross_entropy(input=rnn_out, label=label)
-avg_cost = pd.mean(cost)
+    avg_cost = pd.mean(cost)
-return avg_cost
+    return avg_cost
 def optimizer_func():
-return fluid.optimizer.Adagrad(
+    return fluid.optimizer.Adagrad(
-learning_rate=1e-4,
+        learning_rate=1e-4,
-regularization=fluid.regularizer.L2DecayRegularizer(
+        regularization=fluid.regularizer.L2DecayRegularizer(
-regularization_coeff=0.1))
+            regularization_coeff=0.1))
 ```
 ## 训练模型
@@ -307,9 +327,9 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
+        paddle.reader.shuffle(
-paddle.dataset.wmt14.train(dict_size), buf_size=1000),
+            paddle.dataset.wmt14.train(dict_size), buf_size=1000),
-batch_size=batch_size)
+        batch_size=batch_size)
 ```
 ### 构造训练器(trainer)
@@ -318,9 +338,9 @@ batch_size=batch_size)
 ```python
 is_sparse = False
 trainer = fluid.Trainer(
-train_func=partial(train_program, is_sparse),
+        train_func=partial(train_program, is_sparse),
-place=place,
+        place=place,
-optimizer_func=optimizer_func)
+        optimizer_func=optimizer_func)
 ```
 ### 提供数据
@@ -329,8 +349,8 @@ optimizer_func=optimizer_func)
 ```python
 feed_order = [
-'src_word_id', 'target_language_word', 'target_language_next_word'
+        'src_word_id', 'target_language_word', 'target_language_next_word'
-]
+    ]
 ```
 ### 事件处理器
@@ -338,12 +358,12 @@ feed_order = [
 ```python
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if event.step % 10 == 0:
+        if event.step % 10 == 0:
-print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))
+            print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))
-if event.step == 20:
+        if event.step == 20:
-trainer.stop()
+            trainer.stop()
 ```
 ### 开始训练
@@ -353,10 +373,10 @@ trainer.stop()
 EPOCH_NUM = 1
 trainer.train(
-reader=train_reader,
+        reader=train_reader,
-num_epochs=EPOCH_NUM,
+        num_epochs=EPOCH_NUM,
-event_handler=event_handler,
+        event_handler=event_handler,
-feed_order=feed_order)
+        feed_order=feed_order)
 ```
 ## 应用模型
@@ -377,7 +397,7 @@ translation_ids, translation_scores = decode(context, is_sparse)
 ```python
 init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
 init_scores_data = np.array(
-[1. for _ in range(batch_size)], dtype='float32')
+    [1. for _ in range(batch_size)], dtype='float32')
 init_ids_data = init_ids_data.reshape((batch_size, 1))
 init_scores_data = init_scores_data.reshape((batch_size, 1))
 init_lod = [1] * batch_size
@@ -387,14 +407,14 @@ init_ids = fluid.create_lod_tensor(init_ids_data, init_lod, place)
 init_scores = fluid.create_lod_tensor(init_scores_data, init_lod, place)
 test_data = paddle.batch(
-paddle.reader.shuffle(
+    paddle.reader.shuffle(
-paddle.dataset.wmt14.test(dict_size), buf_size=1000),
+        paddle.dataset.wmt14.test(dict_size), buf_size=1000),
-batch_size=batch_size)
+    batch_size=batch_size)
 feed_order = ['src_word_id']
 feed_list = [
-framework.default_main_program().global_block().var(var_name)
+    framework.default_main_program().global_block().var(var_name)
-for var_name in feed_order
+    for var_name in feed_order
 ]
 feeder = fluid.DataFeeder(feed_list, place)
@@ -409,27 +429,30 @@ exe = Executor(place)
 exe.run(framework.default_startup_program())
 for data in test_data():
-feed_data = map(lambda x: [x[0]], data)
+    feed_data = map(lambda x: [x[0]], data)
-feed_dict = feeder.feed(feed_data)
+    feed_dict = feeder.feed(feed_data)
-feed_dict['init_ids'] = init_ids
+    feed_dict['init_ids'] = init_ids
-feed_dict['init_scores'] = init_scores
+    feed_dict['init_scores'] = init_scores
-results = exe.run(
+    results = exe.run(
-framework.default_main_program(),
+        framework.default_main_program(),
-feed=feed_dict,
+        feed=feed_dict,
-fetch_list=[translation_ids, translation_scores],
+        fetch_list=[translation_ids, translation_scores],
-return_numpy=False)
+        return_numpy=False)
-result_ids = np.array(results[0])
+    result_ids = np.array(results[0])
-result_scores = np.array(results[1])
+    result_scores = np.array(results[1])
-print("Original sentence:")
+    print("Original sentence:")
-print(" ".join([src_dict[w] for w in feed_data[0][0]]))
+    print(" ".join([src_dict[w] for w in feed_data[0][0][1:-1]]))
-print("Translated sentence:")
+    print("Translated score and sentence:")
-print(" ".join([trg_dict[w] for w in result_ids]))
+    for i in xrange(beam_size):
-print("Corresponding score: ", result_scores)
+        start_pos = result_ids_lod[1][i] + 1
+        end_pos = result_ids_lod[1][i+1]
-break
+        print("%d\t%.4f\t%s\n" % (i+1, result_scores[end_pos-1],
+                " ".join([trg_dict[w] for w in result_ids[start_pos:end_pos]])))
+    break
 ```
 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/bi_rnn_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/bi_rnn_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/decoder_attention_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/decoder_attention_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_attention_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_attention_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/encoder_decoder_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/gru_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/gru_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/nmt_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/machine_translation/image/nmt_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/.gitignore
-.idea
-.ipynb_checkpoints
--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/index.md
 # 个性化推荐
-本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/176.html)。
 ## 背景介绍
@@ -36,8 +36,8 @@ Prediction Score is 4.25
 YouTube是世界上最大的视频上传、分享和发现网站，YouTube推荐系统为超过10亿用户从不断增长的视频库中推荐个性化的内容。整个系统由两个神经网络组成：候选生成网络和排序网络。候选生成网络从百万量级的视频库中生成上百个候选，排序网络对候选进行打分排序，输出排名最高的数十个结果。系统结构如图1所示：
-![YouTube_Overview](./image/YouTube_Overview.png)
 <p align="center">
+<img src="image/YouTube_Overview.png" width="70%" ><br/>
 图1. YouTube 推荐系统结构
 </p>
@@ -45,20 +45,20 @@ YouTube是世界上最大的视频上传、分享和发现网站，YouTube推荐
 候选生成网络将推荐问题建模为一个类别数极大的多类分类问题：对于一个Youtube用户，使用其观看历史（视频ID）、搜索词记录（search tokens）、人口学信息（如地理位置、用户登录设备）、二值特征（如性别，是否登录）和连续特征（如用户年龄）等，对视频库中所有视频进行多分类，得到每一类别的分类结果（即每一个视频的推荐概率），最终输出概率较高的几百个视频。
-首先，将观看历史及搜索词记录这类历史信息，映射为向量后取平均值得到定长表示；同时，输入人口学特征以优化新用户的推荐效果，并将二值特征和连续特征归一化处理到[0, 1]范围。接下来，将所有特征表示拼接为一个向量，并输入给非线形多层感知器（MLP，详见[识别数字](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.cn.md)教程）处理。最后，训练时将MLP的输出给softmax做分类，预测时计算用户的综合特征（MLP的输出）与所有视频的相似度，取得分最高的`$k$`个作为候选生成网络的筛选结果。图2显示了候选生成网络结构。
+首先，将观看历史及搜索词记录这类历史信息，映射为向量后取平均值得到定长表示；同时，输入人口学特征以优化新用户的推荐效果，并将二值特征和连续特征归一化处理到[0, 1]范围。接下来，将所有特征表示拼接为一个向量，并输入给非线形多层感知器（MLP，详见[识别数字](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.cn.md)教程）处理。最后，训练时将MLP的输出给softmax做分类，预测时计算用户的综合特征（MLP的输出）与所有视频的相似度，取得分最高的$k$个作为候选生成网络的筛选结果。图2显示了候选生成网络结构。
-![Deep_candidate_generation_model_architecture](./image/Deep_candidate_generation_model_architecture.png)
 <p align="center">
+<img src="image/Deep_candidate_generation_model_architecture.png" width="70%" ><br/>
 图2. 候选生成网络结构
 </p>
-对于一个用户`$U$`，预测此刻用户要观看的视频`$\omega$`为视频`$i$`的概率公式为：
+对于一个用户$U$，预测此刻用户要观看的视频$\omega$为视频$i$的概率公式为：
 $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
-其中`$u$`为用户`$U$`的特征表示，`$V$`为视频库集合，`$v_i$`为视频库中第`$i$`个视频的特征表示。`$u$`和`$v_i$`为长度相等的向量，两者点积可以通过全连接层实现。
+其中$u$为用户$U$的特征表示，$V$为视频库集合，$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量，两者点积可以通过全连接层实现。
-考虑到softmax分类的类别数非常多，为了保证一定的计算效率：1）训练阶段，使用负样本类别采样将实际计算的类别数缩小至数千；2）推荐（预测）阶段，忽略softmax的归一化计算（不影响结果），将类别打分问题简化为点积（dot product）空间中的最近邻（nearest neighbor）搜索问题，取与`$u$`最近的`$k$`个视频作为生成的候选。
+考虑到softmax分类的类别数非常多，为了保证一定的计算效率：1）训练阶段，使用负样本类别采样将实际计算的类别数缩小至数千；2）推荐（预测）阶段，忽略softmax的归一化计算（不影响结果），将类别打分问题简化为点积（dot product）空间中的最近邻（nearest neighbor）搜索问题，取与$u$最近的$k$个视频作为生成的候选。
 #### 排序网络（Ranking Network）
 排序网络的结构类似于候选生成网络，但是它的目标是对候选进行更细致的打分排序。和传统广告排序中的特征抽取方法类似，这里也构造了大量的用于视频排序的相关特征（如视频 ID、上次观看时间等）。这些特征的处理方式和候选生成网络类似，不同之处是排序网络的顶部是一个加权逻辑回归（weighted logistic regression），它对所有候选视频进行打分，从高到底排序后将分数较高的一些视频返回给用户。
@@ -72,20 +72,20 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
 卷积神经网络主要由卷积（convolution）和池化（pooling）操作构成，其应用及组合方式灵活多变，种类繁多。本小结我们以如图3所示的网络进行讲解：
-![text_cnn](./image/text_cnn.png)
 <p align="center">
+<img src="image/text_cnn.png" width = "80%" align="center"/><br/>
 图3. 卷积神经网络文本分类模型
 </p>
-假设待处理句子的长度为`$n$`，其中第`$i$`个词的词向量（word embedding）为`$x_i\in\mathbb{R}^k$`，`$k$`为维度大小。
+假设待处理句子的长度为$n$，其中第$i$个词的词向量（word embedding）为$x_i\in\mathbb{R}^k$，$k$为维度大小。
-首先，进行词向量的拼接操作：将每`$h$`个词拼接起来形成一个大小为`$h$`的词窗口，记为`$x_{i:i+h-1}$`，它表示词序列`$x_{i},x_{i+1},\ldots,x_{i+h-1}$`的拼接，其中，`$i$`表示词窗口中第一个词在整个句子中的位置，取值范围从`$1$`到`$n-h+1$`，`$x_{i:i+h-1}\in\mathbb{R}^{hk}$`。
+首先，进行词向量的拼接操作：将每$h$个词拼接起来形成一个大小为$h$的词窗口，记为$x_{i:i+h-1}$，它表示词序列$x_{i},x_{i+1},\ldots,x_{i+h-1}$的拼接，其中，$i$表示词窗口中第一个词在整个句子中的位置，取值范围从$1$到$n-h+1$，$x_{i:i+h-1}\in\mathbb{R}^{hk}$。
-其次，进行卷积操作：把卷积核(kernel)`$w\in\mathbb{R}^{hk}$`应用于包含`$h$`个词的窗口`$x_{i:i+h-1}$`，得到特征`$c_i=f(w\cdot x_{i:i+h-1}+b)$`，其中`$b\in\mathbb{R}$`为偏置项（bias），`$f$`为非线性激活函数，如`$sigmoid$`。将卷积核应用于句子中所有的词窗口`${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$`，产生一个特征图（feature map）：
+其次，进行卷积操作：把卷积核(kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$，得到特征$c_i=f(w\cdot x_{i:i+h-1}+b)$，其中$b\in\mathbb{R}$为偏置项（bias），$f$为非线性激活函数，如$sigmoid$。将卷积核应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$，产生一个特征图（feature map）：
 $$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$
-接下来，对特征图采用时间维度上的最大池化（max pooling over time）操作得到此卷积核对应的整句话的特征`$\hat c$`，它是特征图中所有元素的最大值：
+接下来，对特征图采用时间维度上的最大池化（max pooling over time）操作得到此卷积核对应的整句话的特征$\hat c$，它是特征图中所有元素的最大值：
 $$\hat c=max(c)$$
@@ -95,9 +95,9 @@ $$\hat c=max(c)$$
 1. 首先，使用用户特征和电影特征作为神经网络的输入，其中：
- 用户特征融合了四个属性信息，分别是用户ID、性别、职业和年龄。
+   - 用户特征融合了四个属性信息，分别是用户ID、性别、职业和年龄。
- 电影特征融合了三个属性信息，分别是电影ID、电影类型ID和电影名称。
+   - 电影特征融合了三个属性信息，分别是电影ID、电影类型ID和电影名称。
 2. 对用户特征，将用户ID映射为维度大小为256的向量表示，输入全连接层，并对其他三个属性也做类似的处理。然后将四个属性的特征表示分别全连接并相加。
@@ -105,8 +105,9 @@ $$\hat c=max(c)$$
 4. 得到用户和电影的向量表示后，计算二者的余弦相似度作为推荐系统的打分。最后，用该相似度打分和用户真实打分的差异的平方作为该回归模型的损失函数。
-![rec_regression_network](./image/rec_regression_network.png)
 <p align="center">
+<img src="image/rec_regression_network.png" width="90%" ><br/>
 图4. 融合推荐模型
 </p>
@@ -141,7 +142,7 @@ movie_info = paddle.dataset.movielens.movie_info()
 print movie_info.values()[0]
 ```
-<MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
+    <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
 这表示，电影的id是1，标题是《Toy Story》，该电影被分为到三个类别中。这三个类别是动画，儿童，喜剧。
@@ -152,13 +153,14 @@ user_info = paddle.dataset.movielens.user_info()
 print user_info.values()[0]
 ```
-<UserInfo id(1), gender(F), age(1), job(10)>
+    <UserInfo id(1), gender(F), age(1), job(10)>
 这表示，该用户ID是1，女性，年龄比18岁还年轻。职业ID是10。
 其中，年龄使用下列分布
 *  1:  "Under 18"
 * 18:  "18-24"
 * 25:  "25-34"
@@ -168,6 +170,7 @@ print user_info.values()[0]
 * 56:  "56+"
 职业是从下面几种选项里面选则得出:
 *  0:  "other" or not specified
 *  1:  "academic/educator"
 *  2:  "artist"
@@ -203,7 +206,7 @@ mov_id = train_sample[len(user_info[uid].value())]
 print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
 ```
-User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
+    User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
 即用户1对电影1193的评价为5分。
@@ -214,6 +217,7 @@ User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193
 ```python
+from __future__ import print_function
 import math
 import sys
 import numpy as np
@@ -232,59 +236,59 @@ BATCH_SIZE = 256
 ```python
 def get_usr_combined_features():
-USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
+    USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
-uid = layers.data(name='user_id', shape=[1], dtype='int64')
+    uid = layers.data(name='user_id', shape=[1], dtype='int64')
-usr_emb = layers.embedding(
+    usr_emb = layers.embedding(
-input=uid,
+        input=uid,
-dtype='float32',
+        dtype='float32',
-size=[USR_DICT_SIZE, 32],
+        size=[USR_DICT_SIZE, 32],
-param_attr='user_table',
+        param_attr='user_table',
-is_sparse=IS_SPARSE)
+        is_sparse=IS_SPARSE)
-usr_fc = layers.fc(input=usr_emb, size=32)
+    usr_fc = layers.fc(input=usr_emb, size=32)
-USR_GENDER_DICT_SIZE = 2
+    USR_GENDER_DICT_SIZE = 2
-usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
+    usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
-usr_gender_emb = layers.embedding(
+    usr_gender_emb = layers.embedding(
-input=usr_gender_id,
+        input=usr_gender_id,
-size=[USR_GENDER_DICT_SIZE, 16],
+        size=[USR_GENDER_DICT_SIZE, 16],
-param_attr='gender_table',
+        param_attr='gender_table',
-is_sparse=IS_SPARSE)
+        is_sparse=IS_SPARSE)
-usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
+    usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
-USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
+    USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
-usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
+    usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
-usr_age_emb = layers.embedding(
+    usr_age_emb = layers.embedding(
-input=usr_age_id,
+        input=usr_age_id,
-size=[USR_AGE_DICT_SIZE, 16],
+        size=[USR_AGE_DICT_SIZE, 16],
-is_sparse=IS_SPARSE,
+        is_sparse=IS_SPARSE,
-param_attr='age_table')
+        param_attr='age_table')
-usr_age_fc = layers.fc(input=usr_age_emb, size=16)
+    usr_age_fc = layers.fc(input=usr_age_emb, size=16)
-USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
+    USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
-usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
+    usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
-usr_job_emb = layers.embedding(
+    usr_job_emb = layers.embedding(
-input=usr_job_id,
+        input=usr_job_id,
-size=[USR_JOB_DICT_SIZE, 16],
+        size=[USR_JOB_DICT_SIZE, 16],
-param_attr='job_table',
+        param_attr='job_table',
-is_sparse=IS_SPARSE)
+        is_sparse=IS_SPARSE)
-usr_job_fc = layers.fc(input=usr_job_emb, size=16)
+    usr_job_fc = layers.fc(input=usr_job_emb, size=16)
-concat_embed = layers.concat(
+    concat_embed = layers.concat(
-input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
+        input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
-usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
+    usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
-return usr_combined_features
+    return usr_combined_features
 ```
 如上述代码所示，对于每个用户，我们输入4维特征。其中包括user_id,gender_id,age_id,job_id。这几维特征均是简单的整数值。为了后续神经网络处理这些特征方便，我们借鉴NLP中的语言模型，将这几维离散的整数值，变换成embedding取出。分别形成usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb。
@@ -297,51 +301,51 @@ return usr_combined_features
 ```python
 def get_mov_combined_features():
-MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
+    MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
-mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
+    mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
-mov_emb = layers.embedding(
+    mov_emb = layers.embedding(
-input=mov_id,
+        input=mov_id,
-dtype='float32',
+        dtype='float32',
-size=[MOV_DICT_SIZE, 32],
+        size=[MOV_DICT_SIZE, 32],
-param_attr='movie_table',
+        param_attr='movie_table',
-is_sparse=IS_SPARSE)
+        is_sparse=IS_SPARSE)
-mov_fc = layers.fc(input=mov_emb, size=32)
+    mov_fc = layers.fc(input=mov_emb, size=32)
-CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
+    CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
-category_id = layers.data(
+    category_id = layers.data(
-name='category_id', shape=[1], dtype='int64', lod_level=1)
+        name='category_id', shape=[1], dtype='int64', lod_level=1)
-mov_categories_emb = layers.embedding(
+    mov_categories_emb = layers.embedding(
-input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
+        input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
-mov_categories_hidden = layers.sequence_pool(
+    mov_categories_hidden = layers.sequence_pool(
-input=mov_categories_emb, pool_type="sum")
+        input=mov_categories_emb, pool_type="sum")
-MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
+    MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
-mov_title_id = layers.data(
+    mov_title_id = layers.data(
-name='movie_title', shape=[1], dtype='int64', lod_level=1)
+        name='movie_title', shape=[1], dtype='int64', lod_level=1)
-mov_title_emb = layers.embedding(
+    mov_title_emb = layers.embedding(
-input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
+        input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
-mov_title_conv = nets.sequence_conv_pool(
+    mov_title_conv = nets.sequence_conv_pool(
-input=mov_title_emb,
+        input=mov_title_emb,
-num_filters=32,
+        num_filters=32,
-filter_size=3,
+        filter_size=3,
-act="tanh",
+        act="tanh",
-pool_type="sum")
+        pool_type="sum")
-concat_embed = layers.concat(
+    concat_embed = layers.concat(
-input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
+        input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
-mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
+    mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
-return mov_combined_features
+    return mov_combined_features
 ```
 电影标题名称(title)是一个序列的整数，整数代表的是这个词在索引序列中的下标。这个序列会被送入 `sequence_conv_pool` 层，这个层会在时间维度上使用卷积和池化。因为如此，所以输出会是固定长度，尽管输入的序列长度各不相同。
@@ -350,13 +354,13 @@ return mov_combined_features
 ```python
 def inference_program():
-usr_combined_features = get_usr_combined_features()
+    usr_combined_features = get_usr_combined_features()
-mov_combined_features = get_mov_combined_features()
+    mov_combined_features = get_mov_combined_features()
-inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
+    inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
-scale_infer = layers.scale(x=inference, scale=5.0)
+    scale_infer = layers.scale(x=inference, scale=5.0)
-return scale_infer
+    return scale_infer
 ```
 进而，我们定义一个`train_program`来使用`inference_program`计算出的结果，在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。
@@ -364,17 +368,17 @@ return scale_infer
 ```python
 def train_program():
-scale_infer = inference_program()
+    scale_infer = inference_program()
-label = layers.data(name='score', shape=[1], dtype='float32')
+    label = layers.data(name='score', shape=[1], dtype='float32')
-square_cost = layers.square_error_cost(input=scale_infer, label=label)
+    square_cost = layers.square_error_cost(input=scale_infer, label=label)
-avg_cost = layers.mean(square_cost)
+    avg_cost = layers.mean(square_cost)
-return [avg_cost, scale_infer]
+    return [avg_cost, scale_infer]
 def optimizer_func():
-return fluid.optimizer.SGD(learning_rate=0.2)
+    return fluid.optimizer.SGD(learning_rate=0.2)
 ```
@@ -393,12 +397,12 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
+    paddle.reader.shuffle(
-paddle.dataset.movielens.train(), buf_size=8192),
+        paddle.dataset.movielens.train(), buf_size=8192),
-batch_size=BATCH_SIZE)
+    batch_size=BATCH_SIZE)
 test_reader = paddle.batch(
-paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
+    paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
 ```
 ### 构造训练器(trainer)
@@ -406,7 +410,7 @@ paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
 ```python
 trainer = fluid.Trainer(
-train_func=train_program, place=place, optimizer_func=optimizer_func)
+    train_func=train_program, place=place, optimizer_func=optimizer_func)
 ```
 ### 提供数据
@@ -415,8 +419,8 @@ train_func=train_program, place=place, optimizer_func=optimizer_func)
 ```python
 feed_order = [
-'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
+    'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
-'movie_title', 'score'
+    'movie_title', 'score'
 ]
 ```
@@ -433,23 +437,23 @@ plot_cost = Ploter(test_title)
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-avg_cost_set = trainer.test(
+        avg_cost_set = trainer.test(
-reader=test_reader, feed_order=feed_order)
+            reader=test_reader, feed_order=feed_order)
-# get avg cost
+        # get avg cost
-avg_cost = np.array(avg_cost_set).mean()
+        avg_cost = np.array(avg_cost_set).mean()
-plot_cost.append(test_title, event.step, avg_cost_set[0])
+        plot_cost.append(test_title, event.step, avg_cost_set[0])
-plot_cost.plot()
+        plot_cost.plot()
-print("avg_cost: %s" % avg_cost)
+        print("avg_cost: %s" % avg_cost)
-print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
+        print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
-float(avg_cost)))
+                                                          float(avg_cost)))
-if event.step == 20: # Adjust this number for accuracy
+        if event.step == 20: # Adjust this number for accuracy
-trainer.save_params(params_dirname)
+            trainer.save_params(params_dirname)
-trainer.stop()
+            trainer.stop()
 ```
 ### 开始训练
@@ -457,10 +461,10 @@ trainer.stop()
 ```python
 trainer.train(
-num_epochs=1,
+    num_epochs=1,
-event_handler=event_handler,
+    event_handler=event_handler,
-reader=train_reader,
+    reader=train_reader,
-feed_order=feed_order)
+    feed_order=feed_order)
 ```
 ## 应用模型
@@ -470,7 +474,7 @@ feed_order=feed_order)
 ```python
 inferencer = fluid.Inferencer(
-inference_program, param_path=params_dirname, place=place)
+        inference_program, param_path=params_dirname, place=place)
 ```
 ### 生成测试用输入数据
@@ -488,7 +492,7 @@ job_id = fluid.create_lod_tensor([[10]], [[1]], place)
 movie_id = fluid.create_lod_tensor([[783]], [[1]], place) # Hunchback of Notre Dame
 category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place) # Animation, Children's, Musical
 movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
-place) # 'hunchback','of','notre','dame','the'
+                                      place) # 'hunchback','of','notre','dame','the'
 ```
 ### 测试
@@ -497,16 +501,21 @@ place) # 'hunchback','of','notre','dame','the'
 ```python
 results = inferencer.infer(
-{
+    {
-'user_id': user_id,
+        'user_id': user_id,
-'gender_id': gender_id,
+        'gender_id': gender_id,
-'age_id': age_id,
+        'age_id': age_id,
-'job_id': job_id,
+        'job_id': job_id,
-'movie_id': movie_id,
+        'movie_id': movie_id,
-'category_id': category_id,
+        'category_id': category_id,
-'movie_title': movie_title
+        'movie_title': movie_title
-},
+    },
-return_numpy=False)
+    return_numpy=False)
+predict_rating = np.array(results[0])
+print("Predict Rating of user id 1 on movie \"" + infer_movie_name + "\" is " + str(predict_rating[0][0]))
+print("Actual Rating of user id 1 on movie \"" + infer_movie_name + "\" is 4.")
 ```
 ## 总结
@@ -515,12 +524,12 @@ return_numpy=False)
 ## 参考文献
-1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
+1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
-2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
+2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
 3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
-4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
+4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
 5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
-6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
+6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.

--- a/doc/fluid/new_docs/beginners_guide/basics/recommender_system/image/rec_regression_network_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/recommender_system/image/rec_regression_network_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/.gitignore
-data/aclImdb
-data/imdb
-data/pre-imdb
-data/mosesdecoder-master
-*.log
-model_output
-dataprovider_copy_1.py
-model.list
-*.pyc
-.DS_Store
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/index.md
 # 情感分析
-本教程源代码目录在[book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/177.html)。
 ## 背景介绍
@@ -36,54 +36,54 @@
 循环神经网络是一种能对序列数据进行精确建模的有力工具。实际上，循环神经网络的理论计算能力是图灵完备的\[[4](#参考文献)\]。自然语言是一种典型的序列数据（词序列），近年来，循环神经网络及其变体（如long short term memory\[[5](#参考文献)\]等）在自然语言处理的多个领域，如语言模型、句法解析、语义角色标注（或一般的序列标注）、语义表示、图文生成、对话、机器翻译等任务上均表现优异甚至成为目前效果最好的方法。
-![rnn](./image/rnn.png)
 <p align="center">
+<img src="image/rnn.png" width = "60%" align="center"/><br/>
 图1. 循环神经网络按时间展开的示意图
 </p>
-循环神经网络按时间展开后如图1所示：在第`$t$`时刻，网络读入第`$t$`个输入`$x_t$`（向量表示）及前一时刻隐层的状态值`$h_{t-1}$`（向量表示，`$h_0$`一般初始化为`$0$`向量），计算得出本时刻隐层的状态值`$h_t$`，重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为`$f$`，则其公式可表示为：
+循环神经网络按时间展开后如图1所示：在第$t$时刻，网络读入第$t$个输入$x_t$（向量表示）及前一时刻隐层的状态值$h_{t-1}$（向量表示，$h_0$一般初始化为$0$向量），计算得出本时刻隐层的状态值$h_t$，重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$，则其公式可表示为：
 $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$
-其中`$W_{xh}$`是输入到隐层的矩阵参数，`$W_{hh}$`是隐层到隐层的矩阵参数，`$b_h$`为隐层的偏置向量（bias）参数，`$\sigma$`为`$sigmoid$`函数。
+其中$W_{xh}$是输入到隐层的矩阵参数，$W_{hh}$是隐层到隐层的矩阵参数，$b_h$为隐层的偏置向量（bias）参数，$\sigma$为$sigmoid$函数。  
-在处理自然语言时，一般会先将词（one-hot表示）映射为其词向量（word embedding）表示，然后再作为循环神经网络每一时刻的输入`$x_t$`。此外，可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如，可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层（deep or stacked）循环神经网络，或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。
+在处理自然语言时，一般会先将词（one-hot表示）映射为其词向量（word embedding）表示，然后再作为循环神经网络每一时刻的输入$x_t$。此外，可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如，可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层（deep or stacked）循环神经网络，或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。  
 ### 长短期记忆网络（LSTM）
 对于较长的序列数据，循环神经网络的训练过程中容易出现梯度消失或爆炸现象\[[6](#参考文献)\]。为了解决这一问题，Hochreiter S, Schmidhuber J. (1997)提出了LSTM(long short term memory\[[5](#参考文献)\])。  
-相比于简单的循环神经网络，LSTM增加了记忆单元`$c$`、输入门`$i$`、遗忘门`$f$`及输出门`$o$`。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为`$F$`，则其公式为：
+相比于简单的循环神经网络，LSTM增加了记忆单元$c$、输入门$i$、遗忘门$f$及输出门$o$。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为$F$，则其公式为：
 $$ h_t=F(x_t,h_{t-1})$$
-`$F$`由下列公式组合而成\[[7](#参考文献)\]：
+$F$由下列公式组合而成\[[7](#参考文献)\]：
 $$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
 $$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$
 $$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$
 $$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$
 $$ h_t = o_t\odot tanh(c_t) $$
-其中，`$i_t, f_t, c_t, o_t$`分别表示输入门，遗忘门，记忆单元及输出门的向量值，带角标的`$W$`及`$b$`为模型参数，`$tanh$`为双曲正切函数，`$\odot$`表示逐元素（elementwise）的乘法操作。输入门控制着新输入进入记忆单元`$c$`的强度，遗忘门控制着记忆单元维持上一时刻值的强度，输出门控制着输出记忆单元的强度。三种门的计算方式类似，但有着完全不同的参数，它们各自以不同的方式控制着记忆单元`$c$`，如图2所示：
+其中，$i_t, f_t, c_t, o_t$分别表示输入门，遗忘门，记忆单元及输出门的向量值，带角标的$W$及$b$为模型参数，$tanh$为双曲正切函数，$\odot$表示逐元素（elementwise）的乘法操作。输入门控制着新输入进入记忆单元$c$的强度，遗忘门控制着记忆单元维持上一时刻值的强度，输出门控制着输出记忆单元的强度。三种门的计算方式类似，但有着完全不同的参数，它们各自以不同的方式控制着记忆单元$c$，如图2所示：
-![lstm](./image/lstm.png)
 <p align="center">
-图2. 时刻`$t$`的LSTM [7]
+<img src="image/lstm.png" width = "65%" align="center"/><br/>
+图2. 时刻$t$的LSTM [7]
 </p>
 LSTM通过给简单的循环神经网络增加记忆及控制门的方式，增强了其处理远距离依赖问题的能力。类似原理的改进还有Gated Recurrent Unit (GRU)\[[8](#参考文献)\]，其设计更为简洁一些。**这些改进虽然各有不同，但是它们的宏观描述却与简单的循环神经网络一样（如图2所示），即隐状态依据当前输入及前一时刻的隐状态来改变，不断地循环这一过程直至输入处理完毕：**
 $$ h_t=Recrurent(x_t,h_{t-1})$$
-其中，`$Recrurent$`可以表示简单的循环神经网络、GRU或LSTM。
+其中，$Recrurent$可以表示简单的循环神经网络、GRU或LSTM。
 ### 栈式双向LSTM（Stacked Bidirectional LSTM）
-对于正常顺序的循环神经网络，`$h_t$`包含了`$t$`时刻之前的输入信息，也就是上文信息。同样，为了得到下文信息，我们可以使用反方向（将输入逆序处理）的循环神经网络。结合构建深层循环神经网络的方法（深层神经网络往往能得到更抽象和高级的特征表示），我们可以通过构建更加强有力的基于LSTM的栈式双向循环神经网络\[[9](#参考文献)\]，来对时序数据进行建模。
+对于正常顺序的循环神经网络，$h_t$包含了$t$时刻之前的输入信息，也就是上文信息。同样，为了得到下文信息，我们可以使用反方向（将输入逆序处理）的循环神经网络。结合构建深层循环神经网络的方法（深层神经网络往往能得到更抽象和高级的特征表示），我们可以通过构建更加强有力的基于LSTM的栈式双向循环神经网络\[[9](#参考文献)\]，来对时序数据进行建模。  
 如图3所示（以三层为例），奇数层LSTM正向，偶数层LSTM反向，高一层的LSTM使用低一层LSTM及之前所有层的信息作为输入，对最高层LSTM序列使用时间维度上的最大池化即可得到文本的定长向量表示（这一表示充分融合了文本的上下文信息，并且对文本进行了深层次抽象），最后我们将文本表示连接至softmax构建分类模型。
-![stacked_lstm](./image/stacked_lstm.jpg)
 <p align="center">
+<img src="image/stacked_lstm.jpg" width=450><br/>
 图3. 栈式双向LSTM用于文本分类
 </p>
@@ -94,11 +94,11 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$
 ```text
 aclImdb
 |- test
-|-- neg
+   |-- neg
-|-- pos
+   |-- pos
 |- train
-|-- neg
+   |-- neg
-|-- pos
+   |-- pos
 ```
 Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取，并提供了读取字典、训练数据、测试数据等API。
@@ -107,6 +107,7 @@ Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取
 在该示例中，我们实现了两种文本分类算法，分别基于[推荐系统](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system)一节介绍过的文本卷积神经网络，以及[栈式双向LSTM](#栈式双向LSTM（Stacked Bidirectional LSTM）)。我们首先引入要用到的库和定义全局变量：
 ```python
+from __future__ import print_function
 import paddle
 import paddle.fluid as fluid
 from functools import partial
@@ -115,6 +116,7 @@ import numpy as np
 CLASS_DIM = 2
 EMB_DIM = 128
 HID_DIM = 512
+STACKED_NUM = 3
 BATCH_SIZE = 128
 USE_GPU = False
 ```
@@ -126,23 +128,23 @@ USE_GPU = False
 ```python
 def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
-emb = fluid.layers.embedding(
+    emb = fluid.layers.embedding(
-input=data, size=[input_dim, emb_dim], is_sparse=True)
+        input=data, size=[input_dim, emb_dim], is_sparse=True)
-conv_3 = fluid.nets.sequence_conv_pool(
+    conv_3 = fluid.nets.sequence_conv_pool(
-input=emb,
+        input=emb,
-num_filters=hid_dim,
+        num_filters=hid_dim,
-filter_size=3,
+        filter_size=3,
-act="tanh",
+        act="tanh",
-pool_type="sqrt")
+        pool_type="sqrt")
-conv_4 = fluid.nets.sequence_conv_pool(
+    conv_4 = fluid.nets.sequence_conv_pool(
-input=emb,
+        input=emb,
-num_filters=hid_dim,
+        num_filters=hid_dim,
-filter_size=4,
+        filter_size=4,
-act="tanh",
+        act="tanh",
-pool_type="sqrt")
+        pool_type="sqrt")
-prediction = fluid.layers.fc(
+    prediction = fluid.layers.fc(
-input=[conv_3, conv_4], size=class_dim, act="softmax")
+        input=[conv_3, conv_4], size=class_dim, act="softmax")
-return prediction
+    return prediction
 ```
 网络的输入`input_dim`表示的是词典的大小，`class_dim`表示类别数。这里，我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
@@ -154,27 +156,26 @@ return prediction
 ```python
 def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):
-emb = fluid.layers.embedding(
+    emb = fluid.layers.embedding(
-input=data, size=[input_dim, emb_dim], is_sparse=True)
+        input=data, size=[input_dim, emb_dim], is_sparse=True)
-fc1 = fluid.layers.fc(input=emb, size=hid_dim)
+    fc1 = fluid.layers.fc(input=emb, size=hid_dim)
-lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
+    lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
-inputs = [fc1, lstm1]
+    inputs = [fc1, lstm1]
-for i in range(2, stacked_num + 1):
+    for i in range(2, stacked_num + 1):
-fc = fluid.layers.fc(input=inputs, size=hid_dim)
+        fc = fluid.layers.fc(input=inputs, size=hid_dim)
-lstm, cell = fluid.layers.dynamic_lstm(
+        lstm, cell = fluid.layers.dynamic_lstm(
-input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
+            input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
-inputs = [fc, lstm]
+        inputs = [fc, lstm]
-fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
+    fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
-lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')
+    lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')
-prediction = fluid.layers.fc(input=[fc_last, lstm_last],
+    prediction = fluid.layers.fc(
-size=class_dim,
+        input=[fc_last, lstm_last], size=class_dim, act='softmax')
-act='softmax')
+    return prediction
-return prediction
 ```
 以上的栈式双向LSTM抽象出了高级特征并把其映射到和分类类别数同样大小的向量上。`paddle.activation.Softmax`函数用来计算分类属于某个类别的概率。
@@ -184,12 +185,13 @@ return prediction
 ```python
 def inference_program(word_dict):
-data = fluid.layers.data(
+    data = fluid.layers.data(
-name="words", shape=[1], dtype="int64", lod_level=1)
+        name="words", shape=[1], dtype="int64", lod_level=1)
-dict_dim = len(word_dict)
+    dict_dim = len(word_dict)
-net = convolution_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM)
+    net = convolution_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM)
-return net
+    # net = stacked_lstm_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM, STACKED_NUM)
+    return net
 ```
 我们这里定义了`training_program`。它使用了从`inference_program`返回的结果来计算误差。我们同时定义了优化函数`optimizer_func`。
@@ -200,16 +202,16 @@ return net
 ```python
 def train_program(word_dict):
-prediction = inference_program(word_dict)
+    prediction = inference_program(word_dict)
-label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
-cost = fluid.layers.cross_entropy(input=prediction, label=label)
+    cost = fluid.layers.cross_entropy(input=prediction, label=label)
-avg_cost = fluid.layers.mean(cost)
+    avg_cost = fluid.layers.mean(cost)
-accuracy = fluid.layers.accuracy(input=prediction, label=label)
+    accuracy = fluid.layers.accuracy(input=prediction, label=label)
-return [avg_cost, accuracy]
+    return [avg_cost, accuracy]
 def optimizer_func():
-return fluid.optimizer.Adagrad(learning_rate=0.002)
+    return fluid.optimizer.Adagrad(learning_rate=0.002)
 ```
 ## 训练模型
@@ -236,9 +238,9 @@ word_dict = paddle.dataset.imdb.word_dict()
 print ("Reading training data....")
 train_reader = paddle.batch(
-paddle.reader.shuffle(
+    paddle.reader.shuffle(
-paddle.dataset.imdb.train(word_dict), buf_size=25000),
+        paddle.dataset.imdb.train(word_dict), buf_size=25000),
-batch_size=BATCH_SIZE)
+    batch_size=BATCH_SIZE)
 ```
 ### 构造训练器(trainer)
@@ -246,9 +248,9 @@ batch_size=BATCH_SIZE)
 ```python
 trainer = fluid.Trainer(
-train_func=partial(train_program, word_dict),
+    train_func=partial(train_program, word_dict),
-place=place,
+    place=place,
-optimizer_func=optimizer_func)
+    optimizer_func=optimizer_func)
 ```
 ### 提供数据
@@ -268,13 +270,13 @@ feed_order = ['words', 'label']
 params_dirname = "understand_sentiment_conv.inference.model"
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-print("Step {0}, Epoch {1} Metrics {2}".format(
+        print("Step {0}, Epoch {1} Metrics {2}".format(
-event.step, event.epoch, map(np.array, event.metrics)))
+                event.step, event.epoch, map(np.array, event.metrics)))
-if event.step == 10:
+        if event.step == 10:
-trainer.save_params(params_dirname)
+            trainer.save_params(params_dirname)
-trainer.stop()
+            trainer.stop()
 ```
 ### 开始训练
@@ -283,10 +285,10 @@ trainer.stop()
 ```python
 trainer.train(
-num_epochs=1,
+    num_epochs=1,
-event_handler=event_handler,
+    event_handler=event_handler,
-reader=train_reader,
+    reader=train_reader,
-feed_order=feed_order)
+    feed_order=feed_order)
 ```
 ## 应用模型
@@ -297,7 +299,7 @@ feed_order=feed_order)
 ```python
 inferencer = fluid.Inferencer(
-inference_program, param_path=params_dirname, place=place)
+        infer_func=partial(inference_program, word_dict), param_path=params_dirname, place=place)
 ```
 ### 生成测试用输入数据
@@ -307,14 +309,14 @@ inference_program, param_path=params_dirname, place=place)
 ```python
 reviews_str = [
-'read the book forget the movie', 'this is a great movie', 'this is very bad'
+    'read the book forget the movie', 'this is a great movie', 'this is very bad'
 ]
 reviews = [c.split() for c in reviews_str]
 UNK = word_dict['<unk>']
 lod = []
 for c in reviews:
-lod.append([word_dict.get(words, UNK) for words in c])
+    lod.append([word_dict.get(words, UNK) for words in c])
 base_shape = [[len(c) for c in lod]]
@@ -329,7 +331,7 @@ tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
 results = inferencer.infer({'words': tensor_words})
 for i, r in enumerate(results[0]):
-print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'")
+    print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'")
 ```

--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/rnn.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/rnn.png
--- a/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/stacked_lstm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/understand_sentiment/image/stacked_lstm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/.gitignore
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/.gitignore
-data/train.list
-data/test.list
-data/simple-examples*
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/index.md
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/index.md
 # 词向量
-本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/175.html)。
 ## 背景介绍
@@ -12,17 +12,19 @@
 One-hot vector虽然自然，但是用处有限。比如，在互联网广告系统里，如果用户输入的query是“母亲节”，而有一个广告的关键词是“康乃馨”。虽然按照常理，我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨；但是这两个词对应的one-hot vectors之间的距离度量，无论是欧氏距离还是余弦相似度(cosine similarity)，由于其向量正交，都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是：每个词本身的信息量都太小。所以，仅仅给定两个词，不足以让我们准确判别它们是否相关。要想精确计算相关性，我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。
-在机器学习领域里，各种“知识”被各种模型表示，词向量模型(word embedding model)就是其中的一类。通过词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如`$embedding(Mother's\ Day) = [0.3, 4.2, -1.5, ...], embedding(Carnation) = [0.2, 5.6, -2.3, ...]$`。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”，这样如“母亲节”和“康乃馨”的对应词向量的余弦相似度就不再为零了。
+在机器学习领域里，各种“知识”被各种模型表示，词向量模型(word embedding model)就是其中的一类。通过词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如$embedding(母亲节) = [0.3, 4.2, -1.5, ...], embedding(康乃馨) = [0.2, 5.6, -2.3, ...]$。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”，这样如“母亲节”和“康乃馨”的对应词向量的余弦相似度就不再为零了。
-词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前，传统做法是统计一个词语的共生矩阵`$X$`。`$X$`是一个`$|V| \times |V|$` 大小的矩阵，`$X_{ij}$`表示在所有语料中，词汇表`V`(vocabulary)中第i个词和第j个词同时出现的词数，`$|V|$`为词汇表的大小。对`$X$`做矩阵分解（如奇异值分解，Singular Value Decomposition \[[5](#参考文献)\]），得到的`$U$`即视为所有词的词向量：
+词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前，传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵，$X_{ij}$表示在所有语料中，词汇表`V`(vocabulary)中第i个词和第j个词同时出现的词数，$|V|$为词汇表的大小。对$X$做矩阵分解（如奇异值分解，Singular Value Decomposition \[[5](#参考文献)\]），得到的$U$即视为所有词的词向量：
 $$X = USV^T$$
-但这样的传统做法有很多问题：<br/>
+但这样的传统做法有很多问题：
-1) 由于很多词没有出现，导致矩阵极其稀疏，因此需要对词频做额外处理来达到好的矩阵分解效果；<br/>
-2) 矩阵非常大，维度太高(通常达到`$10^6*10^6$`的数量级)；<br/>
+1) 由于很多词没有出现，导致矩阵极其稀疏，因此需要对词频做额外处理来达到好的矩阵分解效果；
-3) 需要手动去掉停用词（如although, a,...），不然这些频繁出现的词也会影响矩阵分解的效果。
+2) 矩阵非常大，维度太高(通常达到$10^6 \times 10^6$的数量级)；
+3) 需要手动去掉停用词（如although, a,...），不然这些频繁出现的词也会影响矩阵分解的效果。
 基于神经网络的模型不需要计算存储一个在全语料上统计的大表，而是通过学习语义信息得到词向量，因此能很好地解决以上问题。在本章里，我们将展示基于神经网络训练词向量的细节，以及如何用PaddlePaddle训练一个词向量模型。
@@ -31,19 +33,21 @@ $$X = USV^T$$
 本章中，当词向量训练好后，我们可以用数据可视化算法t-SNE\[[4](#参考文献)\]画出词语特征在二维上的投影（如下图所示）。从图中可以看出，语义相关的词语（如a, the, these; big, huge）在投影上距离很近，语意无关的词（如say, business; decision, japan）在投影上的距离很远。
-![2d_similarity](./image/2d_similarity.png)
 <p align="center">
-图1. 词向量的二维投影
+    <img src = "image/2d_similarity.png" width=400><br/>
+    图1. 词向量的二维投影
 </p>
-另一方面，我们知道两个向量的余弦值在`$[-1,1]$`的区间内：两个完全相同的向量余弦值为1, 两个相互垂直的向量之间余弦值为0，两个方向完全相反的向量余弦值为-1，即相关性和余弦值大小成正比。因此我们还可以计算两个词向量的余弦相似度:
+另一方面，我们知道两个向量的余弦值在$[-1,1]$的区间内：两个完全相同的向量余弦值为1, 两个相互垂直的向量之间余弦值为0，两个方向完全相反的向量余弦值为-1，即相关性和余弦值大小成正比。因此我们还可以计算两个词向量的余弦相似度:
 ```
-similarity: 0.899180685161
 please input two words: big huge
+similarity: 0.899180685161
 please input two words: from company
 similarity: -0.0997506977351
 ```
 以上结果可以通过运行`calculate_dis.py`, 加载字典里的单词和对应训练特征结果得到，我们将在[应用模型](#应用模型)中详细描述用法。
@@ -56,10 +60,10 @@ similarity: -0.0997506977351
 ### 语言模型
 在介绍词向量模型之前，我们先来引入一个概念：语言模型。
-语言模型旨在为语句的联合概率函数`$P(w_1, ..., w_T)$`建模, 其中`$w_i$`表示句子中的第i个词。语言模型的目标是，希望模型对有意义的句子赋予大概率，对没意义的句子赋予小概率。
+语言模型旨在为语句的联合概率函数$P(w_1, ..., w_T)$建模, 其中$w_i$表示句子中的第i个词。语言模型的目标是，希望模型对有意义的句子赋予大概率，对没意义的句子赋予小概率。
 这样的模型可以应用于很多领域，如机器翻译、语音识别、信息检索、词性标注、手写识别等，它们都希望能得到一个连续序列的概率。 以信息检索为例，当你在搜索“how long is a football bame”时（bame是一个医学名词），搜索引擎会提示你是否希望搜索"how long is a football game", 这是因为根据语言模型计算出“how long is a football bame”的概率很低，而与bame近似的，可能引起错误的词中，game会使该句生成的概率最大。
-对语言模型的目标概率`$P(w_1, ..., w_T)$`，如果假设文本中每个词都是相互独立的，则整句话的联合概率可以表示为其中所有词语条件概率的乘积，即：
+对语言模型的目标概率$P(w_1, ..., w_T)$，如果假设文本中每个词都是相互独立的，则整句话的联合概率可以表示为其中所有词语条件概率的乘积，即：
 $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
@@ -75,7 +79,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
 Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Language Models \[[1](#参考文献)\] 中介绍如何学习一个神经元网络表示的词向量模型。文中的神经概率语言模型（Neural Network Language Model，NNLM）通过一个线性映射和一个非线性隐层连接，同时学习了语言模型和词向量，即通过学习大量语料得到词语的向量表达，通过这些向量得到整个句子的概率。用这种方法学习语言模型可以克服维度灾难（curse of dimensionality）,即训练和测试数据不同导致的模型不准。注意：由于“神经概率语言模型”说法较为泛泛，我们在这里不用其NNLM的本名，考虑到其具体做法，本文中称该模型为N-gram neural model。
-我们在上文中已经讲到用条件概率建模语言模型，即一句话中第`$t$`个词的概率和该句话的前`$t-1$`个词相关。可实际上越远的词语其实对该词的影响越小，那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响，则有：
+我们在上文中已经讲到用条件概率建模语言模型，即一句话中第$t$个词的概率和该句话的前$t-1$个词相关。可实际上越远的词语其实对该词的影响越小，那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响，则有：
 $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
@@ -83,33 +87,33 @@ $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
 $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
-其中`$f(w_t, w_{t-1}, ..., w_{t-n+1})$`表示根据历史n-1个词得到当前词`$w_t$`的条件概率，`$R(\theta)$`表示参数正则项。
+其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率，$R(\theta)$表示参数正则项。
-![nnlm](./image/nnlm.png)
 <p align="center">
-图2. N-gram神经网络模型
+       <img src="image/nnlm.png" width=500><br/>
+       图2. N-gram神经网络模型
 </p>
 图2展示了N-gram神经网络模型，从下往上看，该模型分为以下几个部分：
- 对于每个样本，模型输入`$w_{t-n+1},...w_{t-1}$`, 输出句子第t个词为字典中`|V|`个词的概率。
+ - 对于每个样本，模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。
-每个输入词`$w_{t-n+1},...w_{t-1}$`首先通过映射矩阵映射到词向量`$C(w_{t-n+1}),...C(w_{t-1})$`。
+   每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。
- 然后所有词语的词向量连接成一个大向量，并经过一个非线性映射得到历史词语的隐层表示：
+ - 然后所有词语的词向量连接成一个大向量，并经过一个非线性映射得到历史词语的隐层表示：
-$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
+    $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
-其中，`$x$`为所有词语的词向量连接成的大向量，表示文本历史特征；`$\theta$`、`$U$`、`$b_1$`、`$b_2$`和`$W$`分别为词向量层到隐层连接的参数。`$g$`表示未经归一化的所有输出单词概率，`$g_i$`表示未经归一化的字典中第`$i$`个单词的输出概率。
+    其中，$x$为所有词语的词向量连接成的大向量，表示文本历史特征；$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率，$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
- 根据softmax的定义，通过归一化`$g_i$`, 生成目标词`$w_t$`的概率为：
+ - 根据softmax的定义，通过归一化$g_i$, 生成目标词$w_t$的概率为：
-$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
+  $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- 整个网络的损失值(cost)为多类分类交叉熵，用公式表示为
+ - 整个网络的损失值(cost)为多类分类交叉熵，用公式表示为
-$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
+   $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
-其中`$y_k^i$`表示第`$i$`个样本第`$k$`类的真实标签(0或1)，`$softmax(g_k^i)$`表示第i个样本第k类softmax输出的概率。
+   其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1)，$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。
@@ -117,27 +121,27 @@ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
 CBOW模型通过一个词的上下文（各N个词）预测当前词。当N=2时，模型如下图所示：
-![cbow](./image/cbow.png)
 <p align="center">
-图3. CBOW模型
+    <img src="image/cbow.png" width=250><br/>
+    图3. CBOW模型
 </p>
 具体来说，不考虑上下文的词语输入顺序，CBOW是用上下文词语的词向量的均值来预测当前词。即：
 $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
-其中`$x_t$`为第`$t$`个词的词向量，分类分数（score）向量 `$z=U*context$`，最终的分类`$y$`采用softmax，损失函数采用多类分类交叉熵。
+其中$x_t$为第$t$个词的词向量，分类分数（score）向量 $z=U*context$，最终的分类$y$采用softmax，损失函数采用多类分类交叉熵。
 ### Skip-gram model
 CBOW的好处是对上下文词语的分布在词向量上进行了平滑，去掉了噪声，因此在小数据集上很有效。而Skip-gram的方法中，用一个词预测其上下文，得到了当前词上下文的很多样本，因此可用于更大的数据集。
-![skipgram](./image/skipgram.png)
 <p align="center">
-图4. Skip-gram模型
+    <img src="image/skipgram.png" width=250><br/>
+    图4. Skip-gram模型
 </p>
-如上图所示，Skip-gram模型的具体做法是，将一个词的词向量映射到`$2n$`个词的词向量（`$2n$`表示当前输入词的前后各`$n$`个词），然后分别通过softmax得到这`$2n$`个词的分类损失值之和。
+如上图所示，Skip-gram模型的具体做法是，将一个词的词向量映射到$2n$个词的词向量（$2n$表示当前输入词的前后各$n$个词），然后分别通过softmax得到这$2n$个词的分类损失值之和。
 ## 数据准备
@@ -148,21 +152,21 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑，去
 <p align="center">
 <table>
-<tr>
+    <tr>
-<td>训练数据</td>
+        <td>训练数据</td>
-<td>验证数据</td>
+        <td>验证数据</td>
-<td>测试数据</td>
+        <td>测试数据</td>
-</tr>
+    </tr>
-<tr>
+    <tr>
-<td>ptb.train.txt</td>
+        <td>ptb.train.txt</td>
-<td>ptb.valid.txt</td>
+        <td>ptb.valid.txt</td>
-<td>ptb.test.txt</td>
+        <td>ptb.test.txt</td>
-</tr>
+    </tr>
-<tr>
+    <tr>
-<td>42068句</td>
+        <td>42068句</td>
-<td>3370句</td>
+        <td>3370句</td>
-<td>3761句</td>
+        <td>3761句</td>
-</tr>
+    </tr>
 </table>
 </p>
@@ -189,9 +193,9 @@ dream that one day <e>
 本配置的模型结构如下图所示：
-![ngram](./image/ngram.png)
 <p align="center">
-图5. 模型配置中的N-gram神经网络模型
+    <img src="image/ngram.png" width=400><br/>
+    图5. 模型配置中的N-gram神经网络模型
 </p>
 首先，加载所需要的包：
@@ -204,6 +208,7 @@ from functools import partial
 import math
 import os
 import sys
+from __future__ import print_function
 ```
 然后，定义参数：
@@ -226,57 +231,57 @@ dict_size = len(word_dict)
 ```python
 def inference_program(is_sparse):
-first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
+    first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
-second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
+    second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
-third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
+    third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
-fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
+    fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
-embed_first = fluid.layers.embedding(
+    embed_first = fluid.layers.embedding(
-input=first_word,
+        input=first_word,
-size=[dict_size, EMBED_SIZE],
+        size=[dict_size, EMBED_SIZE],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr='shared_w')
+        param_attr='shared_w')
-embed_second = fluid.layers.embedding(
+    embed_second = fluid.layers.embedding(
-input=second_word,
+        input=second_word,
-size=[dict_size, EMBED_SIZE],
+        size=[dict_size, EMBED_SIZE],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr='shared_w')
+        param_attr='shared_w')
-embed_third = fluid.layers.embedding(
+    embed_third = fluid.layers.embedding(
-input=third_word,
+        input=third_word,
-size=[dict_size, EMBED_SIZE],
+        size=[dict_size, EMBED_SIZE],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr='shared_w')
+        param_attr='shared_w')
-embed_fourth = fluid.layers.embedding(
+    embed_fourth = fluid.layers.embedding(
-input=fourth_word,
+        input=fourth_word,
-size=[dict_size, EMBED_SIZE],
+        size=[dict_size, EMBED_SIZE],
-dtype='float32',
+        dtype='float32',
-is_sparse=is_sparse,
+        is_sparse=is_sparse,
-param_attr='shared_w')
+        param_attr='shared_w')
-concat_embed = fluid.layers.concat(
+    concat_embed = fluid.layers.concat(
-input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
+        input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
-hidden1 = fluid.layers.fc(input=concat_embed,
+    hidden1 = fluid.layers.fc(input=concat_embed,
-size=HIDDEN_SIZE,
+                              size=HIDDEN_SIZE,
-act='sigmoid')
+                              act='sigmoid')
-predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
+    predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
-return predict_word
+    return predict_word
 ```
 - 基于以上的神经网络结构，我们可以如下定义我们的`训练`方法
 ```python
 def train_program(is_sparse):
-# The declaration of 'next_word' must be after the invoking of inference_program,
+    # The declaration of 'next_word' must be after the invoking of inference_program,
-# or the data input order of train program would be [next_word, firstw, secondw,
+    # or the data input order of train program would be [next_word, firstw, secondw,
-# thirdw, fourthw], which is not correct.
+    # thirdw, fourthw], which is not correct.
-predict_word = inference_program(is_sparse)
+    predict_word = inference_program(is_sparse)
-next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
+    next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
-cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
+    cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
-avg_cost = fluid.layers.mean(cost)
+    avg_cost = fluid.layers.mean(cost)
-return avg_cost
+    return avg_cost
 ```
 - 现在我们可以开始训练啦。如今的版本较之以前就简单了许多。我们有现成的训练和测试集：`paddle.dataset.imikolov.train()`和`paddle.dataset.imikolov.test()`。两者都会返回一个读取器。在PaddlePaddle中，读取器是一个Python的函数，每次调用，会读取下一条数据。它是一个Python的generator。
@@ -285,59 +290,59 @@ return avg_cost
 ```python
 def optimizer_func():
-# Note here we need to choose more sophisticated optimizers
+    # Note here we need to choose more sophisticated optimizers
-# such as AdaGrad with a decay rate. The normal SGD converges
+    # such as AdaGrad with a decay rate. The normal SGD converges
-# very slowly.
+    # very slowly.
-# optimizer=fluid.optimizer.SGD(learning_rate=0.001),
+    # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
-return fluid.optimizer.AdagradOptimizer(
+    return fluid.optimizer.AdagradOptimizer(
-learning_rate=3e-3,
+        learning_rate=3e-3,
-regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
+        regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
 def train(use_cuda, train_program, params_dirname):
-train_reader = paddle.batch(
+    train_reader = paddle.batch(
-paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
+        paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
-test_reader = paddle.batch(
+    test_reader = paddle.batch(
-paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
+        paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-def event_handler(event):
+    def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+        if isinstance(event, fluid.EndStepEvent):
-# We output cost every 10 steps.
+            # We output cost every 10 steps.
-if event.step % 10 == 0:
+            if event.step % 10 == 0:
-outs = trainer.test(
+                outs = trainer.test(
-reader=test_reader,
+                    reader=test_reader,
-feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+                    feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
-avg_cost = outs[0]
+                avg_cost = outs[0]
-print "Step %d: Average Cost %f" % (event.step, avg_cost)
+                print("Step %d: Average Cost %f" % (event.step, avg_cost))
-# If average cost is lower than 5.8, we consider the model good enough to stop.
+                # If average cost is lower than 5.8, we consider the model good enough to stop.
-# Note 5.8 is a relatively high value. In order to get a better model, one should
+                # Note 5.8 is a relatively high value. In order to get a better model, one should
-# aim for avg_cost lower than 3.5. But the training could take longer time.
+                # aim for avg_cost lower than 3.5. But the training could take longer time.
-if avg_cost < 5.8:
+                if avg_cost < 5.8:
-trainer.save_params(params_dirname)
+                    trainer.save_params(params_dirname)
-trainer.stop()
+                    trainer.stop()
-if math.isnan(avg_cost):
+                if math.isnan(avg_cost):
-sys.exit("got NaN loss, training failed.")
+                    sys.exit("got NaN loss, training failed.")
-trainer = fluid.Trainer(
+    trainer = fluid.Trainer(
-train_func=train_program,
+        train_func=train_program,
-optimizer_func=optimizer_func,
+        optimizer_func=optimizer_func,
-place=place)
+        place=place)
-trainer.train(
+    trainer.train(
-reader=train_reader,
+        reader=train_reader,
-num_epochs=1,
+        num_epochs=1,
-event_handler=event_handler,
+        event_handler=event_handler,
-feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+        feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
 ```
 - `trainer.train`将会开始训练。从`event_handler`返回的监控情况如下：
-```python
+```text
 Step 0: Average Cost 7.337213
 Step 10: Average Cost 6.136128
 Step 20: Average Cost 5.766995
@@ -352,50 +357,49 @@ Step 20: Average Cost 5.766995
 ```python
 def infer(use_cuda, inference_program, params_dirname=None):
-place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-inferencer = fluid.Inferencer(
+    inferencer = fluid.Inferencer(
-infer_func=inference_program, param_path=params_dirname, place=place)
+        infer_func=inference_program, param_path=params_dirname, place=place)
-# Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
+    # Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
-# is simply an index to look up for the corresponding word vector and hence
+    # is simply an index to look up for the corresponding word vector and hence
-# the shape of word (base_shape) should be [1]. The length-based level of
+    # the shape of word (base_shape) should be [1]. The length-based level of
-# detail (lod) info of each LoDtensor should be [[1]] meaning there is only
+    # detail (lod) info of each LoDtensor should be [[1]] meaning there is only
-# one lod_level and there is only one sequence of one word on this level.
+    # one lod_level and there is only one sequence of one word on this level.
-# Note that lod info should be a list of lists.
+    # Note that lod info should be a list of lists.
-data1 = [[211]]  # 'among'
+    data1 = [[211]]  # 'among'
-data2 = [[6]]    # 'a'
+    data2 = [[6]]    # 'a'
-data3 = [[96]]   # 'group'
+    data3 = [[96]]   # 'group'
-data4 = [[4]]    # 'of'
+    data4 = [[4]]    # 'of'
-lod = [[1]]
+    lod = [[1]]
-first_word  = fluid.create_lod_tensor(data1, lod, place)
+    first_word  = fluid.create_lod_tensor(data1, lod, place)
-second_word = fluid.create_lod_tensor(data2, lod, place)
+    second_word = fluid.create_lod_tensor(data2, lod, place)
-third_word  = fluid.create_lod_tensor(data3, lod, place)
+    third_word  = fluid.create_lod_tensor(data3, lod, place)
-fourth_word = fluid.create_lod_tensor(data4, lod, place)
+    fourth_word = fluid.create_lod_tensor(data4, lod, place)
-result = inferencer.infer(
+    result = inferencer.infer(
-{
+        {
-'firstw': first_word,
+            'firstw': first_word,
-'secondw': second_word,
+            'secondw': second_word,
-'thirdw': third_word,
+            'thirdw': third_word,
-'fourthw': fourth_word
+            'fourthw': fourth_word
-},
+        },
-return_numpy=False)
+        return_numpy=False)
-print(numpy.array(result[0]))
+    print(numpy.array(result[0]))
-most_possible_word_index = numpy.argmax(result[0])
+    most_possible_word_index = numpy.argmax(result[0])
-print(most_possible_word_index)
+    print(most_possible_word_index)
-print([
+    print([
-key for key, value in word_dict.iteritems()
+        key for key, value in word_dict.iteritems()
-if value == most_possible_word_index
+        if value == most_possible_word_index
-][0])
+    ][0])
 ```
 在经历3分钟的短暂训练后，我们得到如下的预测。我们的模型预测 `among a group of` 的下一个词是`a`。这比较符合文法规律。如果我们训练时间更长，比如几个小时，那么我们会得到的下一个预测是 `workers`。
+```text
-```python
 [[0.00106646 0.0007907  0.00072041 ... 0.00049024 0.00041355 0.00084464]]
 6
 a
@@ -405,20 +409,20 @@ a
 ```python
 def main(use_cuda, is_sparse):
-if use_cuda and not fluid.core.is_compiled_with_cuda():
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
-return
+        return
-params_dirname = "word2vec.inference.model"
+    params_dirname = "word2vec.inference.model"
-train(
+    train(
-use_cuda=use_cuda,
+        use_cuda=use_cuda,
-train_program=partial(train_program, is_sparse),
+        train_program=partial(train_program, is_sparse),
-params_dirname=params_dirname)
+        params_dirname=params_dirname)
-infer(
+    infer(
-use_cuda=use_cuda,
+        use_cuda=use_cuda,
-inference_program=partial(inference_program, is_sparse),
+        inference_program=partial(inference_program, is_sparse),
-params_dirname=params_dirname)
+        params_dirname=params_dirname)
 main(use_cuda=use_cuda, is_sparse=True)

--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/cbow_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/cbow_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/ngram.en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/ngram.en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/nnlm_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/nnlm_en.png
--- a/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/skipgram_en.png
+++ b/doc/fluid/new_docs/beginners_guide/basics/word2vec/image/skipgram_en.png
--- a/doc/fluid/new_docs/beginners_guide/install/install_doc.rst
+++ b/doc/fluid/new_docs/beginners_guide/install/install_doc.rst
@@ -57,7 +57,28 @@ paddlepaddle-gpu==0.11.0            使用CUDA 7.5和cuDNN 5编译的0.11.0版
 您可以在 `Release History <https://pypi.org/project/paddlepaddle-gpu/#history>`_
 中找到paddlepaddle-gpu的各个发行版本。
-如果需要获取并安装最新的PaddlePaddle开发分支，可以从我们的 `CI系统 <https://paddleci.ngrok.io/project.html?projectId=Manylinux1&tab=projectOverview>`_ 中下载最新的whl安装包和c-api开发包并安装。如需登录，请点击“Log in as guest”。
+如果需要获取并安装最新的（开发分支）PaddlePaddle，可以从我们的CI系统中下载最新的whl
+安装包和c-api开发包并安装，您可以从下面的表格中找到需要的版本：
+如果在点击下面链接时出现如下登陆界面，点击“Log in as guest”即可开始下载：
+.. image:: paddleci.png
+   :scale: 50 %
+   :align: center
+..  csv-table:: 各个版本最新的whl包
+    :header: "版本说明", "cp27-cp27mu", "cp27-cp27m"
+    :widths: 1, 3, 3
+    "stable_cuda9.0_cudnn7", "`paddlepaddle_gpu-0.14.0-cp27-cp27mu-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/ee/ee/5d96e99d4a6d57bd1a7a8c4c98124a5ba0f6f0e07f38f4cee1365e0d9734/paddlepaddle_gpu-0.14.0-cp27-cp27mu-manylinux1_x86_64.whl>`__", "`paddlepaddle_gpu-0.14.0-cp27-cp27m-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/2e/65/3c1e44417dfc4afc7004f4db06789876b1237a0b6b234e0bd4213f3258b7/paddlepaddle_gpu-0.14.0-cp27-cp27m-manylinux1_x86_64.whl>`__"
+    "stable_cuda8.0_cudnn7", "`paddlepaddle_gpu-0.14.0.post87-cp27-cp27mu-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/a1/eb/261d920ede38d4b2b8dfb5817d7f7d25c526b1a70260f23312ad6029c0d3/paddlepaddle_gpu-0.14.0.post87-cp27-cp27mu-manylinux1_x86_64.whl>`__", "`paddlepaddle_gpu-0.14.0.post87-cp27-cp27m-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/54/1d/2c2a5c8665634b47fa925839108752611202a7c08ba4d65c2ee79f825a0e/paddlepaddle_gpu-0.14.0.post87-cp27-cp27m-manylinux1_x86_64.whl>`__"
+    "stable_cuda8.0_cudnn5", "`paddlepaddle_gpu-0.14.0.post85-cp27-cp27mu-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/60/50/94d16d34976f06b3cd8818d9b7bf40a9ff16bc48120ac9254d976f8ffc35/paddlepaddle_gpu-0.14.0.post85-cp27-cp27mu-manylinux1_x86_64.whl>`__", "`paddlepaddle_gpu-0.14.0.post85-cp27-cp27m-manylinux1_x86_64.whl <https://files.pythonhosted.org/packages/24/dd/25c1db09524f654c80baa83e7aafdd67109449bd5b500964f4005047dcf8/paddlepaddle_gpu-0.14.0.post85-cp27-cp27m-manylinux1_x86_64.whl>`__"
+    "cpu_avx_mkl", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/845:id/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/845:id/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
+    "cpu_avx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/846:id/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/846:id/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`__"
+    "cpu_noavx_openblas", "`paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/847:id/paddlepaddle-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/847:id/paddlepaddle-latest-cp27-cp27m-linux_x86_64.whl>`_"
+    "cuda8.0_cudnn5_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/841:id/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/841:id/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
+    "cuda8.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/843:id/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/843:id/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
+    "cuda9.0_cudnn7_avx_mkl", "`paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/842:id/paddlepaddle_gpu-latest-cp27-cp27mu-linux_x86_64.whl>`__", "`paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl <https://paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/842:id/paddlepaddle_gpu-latest-cp27-cp27m-linux_x86_64.whl>`__"
 .. _FAQ:

--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/README.cn.md
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/README.cn.md
-```eval_rst
-..  _quick_start_fit_a_line:
-```
 # 线性回归
 让我们从经典的线性回归（Linear Regression \[[1](#参考文献)\]）模型开始这份教程。在这一章里，你将使用真实的数据集建立起一个房价预测模型，并且了解到机器学习中的若干重要概念。
-本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/137.html)。
 ## 背景介绍
-给定一个大小为`$n$`的数据集  `${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$`，其中`$x_{i1}, \ldots, x_{id}$`是第`$i$`个样本`$d$`个属性上的取值，`$y_i$`是该样本待预测的目标。线性回归模型假设目标`$y_i$`可以被属性间的线性组合描述，即
+给定一个大小为$n$的数据集  ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$，其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值，$y_i$是该样本待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述，即
 $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b,  i=1,\ldots,n$$
-例如，在我们将要建模的房价预测问题里，`$x_{ij}$`是描述房子`$i$`的各种属性（比如房间的个数、周围学校和医院的个数、交通状况等），而 `$y_i$`是房屋的价格。
+例如，在我们将要建模的房价预测问题里，$x_{ij}$是描述房子$i$的各种属性（比如房间的个数、周围学校和医院的个数、交通状况等），而 $y_i$是房屋的价格。
 初看起来，这个假设实在过于简单了，变量间的真实关系很难是线性的。但由于线性回归模型有形式简单和易于建模分析的优点，它在实际问题中得到了大量的应用。很多经典的统计学习、机器学习书籍\[[2,3,4](#参考文献)\]也选择对线性模型独立成章重点讲解。
 ## 效果展示
 我们使用从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)获得的波士顿房价数据集进行模型的训练和预测。下面的散点图展示了使用模型对部分房屋价格进行的预测。其中，每个点的横坐标表示同一类房屋真实价格的中位数，纵坐标表示线性回归模型根据特征预测的结果，当二者值完全相等的时候就会落在虚线上。所以模型预测得越准确，则点离虚线越近。
+<p align="center">
-![BostonHousePricePredictions](./image/predictions.png)
+    <img src = "image/predictions.png" width=400><br/>
-<p align="center">图1. 预测值 V.S. 真实值</p>
+    图1. 预测值 V.S. 真实值
+</p>
 ## 模型概览
 ### 模型定义
-在波士顿房价数据集中，和房屋相关的值共有14个：前13个用来描述房屋相关的各种信息，即模型中的 `$x_i$`；最后一个值为我们要预测的该类房屋价格的中位数，即模型中的 `$y_i$`。因此，我们的模型就可以表示成：
+在波士顿房价数据集中，和房屋相关的值共有14个：前13个用来描述房屋相关的各种信息，即模型中的 $x_i$；最后一个值为我们要预测的该类房屋价格的中位数，即模型中的 $y_i$。因此，我们的模型就可以表示成：
 $$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
-`$\hat{Y}$` 表示模型的预测结果，用来和真实值`$Y$`区分。模型要学习的参数即：`$\omega_1, \ldots, \omega_{13}, b$`。
+$\hat{Y}$ 表示模型的预测结果，用来和真实值$Y$区分。模型要学习的参数即：$\omega_1, \ldots, \omega_{13}, b$。
-建立模型后，我们需要给模型一个优化目标，使得学到的参数能够让预测值`$\hat{Y}$`尽可能地接近真实值`$Y$`。这里我们引入损失函数（[Loss Function](https://en.wikipedia.org/wiki/Loss_function)，或Cost Function）这个概念。 输入任意一个数据样本的目标值`$y_{i}$`和模型给出的预测值`$\hat{y_{i}}$`，损失函数输出一个非负的实值。这个实值通常用来反映模型误差的大小。
+建立模型后，我们需要给模型一个优化目标，使得学到的参数能够让预测值$\hat{Y}$尽可能地接近真实值$Y$。这里我们引入损失函数（[Loss Function](https://en.wikipedia.org/wiki/Loss_function)，或Cost Function）这个概念。 输入任意一个数据样本的目标值$y_{i}$和模型给出的预测值$\hat{y_{i}}$，损失函数输出一个非负的实值。这个实值通常用来反映模型误差的大小。
 对于线性回归模型来讲，最常见的损失函数就是均方误差（Mean Squared Error， [MSE](https://en.wikipedia.org/wiki/Mean_squared_error)）了，它的形式是：
 $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
-即对于一个大小为`$n$`的测试集，`$MSE$`是`$n$`个数据预测结果误差平方的均值。
+即对于一个大小为$n$的测试集，$MSE$是$n$个数据预测结果误差平方的均值。
 ### 训练过程
 定义好模型结构之后，我们要通过以下几个步骤进行模型训练
-1. 初始化参数，其中包括权重`$\omega_i$`和偏置`$b$`，对其进行初始化（如0均值，1方差）。
-2. 网络正向传播计算网络输出和损失函数。
+ 1. 初始化参数，其中包括权重$\omega_i$和偏置$b$，对其进行初始化（如0均值，1方差）。
-3. 根据损失函数进行反向误差传播 （[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)），将网络误差从输出层依次向前传递, 并更新网络中的参数。
-4. 重复2~3步骤，直至网络训练误差达到规定的程度或训练轮次达到设定值。
+ 2. 网络正向传播计算网络输出和损失函数。
+ 3. 根据损失函数进行反向误差传播 （[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)），将网络误差从输出层依次向前传递, 并更新网络中的参数。
+ 4. 重复2~3步骤，直至网络训练误差达到规定的程度或训练轮次达到设定值。
 ## 数据集
 ### 数据集介绍
 这份数据集共506行，每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下：
-<p align="center">
+| 属性名 | 解释 | 类型 |
-<table>
+| ------| ------ | ------ |
-    <thead>
+| CRIM | 该镇的人均犯罪率 | 连续值 |
-    <tr>
+| ZN | 占地面积超过25,000平方呎的住宅用地比例 | 连续值 |
-        <th>属性名</th>
+| INDUS | 非零售商业用地比例 | 连续值 |
-        <th>解释</th>
+| CHAS | 是否邻近 Charles River  | 离散值，1=邻近；0=不邻近 |
-        <th>类型</th>
+| NOX | 一氧化氮浓度 | 连续值 |
-    </tr>
+| RM | 每栋房屋的平均客房数 | 连续值 |
-    </thead>
+| AGE | 1940年之前建成的自用单位比例 | 连续值 |
-    <tbody>
+| DIS | 到波士顿5个就业中心的加权距离 | 连续值 |
-    <tr>
+| RAD | 到径向公路的可达性指数 | 连续值 |
-        <td>CRIM</td>
+| TAX | 全值财产税率 | 连续值 |
-        <td>该镇的人均犯罪率</td>
+| PTRATIO | 学生与教师的比例 | 连续值 |
-        <td>连续值</td>
+| B | 1000(BK - 0.63)^2，其中BK为黑人占比 | 连续值 |
-    </tr>
+| LSTAT | 低收入人群占比 | 连续值 |
-    <tr>
+| MEDV | 同类房屋价格的中位数 | 连续值 |
-        <td>ZN</td>
-        <td>占地面积超过25,000平方呎的住宅用地比例</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>INDUS</td>
-        <td>非零售商业用地比例</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>CHAS</td>
-        <td>是否邻近 Charles River</td>
-        <td>离散值，1=邻近；0=不邻近</td>
-    </tr>
-    <tr>
-        <td>NOX</td>
-        <td>一氧化氮浓度</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>RM</td>
-        <td>每栋房屋的平均客房数</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>AGE</td>
-        <td>1940年之前建成的自用单位比例</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>DIS</td>
-        <td>到波士顿5个就业中心的加权距离</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>RAD</td>
-        <td>到径向公路的可达性指数</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>TAX</td>
-        <td>全值财产税率</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>PTRATIO</td>
-        <td>学生与教师的比例</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>B</td>
-        <td>1000(BK - 0.63)^2，其中BK为黑人占比</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>LSTAT</td>
-        <td>低收入人群占比</td>
-        <td>连续值</td>
-    </tr>
-    <tr>
-        <td>MEDV</td>
-        <td>同类房屋价格的中位数</td>
-        <td>连续值</td>
-    </tr>
-    </tbody>
-</table>
-</p>
 ### 数据预处理
 #### 连续值与离散值
-观察一下数据，我们的第一个发现是：所有的13维属性中，有12维的连续值和1维的离散值（CHAS）。离散值虽然也常使用类似0、1、2这样的数字表示，但是其含义与连续值是不同的，因为这里的差值没有实际意义。例如，我们用0、1、2来分别表示红色、绿色和蓝色的话，我们并不能因此说“蓝色和红色”比“绿色和红色”的距离更远。所以通常对一个有`$d$`个可能取值的离散属性，我们会将它们转为`$d$`个取值为0或1的二值属性或者将每个可能取值映射为一个多维向量。不过就这里而言，因为CHAS本身就是一个二值属性，就省去了这个麻烦。
+观察一下数据，我们的第一个发现是：所有的13维属性中，有12维的连续值和1维的离散值（CHAS）。离散值虽然也常使用类似0、1、2这样的数字表示，但是其含义与连续值是不同的，因为这里的差值没有实际意义。例如，我们用0、1、2来分别表示红色、绿色和蓝色的话，我们并不能因此说“蓝色和红色”比“绿色和红色”的距离更远。所以通常对一个有$d$个可能取值的离散属性，我们会将它们转为$d$个取值为0或1的二值属性或者将每个可能取值映射为一个多维向量。不过就这里而言，因为CHAS本身就是一个二值属性，就省去了这个麻烦。
 #### 属性的归一化
 另外一个稍加观察即可发现的事实是，各维属性的取值范围差别很大（如图2所示）。例如，属性B的取值范围是[0.32, 396.90]，而属性NOX的取值范围是[0.3850, 0.8170]。这里就要用到一个常见的操作-归一化（normalization）了。归一化的目标是把各位属性的取值范围放缩到差不多的区间，例如[-0.5,0.5]。这里我们使用一种很常见的操作方法：减掉均值，然后除以原取值范围。
@@ -148,11 +83,13 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 - 不同的数值范围会导致不同属性对模型的重要性不同（至少在训练的初始阶段如此），而这个隐含的假设常常是不合理的。这会对优化的过程造成困难，使训练时间大大的加长。
 - 很多的机器学习技巧/模型（例如L1，L2正则项，向量空间模型-Vector Space Model）都基于这样的假设：所有的属性取值都差不多是以0为均值且取值范围相近的。
-![featureScale](./image/ranges.png)
+<p align="center">
-<p align="center">图2. 各维属性的取值范围</p>
+    <img src = "image/ranges.png" width=550><br/>
+    图2. 各维属性的取值范围
+</p>
 #### 整理训练集与测试集
-我们将数据集分割为两份：一份用于调整模型的参数，即进行模型的训练，模型在这份数据集上的误差被称为**训练误差**；另外一份被用来测试，模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据，所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素：更多的训练数据会降低参数估计的方差，从而得到更可信的模型；而更多的测试数据会降低测试误差的方差，从而得到更可信的测试误差。我们这个例子中设置的分割比例为`$8:2$`
+我们将数据集分割为两份：一份用于调整模型的参数，即进行模型的训练，模型在这份数据集上的误差被称为**训练误差**；另外一份被用来测试，模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据，所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素：更多的训练数据会降低参数估计的方差，从而得到更可信的模型；而更多的测试数据会降低测试误差的方差，从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
 在更复杂的模型训练过程中，我们往往还会多使用一种数据集：验证集。因为复杂的模型中常常还有一些超参数（[Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)）需要调节，所以我们会尝试多种超参数的组合来分别训练多个模型，然后对比它们在验证集上的表现选择相对最好的一组超参数，最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单，我们暂且忽略掉这个过程。
@@ -167,6 +104,7 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 import paddle
 import paddle.fluid as fluid
 import numpy
+from __future__ import print_function
 ```
 我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)
@@ -182,14 +120,14 @@ import numpy
 BATCH_SIZE = 20
 train_reader = paddle.batch(
-paddle.reader.shuffle(
+    paddle.reader.shuffle(
-paddle.dataset.uci_housing.train(), buf_size=500),
+        paddle.dataset.uci_housing.train(), buf_size=500),
-batch_size=BATCH_SIZE)
+    batch_size=BATCH_SIZE)
 test_reader = paddle.batch(
-paddle.reader.shuffle(
+    paddle.reader.shuffle(
-paddle.dataset.uci_housing.test(), buf_size=500),
+        paddle.dataset.uci_housing.test(), buf_size=500),
-batch_size=BATCH_SIZE)
+    batch_size=BATCH_SIZE)
 ```
 ### 配置训练程序
@@ -197,16 +135,25 @@ batch_size=BATCH_SIZE)
 ```python
 def train_program():
-y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+    # feature vector of length 13
+    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+    y_predict = fluid.layers.fc(input=x, size=1, act=None)
+    loss = fluid.layers.square_error_cost(input=y_predict, label=y)
+    avg_loss = fluid.layers.mean(loss)
-# feature vector of length 13
+    return avg_loss
-x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+```
-y_predict = fluid.layers.fc(input=x, size=1, act=None)
+### Optimizer Function 配置
-loss = fluid.layers.square_error_cost(input=y_predict, label=y)
+在下面的 `SGD optimizer`，`learning_rate` 是训练的速度，与网络的训练收敛速度有关系。
-avg_loss = fluid.layers.mean(loss)
-return avg_loss
+```python
+def optimizer_program():
+    return fluid.optimizer.SGD(learning_rate=0.001)
 ```
 ### 定义运算场所
@@ -222,9 +169,9 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 ```python
 trainer = fluid.Trainer(
-train_func=train_program,
+    train_func=train_program,
-place=place,
+    place=place,
-optimizer_func=fluid.optimizer.SGD(learning_rate=0.001))
+    optimizer_func=optimizer_program)
 ```
 ### 开始提供数据
@@ -237,7 +184,7 @@ feed_order=['x', 'y']
 除此之外，可以定义一个事件相应器来处理类似`打印训练进程`的事件：
 ```python
-# Specify the directory path to save the parameters
+# Specify the directory to save the parameters
 params_dirname = "fit_a_line.inference.model"
 # Plot data
@@ -248,27 +195,27 @@ plot_cost = Ploter(train_title, test_title)
 step = 0
-# event_handler to print training and testing info
+# event_handler prints training and testing info
 def event_handler_plot(event):
-global step
+    global step
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if event.step % 10 == 0: # every 10 batches, record a test cost
+        if event.step % 10 == 0: # record the test cost every 10 seconds
-test_metrics = trainer.test(
+            test_metrics = trainer.test(
-reader=test_reader, feed_order=feed_order)
+                reader=test_reader, feed_order=feed_order)
-plot_cost.append(test_title, step, test_metrics[0])
+            plot_cost.append(test_title, step, test_metrics[0])
-plot_cost.plot()
+            plot_cost.plot()
-if test_metrics[0] < 10.0:
+            if test_metrics[0] < 10.0:
-# If the accuracy is good enough, we can stop the training.
+                # If the accuracy is good enough, we can stop the training.
-print('loss is less than 10.0, stop')
+                print('loss is less than 10.0, stop')
-trainer.stop()
+                trainer.stop()
-# We can save the trained parameters for the inferences later
+        # We can save the trained parameters for the inferences later
-if params_dirname is not None:
+        if params_dirname is not None:
-trainer.save_params(params_dirname)
+            trainer.save_params(params_dirname)
-step += 1
+        step += 1
 ```
 ### 开始训练
@@ -279,13 +226,17 @@ step += 1
 # The training could take up to a few minutes.
 trainer.train(
-reader=train_reader,
+    reader=train_reader,
-num_epochs=100,
+    num_epochs=100,
-event_handler=event_handler_plot,
+    event_handler=event_handler_plot,
-feed_order=feed_order)
+    feed_order=feed_order)
 ```
-![trainTestCost](./image/train_and_test.png)
+<p align="center">
+    <img src = "image/train_and_test1.png" width=400><br/>
+    图3. 训练结果
+</p>
 ## 预测
 提供一个`inference_program`和一个`params_dirname`来初始化预测器。`params_dirname`用来存储我们的参数。
@@ -296,9 +247,9 @@ feed_order=feed_order)
 ```python
 def inference_program():
-x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-y_predict = fluid.layers.fc(input=x, size=1, act=None)
+    y_predict = fluid.layers.fc(input=x, size=1, act=None)
-return y_predict
+    return y_predict
 ```
 ### 预测
@@ -306,13 +257,23 @@ return y_predict
 ```python
 inferencer = fluid.Inferencer(
-infer_func=inference_program, param_path=params_dirname, place=place)
+    infer_func=inference_program, param_path=params_dirname, place=place)
 batch_size = 10
-tensor_x = numpy.random.uniform(0, 10, [batch_size, 13]).astype("float32")
+test_reader = paddle.batch(paddle.dataset.uci_housing.test(),batch_size=batch_size)
+test_data = test_reader().next()
+test_feat = numpy.array([data[0] for data in test_data]).astype("float32")
+test_label = numpy.array([data[1] for data in test_data]).astype("float32")
+results = inferencer.infer({'x': test_feat})
+print("infer results: (House Price)")
+for k in range(0, batch_size-1):
+    print("%d. %f" % (k, results[0][k]))
-results = inferencer.infer({'x': tensor_x})
+print("\nground truth:")
-print("infer results: ", results[0])
+for k in range(0, batch_size-1):
+    print("%d. %f" % (k, test_label[k]))
 ```
 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/predictions_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/predictions_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/ranges.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/ranges.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/ranges_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/ranges_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/train_and_test.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/fit_a_line/image/train_and_test.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/README.cn.md
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/README.cn.md
 # 识别数字
-本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
+本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)， 初次使用请参考PaddlePaddle[安装教程](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)，更多内容请参考本教程的[视频课堂](http://bit.baidu.com/course/detail/id/167.html)。
 ## 背景介绍
 当我们学习编程的时候，编写的第一个程序一般是实现打印"Hello World"。而机器学习（或深度学习）的入门教程，一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题，比较简单，同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集，包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵，标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。
-![MNIST](./image/mnist_example_image.png)
+<p align="center">
-<p align="center">图1. MNIST图片示例</p>
+    <img src="image/mnist_example_image.png" width="400"><br/>
+    图1. MNIST图片示例
+</p>
 MNIST数据集是从 [NIST](https://www.nist.gov/srd/nist-special-database-19) 的Special Database 3（SD-3）和Special Database 1（SD-1）构建而来。由于SD-3是由美国人口调查局的员工进行标注，SD-1是由美国高中生进行标注，因此SD-3比SD-1更干净也更容易识别。Yann LeCun等人从SD-1和SD-3中各取一半作为MNIST的训练集（60000条数据）和测试集（10000条数据），其中训练集来自250位不同的标注员，此外还保证了训练集和测试集的标注员是不完全相同的。
@@ -20,86 +22,99 @@ Yann LeCun早先在手写字符识别上做了很多研究，并在研究过程
 ## 模型概览
 基于MNIST数据训练一个分类器，在介绍本教程使用的三个基本图像分类网络前，我们先给出一些定义：
- `$X$`是输入：MNIST图片是`$28\times28$` 的二维图像，为了进行计算，我们将其转化为`$784$`维向量，即`$X=\left ( x_0, x_1, \dots, x_{783} \right )$`。
+- $X$是输入：MNIST图片是$28\times28$ 的二维图像，为了进行计算，我们将其转化为$784$维向量，即$X=\left ( x_0, x_1, \dots, x_{783} \right )$。
- `$Y$`是输出：分类器的输出是10类数字（0-9），即`$Y=\left ( y_0, y_1, \dots, y_9 \right )$`，每一维`$y_i$`代表图片分类为第`$i$`类数字的概率。
+- $Y$是输出：分类器的输出是10类数字（0-9），即$Y=\left ( y_0, y_1, \dots, y_9 \right )$，每一维$y_i$代表图片分类为第$i$类数字的概率。
- `$L$`是图片的真实标签：`$L=\left ( l_0, l_1, \dots, l_9 \right )$`也是10维，但只有一维为1，其他都为0。
+- $L$是图片的真实标签：$L=\left ( l_0, l_1, \dots, l_9 \right )$也是10维，但只有一维为1，其他都为0。
 ### Softmax回归(Softmax Regression)
 最简单的Softmax回归模型是先将输入层经过一个全连接层得到的特征，然后直接通过softmax 函数进行多分类\[[9](#参考文献)\]。
-输入层的数据`$X$`传到输出层，在激活操作之前，会乘以相应的权重 `$W$` ，并加上偏置变量 `$b$` ，具体如下：
+输入层的数据$X$传到输出层，在激活操作之前，会乘以相应的权重 $W$ ，并加上偏置变量 $b$ ，具体如下：
 $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
-其中 `$ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $`
+其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
-对于有 `$N$` 个类别的多分类问题，指定 `$N$` 个输出节点，`$N$` 维结果向量经过softmax将归一化为 `$N$` 个[0,1]范围内的实数值，分别表示该样本属于这 `$N$` 个类别的概率。此处的 `$y_i$` 即对应该图片为数字 `$i$` 的预测概率。
+对于有 $N$ 个类别的多分类问题，指定 $N$ 个输出节点，$N$ 维结果向量经过softmax将归一化为 $N$ 个[0,1]范围内的实数值，分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。
-在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy），公式如下：
+在分类问题中，我们一般采用交叉熵代价损失函数（cross entropy loss），公式如下：
-$$  \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
+$$  L_{cross-entropy} (label, y) = -\sum_i label_ilog(y_i) $$
 图2为softmax回归的网络图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
-![softmaxRegression](./image/softmax_regression.png)
+<p align="center">
-<p align="center">图2. softmax回归网络结构图</p>
+<img src="image/softmax_regression.png" width=400><br/>
+图2. softmax回归网络结构图<br/>
+</p>
 ### 多层感知器(Multilayer Perceptron, MLP)
 Softmax回归模型采用了最简单的两层神经网络，即只有输入层和输出层，因此其拟合能力有限。为了达到更好的识别效果，我们考虑在输入层和输出层中间加上若干个隐藏层\[[10](#参考文献)\]。
-1.  经过第一个隐藏层，可以得到 `$ H_1 = \phi(W_1X + b_1) $`，其中`$\phi$`代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
+1.  经过第一个隐藏层，可以得到 $ H_1 = \phi(W_1X + b_1) $，其中$\phi$代表激活函数，常见的有sigmoid、tanh或ReLU等函数。
-2.  经过第二个隐藏层，可以得到 `$ H_2 = \phi(W_2H_1 + b_2) $`。
-3.  最后，再经过输出层，得到的`$Y=\text{softmax}(W_3H_2 + b_3)$`，即为最后的分类结果向量。
+2.  经过第二个隐藏层，可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
+3.  最后，再经过输出层，得到的$Y=\text{softmax}(W_3H_2 + b_3)$，即为最后的分类结果向量。
 图3为多层感知器的网络结构图，图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
-![multilayerPerceptron](./image/mlp.png)
+<p align="center">
-<p align="center">图3. 多层感知器网络结构图</p>
+<img src="image/mlp.png" width=500><br/>
+图3. 多层感知器网络结构图<br/>
+</p>
 ### 卷积神经网络(Convolutional Neural Network, CNN)
 在多层感知器模型中，将图像展开成一维向量输入到网络中，忽略了图像的位置和结构信息，而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构：输入的二维图像，先经过两次卷积层到池化层，再经过全连接层，最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
-![cnnStructure](./image/cnn.png)
+<p align="center">
-<p align="center">图4. LeNet-5卷积神经网络结构</p>
+<img src="image/cnn.png"><br/>
+图4. LeNet-5卷积神经网络结构<br/>
+</p>
 #### 卷积层
 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积，即离散二维滤波器（也称作卷积核）与二维图像做卷积操作，简单的讲是二维滤波器滑动到二维图像上所有位置，并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域，不同卷积核可以提取不同的特征，例如边沿、线性、角等特征。在深层卷积神经网络中，通过卷积操作可以提取出图像低级到复杂的特征。
-![cnn](https://raw.githubusercontent.com/PaddlePaddle/book/develop/02.recognize_digits/image/conv_layer.png)
+<p align="center">
-<p align="center">图5. 卷积层图片</p>
+<img src="image/conv_layer.png" width='750'><br/>
+图5. 卷积层图片<br/>
+</p>
-图5给出一个卷积计算过程的示例图，输入图像大小为`$H=5,W=5,D=3$`，即`$5 \times 5$`大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用`$K$`表示）组卷积核，即图中滤波器`$W_0$`和`$W_1$`。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（`$D=3$`）个`$3 \times 3$`（用`$F \times F$`表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（`$W$`方向）和垂直方向（`$H$`方向）的滑动步长为2（用`$S$`表示）；对输入图像周围各填充1（用`$P$`表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为`$3 \times 3 \times 2$`（用`$H_{o} \times W_{o} \times K$`表示）大小的特征图，即`$3 \times 3$`大小的2通道特征图，其中`$H_o$`计算公式为：`$H_o = (H - F + 2 \times P)/S + 1$`，`$W_o$`同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置`$b_o$`，偏置通常对于每个输出特征图是共享的。输出特征图`$o[:,:,0]$`中的最后一个`$-2$`计算如图5右下角公式所示。
+图5给出一个卷积计算过程的示例图，输入图像大小为$H=5,W=5,D=3$，即$5 \times 5$大小的3通道（RGB，也称作深度）彩色图像。这个示例图中包含两（用$K$表示）组卷积核，即图中滤波器$W_0$和$W_1$。在卷积计算中，通常对不同的输入通道采用不同的卷积核，如图示例中每组卷积核包含（$D=3）$个$3 \times 3$（用$F \times F$表示）大小的卷积核。另外，这个示例中卷积核在图像的水平方向（$W$方向）和垂直方向（$H$方向）的滑动步长为2（用$S$表示）；对输入图像周围各填充1（用$P$表示）个0，即图中输入层原始数据为蓝色部分，灰色部分是进行了大小为1的扩展，用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$（用$H_{o} \times W_{o} \times K$表示）大小的特征图，即$3 \times 3$大小的2通道特征图，其中$H_o$计算公式为：$H_o = (H - F + 2 \times P)/S + 1$，$W_o$同理。 而输出特征图中的每个像素，是每组滤波器与输入图像每个特征图的内积再求和，再加上偏置$b_o$，偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。
-在卷积操作中卷积核是可学习的参数，经过上面示例介绍，每层卷积的参数大小为`$D \times F \times F \times K$`。在多层感知器模型中，神经元通常是全部连接，参数较多。而卷积层的参数较少，这也是由卷积层的主要特性即局部连接和共享权重所决定。
+在卷积操作中卷积核是可学习的参数，经过上面示例介绍，每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中，神经元通常是全部连接，参数较多。而卷积层的参数较少，这也是由卷积层的主要特性即局部连接和共享权重所决定。
 - 局部连接：每个神经元仅与输入神经元的一块区域连接，这块局部区域称作感受野（receptive field）。在图像卷积操作中，即神经元在空间维度（spatial dimension，即上图示例H和W所在的平面）是局部连接，但在深度上是全部连接。对于二维图像本身而言，也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想，也是受启发于生物学里面的视觉系统结构，视觉皮层的神经元就是局部接受信息的。
- 权重共享：计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算`$o[:,:,0]$`的每个每个神经元的滤波器均相同，都为`$W_0$`，这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的，例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的，比如输入的图片是人脸，眼睛和头发位于不同的位置，希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的，在卷积层，通常采用多组卷积核提取不同特征，即对应不同深度切片的特征，不同深度切片的神经元权重是不共享。另外，偏重对同一深度切片的所有神经元都是共享的。
+- 权重共享：计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算$o[:,:,0]$的每个每个神经元的滤波器均相同，都为$W_0$，这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的，例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的，比如输入的图片是人脸，眼睛和头发位于不同的位置，希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的，在卷积层，通常采用多组卷积核提取不同特征，即对应不同深度切片的特征，不同深度切片的神经元权重是不共享。另外，偏重对同一深度切片的所有神经元都是共享的。
 通过介绍卷积计算过程及其特性，可以看出卷积是线性操作，并具有平移不变性（shift-invariant），平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小，这样也有利于训练较大卷积神经网络。
 #### 池化层
-![pooling](./image/max_pooling.png)
+<p align="center">
-<p align="center">图6. 池化层图片</p>
+<img src="image/max_pooling.png" width="400px"><br/>
+图6. 池化层图片<br/>
+</p>
 池化是非线性下采样的一种形式，主要作用是通过减少网络的参数来减小计算量，并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域，对于每个矩形框的数取最大值作为输出层，如图6所示。
 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。
 ### 常见激活函数介绍
- sigmoid激活函数： `$ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $`
- tanh激活函数： `$ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $`
+- sigmoid激活函数： $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
-实际上，tanh函数只是规模变化的sigmoid函数，将sigmoid函数值放大2倍之后再向下平移1个单位：tanh(x) = 2sigmoid(2x) - 1 。
+- tanh激活函数： $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
- ReLU激活函数： `$ f(x) = max(0, x) $`
+  实际上，tanh函数只是规模变化的sigmoid函数，将sigmoid函数值放大2倍之后再向下平移1个单位：tanh(x) = 2sigmoid(2x) - 1 。
+- ReLU激活函数： $ f(x) = max(0, x) $
 更详细的介绍请参考[维基百科激活函数](https://en.wikipedia.org/wiki/Activation_function)。
@@ -107,35 +122,13 @@ Softmax回归模型采用了最简单的两层神经网络，即只有输入层
 PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mnist/)数据的模块`paddle.dataset.mnist`。加载后的数据位于`/home/username/.cache/paddle/dataset/mnist`下：
-<p align="center">
-<table>
+|    文件名称          |       说明              |
-    <thead>
+|----------------------|-------------------------|
-    <tr>
+|train-images-idx3-ubyte|  训练数据图片，60,000条数据 |
-        <th>文件名称</th>
+|train-labels-idx1-ubyte|  训练数据标签，60,000条数据 |
-        <th>说明</th>
+|t10k-images-idx3-ubyte |  测试数据图片，10,000条数据 |
-    </tr>
+|t10k-labels-idx1-ubyte |  测试数据标签，10,000条数据 |
-    </thead>
-    <tbody>
-    <tr>
-        <td>train-images-idx3-ubyte</td>
-        <td>训练数据图片，60,000条数据</td>
-    </tr>
-    <tr>
-        <td>train-labels-idx1-ubyte</td>
-        <td>训练数据标签，60,000条数据</td>
-    </tr>
-    <tr>
-        <td>t10k-images-idx3-ubyte</td>
-        <td>测试数据图片，10,000条数据</td>
-    </tr>
-    <tr>
-        <td>t10k-labels-idx1-ubyte</td>
-        <td>测试数据标签，10,000条数据</td>
-    </tr>
-    </tbody>
-</table>
-</p>
 ## Fluid API 概述
@@ -143,18 +136,20 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
 我们建议使用 Fluid API，因为它更容易学起来。
 下面是快速的 Fluid API 概述。
 1. `inference_program`：指定如何从数据输入中获得预测的函数。
 这是指定网络流的地方。
-1. `train_program`：指定如何从 `inference_program` 和`标签值`中获取 `loss` 的函数。
+2. `train_program`：指定如何从 `inference_program` 和`标签值`中获取 `loss` 的函数。
 这是指定损失计算的地方。
-1. `optimizer_func`: “指定优化器配置的函数。优化器负责减少损失并驱动培训。Paddle 支持多种不同的优化器。
+3. `optimizer_func`: “指定优化器配置的函数。优化器负责减少损失并驱动培训。Paddle 支持多种不同的优化器。
-1. `Trainer`：PaddlePaddle Trainer 管理由 `train_program` 和 `optimizer` 指定的训练过程。
+4. `Trainer`：PaddlePaddle Trainer 管理由 `train_program` 和 `optimizer` 指定的训练过程。
 通过 `event_handler` 回调函数，用户可以监控培训的进展。
-1. `Inferencer`：Fluid inferencer 加载 `inference_program` 和由 Trainer 训练的参数。
+5. `Inferencer`：Fluid inferencer 加载 `inference_program` 和由 Trainer 训练的参数。
 然后，它可以推断数据和返回预测。
 在这个演示中，我们将深入了解它们。
@@ -165,6 +160,7 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
 ```python
 import paddle
 import paddle.fluid as fluid
+from __future__ import print_function
 ```
 ### Program Functions 配置
@@ -177,51 +173,51 @@ import paddle.fluid as fluid
 ```python
 def softmax_regression():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-predict = fluid.layers.fc(
+    predict = fluid.layers.fc(
-input=img, size=10, act='softmax')
+        input=img, size=10, act='softmax')
-return predict
+    return predict
 ```
 - 多层感知器：下面代码实现了一个含有两个隐藏层（即全连接层）的多层感知器。其中两个隐藏层的激活函数均采用ReLU，输出层的激活函数用Softmax。
 ```python
 def multilayer_perceptron():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-# 第一个全连接层，激活函数为ReLU
+    # 第一个全连接层，激活函数为ReLU
-hidden = fluid.layers.fc(input=img, size=200, act='relu')
+    hidden = fluid.layers.fc(input=img, size=200, act='relu')
-# 第二个全连接层，激活函数为ReLU
+    # 第二个全连接层，激活函数为ReLU
-hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
+    hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
-# 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
+    # 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
-prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-return prediction
+    return prediction
 ```
 - 卷积神经网络LeNet-5: 输入的二维图像，首先经过两次卷积层到池化层，再经过全连接层，最后使用以softmax为激活函数的全连接层作为输出层。
 ```python
 def convolutional_neural_network():
-img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-# 第一个卷积-池化层
+    # 第一个卷积-池化层
-conv_pool_1 = fluid.nets.simple_img_conv_pool(
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
-input=img,
+        input=img,
-filter_size=5,
+        filter_size=5,
-num_filters=20,
+        num_filters=20,
-pool_size=2,
+        pool_size=2,
-pool_stride=2,
+        pool_stride=2,
-act="relu")
+        act="relu")
-conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-# 第二个卷积-池化层
+    # 第二个卷积-池化层
-conv_pool_2 = fluid.nets.simple_img_conv_pool(
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
-input=conv_pool_1,
+        input=conv_pool_1,
-filter_size=5,
+        filter_size=5,
-num_filters=50,
+        num_filters=50,
-pool_size=2,
+        pool_size=2,
-pool_stride=2,
+        pool_stride=2,
-act="relu")
+        act="relu")
-# 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
+    # 以softmax为激活函数的全连接输出层，输出层的大小必须为数字的个数10
-prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
+    prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
-return prediction
+    return prediction
 ```
 #### Train Program 配置
@@ -234,18 +230,16 @@ return prediction
 ```python
 def train_program():
-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-# predict = softmax_regression() # uncomment for Softmax回归
-# predict = multilayer_perceptron() # uncomment for 多层感知器
-predict = convolutional_neural_network() # uncomment for LeNet5卷积神经网络
-cost = fluid.layers.cross_entropy(input=predict, label=label)
-avg_cost = fluid.layers.mean(cost)
-acc = fluid.layers.accuracy(input=predict, label=label)
-return [avg_cost, acc]
+    # predict = softmax_regression() # uncomment for Softmax回归
+    # predict = multilayer_perceptron() # uncomment for 多层感知器
+    predict = convolutional_neural_network() # uncomment for LeNet5卷积神经网络
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    acc = fluid.layers.accuracy(input=predict, label=label)
+    return [avg_cost, acc]
-# 该模型运行在单个CPU上
 ```
 #### Optimizer Function 配置
@@ -254,25 +248,25 @@ return [avg_cost, acc]
 ```python
 def optimizer_program():
-return fluid.optimizer.Adam(learning_rate=0.001)
+    return fluid.optimizer.Adam(learning_rate=0.001)
 ```
 ### 数据集 Feeders 配置
 下一步，我们开始训练过程。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数，每次调用的时候返回一个Python yield generator。
-下面`shuffle`是一个reader decorator，它接受一个reader A，返回另一个reader B —— reader B 每次读入`buffer_size`条训练数据到一个buffer里，然后随机打乱其顺序，并且逐条输出。
+下面`shuffle`是一个reader decorator，它接受一个reader A，返回另一个reader B 。reader B 每次读入`buffer_size`条训练数据到一个buffer里，然后随机打乱其顺序，并且逐条输出。
-`batch`是一个特殊的decorator，它的输入是一个reader，输出是一个batched reader —— 在PaddlePaddle里，一个reader每次yield一条训练数据，而一个batched reader每次yield一个minibatch。
+`batch`是一个特殊的decorator，它的输入是一个reader，输出是一个batched reader 。在PaddlePaddle里，一个reader每次yield一条训练数据，而一个batched reader每次yield一个minibatch。
 ```python
 train_reader = paddle.batch(
-paddle.reader.shuffle(
+        paddle.reader.shuffle(
-paddle.dataset.mnist.train(), buf_size=500),
+            paddle.dataset.mnist.train(), buf_size=500),
-batch_size=64)
+        batch_size=64)
 test_reader = paddle.batch(
-paddle.dataset.mnist.test(), batch_size=64)
+            paddle.dataset.mnist.test(), batch_size=64)
 ```
 ### Trainer 配置
@@ -285,7 +279,8 @@ use_cuda = False # set to True if training with GPU
 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
 trainer = fluid.Trainer(
-train_func=train_program, place=place, optimizer_func=optimizer_program)
+    train_func=train_program, place=place, optimizer_func=optimizer_program)
 ```
 #### Event Handler 配置
@@ -300,27 +295,32 @@ Fluid API 在训练期间为回调函数提供了一个钩子。用户能够通
 params_dirname = "recognize_digits_network.inference.model"
 lists = []
 def event_handler(event):
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if event.step % 100 == 0:
+        if event.step % 100 == 0:
-# event.metrics maps with train program return arguments.
+            # event.metrics maps with train program return arguments.
-# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
+            # event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
-print "Pass %d, Batch %d, Cost %f" % (
+            print("Pass %d, Batch %d, Cost %f" % (
-event.step, event.epoch, event.metrics[0])
+                event.step, event.epoch, event.metrics[0]))
-if isinstance(event, fluid.EndEpochEvent):
+    if isinstance(event, fluid.EndEpochEvent):
-avg_cost, acc = trainer.test(
+        avg_cost, acc = trainer.test(
-reader=test_reader, feed_order=['img', 'label'])
+            reader=test_reader, feed_order=['img', 'label'])
-print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc))
+        print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc))
-# save parameters
+        # save parameters
-trainer.save_params(params_dirname)
+        trainer.save_params(params_dirname)
-lists.append((event.epoch, avg_cost, acc))
+        lists.append((event.epoch, avg_cost, acc))
 ```
 `event_handler_plot` 可以用来在训练过程中画图如下：
-![png](./image/train_and_test.png)
+<p align="center">
+<img src="image/train_and_test2.png" width="400"><br/>
+图7. 训练结果
+</p>
 ```python
 from paddle.v2.plot import Ploter
@@ -333,22 +333,22 @@ lists = []
 # event_handler to plot a figure
 def event_handler_plot(event):
-global step
+    global step
-if isinstance(event, fluid.EndStepEvent):
+    if isinstance(event, fluid.EndStepEvent):
-if step % 100 == 0:
+        if step % 100 == 0:
-# event.metrics maps with train program return arguments.
+            # event.metrics maps with train program return arguments.
-# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
+            # event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
-cost_ploter.append(train_title, step, event.metrics[0])
+            cost_ploter.append(train_title, step, event.metrics[0])
-cost_ploter.plot()
+            cost_ploter.plot()
-step += 1
+        step += 1
-if isinstance(event, fluid.EndEpochEvent):
+    if isinstance(event, fluid.EndEpochEvent):
-# save parameters
+        # save parameters
-trainer.save_params(params_dirname)
+        trainer.save_params(params_dirname)
-avg_cost, acc = trainer.test(
+        avg_cost, acc = trainer.test(
-reader=test_reader, feed_order=['img', 'label'])
+            reader=test_reader, feed_order=['img', 'label'])
-cost_ploter.append(test_title, step, avg_cost)
+        cost_ploter.append(test_title, step, avg_cost)
-lists.append((event.epoch, avg_cost, acc))
+        lists.append((event.epoch, avg_cost, acc))
 ```
 #### 开始训练
@@ -359,10 +359,10 @@ lists.append((event.epoch, avg_cost, acc))
 ```python
 trainer.train(
-num_epochs=5,
+    num_epochs=5,
-event_handler=event_handler,
+    event_handler=event_handler,
-reader=train_reader,
+    reader=train_reader,
-feed_order=['img', 'label'])
+    feed_order=['img', 'label'])
 ```
 训练过程是完全自动的，event_handler里打印的日志类似如下所示：
@@ -395,11 +395,11 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
 ```python
 inferencer = fluid.Inferencer(
-# infer_func=softmax_regression, # uncomment for softmax regression
+    # infer_func=softmax_regression, # uncomment for softmax regression
-# infer_func=multilayer_perceptron, # uncomment for MLP
+    # infer_func=multilayer_perceptron, # uncomment for MLP
-infer_func=convolutional_neural_network,  # uncomment for LeNet5
+    infer_func=convolutional_neural_network,  # uncomment for LeNet5
-param_path=params_dirname,
+    param_path=params_dirname,
-place=place)
+    place=place)
 ```
 ### 生成预测输入数据
@@ -412,11 +412,11 @@ import os
 import numpy as np
 from PIL import Image
 def load_image(file):
-im = Image.open(file).convert('L')
+    im = Image.open(file).convert('L')
-im = im.resize((28, 28), Image.ANTIALIAS)
+    im = im.resize((28, 28), Image.ANTIALIAS)
-im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
+    im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
-im = im / 255.0 * 2.0 - 1.0
+    im = im / 255.0 * 2.0 - 1.0
-return im
+    return im
 cur_dir = cur_dir = os.getcwd()
 img = load_image(cur_dir + '/image/infer_3.png')
@@ -429,7 +429,7 @@ img = load_image(cur_dir + '/image/infer_3.png')
 ```python
 results = inferencer.infer({'img': img})
 lab = np.argsort(results)  # probs and lab are the results of one batch data
-print "Label of image/infer_3.png is: %d" % lab[0][0][-1]
+print ("Inference result of image/infer_3.png is: %d" % lab[0][0][-1])
 ```
 ## 总结

--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/cnn_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/conv_layer.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/conv_layer.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/max_pooling_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/max_pooling_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/mlp_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_regression_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_regression_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_train_log_en.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/softmax_train_log_en.png
--- a/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/train_and_test.png
+++ b/doc/fluid/new_docs/beginners_guide/quick_start/recognize_digits/image/train_and_test.png
--- a/paddle/fluid/API.spec
+++ b/paddle/fluid/API.spec
@@ -36,6 +36,7 @@ paddle.fluid.default_startup_program ArgSpec(args=[], varargs=None, keywords=Non
 paddle.fluid.default_main_program ArgSpec(args=[], varargs=None, keywords=None, defaults=None)
 paddle.fluid.program_guard ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None)
 paddle.fluid.get_var ArgSpec(args=['name', 'program'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.name_scope ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None)
 paddle.fluid.Executor.__init__ ArgSpec(args=['self', 'place'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.Executor.close ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.Executor.run ArgSpec(args=['self', 'program', 'feed', 'fetch_list', 'feed_var_name', 'fetch_var_name', 'scope', 'return_numpy', 'use_program_cache'], varargs=None, keywords=None, defaults=(None, None, None, 'feed', 'fetch', None, True, False))
@@ -55,9 +56,10 @@ paddle.fluid.Inferencer.__init__ ArgSpec(args=['self', 'infer_func', 'param_path
 paddle.fluid.Inferencer.infer ArgSpec(args=['self', 'inputs', 'return_numpy'], varargs=None, keywords=None, defaults=(True,))
 paddle.fluid.DistributeTranspiler.__init__ ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.DistributeTranspiler.get_pserver_program ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None)
-paddle.fluid.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.DistributeTranspiler.get_pserver_programs ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None)
+paddle.fluid.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None))
 paddle.fluid.DistributeTranspiler.get_trainer_program ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
-paddle.fluid.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True))
+paddle.fluid.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None))
 paddle.fluid.InferenceTranspiler.__init__ 
 paddle.fluid.InferenceTranspiler.transpile ArgSpec(args=['self', 'program', 'place', 'scope'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.memory_optimize ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level'], varargs=None, keywords=None, defaults=(None, False, 0))
@@ -144,6 +146,8 @@ paddle.fluid.layers.smooth_l1 ArgSpec(args=['x', 'y', 'inside_weight', 'outside_
 paddle.fluid.layers.one_hot ArgSpec(args=['input', 'depth'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.layers.autoincreased_step_counter ArgSpec(args=['counter_name', 'begin', 'step'], varargs=None, keywords=None, defaults=(None, 1, 1))
 paddle.fluid.layers.reshape ArgSpec(args=['x', 'shape', 'actual_shape', 'act', 'inplace', 'name'], varargs=None, keywords=None, defaults=(None, None, True, None))
+paddle.fluid.layers.squeeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.layers.unsqueeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.layers.lod_reset ArgSpec(args=['x', 'y', 'target_lod'], varargs=None, keywords=None, defaults=(None, None))
 paddle.fluid.layers.lrn ArgSpec(args=['input', 'n', 'k', 'alpha', 'beta', 'name'], varargs=None, keywords=None, defaults=(5, 1.0, 0.0001, 0.75, None))
 paddle.fluid.layers.pad ArgSpec(args=['x', 'paddings', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None))
@@ -255,6 +259,7 @@ paddle.fluid.layers.logical_xor ArgSpec(args=[], varargs='args', keywords='kwarg
 paddle.fluid.layers.logical_not ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.uniform_random_batch_size_like ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.gaussian_random ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
+paddle.fluid.layers.sampling_id ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.gaussian_random_batch_size_like ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.sum ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.slice ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
@@ -297,8 +302,9 @@ paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'neg
 paddle.fluid.layers.detection_output ArgSpec(args=['loc', 'scores', 'prior_box', 'prior_box_var', 'background_label', 'nms_threshold', 'nms_top_k', 'keep_top_k', 'score_threshold', 'nms_eta'], varargs=None, keywords=None, defaults=(0, 0.3, 400, 200, 0.01, 1.0))
 paddle.fluid.layers.ssd_loss ArgSpec(args=['location', 'confidence', 'gt_box', 'gt_label', 'prior_box', 'prior_box_var', 'background_label', 'overlap_threshold', 'neg_pos_ratio', 'neg_overlap', 'loc_loss_weight', 'conf_loss_weight', 'match_type', 'mining_type', 'normalize', 'sample_size'], varargs=None, keywords=None, defaults=(None, 0, 0.5, 3.0, 0.5, 1.0, 1.0, 'per_prediction', 'max_negative', True, None))
 paddle.fluid.layers.detection_map ArgSpec(args=['detect_res', 'label', 'class_num', 'background_label', 'overlap_threshold', 'evaluate_difficult', 'has_state', 'input_states', 'out_states', 'ap_version'], varargs=None, keywords=None, defaults=(0, 0.3, True, None, None, None, 'integral'))
-paddle.fluid.layers.rpn_target_assign ArgSpec(args=['loc', 'scores', 'anchor_box', 'gt_box', 'rpn_batch_size_per_im', 'fg_fraction', 'rpn_positive_overlap', 'rpn_negative_overlap'], varargs=None, keywords=None, defaults=(256, 0.25, 0.7, 0.3))
+paddle.fluid.layers.rpn_target_assign ArgSpec(args=['loc', 'scores', 'anchor_box', 'anchor_var', 'gt_box', 'rpn_batch_size_per_im', 'fg_fraction', 'rpn_positive_overlap', 'rpn_negative_overlap'], varargs=None, keywords=None, defaults=(256, 0.25, 0.7, 0.3))
 paddle.fluid.layers.anchor_generator ArgSpec(args=['input', 'anchor_sizes', 'aspect_ratios', 'variance', 'stride', 'offset', 'name'], varargs=None, keywords=None, defaults=(None, None, [0.1, 0.1, 0.2, 0.2], None, 0.5, None))
+paddle.fluid.layers.generate_proposal_labels ArgSpec(args=['rpn_rois', 'gt_classes', 'gt_boxes', 'im_scales', 'batch_size_per_im', 'fg_fraction', 'fg_thresh', 'bg_thresh_hi', 'bg_thresh_lo', 'bbox_reg_weights', 'class_nums'], varargs=None, keywords=None, defaults=(256, 0.25, 0.25, 0.5, 0.0, [0.1, 0.1, 0.2, 0.2], None))
 paddle.fluid.layers.generate_proposals ArgSpec(args=['scores', 'bbox_deltas', 'im_info', 'anchors', 'variances', 'pre_nms_top_n', 'post_nms_top_n', 'nms_thresh', 'min_size', 'eta', 'name'], varargs=None, keywords=None, defaults=(6000, 1000, 0.5, 0.1, 1.0, None))
 paddle.fluid.layers.iou_similarity ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
 paddle.fluid.layers.box_coder ArgSpec(args=[], varargs='args', keywords='kwargs', defaults=None)
@@ -335,9 +341,10 @@ paddle.fluid.contrib.BeamSearchDecoder.update_array ArgSpec(args=['self', 'array
 paddle.fluid.contrib.memory_usage ArgSpec(args=['program', 'batch_size'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.transpiler.DistributeTranspiler.__init__ ArgSpec(args=['self', 'config'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.transpiler.DistributeTranspiler.get_pserver_program ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None)
-paddle.fluid.transpiler.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.transpiler.DistributeTranspiler.get_pserver_programs ArgSpec(args=['self', 'endpoint'], varargs=None, keywords=None, defaults=None)
+paddle.fluid.transpiler.DistributeTranspiler.get_startup_program ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None))
 paddle.fluid.transpiler.DistributeTranspiler.get_trainer_program ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
-paddle.fluid.transpiler.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True))
+paddle.fluid.transpiler.DistributeTranspiler.transpile ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None))
 paddle.fluid.transpiler.InferenceTranspiler.__init__ 
 paddle.fluid.transpiler.InferenceTranspiler.transpile ArgSpec(args=['self', 'program', 'place', 'scope'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.transpiler.memory_optimize ArgSpec(args=['input_program', 'skip_opt_set', 'print_log', 'level'], varargs=None, keywords=None, defaults=(None, False, 0))

--- a/paddle/fluid/framework/details/multi_devices_graph_pass.cc
+++ b/paddle/fluid/framework/details/multi_devices_graph_pass.cc
@@ -625,19 +625,11 @@ int MultiDevSSAGraphBuilder::GetVarDeviceID(const ir::Graph &graph,
 void MultiDevSSAGraphBuilder::CreateScaleLossGradOp(
    ir::Graph *result, const std::string &loss_grad_name) const {
  for (size_t i = 0; i < places_.size(); ++i) {
-// Insert ScaleCost OpHandle
+    // Insert ScaleCost OpHandle
-#ifdef PADDLE_WITH_CUDA
+    auto *dev_ctx = platform::DeviceContextPool::Instance().Get(places_[i]);
-    auto *communication_dev_ctx =
-        nccl_ctxs_ ? nccl_ctxs_->DevCtx(places_[i])
-                   : platform::DeviceContextPool::Instance().Get(places_[i]);
-#else
-    auto *communication_dev_ctx =
-        platform::DeviceContextPool::Instance().Get(platform::CPUPlace());
-#endif
    auto *op_handle = new ScaleLossGradOpHandle(
        result->CreateEmptyNode("scale_loss_grad", ir::Node::Type::kOperation),
-        local_scopes_.size(), local_scopes_[i], places_[i],
+        local_scopes_.size(), local_scopes_[i], places_[i], dev_ctx);
-        communication_dev_ctx);
    result->Get<GraphOps>(kGraphOps).emplace_back(op_handle);
    // FIXME: Currently ScaleLossGradOp only use device_count as scale
@@ -744,7 +736,7 @@ void MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result,
          .emplace(varname, op_dev_id);
    }
  } else {
-    PADDLE_ENFORCE(
+    PADDLE_THROW(
        "the distribute training related op should be in [split_byref, "
        "concat].");
  }

--- a/paddle/fluid/framework/executor.h
+++ b/paddle/fluid/framework/executor.h
@@ -60,6 +60,7 @@ class Executor {
  void Run(const ProgramDesc& prog, Scope* scope, int block_id,
           bool create_local_scope = true, bool create_vars = true);
+  // This API is very slow.
  void Run(const ProgramDesc& program, Scope* scope,
           std::map<std::string, const LoDTensor*>* feed_targets,
           std::map<std::string, LoDTensor*>* fetch_targets,
@@ -79,6 +80,7 @@ class Executor {
                          bool create_local_scope = true,
                          bool create_vars = true, bool keep_kids = false);
+  // This API is very slow.
  void RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
                          std::map<std::string, const LoDTensor*>* feed_targets,
                          std::map<std::string, LoDTensor*>* fetch_targets,

--- a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc
@@ -59,7 +59,7 @@ void FindWhileOp(Graph* graph) {
  auto handle = [&](const GraphPatternDetector::subgraph_t& subgraph,
                    Graph* g) {
-    auto* while_pat_node = gpd.pattern().RetriveNode("while");
+    auto* while_pat_node = gpd.pattern().RetrieveNode("while");
    auto* while_node = subgraph.at(while_pat_node);
    marked_nodes.insert(while_node);
  };

--- a/paddle/fluid/framework/ir/fc_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/fc_fuse_pass.cc
@@ -31,77 +31,34 @@ bool VarOutLinksToOp(Node* node, const std::string& op_type) {
 }
 void BuildFCPattern(PDPattern* pattern) {
-  // make sure the selected MUL op has one input argument is a parameter.
+  // Create Operators
-  auto* mul_parameter_var = pattern->NewNode(
+  auto* mul_op = pattern->NewNode("mul")->assert_is_op("mul");
-      [](Node* node) {
+  auto* elementwise_add_op =
-        return node->IsVar() && node->outputs.size() == 1UL &&
+      pattern->NewNode("elementwise_add")->assert_is_op("elementwise_add");
-               node->outputs.front()->Op()->Type() == "mul" && node->Var() &&
+  // Create variables
-               node->Var()->Persistable();  // check is a parameter
+  // w
-      },
+  auto* mul_weight_var = pattern->NewNode("mul_weight")
-      "mul_weight" /*name*/);
+                             ->AsInput()
+                             ->assert_is_op_nth_input("mul", "Y", 0);
-  auto* mul_tmp_input_var = pattern->NewNode(
+  // x
-      [](Node* node) {
+  auto* mul_tmp_var = pattern->NewNode("mul_tmp_var")
-        bool result =
+                          ->AsInput()
-            node->IsVar() && node->outputs.size() >= 1UL && node->Var() &&
+                          ->assert_is_op_nth_input("mul", "X", 0);
-            !node->Var()->Persistable();  // this input is not an parameter.
+  // intermediate variable, will be removed in the IR after fuse.
-        if (!result) return false;
+  auto* mul_out_var = pattern->NewNode("mul_out")
-        // check whether one output is MUL op.
+                          ->AsIntermediate()
-        for (auto* op : node->outputs) {
+                          ->assert_is_only_output_of_op("mul")
-          if (op->IsOp() && op->Op()->Type() == "mul") return true;
+                          ->assert_is_op_input("elementwise_add");
-        }
+  // bias
-        return false;
+  auto* elementwise_add_tmp_var = pattern->NewNode("elementwise_add_tmpvar")
-      },
+                                      ->assert_is_op_input("elementwise_add")
-      "mul_tmp_var" /*name*/);
+                                      ->AsInput();
+  // output
-  // select a MUL op
+  auto* elementwise_add_out_var = pattern->NewNode("elementwise_add_out")
-  auto* mul_op = pattern->NewNode(
+                                      ->AsOutput()
-      [](Node* node) {
+                                      ->assert_is_op_output("elementwise_add");
-        return node->IsOp() &&               // start from an Op
-               node->Op()->Type() == "mul";  // type is mul
+  mul_op->LinksFrom({mul_weight_var, mul_tmp_var}).LinksTo({mul_out_var});
-        // the output should be consumed only by one element_add, that check
-        // leaves in a Var PDNode.
-      },
-      "mul" /*name*/);
-  // make sure the MUL op's output has only one consumer and links to an
-  // ELEMENTWISE_ADD op.
-  auto* mul_out_var = pattern->NewNode(
-      [](Node* node) {
-        return node->IsVar() &&                  // starts from a Var
-               node->outputs.size() == 1UL &&    // only has one consumer
-               node->outputs.front()->IsOp() &&  // check basic logic
-               node->Var() &&                    // not a ControlDepVar
-               node->outputs.front()->Op()->Type() ==
-                   "elementwise_add";  // a very strong validation
-      },
-      "mul_out");
-  // this check is not essential, just to make the corresponding variable Node
-  // retrival easier.
-  auto* elementwise_add_tmp_var = pattern->NewNode(
-      [](Node* node) {
-        return node->IsVar() && node->outputs.size() >= 1UL && node->Var() &&
-               VarOutLinksToOp(node, "elementwise_add");
-      },
-      "elementwise_add_tmpvar");
-  // select an ELEMENTWISE_ADD op
-  auto* elementwise_add_op = pattern->NewNode(
-      [](Node* node) {
-        return node->IsOp() && node->Op()->Type() == "elementwise_add";
-      },
-      "elementwise_add" /*name*/);
-  // get the ELEMENTWISE_ADD op's output
-  auto* elementwise_add_out_var = pattern->NewNode(
-      [](Node* node) {
-        return node->IsVar() && node->inputs.size() == 1UL && node->Var() &&
-               node->inputs.front()->Op()->Type() == "elementwise_add";
-      },
-      "elementwise_add_out");
-  mul_op->LinksFrom({mul_parameter_var, mul_tmp_input_var})
-      .LinksTo({mul_out_var});
  elementwise_add_op->LinksFrom({mul_out_var, elementwise_add_tmp_var})
      .LinksTo({elementwise_add_out_var});
 }
@@ -120,6 +77,7 @@ bool LinksReplace(std::vector<Node*>* links, Node* from, Node* to) {
 std::unique_ptr<ir::Graph> FCFusePass::ApplyImpl(
    std::unique_ptr<ir::Graph> graph) const {
  PADDLE_ENFORCE(graph.get());
+  FusePassBase::Init("fc", graph.get());
  std::unordered_set<Node*> nodes2delete;
@@ -127,11 +85,12 @@ std::unique_ptr<ir::Graph> FCFusePass::ApplyImpl(
  BuildFCPattern(gpd.mutable_pattern());
 #define GET_NODE(id)                                              \
-  PADDLE_ENFORCE(subgraph.count(gpd.pattern().RetriveNode(#id)), \
+  PADDLE_ENFORCE(subgraph.count(gpd.pattern().RetrieveNode(#id)), \
                 "pattern has no Node called %s", #id);           \
-  auto* id = subgraph.at(gpd.pattern().RetriveNode(#id));        \
+  auto* id = subgraph.at(gpd.pattern().RetrieveNode(#id));        \
  PADDLE_ENFORCE_NOT_NULL(id, "subgraph has no node %s", #id);
+  int found_fc_count = 0;
  auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
                     Graph* g) {
    VLOG(4) << "handle FC fuse";
@@ -176,10 +135,13 @@ std::unique_ptr<ir::Graph> FCFusePass::ApplyImpl(
    graph->RemoveNode(mul);
    graph->RemoveNode(elementwise_add);
    graph->RemoveNode(mul_out);  // tmp variable
+    found_fc_count++;
  };
  gpd(graph.get(), handler);
+  AddStatis(found_fc_count);
  return graph;
 }

--- a/paddle/fluid/framework/ir/fc_fuse_pass.h
+++ b/paddle/fluid/framework/ir/fc_fuse_pass.h
@@ -12,6 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/framework/ir/graph.h"
 #include "paddle/fluid/framework/ir/graph_pattern_detector.h"
 #include "paddle/fluid/framework/ir/pass.h"
@@ -23,7 +24,7 @@ namespace ir {
 /*
 * Fuse the MUL and ELEMENTWISE_ADD to a FCOp.
 */
-class FCFusePass : public Pass {
+class FCFusePass : public FusePassBase {
 public:
  virtual ~FCFusePass() {}

--- a/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc
+++ b/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc
@@ -25,8 +25,13 @@ void SetOp(ProgramDesc* prog, const std::string& type,
           const std::vector<std::string>& outputs) {
  auto* op = prog->MutableBlock(0)->AppendOp();
  op->SetType(type);
-  op->SetInput("Xs", inputs);
+  if (type == "mul") {
-  op->SetOutput("Ys", outputs);
+    op->SetInput("X", {inputs[0]});
+    op->SetInput("Y", {inputs[1]});
+  } else if (type == "elementwise_add") {
+    op->SetInput("X", inputs);
+  }
+  op->SetOutput("Out", outputs);
 }
 // a->OP0->b

--- a/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc
@@ -36,7 +36,7 @@ std::unique_ptr<ir::Graph> FCLstmFusePass::ApplyImpl(
  auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
                     Graph* g) {
-    auto* id = subgraph.at(gpd.pattern().RetriveNode("any_node"));
+    auto* id = subgraph.at(gpd.pattern().RetrieveNode("any_node"));
    marked_nodes.insert(id);
  };
  gpd(graph.get(), handler);
@@ -64,9 +64,9 @@ std::unique_ptr<ir::Graph> FCLstmFusePass::ApplyImpl(
 #undef GET_NODE
 #undef SET_IN
-    LOG(INFO) << "hidden_n: " << hidden_n->Name();
+    VLOG(4) << "hidden_n: " << hidden_n->Name();
-    LOG(INFO) << "cell: " << cell_n->Name();
+    VLOG(4) << "cell: " << cell_n->Name();
-    LOG(INFO) << "xx: " << xx_n->Name();
+    VLOG(4) << "xx: " << xx_n->Name();
    op_desc.SetInput("H0", {});
    op_desc.SetInput("C0", {});

--- a/paddle/fluid/framework/ir/fuse_pass_base.h
+++ b/paddle/fluid/framework/ir/fuse_pass_base.h
@@ -22,21 +22,37 @@ namespace paddle {
 namespace framework {
 namespace ir {
-static const char kParamScopeAttr[] = "param_scope";
+static const char kParamScopeAttr[] = "__param_scope__";
+static const char kFuseStatisAttr[] = "__fuse_statis__";
 class FusePassBase : public Pass {
 public:
-  void Init(Graph* graph) const { graph_ = graph; }
+  void Init(const std::string& repr, Graph* graph) const {
+    repr_ = repr;
+    graph_ = graph;
+  }
  Scope* param_scope() const {
    PADDLE_ENFORCE(graph_->Has(kParamScopeAttr));
    return graph_->Get<framework::Scope*>(kParamScopeAttr);
  }
+  void AddStatis(int count_of_fused) const {
+    PADDLE_ENFORCE(graph_);
+    PADDLE_ENFORCE(!repr_.empty());
+    if (!graph_->Has(kFuseStatisAttr)) {
+      graph_->Set(kFuseStatisAttr, new std::unordered_map<std::string, int>);
+    }
+    auto& info =
+        graph_->Get<std::unordered_map<std::string, int>>(kFuseStatisAttr);
+    info[repr_] = count_of_fused;
+  }
  virtual ~FusePassBase() {}
 protected:
  mutable Graph* graph_;
+  mutable std::string repr_;
 };
 }  // namespace ir

--- a/paddle/fluid/framework/ir/graph.cc
+++ b/paddle/fluid/framework/ir/graph.cc
@@ -87,6 +87,9 @@ bool IsDistTrainOp(ir::Node *node, const std::vector<std::string> &send_vars,
 }
 Graph::Graph(const ProgramDesc &program) : program_(program) {
+  // Make the nodes id start from 0.
+  Node::ResetId();
  VLOG(3) << "block in program:" << program_.Size();
  std::unordered_map<std::string, VarDesc *> all_vars;
  for (auto *var : program.Block(0).AllVars()) {

--- a/paddle/fluid/framework/ir/graph.h
+++ b/paddle/fluid/framework/ir/graph.h
@@ -99,13 +99,13 @@ class Graph {
  // Create a normal variable with non-null VarDesc.
  ir::Node *CreateVarNode(VarDesc *var_desc) {
    PADDLE_ENFORCE(var_desc);
-    return AddNode(new ir::Node(var_desc, node_count_++));
+    return AddNode(new ir::Node(var_desc));
  }
  // Create a normal runnable operator with OpDesc.
  ir::Node *CreateOpNode(OpDesc *op_desc) {
    PADDLE_ENFORCE(op_desc);
-    return AddNode(new ir::Node(op_desc, node_count_++));
+    return AddNode(new ir::Node(op_desc));
  }
  // Create a control dependency var that connects 2 operations. The
@@ -115,14 +115,13 @@ class Graph {
    // TODO(panyx0718): control var name should be really unique.
    const std::string name = string::Sprintf(
        "%s@%llu", ir::Node::kControlDepVarName, node_set_.size());
-    return AddNode(
+    return AddNode(new ir::Node(name, ir::Node::Type::kVariable));
-        new ir::Node(name, ir::Node::Type::kVariable, node_count_++));
  }
  // A more free style way of creating a graph node. Mostly use for test
  // or "copy" from another node. Avoid using it if possible.
  ir::Node *CreateEmptyNode(const std::string &name, ir::Node::Type type) {
-    return AddNode(new ir::Node(name, type, node_count_++));
+    return AddNode(new ir::Node(name, type));
  }
  // Clear all node information of the graph and return the ownership of the
@@ -143,9 +142,13 @@ class Graph {
    nodes_.erase(node);
  }
+  // NOTE low performance, but simple and secure.
  Node *RetriveNode(int id) {
-    auto it = id2node_.find(id);
+    for (auto &node : nodes_) {
-    if (it != id2node_.end()) return it->second;
+      if (node.second->id() == id) {
+        return node.second.get();
+      }
+    }
    return nullptr;
  }
@@ -155,8 +158,6 @@ class Graph {
    PADDLE_ENFORCE(node_set_.find(node) == node_set_.end());
    nodes_[node].reset(node);
    node_set_.insert(node);
-    PADDLE_ENFORCE(!id2node_.count(node->id()), "duplicate id %d", node->id());
-    id2node_[node->id()] = node;
    return node;
  }
@@ -166,7 +167,6 @@ class Graph {
  std::map<std::string, std::function<void(void)>> attr_dels_;
  std::map<ir::Node *, std::unique_ptr<ir::Node>> nodes_;
  std::unordered_set<ir::Node *> node_set_;
-  std::map<int, Node *> id2node_;
  int node_count_{0};
 };

--- a/paddle/fluid/framework/ir/graph_pattern_detector.cc
+++ b/paddle/fluid/framework/ir/graph_pattern_detector.cc
@@ -27,6 +27,19 @@ namespace ir {
 size_t PDPattern::id_ = 0UL;
+PDNode* PDPattern::NewNode(const std::string& name) {
+  if (!name.empty()) {
+    PADDLE_ENFORCE_EQ(node_map_.count(name), 0,
+                      "PDNode's name should be unique, get duplicate [%s]",
+                      name);
+  }
+  nodes_.emplace_back(new PDNode(this, name));
+  auto* cur = nodes_.back().get();
+  node_map_[name] = cur;
+  return cur;
+}
 PDNode* PDPattern::NewNode(PDNode::teller_t&& teller, const std::string& name) {
  if (!name.empty()) {
    PADDLE_ENFORCE_EQ(node_map_.count(name), 0,
@@ -40,7 +53,7 @@ PDNode* PDPattern::NewNode(PDNode::teller_t&& teller, const std::string& name) {
  return cur;
 }
-PDNode* PDPattern::RetriveNode(const std::string& id) const {
+PDNode* PDPattern::RetrieveNode(const std::string& id) const {
  auto it = node_map_.find(id);
  if (it == node_map_.end()) {
    return nullptr;
@@ -62,7 +75,9 @@ void GraphPatternDetector::operator()(Graph* graph,
  auto subgraphs = DetectPatterns();
  UniquePatterns(&subgraphs);
  RemoveOverlappedMatch(&subgraphs);
+  ValidateByNodeRole(&subgraphs);
+  if (subgraphs.empty()) return;
  LOG(INFO) << "detect " << subgraphs.size() << " subgraph matches the pattern";
  int id = 0;
  for (auto& g : subgraphs) {
@@ -83,10 +98,54 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph& graph) {
      }
    }
  }
+  // Check to early stop if some PDNode can't find matched Node.
+  for (auto& pdnode : pattern_.nodes()) {
+    if (!pdnodes2nodes_.count(pdnode.get())) {
+      VLOG(4) << pdnode->name() << " can't find matched Node, early stop";
+      return false;
+    }
+  }
  VLOG(3) << pdnodes2nodes_.size() << " nodes marked";
  return !pdnodes2nodes_.empty();
 }
+// The intermediate Nodes can only link to the nodes inside the pattern, or this
+// subgraph will be droped.
+void GraphPatternDetector::ValidateByNodeRole(
+    std::vector<GraphPatternDetector::subgraph_t>* subgraphs) {
+  std::vector<GraphPatternDetector::subgraph_t> result;
+  subgraphs->erase(
+      std::remove_if(
+          subgraphs->begin(), subgraphs->end(),
+          [](const GraphPatternDetector::subgraph_t& subgraph) -> bool {
+            // Collect the inputs and outputs.
+            std::unordered_set<Node*> ios;
+            for (auto& item : subgraph) {
+              if (!item.first->IsIntermediate()) {
+                ios.insert(item.second);
+              }
+            }
+            for (auto& item : subgraph) {
+              if (item.first->IsIntermediate()) {
+                for (auto* x : item.second->inputs) {
+                  if (!ios.count(x)) {
+                    return true;
+                  }
+                }
+                for (auto* x : item.second->outputs) {
+                  if (!ios.count(x)) {
+                    return true;
+                  }
+                }
+              }
+            }
+            return false;
+          }),
+      subgraphs->end());
+}
 struct HitGroup {
  std::unordered_map<PDNode*, Node*> roles;
@@ -140,6 +199,7 @@ GraphPatternDetector::DetectPatterns() {
  // in edges of PDNodes.
  for (const auto& edge : pattern_.edges()) {
    VLOG(4) << "check " << edge.first->name() << " -> " << edge.second->name();
+    // TODO(Superjomn) Fix bug here, the groups might be duplicate here.
    // Each role has two PDNodes, which indicates two roles.
    // Detect two Nodes that can match these two roles and they are connected.
    auto& pre_groups = bi_records[step % 2];
@@ -149,6 +209,7 @@ GraphPatternDetector::DetectPatterns() {
    // source -> target
    for (Node* source : pdnodes2nodes_[edge.first]) {
      for (Node* target : pdnodes2nodes_[edge.second]) {
+        VLOG(8) << "check " << source->id() << " -- " << target->id();
        // TODO(Superjomn) add some prune strategies.
        for (const auto& group : pre_groups) {
          HitGroup new_group = group;
@@ -165,6 +226,12 @@ GraphPatternDetector::DetectPatterns() {
      }
    }
    VLOG(3) << "step " << step << " get records: " << cur_groups.size();
+    for (auto& group : cur_groups) {
+      for (auto& item : group.roles) {
+        VLOG(4) << "node " << item.second->id() << " as " << item.first->name();
+      }
+      VLOG(4) << "=========================================================";
+    }
  }
  for (auto& group : bi_records[step % 2]) {
@@ -260,6 +327,118 @@ PDNode& PDNode::LinksFrom(const std::vector<PDNode*>& others) {
  return *this;
 }
+PDNode* PDNode::assert_is_op() {
+  asserts_.emplace_back([this](Node* x) { return x && x->IsOp(); });
+  return this;
+}
+PDNode* PDNode::assert_is_op(const std::string& op_type) {
+  asserts_.emplace_back([this, op_type](Node* x) {
+    return x && x->IsOp() && x->Op()->Type() == op_type;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_var() {
+  asserts_.emplace_back([this](Node* x) { return x && x->IsVar(); });
+  return this;
+}
+PDNode* PDNode::assert_var_not_persistable() {
+  assert_is_var();
+  asserts_.emplace_back([this](Node* x) { return !x->Var()->Persistable(); });
+  return this;
+}
+PDNode* PDNode::assert_is_persistable_var() {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) { return x->Var()->Persistable(); });
+  return this;
+}
+PDNode* PDNode::assert_is_op_nth_input(const std::string& op_type,
+                                       const std::string& argument, int nth) {
+  assert_is_var();
+  assert_is_op_input(op_type);
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->outputs) {
+      if (IsNthInput(x, op, argument, nth)) return true;
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_op_nth_output(const std::string& op_type,
+                                        const std::string& argument, int nth) {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->inputs) {
+      if (IsNthOutput(x, op, argument, nth)) return true;
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_only_input_of_op(const std::string& op_type) {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->outputs) {
+      if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type &&
+          op->inputs.size() == 1) {
+        return true;
+      }
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_only_output_of_op(const std::string& op_type) {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->inputs) {
+      if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type &&
+          op->outputs.size() == 1) {
+        return true;
+      }
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_op_output(const std::string& op_type) {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->inputs) {
+      if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type) {
+        return true;
+      }
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_is_op_input(const std::string& op_type) {
+  assert_is_var();
+  asserts_.emplace_back([=](Node* x) {
+    for (auto* op : x->outputs) {
+      if (op && op->IsOp() && op->Op() && op->Op()->Type() == op_type) {
+        return true;
+      }
+    }
+    return false;
+  });
+  return this;
+}
+PDNode* PDNode::assert_op_has_n_inputs(const std::string& op_type, size_t n) {
+  assert_is_op(op_type);
+  asserts_.emplace_back([=](Node* x) { return x->inputs.size() == n; });
+  return this;
+}
+PDNode* PDNode::assert_op_has_n_outputs(const std::string& op_type, size_t n) {
+  assert_is_op(op_type);
+  asserts_.emplace_back([=](Node* x) { return x->outputs.size() == n; });
+  return this;
+}
+PDNode* PDNode::assert_more(PDNode::teller_t&& teller) {
+  asserts_.emplace_back(std::move(teller));
+  return this;
+}
 }  // namespace ir
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/ir/graph_pattern_detector.h
+++ b/paddle/fluid/framework/ir/graph_pattern_detector.h
@@ -39,14 +39,24 @@ struct PDNode {
  // tell whether an ir::Node* is a candidation for a PDNode.
  using teller_t = std::function<bool(Node*)>;
  enum class Type { kOp, kVar };
+  enum class Role {
+    kUnknown,      // No role,
+    kInput,        // an input and will be retained,
+    kOutput,       // an output and will be retained,
+    kIntermediate  // will be removed after handler.
+  };
  // this link to others
  PDNode& LinksTo(const std::vector<PDNode*>& others);
  PDNode& LinksFrom(const std::vector<PDNode*>& others);
  bool Tell(Node* node) const {
-    PADDLE_ENFORCE(teller_ != nullptr, "teller should be set for a PDNode");
+    if (teller_) return teller_(node);
-    return teller_(node);
+    for (auto& asrt : asserts_) {
+      if (!asrt(node)) return false;
+    }
+    return true;
  }
  bool IsOp() const { return type_ == Type::kOp; }
@@ -54,10 +64,52 @@ struct PDNode {
  const std::string& name() const { return name_; }
-  PDNode(const PDNode&) = delete;
  PDNode& operator=(const PDNode&) = delete;
+  PDNode(const PDNode&) = delete;
+  // Mark this node is an Input of a subgraph and will be retained.
+  PDNode* AsInput() {
+    role_ = Role::kInput;
+    return this;
+  }
+  // Mark this node is an Output of a subgraph and will be retained.
+  PDNode* AsOutput() {
+    role_ = Role::kOutput;
+    return this;
+  }
+  // Mark this node will be removed, so all the links should be inside a matched
+  // sub-graph.
+  PDNode* AsIntermediate() {
+    role_ = Role::kIntermediate;
+    return this;
+  }
+  bool IsIntermediate() const { return role_ == Role::kIntermediate; }
+  bool IsInput() const { return role_ == Role::kInput; }
+  bool IsOutput() const { return role_ == Role::kOutput; }
+  // Assertions, helper functions to simplify the pattern definition.
+  PDNode* assert_is_op();
+  PDNode* assert_is_op(const std::string& op_type);
+  PDNode* assert_is_var();
+  PDNode* assert_var_not_persistable();
+  PDNode* assert_is_persistable_var();
+  PDNode* assert_is_op_output(const std::string& op_type);
+  PDNode* assert_is_op_input(const std::string& op_type);
+  PDNode* assert_is_op_nth_input(const std::string& op_type,
+                                 const std::string& argument, int nth);
+  PDNode* assert_is_op_nth_output(const std::string& op_type,
+                                  const std::string& argument, int nth);
+  PDNode* assert_is_only_input_of_op(const std::string& op_type);
+  PDNode* assert_is_only_output_of_op(const std::string& op_type);
+  PDNode* assert_op_has_n_inputs(const std::string& op_type, size_t n);
+  PDNode* assert_op_has_n_outputs(const std::string& op_type, size_t n);
+  PDNode* assert_more(teller_t&& teller);
 private:
+  PDNode(PDPattern* pattern, const std::string& name = "",
+         Type type = Type::kVar)
+      : pattern_(pattern), name_(name), type_(type) {}
  PDNode(teller_t&& teller, PDPattern* pattern, const std::string& name = "",
         Type type = Type::kVar)
      : teller_(std::move(teller)),
@@ -71,10 +123,13 @@ struct PDNode {
  friend class PDPattern;
+  // Will removed latter.
  teller_t teller_;
+  std::vector<teller_t> asserts_;
  PDPattern* pattern_;
  std::string name_;
  Type type_;
+  Role role_{Role::kUnknown};
 };
 /*
@@ -87,19 +142,18 @@ struct PDNode {
 * This pattern can be defined as with the following pseudo codes
 *
 *     // Create two operator PDNodes.
- *     MUL = PDPattern.NewNode()
+ *     MUL = PDPattern.NewNode().assert_is_op("mul");
- *     ELE = PDPattern.NewNode()
+ *     ELE = PDPattern.NewNode().assert_is_op("elementwise_add");
 *     // Create the variable PDNodes.
- *     MUL_out = PDPattern.NewNode()
+ *     MUL_out = PDPattern.NewNode().assert_is_op_output("mul") \
- *     // Add teller to define some rules that help to filter the target Nodes.
+ *                                  .assert_is_op_input("elementwise_add") \
- *     MUL.teller = lambda(node): node->IsOp() && node->Op()->Type == "mul";
+ *                                  .AsIntermediate();
- *     ELE.teller = lambda(node): \
+ *     // Add relations.
- *                        node->IsOp() && node->Op()->Type == "elementwise_add";
+ *     MUL->LinksTo({MUL_out});
- *     MUL_out.teller = lambda(node): node->IsVar() && (MUL in node->inputs)
+ *     MUL_out->LinksTo({ELE});
- *                                                  && (ELE in node->outputs)
 *
- * One can add more specific tellers for PDNodes or edges, both the Operator
+ * One can add more specific asserts for PDNodes or edges, both the Operator
- * and Variable Nodes can be ruled in PDNode.teller.
+ * and Variable Nodes can be ruled in PDNode.assert_more(...).
 *
 * PDPattern can record the general patterns, such as the pattern represents
 *   - Op in CPU -> Op in GPU -> Op in CPU, to findout the IO abnormal place.
@@ -112,7 +166,8 @@ class PDPattern {
  void AddEdge(PDNode* a, PDNode* b);
  PDNode* NewNode(PDNode::teller_t&& teller, const std::string& name = NewID());
-  PDNode* RetriveNode(const std::string& id) const;
+  PDNode* NewNode(const std::string& name = NewID());
+  PDNode* RetrieveNode(const std::string& id) const;
  const std::vector<std::unique_ptr<PDNode>>& nodes() const { return nodes_; }
  const std::vector<edge_t>& edges() const { return edges_; }
@@ -185,6 +240,9 @@ class GraphPatternDetector {
  // Remove overlapped match subgraphs, when overlapped, keep the previous one.
  void RemoveOverlappedMatch(std::vector<subgraph_t>* subgraphs);
+  // Validate whether the intermediate nodes are linked by external nodes.
+  void ValidateByNodeRole(std::vector<subgraph_t>* subgraphs);
 #ifdef PADDLE_WITH_TESTING
  FRIEND_TEST(GraphPatternDetecter, MarkPDNodesInGraph);
  FRIEND_TEST(GraphPatternDetecter, DetectPatterns);
@@ -228,6 +286,14 @@ static bool IsNthInput(Node* var, Node* op, const std::string& argument,
  return var->Name() == op->Op()->Input(argument)[nth];
 }
+static bool IsNthOutput(Node* var, Node* op, const std::string& argument,
+                        size_t nth) {
+  PADDLE_ENFORCE(var->IsVar());
+  PADDLE_ENFORCE(op->IsOp());
+  if (op->inputs.size() <= nth) return false;
+  return var->Name() == op->Op()->Output(argument)[nth];
+}
 static void GraphSafeRemoveNodes(Graph* graph,
                                 const std::unordered_set<const Node*>& nodes) {
  for (auto* node : nodes) {

--- a/paddle/fluid/framework/ir/graph_pattern_detector_tester.cc
+++ b/paddle/fluid/framework/ir/graph_pattern_detector_tester.cc
@@ -167,6 +167,39 @@ TEST(GraphPatternDetecter, MultiSubgraph) {
  ASSERT_LE(count, 2);
 }
+TEST(GraphPatternDetector, IntermediateCheck) {
+  ProgramDesc program;
+  Graph graph(program);
+  BuildGraph(&graph);
+  // o2->v2->o3
+  // o2->v2->o4
+  // check o2+o3 fuse, should fail because v2 also link to o4.
+  GraphPatternDetector detector;
+  auto* op2 = detector.mutable_pattern()->NewNode(
+      [](Node* x) { return x && x->IsOp() && x->Name() == "op2"; }, "op2");
+  auto* op3 = detector.mutable_pattern()->NewNode(
+      [](Node* x) { return x && x->IsOp() && x->Name() == "op3"; }, "op3");
+  auto* v2 =
+      detector.mutable_pattern()
+          ->NewNode(
+              [](Node* x) { return x && x->IsVar() && x->Name() == "var2"; },
+              "var2")
+          ->AsIntermediate();
+  v2->LinksFrom({op2}).LinksTo({op3});
+  int count = 0;
+  detector(&graph, [&](const GraphPatternDetector::subgraph_t& g,
+                       Graph* graph) { ++count; });
+  EXPECT_EQ(count, 0);
+  count = 0;
+  v2->AsInput();
+  detector(&graph, [&](const GraphPatternDetector::subgraph_t& g,
+                       Graph* graph) { ++count; });
+  ASSERT_EQ(count, 1);
+}
 }  // namespace ir
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/ir/graph_to_program_pass_test.cc
+++ b/paddle/fluid/framework/ir/graph_to_program_pass_test.cc
@@ -62,10 +62,10 @@ void BuildNoCircleGraph(Graph* g) {
  v2->outputs.push_back(o3);
  v2->outputs.push_back(o4);
  v2->inputs.push_back(o2);
-  // o2->v3->o5
+  // o4->v3->o5
-  o2->outputs.push_back(v3);
+  o4->outputs.push_back(v3);
  o5->inputs.push_back(v3);
-  v3->inputs.push_back(o2);
+  v3->inputs.push_back(o4);
  v3->outputs.push_back(o5);
  // o3-v4->o5
  o3->outputs.push_back(v4);

--- a/paddle/fluid/framework/ir/graph_viz_pass.cc
+++ b/paddle/fluid/framework/ir/graph_viz_pass.cc
@@ -16,13 +16,27 @@ limitations under the License. */
 #include <unordered_set>
 #include "paddle/fluid/framework/ir/graph_viz_pass.h"
+#include "paddle/fluid/framework/op_proto_maker.h"
 #include "paddle/fluid/inference/analysis/dot.h"
+#include "paddle/fluid/string/printf.h"
 namespace paddle {
 namespace framework {
 namespace ir {
-static const char kGraphVizPath[] = "graph_viz_path";
 using inference::analysis::Dot;
+namespace {
+const char kGraphVizPath[] = "graph_viz_path";
+std::string FormatName(const Node* node) {
+  if (!node->IsOp() || !node->Op() ||
+      !node->Op()->HasAttr(OpProtoAndCheckerMaker::OpNamescopeAttrName())) {
+    return node->Name();
+  }
+  const std::string full_scope = boost::get<std::string>(
+      node->Op()->GetAttr(OpProtoAndCheckerMaker::OpNamescopeAttrName()));
+  return string::Sprintf("%s%s", full_scope.c_str(), node->Name().c_str());
+}
+}  // namespace
 std::unique_ptr<ir::Graph> GraphVizPass::ApplyImpl(
    std::unique_ptr<ir::Graph> graph) const {
@@ -54,7 +68,7 @@ std::unique_ptr<ir::Graph> GraphVizPass::ApplyImpl(
  auto marked_nodes = ConsumeMarkedNodes(graph.get());
  // Create nodes
  for (const Node* n : graph->Nodes()) {
-    std::string node_id = n->Name() + "(" + std::to_string(n->id()) + ")";
+    std::string node_id = FormatName(n) + "(" + std::to_string(n->id()) + ")";
    if (n->IsOp()) {
      decltype(op_attrs) attr =
          marked_nodes.count(n) ? marked_op_attrs : op_attrs;

--- a/paddle/fluid/framework/ir/node.cc
+++ b/paddle/fluid/framework/ir/node.cc
@@ -18,6 +18,7 @@ namespace paddle {
 namespace framework {
 namespace ir {
 constexpr char Node::kControlDepVarName[];
+int Node::count_ = 0;
 }  // namespace ir
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/ir/node.h
+++ b/paddle/fluid/framework/ir/node.h
@@ -29,37 +29,37 @@ class Node {
  enum class Type { kOperation, kVariable };
  static constexpr char kControlDepVarName[] = "__control_var";
-  explicit Node(const std::string& name, Type type, int id = -1)
+  explicit Node(const std::string& name, Type type)
      : name_(name),
        var_desc_(nullptr),
        op_desc_(nullptr),
        type_(type),
-        id_(id) {}
+        id_(count_++) {}
-  explicit Node(VarDesc* var_desc, int id = -1)
+  explicit Node(VarDesc* var_desc)
      : name_(var_desc->Name()),
        var_desc_(new VarDesc(*var_desc)),
        op_desc_(nullptr),
        type_(Type::kVariable),
-        id_(id) {}
+        id_(count_++) {}
-  explicit Node(OpDesc* op_desc, int id = -1)
+  explicit Node(OpDesc* op_desc)
      : name_(op_desc->Type()),
        var_desc_(nullptr),
        op_desc_(new OpDesc(*op_desc, op_desc->Block())),
        type_(Type::kOperation),
-        id_(id) {}
+        id_(count_++) {}
  Type NodeType() const { return type_; }
  std::string Name() const { return name_; }
  VarDesc* Var() {
-    PADDLE_ENFORCE(type_ == Type::kVariable);
+    PADDLE_ENFORCE(IsVar());
    return var_desc_.get();
  }
-  OpDesc* Op() {
+  OpDesc* Op() const {
    PADDLE_ENFORCE(IsOp());
    return op_desc_.get();
  }
@@ -80,6 +80,9 @@ class Node {
  int id_;
 private:
+  friend class Graph;
+  static int count_;
+  static void ResetId() { count_ = 0; }
  DISABLE_COPY_AND_ASSIGN(Node);
 };

--- a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc
@@ -180,16 +180,16 @@ PDNode* BuildFCPattern(PDPattern* pattern, PDNode* fc_x) {
 std::unique_ptr<ir::Graph> SeqConcatFcFusePass::ApplyImpl(
    std::unique_ptr<ir::Graph> graph) const {
-  FusePassBase::Init(graph.get());
+  FusePassBase::Init("seq_concat_fc_fuse", graph.get());
  GraphPatternDetector detector;
  auto* pattern = detector.mutable_pattern();
  auto* concat_out = BuildSeqExpandConcatPattern(pattern);
  BuildFCPattern(pattern, concat_out);
 #define GET_NODE(id, pattern)                               \
-  PADDLE_ENFORCE(subgraph.count(pattern.RetriveNode(#id)), \
+  PADDLE_ENFORCE(subgraph.count(pattern.RetrieveNode(#id)), \
                 "pattern has no Node called %s", #id);     \
-  auto* id = subgraph.at(pattern.RetriveNode(#id));        \
+  auto* id = subgraph.at(pattern.RetrieveNode(#id));        \
  PADDLE_ENFORCE_NOT_NULL(id, "subgraph has no node %s", #id);
  detector(graph.get(), [&](const GraphPatternDetector::subgraph_t& subgraph,

--- a/paddle/fluid/framework/op_proto_maker.cc
+++ b/paddle/fluid/framework/op_proto_maker.cc
@@ -129,6 +129,9 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto,
                                    "Optimized for variable")
      .SetDefault({});
+  AddAttr<std::string>(OpNamescopeAttrName(), "Operator name with namesope.")
+      .SetDefault("");
  Validate();
 }

--- a/paddle/fluid/framework/op_proto_maker.h
+++ b/paddle/fluid/framework/op_proto_maker.h
@@ -39,6 +39,7 @@ class OpProtoAndCheckerMaker {
 public:
  static const char *OpRoleAttrName() { return "op_role"; }
  static const char *OpRoleVarAttrName() { return "op_role_var"; }
+  static const char *OpNamescopeAttrName() { return "op_namescope"; }
  void operator()(proto::OpProto *proto, OpAttrChecker *attr_checker);

--- a/paddle/fluid/inference/analysis/analyzer.cc
+++ b/paddle/fluid/inference/analysis/analyzer.cc
@@ -93,7 +93,6 @@ class DfgPassManagerImpl final : public DfgPassManager {
  void AddGraphvizDebugerPass(Pass* pass) {
    auto* debuger_pass = pass->CreateGraphvizDebugerPass();
    if (debuger_pass) {
-      LOG(INFO) << " - register debug pass [" << debuger_pass->repr() << "]";
      Register(debuger_pass->repr(), debuger_pass);
    }
  }
@@ -102,7 +101,7 @@ class DfgPassManagerImpl final : public DfgPassManager {
 Analyzer::Analyzer() { Register("manager1", new DfgPassManagerImpl); }
 void Analyzer::Run(Argument* argument) {
-  // Ungly support fluid-to-ir-pass
+  // Ugly support fluid-to-ir-pass
  argument->Set(kFluidToIrPassesAttr,
                new std::vector<std::string>({
                    // Manual update the passes here.

--- a/paddle/fluid/inference/analysis/analyzer_tester.cc
+++ b/paddle/fluid/inference/analysis/analyzer_tester.cc
@@ -16,10 +16,13 @@
 #include <google/protobuf/text_format.h>
 #include <gtest/gtest.h>
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/framework/ir/pass.h"
 #include "paddle/fluid/inference/analysis/ut_helper.h"
+#include "paddle/fluid/inference/api/analysis_predictor.h"
 #include "paddle/fluid/inference/api/helper.h"
 #include "paddle/fluid/inference/api/paddle_inference_api.h"
+#include "paddle/fluid/inference/utils/singleton.h"
 #include "paddle/fluid/platform/profiler.h"
 DEFINE_string(infer_ditu_rnn_model, "", "model path for ditu RNN");
@@ -31,6 +34,8 @@ namespace paddle {
 namespace inference {
 namespace analysis {
+using namespace framework;
 TEST(Analyzer, analysis_without_tensorrt) {
  FLAGS_IA_enable_tensorrt_subgraph_engine = false;
  Argument argument;
@@ -311,6 +316,20 @@ void TestDituRNNPrediction(const std::string &model_path,
      EXPECT_NEAR(data[i], base_data[i], 1e-3);
    }
  }
+  if (use_analysis && activate_ir) {
+    AnalysisPredictor *analysis_predictor =
+        dynamic_cast<AnalysisPredictor *>(predictor.get());
+    auto &fuse_statis = analysis_predictor->analysis_argument()
+                            .Get<std::unordered_map<std::string, int>>(
+                                framework::ir::kFuseStatisAttr);
+    for (auto &item : fuse_statis) {
+      LOG(INFO) << "fused " << item.first << " " << item.second;
+    }
+    ASSERT_TRUE(fuse_statis.count("fc"));
+    EXPECT_EQ(fuse_statis.at("fc"), 1);
+  }
 }
 // Directly infer with the original model.

--- a/paddle/fluid/inference/analysis/argument.h
+++ b/paddle/fluid/inference/analysis/argument.h
@@ -64,7 +64,8 @@ struct Argument {
  template <typename T>
  void Set(const std::string& key, T* data) {
    PADDLE_ENFORCE_NOT_NULL(data);
-    PADDLE_ENFORCE(!attrs_.count(key), "duplicate attr called %s", key);
+    PADDLE_ENFORCE(!attrs_.count(key), "Duplicate set Argument's attr [%s]",
+                   key);
    attrs_[key] = data;
    attr_deleters_[key] = [data, key, this]() {
      VLOG(3) << "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";

--- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc
+++ b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc
@@ -15,6 +15,7 @@
 #include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h"
 #include <vector>
 #include "paddle/fluid/framework/block_desc.h"
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/framework/op_desc.h"
 #include "paddle/fluid/framework/proto_desc.h"
 #include "paddle/fluid/inference/analysis/analyzer.h"
@@ -34,7 +35,6 @@ std::vector<std::string> ExtractParameters(
 bool DataFlowGraphToFluidPass::Initialize(Argument *argument) {
  ANALYSIS_ARGUMENT_CHECK_FIELD(argument)
  ANALYSIS_ARGUMENT_CHECK_FIELD(argument->origin_program_desc)
-  PADDLE_ENFORCE(!argument->transformed_program_desc);
  // The transformed_program_desc should inherit all the VarDesc and BlockDesc
  // from the original program desc. The operators of the main block(the first
  // block) should rewritten by data flow graph.
@@ -66,7 +66,7 @@ void DataFlowGraphToFluidPass::Run(DataFlowGraph *graph) {
    }
  }
-  if (argument_->Has("param_scope")) {
+  if (argument_->Has(framework::ir::kParamScopeAttr)) {
    LOG(WARNING) << "parameter changes in the scope takes effect";
  }

--- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc
+++ b/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc
@@ -14,6 +14,7 @@
 #include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h"
 #include "paddle/fluid/framework/executor.h"
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/inference/io.h"
 #include "paddle/fluid/platform/device_context.h"
 #include "paddle/fluid/platform/place.h"
@@ -26,11 +27,11 @@ void FluidToIrPass::EnableParamModify(const std::string &model_dir,
                                      const std::string &prog_file,
                                      const std::string &param_file) {
  PADDLE_ENFORCE(argument_);
-  argument_->Set("param_scope", new framework::Scope);
+  argument_->Set(framework::ir::kParamScopeAttr, new framework::Scope);
  // Load parameters.
  VLOG(3) << "Loading parameters from " << model_dir;
-  LoadParams(&argument_->Get<framework::Scope>("param_scope"), model_dir,
+  LoadParams(&argument_->Get<framework::Scope>(framework::ir::kParamScopeAttr),
-             prog_file, param_file);
+             model_dir, prog_file, param_file);
 }
 bool FluidToIrPass::LoadParams(framework::Scope *scope, const std::string &dir,

--- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.h
+++ b/paddle/fluid/inference/analysis/fluid_to_ir_pass.h
@@ -14,12 +14,14 @@
 #pragma once
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/inference/analysis/ir_pass_manager.h"
 #include "paddle/fluid/inference/analysis/pass.h"
 namespace paddle {
 namespace inference {
 namespace analysis {
+using namespace framework;
 static const char kFluidToIrPassesAttr[] = "__fluid_to_ir_passes__";
@@ -45,13 +47,12 @@ class FluidToIrPass final : public DataFlowGraphPass {
    ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_program_path);
    // Load program.
    auto program = LoadProgramDesc(*argument->fluid_model_program_path);
-    argument->origin_program_desc.reset(
+    argument->origin_program_desc.reset(new proto::ProgramDesc(program));
-        new framework::proto::ProgramDesc(program));
    // Create main data flow graph.
    if (!argument->main_dfg) {
      argument->main_dfg.reset(new DataFlowGraph);
    }
-    argument->Set("ir_program_desc", new framework::ProgramDesc(program));
+    argument->Set("ir_program_desc", new ProgramDesc(program));
    LOG(INFO) << "Loading parameters";
    // Load parameters to argument if needed.
@@ -73,15 +74,15 @@ class FluidToIrPass final : public DataFlowGraphPass {
  void Run(DataFlowGraph *graph) override {
    // Call all the IR Passes
-    IRPassManager ir_passes(
+    IRPassManager ir_passes(argument_->Get<ProgramDesc>("ir_program_desc"),
-        argument_->Get<framework::ProgramDesc>("ir_program_desc"), nullptr);
+                            nullptr);
    // Pass the scope from analysis to IR if needed.
-    if (argument_->Has("param_scope")) {
+    if (argument_->Has(ir::kParamScopeAttr)) {
      // Here the address is passed, attention that IR doesn't own the scope, so
      // the real scope in analysis should live during the IR phase.
      ir_passes.graph().Set(
-          "param_scope", new framework::Scope *(
+          ir::kParamScopeAttr,
-                             &argument_->Get<framework::Scope>("param_scope")));
+          new Scope *(&argument_->Get<Scope>(ir::kParamScopeAttr)));
    }
    const auto &ir_passes_to_apply =
@@ -90,6 +91,14 @@ class FluidToIrPass final : public DataFlowGraphPass {
    PADDLE_ENFORCE(argument_->main_dfg.get());
    argument_->main_dfg->Build(ir_passes.graph());
+    // inherit the arguments from ir.
+    if (ir_passes.graph().Has(ir::kFuseStatisAttr)) {
+      argument_->Set(
+          ir::kFuseStatisAttr,
+          new std::unordered_map<std::string, int>(
+              ir_passes.graph().Get<std::unordered_map<std::string, int>>(
+                  ir::kFuseStatisAttr)));
+    }
  }
  void EnableParamModify(const std::string &model_dir,
@@ -100,7 +109,7 @@ class FluidToIrPass final : public DataFlowGraphPass {
 private:
  // Load parameters from a single file or from a directory.
-  bool LoadParams(framework::Scope *scope, const std::string &dir,
+  bool LoadParams(Scope *scope, const std::string &dir,
                  const std::string &prog_file, const std::string &param_file);
 private:

--- a/paddle/fluid/inference/analysis/ir_pass_manager.cc
+++ b/paddle/fluid/inference/analysis/ir_pass_manager.cc
@@ -14,6 +14,7 @@
 #include "paddle/fluid/inference/analysis/ir_pass_manager.h"
 #include <string>
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/framework/ir/graph.h"
 #include "paddle/fluid/framework/scope.h"
@@ -25,7 +26,8 @@ IRPassManager::IRPassManager(const ProgramDesc &program,
                             framework::Scope *scope)
    : program_(program) {
  graph_.reset(new framework::ir::Graph(program));
-  if (scope) graph_->Set("param_scope", new framework::Scope *(scope));
+  if (scope)
+    graph_->Set(framework::ir::kParamScopeAttr, new framework::Scope *(scope));
 }
 void IRPassManager::Apply(const std::vector<std::string> &passes) {

--- a/paddle/fluid/inference/api/analysis_predictor.cc
+++ b/paddle/fluid/inference/api/analysis_predictor.cc
@@ -12,31 +12,18 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
+#include "paddle/fluid/inference/api/analysis_predictor.h"
 #include <memory>
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
 #include "paddle/fluid/framework/ir/pass.h"
 #include "paddle/fluid/framework/scope.h"
-#include "paddle/fluid/inference/analysis/analyzer.h"
-#include "paddle/fluid/inference/api/api_impl.h"
 #include "paddle/fluid/inference/api/paddle_inference_api.h"
 #include "paddle/fluid/inference/utils/singleton.h"
 namespace paddle {
-using inference::analysis::Argument;
+bool AnalysisPredictor::Init(
-using inference::Singleton;
+    const std::shared_ptr<framework::Scope>& parent_scope) {
-using inference::analysis::Analyzer;
-using framework::proto::ProgramDesc;
-/* This predictor is based on the original native predictor with IR and Analysis
- * support. It will optimize IR and Parameters in the runtime.
- * TODO(Superjomn) Replace the Navive predictor?
- */
-class AnalysisPredictor : public NativePaddlePredictor {
- public:
-  explicit AnalysisPredictor(const NativeConfig& config)
-      : NativePaddlePredictor(config), config_(config) {}
-  bool Init(const std::shared_ptr<framework::Scope>& parent_scope) {
  VLOG(3) << "Predictor::init()";
  if (config_.use_gpu) {
    place_ = paddle::platform::CUDAPlace(config_.device);
@@ -58,8 +45,8 @@ class AnalysisPredictor : public NativePaddlePredictor {
  if (!config_.model_dir.empty()) {
    // Parameters are saved in separate files sited in
    // the specified `dirname`.
-      inference_program_ = paddle::inference::Load(
+    inference_program_ = paddle::inference::Load(executor_.get(), scope_.get(),
-          executor_.get(), scope_.get(), config_.model_dir);
+                                                 config_.model_dir);
  } else if (!config_.prog_file.empty() && !config_.param_file.empty()) {
    // All parameters are saved in a single file.
    // The file names should be consistent with that used
@@ -78,56 +65,43 @@ class AnalysisPredictor : public NativePaddlePredictor {
  PADDLE_ENFORCE(scope_.get());
  executor_->CreateVariables(*inference_program_,
                             sub_scope_ ? sub_scope_ : scope_.get(), 0);
  // Get the feed_target_names and fetch_target_names
-    feed_target_names_ = inference_program_->GetFeedTargetNames();
+  PrepareFeedFetch();
-    fetch_target_names_ = inference_program_->GetFetchTargetNames();
  return true;
-  }
+}
-  bool Run(const std::vector<PaddleTensor>& inputs,
-           std::vector<PaddleTensor>* output_data,
-           int batch_size = -1) override {
-    return NativePaddlePredictor::Run(inputs, output_data, batch_size);
-  }
-  void OptimizeInferenceProgram() {
+void AnalysisPredictor::OptimizeInferenceProgram() {
  LOG(INFO) << "optimize begin";
  FLAGS_IA_enable_ir = true;
  FLAGS_IA_enable_tensorrt_subgraph_engine = false;
  FLAGS_IA_output_storage_path = "";  // Don't output the model.
  // Analyze inference_program
-    Argument argument;
  if (!config_.model_dir.empty()) {
-      argument.fluid_model_dir.reset(new std::string(config_.model_dir));
+    argument_.fluid_model_dir.reset(new std::string(config_.model_dir));
  } else {
    PADDLE_ENFORCE(
        !config_.param_file.empty(),
        "Either model_dir or (param_file, prog_file) should be set.");
    PADDLE_ENFORCE(!config_.prog_file.empty());
-      argument.fluid_model_program_path.reset(
+    argument_.fluid_model_program_path.reset(
        new std::string(config_.prog_file));
-      argument.fluid_model_param_path.reset(
+    argument_.fluid_model_param_path.reset(new std::string(config_.param_file));
-          new std::string(config_.param_file));
  }
-    argument.origin_program_desc.reset(
+  argument_.origin_program_desc.reset(
      new ProgramDesc(*inference_program_->Proto()));
-    Singleton<Analyzer>::Global().Run(&argument);
+  Analyzer().Run(&argument_);
-    CHECK(argument.transformed_program_desc);
+  CHECK(argument_.transformed_program_desc);
  VLOG(5) << "to prepare executor";
  // LOG(INFO) << "transformed_parogram_desc " <<
  // argument.transformed_program_desc->DebugString();
  inference_program_.reset(
-        new framework::ProgramDesc(*argument.transformed_program_desc));
+      new framework::ProgramDesc(*argument_.transformed_program_desc));
-    PADDLE_ENFORCE(argument.Has("param_scope"));
+  PADDLE_ENFORCE(argument_.Has(framework::ir::kParamScopeAttr));
  // Update scope.
-    scope_.reset(argument.Release<framework::Scope>("param_scope"));
+  scope_.reset(
+      argument_.Release<framework::Scope>(framework::ir::kParamScopeAttr));
  LOG(INFO) << "optimize end ==";
-  }
+}
- private:
-  NativeConfig config_;
-};
 template <>
 std::unique_ptr<PaddlePredictor> CreatePaddlePredictor<

--- a/paddle/fluid/inference/api/analysis_predictor.h
+++ b/paddle/fluid/inference/api/analysis_predictor.h
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "paddle/fluid/inference/analysis/analyzer.h"
+#include "paddle/fluid/inference/api/api_impl.h"
+#include "paddle/fluid/inference/api/paddle_inference_api.h"
+namespace paddle {
+using inference::analysis::Argument;
+using inference::analysis::Analyzer;
+using framework::proto::ProgramDesc;
+/* This predictor is based on the original native predictor with IR and Analysis
+ * support. It will optimize IR and Parameters in the runtime.
+ * TODO(Superjomn) Replace the Navive predictor?
+ */
+class AnalysisPredictor : public NativePaddlePredictor {
+ public:
+  explicit AnalysisPredictor(const NativeConfig& config)
+      : NativePaddlePredictor(config), config_(config) {}
+  bool Init(const std::shared_ptr<framework::Scope>& parent_scope);
+  bool Run(const std::vector<PaddleTensor>& inputs,
+           std::vector<PaddleTensor>* output_data,
+           int batch_size = -1) override {
+    return NativePaddlePredictor::Run(inputs, output_data, batch_size);
+  }
+  void OptimizeInferenceProgram();
+  Argument& analysis_argument() { return argument_; }
+ private:
+  NativeConfig config_;
+  Argument argument_;
+};
+}  // namespace paddle
--- a/paddle/fluid/inference/api/api_impl.cc
+++ b/paddle/fluid/inference/api/api_impl.cc
@@ -21,6 +21,7 @@ limitations under the License. */
 #include <utility>
 #include <vector>
+#include "paddle/fluid/framework/feed_fetch_method.h"
 #include "paddle/fluid/inference/api/api_impl.h"
 #include "paddle/fluid/platform/profiler.h"
@@ -57,6 +58,25 @@ std::string num2str(T a) {
 }
 }  // namespace
+void NativePaddlePredictor::PrepareFeedFetch() {
+  for (auto *op : inference_program_->Block(0).AllOps()) {
+    if (op->Type() == "feed") {
+      int idx = boost::get<int>(op->GetAttr("col"));
+      if (feeds_.size() <= static_cast<size_t>(idx)) {
+        feeds_.resize(idx + 1);
+      }
+      feeds_[idx] = op;
+      feed_names_[op->Output("Out")[0]] = idx;
+    } else if (op->Type() == "fetch") {
+      int idx = boost::get<int>(op->GetAttr("col"));
+      if (fetchs_.size() <= idx) {
+        fetchs_.resize(idx + 1);
+      }
+      fetchs_[idx] = op;
+    }
+  }
+}
 bool NativePaddlePredictor::Init(
    std::shared_ptr<framework::Scope> parent_scope) {
  VLOG(3) << "Predictor::init()";
@@ -108,8 +128,7 @@ bool NativePaddlePredictor::Init(
                             sub_scope_ ? sub_scope_ : scope_.get(), 0);
  // Get the feed_target_names and fetch_target_names
-  feed_target_names_ = inference_program_->GetFeedTargetNames();
+  PrepareFeedFetch();
-  fetch_target_names_ = inference_program_->GetFetchTargetNames();
  return true;
 }
@@ -130,36 +149,21 @@ bool NativePaddlePredictor::Run(const std::vector<PaddleTensor> &inputs,
  Timer timer;
  timer.tic();
  // set feed variable
-  std::map<std::string, const framework::LoDTensor *> feed_targets;
  std::vector<framework::LoDTensor> feeds;
-  if (!SetFeed(inputs, &feeds)) {
+  framework::Scope *scope = sub_scope_ != nullptr ? sub_scope_ : scope_.get();
+  if (!SetFeed(inputs, scope)) {
    LOG(ERROR) << "fail to set feed";
    return false;
  }
-  for (size_t i = 0; i < feed_target_names_.size(); ++i) {
-    if (config_.specify_input_name) {
-      feed_targets[inputs[i].name] = &feeds[i];
-    } else {
-      feed_targets[feed_target_names_[i]] = &feeds[i];
-    }
-  }
-  // get fetch variable
-  std::map<std::string, framework::LoDTensor *> fetch_targets;
-  std::vector<framework::LoDTensor> fetchs;
-  fetchs.resize(fetch_target_names_.size());
-  for (size_t i = 0; i < fetch_target_names_.size(); ++i) {
-    fetch_targets[fetch_target_names_[i]] = &fetchs[i];
-  }
  // Run the inference program
  // if share variables, we need not create variables
  VLOG(4) << "Run prepared context";
-  executor_->RunPreparedContext(
+  executor_->RunPreparedContext(ctx_.get(), scope,
-      ctx_.get(), sub_scope_ != nullptr ? sub_scope_ : scope_.get(),
-      &feed_targets, &fetch_targets,
                                false, /* don't create local scope each time*/
                                false /* don't create variable eatch time */);
  VLOG(4) << "Finish prepared context";
-  if (!GetFetch(fetchs, output_data)) {
+  // get fetch variable
+  if (!GetFetch(output_data, scope)) {
    LOG(ERROR) << "fail to get fetches";
    return false;
  }
@@ -180,13 +184,13 @@ std::unique_ptr<PaddlePredictor> NativePaddlePredictor::Clone() {
 }
 bool NativePaddlePredictor::SetFeed(const std::vector<PaddleTensor> &inputs,
-                                    std::vector<framework::LoDTensor> *feeds) {
+                                    framework::Scope *scope) {
  VLOG(3) << "Predictor::set_feed";
-  if (inputs.size() != feed_target_names_.size()) {
+  if (inputs.size() != feeds_.size()) {
    LOG(ERROR) << "wrong feed input size.";
    return false;
  }
-  for (size_t i = 0; i < feed_target_names_.size(); ++i) {
+  for (size_t i = 0; i < inputs.size(); ++i) {
    framework::LoDTensor input;
    framework::DDim ddim = framework::make_ddim(inputs[i].shape);
    void *input_ptr;
@@ -208,29 +212,38 @@ bool NativePaddlePredictor::SetFeed(const std::vector<PaddleTensor> &inputs,
      lod.emplace_back(level);
    }
    input.set_lod(lod);
+    int idx = -1;
-    feeds->push_back(input);
+    if (config_.specify_input_name) {
+      idx = feed_names_[inputs[i].name];
+    } else {
+      idx = boost::get<int>(feeds_[i]->GetAttr("col"));
+    }
+    framework::SetFeedVariable(scope, input, "feed", idx);
  }
  return true;
 }
-bool NativePaddlePredictor::GetFetch(
+bool NativePaddlePredictor::GetFetch(std::vector<PaddleTensor> *outputs,
-    const std::vector<framework::LoDTensor> &fetchs,
+                                     framework::Scope *scope) {
-    std::vector<PaddleTensor> *outputs) {
  VLOG(3) << "Predictor::get_fetch";
-  outputs->resize(fetchs.size());
+  outputs->resize(fetchs_.size());
-  for (size_t i = 0; i < fetchs.size(); ++i) {
+  for (size_t i = 0; i < fetchs_.size(); ++i) {
+    int idx = boost::get<int>(fetchs_[i]->GetAttr("col"));
+    PADDLE_ENFORCE(idx == i);
+    framework::LoDTensor &output =
+        framework::GetFetchVariable(*scope, "fetch", idx);
    // TODO(panyx0718): Support fetch of other types.
-    if (fetchs[i].type() != typeid(float)) {
+    if (output.type() != typeid(float)) {
      LOG(ERROR) << "only support fetching float now.";
      return false;
    }
    std::vector<int> shape;
-    auto dims_i = fetchs[i].dims();
+    auto dims_i = output.dims();
-    auto lod = fetchs[i].lod();
+    auto lod = output.lod();
-    const float *output_ptr = fetchs[i].data<float>();
+    const float *output_ptr = output.data<float>();
    // const int64_t* output_ptr = fetchs[i].data<int64_t>();
-    auto num = fetchs[i].numel();
+    auto num = output.numel();
    std::vector<float> data;
    if (0 == lod.size()) {
      std::copy(output_ptr, output_ptr + num, std::back_inserter(data));
@@ -275,7 +288,7 @@ bool NativePaddlePredictor::GetFetch(
    }
    std::memcpy(buffer.data(), data.data(), buffer.length());
    // copy LoD
-    for (const auto &level : fetchs[i].lod()) {
+    for (const auto &level : output.lod()) {
      outputs->at(i).lod.emplace_back(level);
    }
    outputs->at(i).dtype = PaddleDType::FLOAT32;

--- a/paddle/fluid/inference/api/api_impl.h
+++ b/paddle/fluid/inference/api/api_impl.h
@@ -15,6 +15,7 @@
 #pragma once
 #include <glog/logging.h>
+#include <map>
 #include <memory>
 #include <string>
 #include <vector>
@@ -47,9 +48,11 @@ class NativePaddlePredictor : public PaddlePredictor {
 protected:
  bool SetFeed(const std::vector<PaddleTensor> &input_datas,
-               std::vector<framework::LoDTensor> *feeds);
+               framework::Scope *scope);
-  bool GetFetch(const std::vector<framework::LoDTensor> &fetchs,
+  bool GetFetch(std::vector<PaddleTensor> *output_data,
-                std::vector<PaddleTensor> *output_data);
+                framework::Scope *scope);
+  void PrepareFeedFetch();
  NativeConfig config_;
  platform::Place place_;
@@ -57,8 +60,9 @@ class NativePaddlePredictor : public PaddlePredictor {
  std::shared_ptr<framework::Scope> scope_;
  std::unique_ptr<framework::ExecutorPrepareContext> ctx_;
  std::unique_ptr<framework::ProgramDesc> inference_program_;
-  std::vector<std::string> feed_target_names_;
+  std::vector<framework::OpDesc *> feeds_;
-  std::vector<std::string> fetch_target_names_;
+  std::map<std::string, size_t> feed_names_;
+  std::vector<framework::OpDesc *> fetchs_;
  // Do not use unique_ptr, use parent scope to delete
  framework::Scope *sub_scope_{nullptr};
 };

--- a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc
+++ b/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc
@@ -74,10 +74,8 @@ class TensorRTSubgraphPredictor : public NativePaddlePredictor {
    VLOG(5) << "to create variables";
    executor_->CreateVariables(*inference_program_,
                               sub_scope_ ? sub_scope_ : scope_.get(), 0);
    // Get the feed_target_names and fetch_target_names
-    feed_target_names_ = inference_program_->GetFeedTargetNames();
+    PrepareFeedFetch();
-    fetch_target_names_ = inference_program_->GetFetchTargetNames();
    return true;
  }

--- a/paddle/fluid/inference/tests/book/test_inference_nlp.cc
+++ b/paddle/fluid/inference/tests/book/test_inference_nlp.cc
@@ -21,6 +21,8 @@ limitations under the License. */
 #include "paddle/fluid/inference/tests/test_helper.h"
 #include "paddle/fluid/platform/cpu_helper.h"
+#include "paddle/fluid/framework/feed_fetch_method.h"
 DEFINE_string(model_path, "", "Directory of the inference model.");
 DEFINE_string(data_file, "", "File of input index data.");
 DEFINE_int32(repeat, 100, "Running the inference program repeat times");
@@ -124,14 +126,35 @@ void ThreadRunInfer(
  std::map<std::string, const paddle::framework::LoDTensor*> feed_targets;
  PADDLE_ENFORCE_EQ(feed_target_names.size(), 1UL);
+  // map the data of feed_targets to feed_holder
+  for (auto* op : inference_program->Block(0).AllOps()) {
+    if (op->Type() == "feed") {
+      std::string feed_target_name = op->Output("Out")[0];
+      int idx = boost::get<int>(op->GetAttr("col"));
+      paddle::framework::SetFeedVariable(scope, *feed_targets[feed_target_name],
+                                         "feed", idx);
+    }
+  }
  auto& inputs = jobs[tid];
  auto start_ms = GetCurrentMs();
  for (size_t i = 0; i < inputs.size(); ++i) {
    feed_targets[feed_target_names[0]] = inputs[i];
-    executor.RunPreparedContext(ctx.get(), &sub_scope, &feed_targets,
+    executor.RunPreparedContext(ctx.get(), &sub_scope,
-                                &fetch_targets, false /*create_local_scope*/);
+                                false /*create_local_scope*/);
  }
  auto stop_ms = GetCurrentMs();
+  // obtain the data of fetch_targets from fetch_holder
+  for (auto* op : inference_program->Block(0).AllOps()) {
+    if (op->Type() == "fetch") {
+      std::string fetch_target_name = op->Input("X")[0];
+      int idx = boost::get<int>(op->GetAttr("col"));
+      *fetch_targets[fetch_target_name] =
+          paddle::framework::GetFetchVariable(*scope, "fetch", idx);
+    }
+  }
  scope->DeleteScope(&sub_scope);
  LOG(INFO) << "Tid: " << tid << ", process " << inputs.size()
            << " samples, avg time per sample: "

--- a/paddle/fluid/operators/detection/CMakeLists.txt
+++ b/paddle/fluid/operators/detection/CMakeLists.txt
@@ -29,6 +29,7 @@ target_assign_op.cu)
 detection_library(polygon_box_transform_op SRCS polygon_box_transform_op.cc
 polygon_box_transform_op.cu)
 detection_library(rpn_target_assign_op SRCS rpn_target_assign_op.cc)
+detection_library(generate_proposal_labels_op SRCS generate_proposal_labels_op.cc)
 detection_library(generate_proposals_op SRCS generate_proposals_op.cc)
 #Export local libraries to parent
 set(DETECTION_LIBRARY ${LOCAL_DETECTION_LIBS} PARENT_SCOPE)
--- a/paddle/fluid/operators/detection/generate_proposal_labels_op.cc
+++ b/paddle/fluid/operators/detection/generate_proposal_labels_op.cc
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <math.h>
+#include <algorithm>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/gather.h"
+#include "paddle/fluid/operators/math/concat.h"
+#include "paddle/fluid/operators/math/math_function.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+const int kBoxDim = 4;
+template <typename T>
+void AppendRois(LoDTensor* out, int64_t offset, Tensor* to_add) {
+  auto* out_data = out->data<T>();
+  auto* to_add_data = to_add->data<T>();
+  memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T));
+}
+class GenerateProposalLabelsOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("RpnRois"),
+                   "Input(RpnRois) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtClasses"),
+                   "Input(GtClasses) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtBoxes"),
+                   "Input(GtBoxes) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImScales"),
+                   "Input(ImScales) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Rois"),
+                   "Output(Rois) of RpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("LabelsInt32"),
+        "Output(LabelsInt32) of RpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("BboxTargets"),
+        "Output(BboxTargets) of RpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("BboxInsideWeights"),
+        "Output(BboxInsideWeights) of RpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("BboxOutsideWeights"),
+        "Output(BboxOutsideWeights) of RpnTargetAssignOp should not be null");
+    auto rpn_rois_dims = ctx->GetInputDim("RpnRois");
+    auto gt_classes_dims = ctx->GetInputDim("GtClasses");
+    auto gt_boxes_dims = ctx->GetInputDim("GtBoxes");
+    auto im_scales_dims = ctx->GetInputDim("ImScales");
+    PADDLE_ENFORCE_EQ(rpn_rois_dims.size(), 2,
+                      "The rank of Input(RpnRois) must be 2.");
+    PADDLE_ENFORCE_EQ(gt_classes_dims.size(), 1,
+                      "The rank of Input(GtClasses) must be 1.");
+    PADDLE_ENFORCE_EQ(gt_boxes_dims.size(), 2,
+                      "The rank of Input(GtBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(im_scales_dims.size(), 1,
+                      "The rank of Input(ImScales) must be 1.");
+    int class_nums = ctx->Attrs().Get<int>("class_nums");
+    ctx->SetOutputDim("Rois", {-1, 4});
+    ctx->SetOutputDim("LabelsInt32", {-1});
+    ctx->SetOutputDim("BboxTargets", {-1, 4 * class_nums});
+    ctx->SetOutputDim("BboxInsideWeights", {-1, 4 * class_nums});
+    ctx->SetOutputDim("BboxOutsideWeights", {-1, 4 * class_nums});
+  }
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    auto data_type = framework::GetDataTypeOfVar(ctx.InputVar("RpnRois"));
+    return framework::OpKernelType(data_type, platform::CPUPlace());
+  }
+};
+template <typename T>
+void Concat(const platform::CPUDeviceContext& context,
+            const Tensor& in_tensor_a, const Tensor& in_tensor_b,
+            Tensor* out_tensor) {
+  int axis = 0;
+  std::vector<Tensor> inputs;
+  inputs.emplace_back(in_tensor_a);
+  inputs.emplace_back(in_tensor_b);
+  math::ConcatFunctor<platform::CPUDeviceContext, T> concat_functor;
+  concat_functor(context, inputs, axis, out_tensor);
+}
+template <typename T>
+void BboxOverlaps(const Tensor& r_boxes, const Tensor& c_boxes,
+                  Tensor* overlaps) {
+  auto r_boxes_et = framework::EigenTensor<T, 2>::From(r_boxes);
+  auto c_boxes_et = framework::EigenTensor<T, 2>::From(c_boxes);
+  auto overlaps_et = framework::EigenTensor<T, 2>::From(*overlaps);
+  int r_num = r_boxes.dims()[0];
+  int c_num = c_boxes.dims()[0];
+  auto zero = static_cast<T>(0.0);
+  T r_box_area, c_box_area, x_min, y_min, x_max, y_max, inter_w, inter_h,
+      inter_area;
+  for (int i = 0; i < r_num; ++i) {
+    r_box_area = (r_boxes_et(i, 2) - r_boxes_et(i, 0) + 1) *
+                 (r_boxes_et(i, 3) - r_boxes_et(i, 1) + 1);
+    for (int j = 0; j < c_num; ++j) {
+      c_box_area = (c_boxes_et(j, 2) - c_boxes_et(j, 0) + 1) *
+                   (c_boxes_et(j, 3) - c_boxes_et(j, 1) + 1);
+      x_min = std::max(r_boxes_et(i, 0), c_boxes_et(j, 0));
+      y_min = std::max(r_boxes_et(i, 1), c_boxes_et(j, 1));
+      x_max = std::min(r_boxes_et(i, 2), c_boxes_et(j, 2));
+      y_max = std::min(r_boxes_et(i, 3), c_boxes_et(j, 3));
+      inter_w = std::max(x_max - x_min + 1, zero);
+      inter_h = std::max(y_max - y_min + 1, zero);
+      inter_area = inter_w * inter_h;
+      overlaps_et(i, j) = inter_area / (r_box_area + c_box_area - inter_area);
+    }
+  }
+}
+template <typename T>
+void BoxToDelta(int box_num, const Tensor& ex_boxes, const Tensor& gt_boxes,
+                const std::vector<float>& weights, Tensor* box_delta) {
+  auto ex_boxes_et = framework::EigenTensor<T, 2>::From(ex_boxes);
+  auto gt_boxes_et = framework::EigenTensor<T, 2>::From(gt_boxes);
+  auto box_delta_et = framework::EigenTensor<T, 2>::From(*box_delta);
+  T ex_w, ex_h, ex_ctr_x, ex_ctr_y, gt_w, gt_h, gt_ctr_x, gt_ctr_y;
+  for (int64_t i = 0; i < box_num; ++i) {
+    ex_w = ex_boxes_et(i, 2) - ex_boxes_et(i, 0) + 1;
+    ex_h = ex_boxes_et(i, 3) - ex_boxes_et(i, 1) + 1;
+    ex_ctr_x = ex_boxes_et(i, 0) + 0.5 * ex_w;
+    ex_ctr_y = ex_boxes_et(i, 1) + 0.5 * ex_h;
+    gt_w = gt_boxes_et(i, 2) - gt_boxes_et(i, 0) + 1;
+    gt_h = gt_boxes_et(i, 3) - gt_boxes_et(i, 1) + 1;
+    gt_ctr_x = gt_boxes_et(i, 0) + 0.5 * gt_w;
+    gt_ctr_y = gt_boxes_et(i, 1) + 0.5 * gt_h;
+    box_delta_et(i, 0) = (gt_ctr_x - ex_ctr_x) / ex_w / weights[0];
+    box_delta_et(i, 1) = (gt_ctr_y - ex_ctr_y) / ex_h / weights[1];
+    box_delta_et(i, 2) = log(gt_w / ex_w) / ex_w / weights[2];
+    box_delta_et(i, 3) = log(gt_h / ex_h) / ex_h / weights[3];
+  }
+}
+template <typename T>
+std::vector<std::vector<int>> SampleFgBgGt(
+    const platform::CPUDeviceContext& context, Tensor* iou,
+    const int batch_size_per_im, const float fg_fraction, const float fg_thresh,
+    const float bg_thresh_hi, const float bg_thresh_lo,
+    std::minstd_rand engine) {
+  std::vector<int> fg_inds;
+  std::vector<int> bg_inds;
+  std::vector<int> gt_inds;
+  T* proposal_to_gt_overlaps = iou->mutable_data<T>(context.GetPlace());
+  int64_t row = iou->dims()[0];
+  int64_t col = iou->dims()[1];
+  float epsilon = 0.00001;
+  // Follow the Faster RCNN's implementation
+  for (int64_t i = 0; i < row; ++i) {
+    const T* v = proposal_to_gt_overlaps + i * col;
+    T max_overlap = *std::max_element(v, v + col);
+    if (max_overlap > fg_thresh) {
+      for (int64_t j = 0; j < col; ++j) {
+        T val = proposal_to_gt_overlaps[i * col + j];
+        auto diff = std::abs(max_overlap - val);
+        if (diff < epsilon) {
+          fg_inds.emplace_back(i);
+          gt_inds.emplace_back(j);
+          break;
+        }
+      }
+    } else {
+      if ((max_overlap >= bg_thresh_lo) && (max_overlap < bg_thresh_hi)) {
+        bg_inds.emplace_back(i);
+      }
+    }
+  }
+  // Reservoir Sampling
+  int fg_rois_per_im = std::floor(batch_size_per_im * fg_fraction);
+  int fg_rois_this_image = fg_inds.size();
+  int fg_rois_per_this_image = std::min(fg_rois_per_im, fg_rois_this_image);
+  std::uniform_real_distribution<float> uniform(0, 1);
+  const int64_t fg_size = static_cast<int64_t>(fg_inds.size());
+  if (fg_size > fg_rois_per_this_image) {
+    for (int64_t i = fg_rois_per_this_image; i < fg_size; ++i) {
+      int rng_ind = std::floor(uniform(engine) * i);
+      if (rng_ind < fg_rois_per_this_image) {
+        std::iter_swap(fg_inds.begin() + rng_ind, fg_inds.begin() + i);
+        std::iter_swap(gt_inds.begin() + rng_ind, gt_inds.begin() + i);
+      }
+    }
+  }
+  std::vector<int> new_fg_inds(fg_inds.begin(),
+                               fg_inds.begin() + fg_rois_per_this_image);
+  std::vector<int> new_gt_inds(gt_inds.begin(),
+                               gt_inds.begin() + fg_rois_per_this_image);
+  int bg_rois_per_image = batch_size_per_im - fg_rois_per_this_image;
+  int bg_rois_this_image = bg_inds.size();
+  int bg_rois_per_this_image = std::min(bg_rois_per_image, bg_rois_this_image);
+  const int64_t bg_size = static_cast<int64_t>(bg_inds.size());
+  if (bg_size > bg_rois_per_this_image) {
+    for (int64_t i = bg_rois_per_this_image; i < bg_size; ++i) {
+      int rng_ind = std::floor(uniform(engine) * i);
+      if (rng_ind < fg_rois_per_this_image)
+        std::iter_swap(bg_inds.begin() + rng_ind, bg_inds.begin() + i);
+    }
+  }
+  std::vector<int> new_bg_inds(bg_inds.begin(),
+                               bg_inds.begin() + bg_rois_per_this_image);
+  std::vector<std::vector<int>> res;
+  res.emplace_back(new_fg_inds);
+  res.emplace_back(new_bg_inds);
+  res.emplace_back(new_gt_inds);
+  return res;
+}
+template <typename T>
+void GatherBoxesLabels(const platform::CPUDeviceContext& context,
+                       const Tensor& boxes, const Tensor& gt_boxes,
+                       const Tensor& gt_classes,
+                       const std::vector<int>& fg_inds,
+                       const std::vector<int>& bg_inds,
+                       const std::vector<int>& gt_inds, Tensor* sampled_boxes,
+                       Tensor* sampled_labels, Tensor* sampled_gts) {
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  int gt_num = fg_num + bg_num;
+  Tensor fg_inds_t, bg_inds_t, gt_box_inds_t, gt_label_inds_t;
+  int* fg_inds_data = fg_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  int* bg_inds_data = bg_inds_t.mutable_data<int>({bg_num}, context.GetPlace());
+  int* gt_box_inds_data =
+      gt_box_inds_t.mutable_data<int>({gt_num}, context.GetPlace());
+  int* gt_label_inds_data =
+      gt_label_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  std::copy(fg_inds.begin(), fg_inds.end(), fg_inds_data);
+  std::copy(bg_inds.begin(), bg_inds.end(), bg_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_box_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_label_inds_data);
+  Tensor fg_boxes, bg_boxes, fg_labels, bg_labels;
+  fg_boxes.mutable_data<T>({fg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, fg_inds_t, &fg_boxes);
+  bg_boxes.mutable_data<T>({bg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, bg_inds_t, &bg_boxes);
+  Concat<T>(context, fg_boxes, bg_boxes, sampled_boxes);
+  CPUGather<T>(context, gt_boxes, gt_box_inds_t, sampled_gts);
+  fg_labels.mutable_data<int>({fg_num}, context.GetPlace());
+  CPUGather<int>(context, gt_classes, gt_label_inds_t, &fg_labels);
+  bg_labels.mutable_data<int>({bg_num}, context.GetPlace());
+  math::set_constant(context, &bg_labels, 0);
+  Concat<int>(context, fg_labels, bg_labels, sampled_labels);
+}
+template <typename T>
+std::vector<Tensor> SampleRoisForOneImage(
+    const platform::CPUDeviceContext& context, Tensor* rpn_rois,
+    Tensor* gt_classes, Tensor* gt_boxes, Tensor* im_scale,
+    const int batch_size_per_im, const float fg_fraction, const float fg_thresh,
+    const float bg_thresh_hi, const float bg_thresh_lo,
+    const std::vector<float>& bbox_reg_weights, const int class_nums,
+    std::minstd_rand engine) {
+  auto rpn_rois_et = framework::EigenTensor<T, 2>::From(*rpn_rois);
+  auto im_scale_data = im_scale->data<T>()[0];
+  rpn_rois_et = rpn_rois_et / im_scale_data;
+  Tensor boxes;
+  int proposals_num = gt_boxes->dims()[0] + rpn_rois->dims()[0];
+  boxes.mutable_data<T>({proposals_num, kBoxDim}, context.GetPlace());
+  Concat<T>(context, *gt_boxes, *rpn_rois, &boxes);
+  // Overlaps
+  Tensor proposal_to_gt_overlaps;
+  proposal_to_gt_overlaps.mutable_data<T>({proposals_num, gt_boxes->dims()[0]},
+                                          context.GetPlace());
+  BboxOverlaps<T>(boxes, *gt_boxes, &proposal_to_gt_overlaps);
+  // Generate proposal index
+  std::vector<std::vector<int>> fg_bg_gt = SampleFgBgGt<T>(
+      context, &proposal_to_gt_overlaps, batch_size_per_im, fg_fraction,
+      fg_thresh, bg_thresh_hi, bg_thresh_lo, engine);
+  std::vector<int> fg_inds = fg_bg_gt[0];
+  std::vector<int> bg_inds = fg_bg_gt[1];
+  std::vector<int> gt_inds = fg_bg_gt[2];
+  // Gather boxes and labels
+  Tensor sampled_boxes, sampled_labels, sampled_gts;
+  int boxes_num = fg_inds.size() + bg_inds.size();
+  framework::DDim bbox_dim({boxes_num, kBoxDim});
+  sampled_boxes.mutable_data<T>(bbox_dim, context.GetPlace());
+  sampled_labels.mutable_data<int>({boxes_num}, context.GetPlace());
+  sampled_gts.mutable_data<T>(bbox_dim, context.GetPlace());
+  GatherBoxesLabels<T>(context, boxes, *gt_boxes, *gt_classes, fg_inds, bg_inds,
+                       gt_inds, &sampled_boxes, &sampled_labels, &sampled_gts);
+  // Compute targets
+  Tensor bbox_targets_single;
+  bbox_targets_single.mutable_data<T>(bbox_dim, context.GetPlace());
+  BoxToDelta<T>(boxes_num, sampled_boxes, sampled_gts, bbox_reg_weights,
+                &bbox_targets_single);
+  // Scale rois
+  Tensor sampled_rois;
+  sampled_rois.mutable_data<T>(sampled_boxes.dims(), context.GetPlace());
+  auto sampled_rois_et = framework::EigenTensor<T, 2>::From(sampled_rois);
+  auto sampled_boxes_et = framework::EigenTensor<T, 2>::From(sampled_boxes);
+  sampled_rois_et = sampled_boxes_et * im_scale_data;
+  // Expand box targets
+  Tensor bbox_targets, bbox_inside_weights, bbox_outside_weights;
+  framework::DDim bbox_expand_dim({boxes_num, kBoxDim * class_nums});
+  bbox_targets.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_inside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_outside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  math::set_constant(context, &bbox_targets, 0.0);
+  math::set_constant(context, &bbox_inside_weights, 0.0);
+  math::set_constant(context, &bbox_outside_weights, 0.0);
+  auto* bbox_targets_single_data = bbox_targets_single.data<T>();
+  auto* sampled_labels_data = sampled_labels.data<int>();
+  auto* bbox_targets_data = bbox_targets.data<T>();
+  auto* bbox_inside_weights_data = bbox_inside_weights.data<T>();
+  auto* bbox_outside_weights_data = bbox_outside_weights.data<T>();
+  int width = kBoxDim * class_nums;
+  for (int64_t i = 0; i < boxes_num; ++i) {
+    int label = sampled_labels_data[i];
+    if (label > 0) {
+      int dst_idx = i * width + kBoxDim * label;
+      int src_idx = kBoxDim * i;
+      bbox_targets_data[dst_idx] = bbox_targets_single_data[src_idx];
+      bbox_targets_data[dst_idx + 1] = bbox_targets_single_data[src_idx + 1];
+      bbox_targets_data[dst_idx + 2] = bbox_targets_single_data[src_idx + 2];
+      bbox_targets_data[dst_idx + 3] = bbox_targets_single_data[src_idx + 3];
+      bbox_inside_weights_data[dst_idx] = 1;
+      bbox_inside_weights_data[dst_idx + 1] = 1;
+      bbox_inside_weights_data[dst_idx + 2] = 1;
+      bbox_inside_weights_data[dst_idx + 3] = 1;
+      bbox_outside_weights_data[dst_idx] = 1;
+      bbox_outside_weights_data[dst_idx + 1] = 1;
+      bbox_outside_weights_data[dst_idx + 2] = 1;
+      bbox_outside_weights_data[dst_idx + 3] = 1;
+    }
+  }
+  std::vector<Tensor> res;
+  res.emplace_back(sampled_rois);
+  res.emplace_back(sampled_labels);
+  res.emplace_back(bbox_targets);
+  res.emplace_back(bbox_inside_weights);
+  res.emplace_back(bbox_outside_weights);
+  return res;
+}
+template <typename T>
+class GenerateProposalLabelsKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* rpn_rois = context.Input<LoDTensor>("RpnRois");
+    auto* gt_classes = context.Input<LoDTensor>("GtClasses");
+    auto* gt_boxes = context.Input<LoDTensor>("GtBoxes");
+    auto* im_scales = context.Input<LoDTensor>("ImScales");
+    auto* rois = context.Output<LoDTensor>("Rois");
+    auto* labels_int32 = context.Output<LoDTensor>("LabelsInt32");
+    auto* bbox_targets = context.Output<LoDTensor>("BboxTargets");
+    auto* bbox_inside_weights = context.Output<LoDTensor>("BboxInsideWeights");
+    auto* bbox_outside_weights =
+        context.Output<LoDTensor>("BboxOutsideWeights");
+    int batch_size_per_im = context.Attr<int>("batch_size_per_im");
+    float fg_fraction = context.Attr<float>("fg_fraction");
+    float fg_thresh = context.Attr<float>("fg_thresh");
+    float bg_thresh_hi = context.Attr<float>("bg_thresh_hi");
+    float bg_thresh_lo = context.Attr<float>("bg_thresh_lo");
+    std::vector<float> bbox_reg_weights =
+        context.Attr<std::vector<float>>("bbox_reg_weights");
+    int class_nums = context.Attr<int>("class_nums");
+    PADDLE_ENFORCE_EQ(rpn_rois->lod().size(), 1UL,
+                      "GenerateProposalLabelsOp rpn_rois needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        gt_classes->lod().size(), 1UL,
+        "GenerateProposalLabelsOp gt_classes needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(gt_boxes->lod().size(), 1UL,
+                      "GenerateProposalLabelsOp gt_boxes needs 1 level of LoD");
+    int64_t n = static_cast<int64_t>(rpn_rois->lod().back().size() - 1);
+    rois->mutable_data<T>({n * batch_size_per_im, kBoxDim}, context.GetPlace());
+    labels_int32->mutable_data<int>({n * batch_size_per_im},
+                                    context.GetPlace());
+    bbox_targets->mutable_data<T>({n * batch_size_per_im, kBoxDim * class_nums},
+                                  context.GetPlace());
+    bbox_inside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+    bbox_outside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+    std::random_device rnd;
+    std::minstd_rand engine;
+    int seed =
+        context.Attr<bool>("fix_seed") ? context.Attr<int>("seed") : rnd();
+    engine.seed(seed);
+    framework::LoD lod;
+    std::vector<size_t> lod0(1, 0);
+    int64_t num_rois = 0;
+    auto& dev_ctx = context.device_context<platform::CPUDeviceContext>();
+    auto rpn_rois_lod = rpn_rois->lod().back();
+    auto gt_classes_lod = gt_classes->lod().back();
+    auto gt_boxes_lod = gt_boxes->lod().back();
+    for (size_t i = 0; i < n; ++i) {
+      Tensor rpn_rois_slice =
+          rpn_rois->Slice(rpn_rois_lod[i], rpn_rois_lod[i + 1]);
+      Tensor gt_classes_slice =
+          gt_classes->Slice(gt_classes_lod[i], gt_classes_lod[i + 1]);
+      Tensor gt_boxes_slice =
+          gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]);
+      Tensor im_scales_slice = im_scales->Slice(i, i + 1);
+      std::vector<Tensor> tensor_output = SampleRoisForOneImage<T>(
+          dev_ctx, &rpn_rois_slice, &gt_classes_slice, &gt_boxes_slice,
+          &im_scales_slice, batch_size_per_im, fg_fraction, fg_thresh,
+          bg_thresh_hi, bg_thresh_lo, bbox_reg_weights, class_nums, engine);
+      Tensor sampled_rois = tensor_output[0];
+      Tensor sampled_labels_int32 = tensor_output[1];
+      Tensor sampled_bbox_targets = tensor_output[2];
+      Tensor sampled_bbox_inside_weights = tensor_output[3];
+      Tensor sampled_bbox_outside_weights = tensor_output[4];
+      AppendRois<T>(rois, kBoxDim * num_rois, &sampled_rois);
+      AppendRois<int>(labels_int32, num_rois, &sampled_labels_int32);
+      AppendRois<T>(bbox_targets, kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_targets);
+      AppendRois<T>(bbox_inside_weights, kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_inside_weights);
+      AppendRois<T>(bbox_outside_weights, kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_outside_weights);
+      num_rois += sampled_rois.dims()[0];
+      lod0.emplace_back(num_rois);
+    }
+    lod.emplace_back(lod0);
+    rois->set_lod(lod);
+    labels_int32->set_lod(lod);
+    bbox_targets->set_lod(lod);
+    bbox_inside_weights->set_lod(lod);
+    bbox_outside_weights->set_lod(lod);
+    rois->Resize({num_rois, kBoxDim});
+    labels_int32->Resize({num_rois});
+    bbox_targets->Resize({num_rois, kBoxDim * class_nums});
+    bbox_inside_weights->Resize({num_rois, kBoxDim * class_nums});
+    bbox_outside_weights->Resize({num_rois, kBoxDim * class_nums});
+  }
+};
+class GenerateProposalLabelsOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    // TODO(buxingyuan): Add Document
+    AddInput("RpnRois", "RpnRois.");
+    AddInput("GtClasses", "GtClasses.");
+    AddInput("GtBoxes", "GtBoxes.");
+    AddInput("ImScales", "ImScales.");
+    AddOutput("Rois", "Rois.");
+    AddOutput("LabelsInt32", "LabelsInt32.");
+    AddOutput("BboxTargets", "BboxTargets.");
+    AddOutput("BboxInsideWeights", "BboxInsideWeights.");
+    AddOutput("BboxOutsideWeights", "BboxOutsideWeights.");
+    AddAttr<int>("batch_size_per_im", "batch_size_per_im");
+    AddAttr<float>("fg_fraction", "fg_fraction");
+    AddAttr<float>("fg_thresh", "fg_thresh");
+    AddAttr<float>("bg_thresh_hi", "bg_thresh_hi");
+    AddAttr<float>("bg_thresh_lo", "bg_thresh_lo");
+    AddAttr<std::vector<float>>("bbox_reg_weights", "bbox_reg_weights");
+    AddAttr<int>("class_nums", "class_nums");
+    AddAttr<bool>("fix_seed", "fix_seed").SetDefault(false);
+    AddAttr<int>("seed", "seed").SetDefault(0);
+    AddComment(R"DOC(
+Generate Proposals Labels Operator.
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(generate_proposal_labels, ops::GenerateProposalLabelsOp,
+                  ops::GenerateProposalLabelsOpMaker,
+                  paddle::framework::EmptyGradOpMaker);
+REGISTER_OP_CPU_KERNEL(generate_proposal_labels,
+                       ops::GenerateProposalLabelsKernel<float>,
+                       ops::GenerateProposalLabelsKernel<double>);
--- a/paddle/fluid/operators/detection/rpn_target_assign_op.cc
+++ b/paddle/fluid/operators/detection/rpn_target_assign_op.cc
@@ -86,7 +86,7 @@ class RpnTargetAssignKernel : public framework::OpKernel<T> {
                         std::minstd_rand engine,
                         std::vector<int>* inds) const {
    std::uniform_real_distribution<float> uniform(0, 1);
-    const int64_t size = static_cast<int64_t>(inds->size());
+    const int64_t size = static_cast<int64_t>(inds->size() - offset);
    if (size > num) {
      for (int64_t i = num; i < size; ++i) {
        int rng_ind = std::floor(uniform(engine) * i);
@@ -126,7 +126,7 @@ class RpnTargetAssignKernel : public framework::OpKernel<T> {
                neg_threshold, target_label_data, fg_inds, bg_inds);
    // Reservoir Sampling
    ReservoirSampling(fg_num, fg_offset, engine, fg_inds);
-    int bg_num = rpn_batch_size - fg_inds->size();
+    int bg_num = rpn_batch_size - (fg_inds->size() - fg_offset);
    ReservoirSampling(bg_num, bg_offset, engine, bg_inds);
  }

--- a/paddle/fluid/operators/elementwise_op_function.h
+++ b/paddle/fluid/operators/elementwise_op_function.h
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
 #include <glog/logging.h>
 #include <algorithm>
 #include <vector>
@@ -46,9 +47,9 @@ namespace operators {
 *    pre=2*3, n=4*5, post=1
 *    x.shape(6, 20, 1) * y.shape(1, 20, 1).broadcast(6, 20, 1)
 */
-inline void get_mid_dims(const framework::DDim& x_dims,
+inline void get_mid_dims(const framework::DDim &x_dims,
-                         const framework::DDim& y_dims, const int axis,
+                         const framework::DDim &y_dims, const int axis,
-                         int* pre, int* n, int* post) {
+                         int *pre, int *n, int *post) {
  *pre = 1;
  *n = 1;
  *post = 1;
@@ -68,7 +69,7 @@ inline void get_mid_dims(const framework::DDim& x_dims,
 }
 inline framework::DDim trim_trailing_singular_dims(
-    const framework::DDim& dims) {
+    const framework::DDim &dims) {
  // Remove trailing dimensions of size 1 for y
  auto actual_dims_size = dims.size();
  for (; actual_dims_size != 0; --actual_dims_size) {
@@ -89,15 +90,16 @@ inline framework::DDim trim_trailing_singular_dims(
 template <typename T, typename DeviceContext>
 class RowwiseTransformIterator;
 template <typename T, typename DeviceContext>
 class MidWiseTransformIterator;
 template <typename T>
 class RowwiseTransformIterator<T, platform::CPUDeviceContext> {
 public:
-  RowwiseTransformIterator(const T* ptr, int n) : ptr_(ptr), i_(0), n_(n) {}
+  RowwiseTransformIterator(const T *ptr, int n) : ptr_(ptr), i_(0), n_(n) {}
-  RowwiseTransformIterator<T, platform::CPUDeviceContext>& operator++() {
+  RowwiseTransformIterator<T, platform::CPUDeviceContext> &operator++() {
    ++i_;
    if (UNLIKELY(i_ == n_)) {
      i_ = 0;
@@ -105,20 +107,20 @@ class RowwiseTransformIterator<T, platform::CPUDeviceContext> {
    return *this;
  }
-  bool operator==(const RowwiseTransformIterator<T, platform::CPUDeviceContext>&
+  bool operator==(const RowwiseTransformIterator<T, platform::CPUDeviceContext>
-                      rhs) const {
+                      &rhs) const {
    return (ptr_ + i_) == &(*rhs);
  }
-  bool operator!=(const RowwiseTransformIterator<T, platform::CPUDeviceContext>&
+  bool operator!=(const RowwiseTransformIterator<T, platform::CPUDeviceContext>
-                      rhs) const {
+                      &rhs) const {
    return (ptr_ + i_) != &(*rhs);
  }
-  const T& operator*() { return ptr_[i_]; }
+  const T &operator*() { return ptr_[i_]; }
 private:
-  const T* ptr_;
+  const T *ptr_;
  int i_;
  int64_t n_;
 };
@@ -126,10 +128,10 @@ class RowwiseTransformIterator<T, platform::CPUDeviceContext> {
 template <typename T>
 class MidWiseTransformIterator<T, platform::CPUDeviceContext> {
 public:
-  MidWiseTransformIterator(const T* ptr, int n, int post)
+  MidWiseTransformIterator(const T *ptr, int n, int post)
      : ptr_(ptr), i_(0), j_(0), n_(n), post_(post) {}
-  MidWiseTransformIterator<T, platform::CPUDeviceContext>& operator++() {
+  MidWiseTransformIterator<T, platform::CPUDeviceContext> &operator++() {
    ++j_;
    if (UNLIKELY(j_ == post_)) {
      ++i_;
@@ -141,20 +143,20 @@ class MidWiseTransformIterator<T, platform::CPUDeviceContext> {
    return *this;
  }
-  bool operator==(const MidWiseTransformIterator<T, platform::CPUDeviceContext>&
+  bool operator==(const MidWiseTransformIterator<T, platform::CPUDeviceContext>
-                      rhs) const {
+                      &rhs) const {
    return (ptr_ + i_) == &(*rhs);
  }
-  bool operator!=(const MidWiseTransformIterator<T, platform::CPUDeviceContext>&
+  bool operator!=(const MidWiseTransformIterator<T, platform::CPUDeviceContext>
-                      rhs) const {
+                      &rhs) const {
    return (ptr_ + i_) != &(*rhs);
  }
-  const T& operator*() { return ptr_[i_]; }
+  const T &operator*() { return ptr_[i_]; }
 private:
-  const T* ptr_;
+  const T *ptr_;
  int64_t i_;
  int64_t j_;
  int64_t n_;
@@ -165,18 +167,18 @@ class MidWiseTransformIterator<T, platform::CPUDeviceContext> {
 template <typename T>
 class RowwiseTransformIterator<T, platform::CUDADeviceContext>
    : public thrust::iterator_adaptor<
-          RowwiseTransformIterator<T, platform::CUDADeviceContext>, const T*> {
+          RowwiseTransformIterator<T, platform::CUDADeviceContext>, const T *> {
 public:
  typedef thrust::iterator_adaptor<
-      RowwiseTransformIterator<T, platform::CUDADeviceContext>, const T*>
+      RowwiseTransformIterator<T, platform::CUDADeviceContext>, const T *>
      super_t;
-  HOSTDEVICE RowwiseTransformIterator(const T* x, int n)
+  HOSTDEVICE RowwiseTransformIterator(const T *x, int n)
      : super_t(x), begin_(x), n_(n) {}
  friend class thrust::iterator_core_access;
 private:
  unsigned int n_;
-  const T* begin_;
+  const T *begin_;
  HOSTDEVICE typename super_t::reference dereference() const {
    return *(begin_ + (this->base() - begin_) % n_);
  }
@@ -185,19 +187,19 @@ class RowwiseTransformIterator<T, platform::CUDADeviceContext>
 template <typename T>
 class MidWiseTransformIterator<T, platform::CUDADeviceContext>
    : public thrust::iterator_adaptor<
-          MidWiseTransformIterator<T, platform::CUDADeviceContext>, const T*> {
+          MidWiseTransformIterator<T, platform::CUDADeviceContext>, const T *> {
 public:
  typedef thrust::iterator_adaptor<
-      MidWiseTransformIterator<T, platform::CUDADeviceContext>, const T*>
+      MidWiseTransformIterator<T, platform::CUDADeviceContext>, const T *>
      super_t;
-  HOSTDEVICE MidWiseTransformIterator(const T* x, int n, int post)
+  HOSTDEVICE MidWiseTransformIterator(const T *x, int n, int post)
      : super_t(x), begin_(x), n_(n), post_(post) {}
  friend class thrust::iterator_core_access;
 private:
  unsigned int post_;
  unsigned int n_;
-  const T* begin_;
+  const T *begin_;
  HOSTDEVICE typename super_t::reference dereference() const {
    return *(begin_ + (((this->base() - begin_) / post_) % n_));
  }
@@ -208,8 +210,8 @@ template <typename Functor, typename T, typename DeviceContext,
          typename OutType = T>
 class TransformFunctor {
 public:
-  TransformFunctor(const framework::Tensor* x, const framework::Tensor* y,
+  TransformFunctor(const framework::Tensor *x, const framework::Tensor *y,
-                   framework::Tensor* z, const DeviceContext& ctx, Functor func)
+                   framework::Tensor *z, const DeviceContext &ctx, Functor func)
      : x_(x->data<T>()),
        y_(y->data<T>()),
        z_(z->mutable_data<OutType>(ctx.GetPlace())),
@@ -235,20 +237,20 @@ class TransformFunctor {
  }
 private:
-  const T* x_;
+  const T *x_;
-  const T* y_;
+  const T *y_;
-  OutType* z_;
+  OutType *z_;
  int64_t nx_;
-  const DeviceContext& ctx_;
+  const DeviceContext &ctx_;
  Functor func_;
 };
 #define EIGEN_FUNCTOR(name, eigen_op)                                          \
  struct Eigen##name##Functor {                                                \
    template <typename DeviceContext, typename T>                              \
-    inline void Run(const framework::Tensor* x, const framework::Tensor* y,    \
+    inline void Run(const framework::Tensor *x, const framework::Tensor *y,    \
-                    framework::Tensor* z,                                      \
+                    framework::Tensor *z,                                      \
-                    const framework::ExecutionContext& ctx) {                  \
+                    const framework::ExecutionContext &ctx) {                  \
      auto x_e = framework::EigenVector<T>::Flatten(*x);                       \
      auto y_e = framework::EigenVector<T>::Flatten(*y);                       \
      auto z_e = framework::EigenVector<T>::Flatten(*z);                       \
@@ -257,9 +259,9 @@ class TransformFunctor {
          eigen_op(x_e, y_e);                                                  \
    }                                                                          \
    template <typename DeviceContext, typename T>                              \
-    inline void RunBroadCast(const framework::Tensor* x,                       \
+    inline void RunBroadCast(const framework::Tensor *x,                       \
-                             const framework::Tensor* y, framework::Tensor* z, \
+                             const framework::Tensor *y, framework::Tensor *z, \
-                             const framework::ExecutionContext& ctx, int pre,  \
+                             const framework::ExecutionContext &ctx, int pre,  \
                             int n) {                                          \
      auto x_e = framework::EigenVector<T>::Flatten(*x);                       \
      auto y_e = framework::EigenVector<T>::Flatten(*y);                       \
@@ -272,10 +274,10 @@ class TransformFunctor {
          eigen_op(x_e, y_bcast);                                              \
    }                                                                          \
    template <typename DeviceContext, typename T>                              \
-    inline void RunBroadCast2(const framework::Tensor* x,                      \
+    inline void RunBroadCast2(const framework::Tensor *x,                      \
-                              const framework::Tensor* y,                      \
+                              const framework::Tensor *y,                      \
-                              framework::Tensor* z,                            \
+                              framework::Tensor *z,                            \
-                              const framework::ExecutionContext& ctx, int pre, \
+                              const framework::ExecutionContext &ctx, int pre, \
                              int n, int post) {                               \
      auto x_e = framework::EigenVector<T>::Flatten(*x);                       \
      auto y_e = framework::EigenVector<T>::Flatten(*y);                       \
@@ -290,23 +292,27 @@ class TransformFunctor {
  }
 #define EIGEN_ADD(x, y) ((x) + (y))
 EIGEN_FUNCTOR(Add, EIGEN_ADD);
 #define EIGEN_SUB(x, y) ((x) - (y))
 EIGEN_FUNCTOR(Sub, EIGEN_SUB);
 #define EIGEN_MUL(x, y) ((x) * (y))
 EIGEN_FUNCTOR(Mul, EIGEN_MUL);
 #define EIGEN_DIV(x, y) ((x) / (y))
 EIGEN_FUNCTOR(Div, EIGEN_DIV);
 template <typename T, typename DX_OP, typename DY_OP>
 struct ElemwiseGradNoBroadcast {
-  const T* x_;
+  const T *x_;
-  const T* y_;
+  const T *y_;
-  const T* out_;
+  const T *out_;
-  const T* dout_;
+  const T *dout_;
  HOSTDEVICE void operator()(size_t i) {
    if (dx_ != nullptr) {
@@ -319,14 +325,14 @@ struct ElemwiseGradNoBroadcast {
  DX_OP dx_op_;
  DY_OP dy_op_;
-  T* dx_;
+  T *dx_;
-  T* dy_;
+  T *dy_;
 };
 template <typename T, typename DX_OP, typename DY_OP>
-static void ElemwiseGradBroadcast1CPU(const T* x, const T* y, const T* out,
+static void ElemwiseGradBroadcast1CPU(const T *x, const T *y, const T *out,
-                                      const T* dout, int h, int w, DX_OP dx_op,
+                                      const T *dout, int h, int w, DX_OP dx_op,
-                                      DY_OP dy_op, T* dx, T* dy) {
+                                      DY_OP dy_op, T *dx, T *dy) {
  for (int i = 0; i < h; ++i) {
    for (int j = 0; j < w; ++j) {
      int x_offset = i * w + j;
@@ -348,8 +354,8 @@ static void ElemwiseGradBroadcast1CPU(const T* x, const T* y, const T* out,
 #ifdef __NVCC__
 template <typename T, typename DX_OP, typename DY_OP>
 static __global__ void ElemwiseGradBroadcast1CUDAKernel(
-    const T* x, const T* y, const T* out, const T* dout, int h, int w,
+    const T *x, const T *y, const T *out, const T *dout, int h, int w,
-    DX_OP dx_op, DY_OP dy_op, T* dx, T* dy) {
+    DX_OP dx_op, DY_OP dy_op, T *dx, T *dy) {
  int j = blockIdx.x;
  int i = threadIdx.x;
  int tid = threadIdx.x;
@@ -376,10 +382,10 @@ static __global__ void ElemwiseGradBroadcast1CUDAKernel(
 }
 template <typename T, typename DX_OP, typename DY_OP>
-static void ElemwiseGradBroadcast1CUDA(cudaStream_t stream, const T* x,
+static void ElemwiseGradBroadcast1CUDA(cudaStream_t stream, const T *x,
-                                       const T* y, const T* out, const T* dout,
+                                       const T *y, const T *out, const T *dout,
                                       int h, int w, DX_OP dx_op, DY_OP dy_op,
-                                       T* dx, T* dy) {
+                                       T *dx, T *dy) {
  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, h);
  int gird_size = w;
  ElemwiseGradBroadcast1CUDAKernel<<<gird_size, block_size, 0, stream>>>(
@@ -389,9 +395,9 @@ static void ElemwiseGradBroadcast1CUDA(cudaStream_t stream, const T* x,
 #endif
 template <typename T, typename DX_OP, typename DY_OP>
-static void ElemwiseGradBroadcast2CPU(const T* x, const T* y, const T* out,
+static void ElemwiseGradBroadcast2CPU(const T *x, const T *y, const T *out,
-                                      const T* dout, int pre, int n, int post,
+                                      const T *dout, int pre, int n, int post,
-                                      DX_OP dx_op, DY_OP dy_op, T* dx, T* dy) {
+                                      DX_OP dx_op, DY_OP dy_op, T *dx, T *dy) {
  for (int i = 0; i < pre; ++i) {
    for (int j = 0; j < n; ++j) {
      for (int k = 0; k < post; ++k) {
@@ -416,8 +422,8 @@ static void ElemwiseGradBroadcast2CPU(const T* x, const T* y, const T* out,
 #ifdef __NVCC__
 template <typename T, typename DX_OP, typename DY_OP>
 static __global__ void ElemwiseGradBroadcast2CUDAKernel(
-    const T* x, const T* y, const T* out, const T* dout, int pre, int n,
+    const T *x, const T *y, const T *out, const T *dout, int pre, int n,
-    int post, DX_OP dx_op, DY_OP dy_op, T* dx, T* dy) {
+    int post, DX_OP dx_op, DY_OP dy_op, T *dx, T *dy) {
  int tid = threadIdx.x;
  int j = blockIdx.x;
@@ -453,10 +459,10 @@ static __global__ void ElemwiseGradBroadcast2CUDAKernel(
 }
 template <typename T, typename DX_OP, typename DY_OP>
-static void ElemwiseGradBroadcast2CUDA(cudaStream_t stream, const T* x,
+static void ElemwiseGradBroadcast2CUDA(cudaStream_t stream, const T *x,
-                                       const T* y, const T* out, const T* dout,
+                                       const T *y, const T *out, const T *dout,
                                       int pre, int n, int post, DX_OP dx_op,
-                                       DY_OP dy_op, T* dx, T* dy) {
+                                       DY_OP dy_op, T *dx, T *dy) {
  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, pre * post);
  int gird_size = n;
  ElemwiseGradBroadcast2CUDAKernel<<<gird_size, block_size, 0, stream>>>(
@@ -467,11 +473,11 @@ static void ElemwiseGradBroadcast2CUDA(cudaStream_t stream, const T* x,
 template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP>
 void ElemwiseGradComputeNoBroadcast(
-    const framework::ExecutionContext& ctx, const framework::DDim& x_dim,
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
-    const framework::DDim& y_dim, const framework::Tensor& x,
+    const framework::DDim &y_dim, const framework::Tensor &x,
-    const framework::Tensor& y, const framework::Tensor& out,
+    const framework::Tensor &y, const framework::Tensor &out,
-    const framework::Tensor& dout, int axis, framework::Tensor* dx,
+    const framework::Tensor &dout, int axis, framework::Tensor *dx,
-    framework::Tensor* dy, DX_OP dx_op, DY_OP dy_op) {
+    framework::Tensor *dy, DX_OP dx_op, DY_OP dy_op) {
  size_t N = static_cast<size_t>(framework::product(x_dim));
  platform::ForRange<DeviceContext> for_range(
      ctx.template device_context<DeviceContext>(), N);
@@ -483,11 +489,11 @@ void ElemwiseGradComputeNoBroadcast(
 template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP>
 void ElemwiseGradComputeWithBroadcast(
-    const framework::ExecutionContext& ctx, const framework::DDim& x_dim,
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
-    const framework::DDim& y_dim_untrimed, const framework::Tensor& x,
+    const framework::DDim &y_dim_untrimed, const framework::Tensor &x,
-    const framework::Tensor& y, const framework::Tensor& out,
+    const framework::Tensor &y, const framework::Tensor &out,
-    const framework::Tensor& dout, int axis, framework::Tensor* dx,
+    const framework::Tensor &dout, int axis, framework::Tensor *dx,
-    framework::Tensor* dy, DX_OP dx_op, DY_OP dy_op) {
+    framework::Tensor *dy, DX_OP dx_op, DY_OP dy_op) {
  axis = (axis == -1 ? x_dim.size() - y_dim_untrimed.size() : axis);
  auto y_dim = trim_trailing_singular_dims(y_dim_untrimed);
  axis = (y_dim.size() == 0) ? x_dim.size() : axis;
@@ -531,14 +537,14 @@ void ElemwiseGradComputeWithBroadcast(
 }
 template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP>
-void ElemwiseGradCompute(const framework::ExecutionContext& ctx,
+void ElemwiseGradCompute(const framework::ExecutionContext &ctx,
-                         const framework::Tensor& x, const framework::Tensor& y,
+                         const framework::Tensor &x, const framework::Tensor &y,
-                         const framework::Tensor& out,
+                         const framework::Tensor &out,
-                         const framework::Tensor& dout, int axis,
+                         const framework::Tensor &dout, int axis,
-                         framework::Tensor* dx, framework::Tensor* dy,
+                         framework::Tensor *dx, framework::Tensor *dy,
                         DX_OP dx_op, DY_OP dy_op) {
-  const framework::DDim& x_dim = x.dims();
+  const framework::DDim &x_dim = x.dims();
-  const framework::DDim& y_dim = y.dims();
+  const framework::DDim &y_dim = y.dims();
  if (x.dims() == y.dims()) {
    ElemwiseGradComputeNoBroadcast<DeviceContext, T, DX_OP, DY_OP>(
        ctx, x_dim, y_dim, x, y, out, dout, axis, dx, dy, dx_op, dy_op);
@@ -553,27 +559,27 @@ void ElemwiseGradCompute(const framework::ExecutionContext& ctx,
 // In elementwise_add, elementwise_sub, we use dout as fake X, Y, Out to reuse
 // elementwise code.
 template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP>
-void ElemwiseExplicitGradCompute(const framework::ExecutionContext& ctx,
+void ElemwiseExplicitGradCompute(const framework::ExecutionContext &ctx,
-                                 const framework::Tensor& x,
+                                 const framework::Tensor &x,
-                                 const framework::Tensor& y,
+                                 const framework::Tensor &y,
-                                 const framework::Tensor& out,
+                                 const framework::Tensor &out,
-                                 const framework::Tensor& dout, int axis,
+                                 const framework::Tensor &dout, int axis,
-                                 framework::Tensor* dx, framework::Tensor* dy,
+                                 framework::Tensor *dx, framework::Tensor *dy,
                                 DX_OP dx_op, DY_OP dy_op) {
  if (dy == nullptr) {
-    const framework::DDim& dx_dims = dout.dims();
+    const framework::DDim &dx_dims = dout.dims();
    auto dy_dims = dx_dims;
    ElemwiseGradComputeNoBroadcast<DeviceContext, T, DX_OP, DY_OP>(
        ctx, dx_dims, dy_dims, x, y, out, dout, axis, dx, dy, dx_op, dy_op);
  } else {
    if (dout.dims() == dy->dims()) {
-      const framework::DDim& dx_dims = dout.dims();
+      const framework::DDim &dx_dims = dout.dims();
-      const framework::DDim& dy_dims = dy->dims();
+      const framework::DDim &dy_dims = dy->dims();
      ElemwiseGradComputeNoBroadcast<DeviceContext, T, DX_OP, DY_OP>(
          ctx, dx_dims, dy_dims, x, y, out, dout, axis, dx, dy, dx_op, dy_op);
    } else {  // Y is a scalar
      auto dx_dims = dout.dims();
-      const framework::DDim& dy_dims = dy->dims();
+      const framework::DDim &dy_dims = dy->dims();
      ElemwiseGradComputeWithBroadcast<DeviceContext, T, DX_OP, DY_OP>(
          ctx, dx_dims, dy_dims, x, y, out, dout, axis, dx, dy, dx_op, dy_op);
    }
@@ -583,13 +589,13 @@ void ElemwiseExplicitGradCompute(const framework::ExecutionContext& ctx,
 // Deprecated
 template <typename DeviceContext, typename T, typename functor,
          typename broadcastfunctor, typename broadcast2functor>
-void ElementwiseGradCompute(const framework::ExecutionContext& ctx,
+void ElementwiseGradCompute(const framework::ExecutionContext &ctx,
-                            const framework::Tensor* x,
+                            const framework::Tensor *x,
-                            const framework::Tensor* y,
+                            const framework::Tensor *y,
-                            const framework::Tensor* out,
+                            const framework::Tensor *out,
-                            const framework::Tensor* dout, int axis,
+                            const framework::Tensor *dout, int axis,
-                            framework::Tensor* dx, framework::Tensor* dy) {
+                            framework::Tensor *dx, framework::Tensor *dy) {
-  auto& place = *ctx.template device_context<DeviceContext>().eigen_device();
+  auto &place = *ctx.template device_context<DeviceContext>().eigen_device();
  auto x_dims = x->dims();
  auto y_dims = y->dims();
@@ -627,10 +633,10 @@ void ElementwiseGradCompute(const framework::ExecutionContext& ctx,
 template <typename Functor, typename DeviceContext, typename T,
          typename OutType = T>
-void ElementwiseComputeEx(const framework::ExecutionContext& ctx,
+void ElementwiseComputeEx(const framework::ExecutionContext &ctx,
-                          const framework::Tensor* x,
+                          const framework::Tensor *x,
-                          const framework::Tensor* y, int axis, Functor func,
+                          const framework::Tensor *y, int axis, Functor func,
-                          framework::Tensor* z) {
+                          framework::Tensor *z) {
  TransformFunctor<Functor, T, DeviceContext, OutType> functor(
      x, y, z, ctx.template device_context<DeviceContext>(), func);
@@ -661,5 +667,823 @@ void ElementwiseComputeEx(const framework::ExecutionContext& ctx,
  }
 }
+// FusedElemwiseAndAct
+// --- forward
+template <typename T, typename CompoundFunctor, bool KeepIntermediateOut>
+struct FusedElemwiseAndActNoBroadcast {
+  HOSTDEVICE void operator()(size_t i) {
+    T y_val = y_[i];
+    T x_val = x_[i];
+    if (KeepIntermediateOut) {
+      T intermeidiate_out = compound_functor_.GetIntermediateOut(x_val, y_val);
+      intermediate_out_[i] = intermeidiate_out;
+      out_[i] =
+          compound_functor_.GetOutUseIntermediateOut(x_val, intermeidiate_out);
+    } else {
+      out_[i] = compound_functor_.GetOut(x_val, y_val);
+    }
+  }
+  const T *x_;
+  const T *y_;
+  CompoundFunctor compound_functor_;
+  T *out_;
+  T *intermediate_out_;
+};
+// FusedElemwiseAndActBroadcast1:
+// In this case, X and Y can be reshaped to a matrix.
+// For example shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) and axis = -1 or 2,
+// X can be reshaped to (6, 20) and Y can be reshaped to (1, 20)
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActBroadcast1CPU(const T *x, const T *y,
+                                             CompoundFunctor compound_functor,
+                                             int h, int w, T *out,
+                                             T *intermediate_out) {
+  for (int i = 0; i < h; ++i) {
+    for (int j = 0; j < w; ++j) {
+      int offset = i * w + j;
+      T y_val = BcastY ? y[j] : y[offset];
+      T x_val = BcastY ? x[offset] : x[j];
+      int64_t intermediate_out_offset;
+      if (KeepIntermediateOut) {
+        T intermeidiate_out = compound_functor.GetIntermediateOut(x_val, y_val);
+        if (SameShapeOfIntermediateOutAndOut) {
+          // for the case of f1(f2(x, y))
+          intermediate_out_offset = offset;
+        } else if (BcastY) {
+          intermediate_out_offset = j;
+        } else {
+          intermediate_out_offset = offset;
+        }
+        intermediate_out[intermediate_out_offset] = intermeidiate_out;
+        out[offset] =
+            compound_functor.GetOutUseIntermediateOut(x_val, intermeidiate_out);
+      } else {
+        out[offset] = compound_functor.GetOut(x_val, y_val);
+      }
+    }
+  }
+}
+// FusedElemwiseAndActBroadcast2
+// In this case, X and Y can be reshaped to a matrix.
+// For example shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4) and axis = 1,
+// X can be reshaped to (2, 12, 5) and Y can be reshaped to (1, 12, 1)
+// pre = 2, n = 12, post = 5
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActBroadcast2CPU(const T *x, const T *y, int pre,
+                                             int n, int post,
+                                             CompoundFunctor compound_functor,
+                                             T *out, T *intermediate_out) {
+  for (int i = 0; i < pre; ++i) {
+    for (int j = 0; j < n; ++j) {
+      for (int k = 0; k < post; ++k) {
+        int offset = i * n * post + j * post + k;
+        T y_val = BcastY ? y[j] : y[offset];
+        T x_val = BcastY ? x[offset] : x[j];
+        int64_t intermediate_out_offset;
+        if (KeepIntermediateOut) {
+          T intermeidiate_out =
+              compound_functor.GetIntermediateOut(x_val, y_val);
+          if (SameShapeOfIntermediateOutAndOut) {
+            // for the case of f1(f2(x, y))
+            intermediate_out_offset = offset;
+          } else if (BcastY) {
+            intermediate_out_offset = j;
+          } else {
+            intermediate_out_offset = offset;
+          }
+          intermediate_out[intermediate_out_offset] = intermeidiate_out;
+          out[offset] = compound_functor.GetOutUseIntermediateOut(
+              x_val, intermeidiate_out);
+        } else {
+          out[offset] = compound_functor.GetOut(x_val, y_val);
+        }
+      }
+    }
+  }
+}
+#ifdef __NVCC__
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static __global__ void FusedElemwiseAndActBroadcast1CUDAKernel(
+    const T *x, const T *y, int h, int w, CompoundFunctor compound_functor,
+    T *out, T *intermediate_out) {
+  int j = blockIdx.x;
+  int i = threadIdx.x;
+  while (i < h) {
+    int offset = i * w + j;
+    T y_val = BcastY ? y[j] : y[offset];
+    T x_val = BcastY ? x[offset] : x[j];
+    int64_t intermediate_out_offset;
+    if (KeepIntermediateOut) {
+      T intermeidiate_out = compound_functor.GetIntermediateOut(x_val, y_val);
+      if (SameShapeOfIntermediateOutAndOut) {
+        // for the case of f1(f2(x, y))
+        intermediate_out_offset = offset;
+      } else if (BcastY) {
+        intermediate_out_offset = j;
+      } else {
+        intermediate_out_offset = offset;
+      }
+      intermediate_out[intermediate_out_offset] = intermeidiate_out;
+      out[offset] =
+          compound_functor.GetOutUseIntermediateOut(x_val, intermeidiate_out);
+    } else {
+      out[offset] = compound_functor.GetOut(x_val, y_val);
+    }
+    i += ELEMWISE_MAX_BLOCK_DIM;
+  }
+}
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActBroadcast1CUDA(cudaStream_t stream, const T *x,
+                                              const T *y,
+                                              CompoundFunctor compound_functor,
+                                              int h, int w, T *out,
+                                              T *intermediate_out) {
+  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, h);
+  int gird_size = w;
+  FusedElemwiseAndActBroadcast1CUDAKernel<
+      T, CompoundFunctor, BcastY, KeepIntermediateOut,
+      SameShapeOfIntermediateOutAndOut><<<gird_size, block_size, 0, stream>>>(
+      x, y, h, w, compound_functor, out, intermediate_out);
+}
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static __global__ void FusedElemwiseAndActBroadcast2CUDAKernel(
+    const T *x, const T *y, CompoundFunctor compound_functor, int pre, int n,
+    int post, T *out, T *intermediate_out) {
+  int tid = threadIdx.x;
+  int j = blockIdx.x;
+  while (true) {
+    int i = tid / post;
+    int k = tid % post;
+    if (i >= pre) break;
+    int offset = i * n * post + j * post + k;
+    T y_val = BcastY ? y[j] : y[offset];
+    T x_val = BcastY ? x[offset] : x[j];
+    int64_t intermediate_out_offset;
+    if (KeepIntermediateOut) {
+      T intermeidiate_out = compound_functor.GetIntermediateOut(x_val, y_val);
+      if (SameShapeOfIntermediateOutAndOut) {
+        // for the case of f1(f2(x, y))
+        intermediate_out_offset = offset;
+      } else if (BcastY) {
+        intermediate_out_offset = j;
+      } else {
+        intermediate_out_offset = offset;
+      }
+      intermediate_out[intermediate_out_offset] = intermeidiate_out;
+      out[offset] =
+          compound_functor.GetOutUseIntermediateOut(x_val, intermeidiate_out);
+    } else {
+      out[offset] = compound_functor.GetOut(x_val, y_val);
+    }
+    tid += ELEMWISE_MAX_BLOCK_DIM;
+  }
+}
+template <typename T, typename CompoundFunctor, bool BcastY,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActBroadcast2CUDA(cudaStream_t stream, const T *x,
+                                              const T *y, int pre, int n,
+                                              int post,
+                                              CompoundFunctor compound_functor,
+                                              T *out, T *intermediate_out) {
+  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, pre * post);
+  int gird_size = n;
+  FusedElemwiseAndActBroadcast2CUDAKernel<
+      T, CompoundFunctor, BcastY, KeepIntermediateOut,
+      SameShapeOfIntermediateOutAndOut><<<gird_size, block_size, 0, stream>>>(
+      x, y, compound_functor, pre, n, post, out, intermediate_out);
+}
+#endif
+template <typename DeviceContext, typename T, typename CompoundFunctor,
+          bool KeepIntermediateOut>
+void FusedElemwiseAndActComputeNoBroadcast(
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
+    const framework::Tensor &x, const framework::Tensor &y,
+    CompoundFunctor compound_functor, framework::Tensor *out,
+    framework::Tensor *intermediate_out) {
+  size_t N = static_cast<size_t>(framework::product(x_dim));
+  platform::ForRange<DeviceContext> for_range(
+      ctx.template device_context<DeviceContext>(), N);
+  for_range(
+      FusedElemwiseAndActNoBroadcast<T, CompoundFunctor, KeepIntermediateOut>{
+          x.data<T>(), y.data<T>(), compound_functor,
+          out->mutable_data<T>(ctx.GetPlace()),
+          intermediate_out == nullptr
+              ? nullptr
+              : intermediate_out->mutable_data<T>(ctx.GetPlace())});
+}
+template <typename DeviceContext, typename T, typename CompoundFunctor,
+          bool BcastY, bool KeepIntermediateOut,
+          bool SameShapeOfIntermediateOutAndOut>
+void FusedElemwiseAndActComputeWithBroadcast(
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
+    const framework::DDim &y_dim_untrimed, const framework::Tensor &x,
+    const framework::Tensor &y, CompoundFunctor compound_functor, int axis,
+    framework::Tensor *out, framework::Tensor *intermediate_out) {
+  axis = (axis == -1 ? x_dim.size() - y_dim_untrimed.size() : axis);
+  auto y_dim = trim_trailing_singular_dims(y_dim_untrimed);
+  axis = (y_dim.size() == 0) ? x_dim.size() : axis;
+  int pre, n, post;
+  get_mid_dims(x_dim, y_dim, axis, &pre, &n, &post);
+  if (post == 1) {
+    int h = pre;
+    int w = n;
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef __NVCC__
+      FusedElemwiseAndActBroadcast1CUDA<T, CompoundFunctor, BcastY,
+                                        KeepIntermediateOut,
+                                        SameShapeOfIntermediateOutAndOut>(
+          ctx.template device_context<DeviceContext>().stream(), x.data<T>(),
+          y.data<T>(), compound_functor, h, w,
+          out->mutable_data<T>(ctx.GetPlace()),
+          intermediate_out == nullptr
+              ? nullptr
+              : intermediate_out->mutable_data<T>(ctx.GetPlace()));
+#endif
+    } else {
+      FusedElemwiseAndActBroadcast1CPU<T, CompoundFunctor, BcastY,
+                                       KeepIntermediateOut,
+                                       SameShapeOfIntermediateOutAndOut>(
+          x.data<T>(), y.data<T>(), compound_functor, h, w,
+          out->mutable_data<T>(ctx.GetPlace()),
+          intermediate_out == nullptr
+              ? nullptr
+              : intermediate_out->mutable_data<T>(ctx.GetPlace()));
+    }
+  } else {
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef __NVCC__
+      FusedElemwiseAndActBroadcast2CUDA<T, CompoundFunctor, BcastY,
+                                        KeepIntermediateOut,
+                                        SameShapeOfIntermediateOutAndOut>(
+          ctx.template device_context<DeviceContext>().stream(), x.data<T>(),
+          y.data<T>(), pre, n, post, compound_functor,
+          out->mutable_data<T>(ctx.GetPlace()),
+          intermediate_out == nullptr
+              ? nullptr
+              : intermediate_out->mutable_data<T>(ctx.GetPlace()));
+#endif
+    } else {
+      FusedElemwiseAndActBroadcast2CPU<T, CompoundFunctor, BcastY,
+                                       KeepIntermediateOut,
+                                       SameShapeOfIntermediateOutAndOut>(
+          x.data<T>(), y.data<T>(), pre, n, post, compound_functor,
+          out->mutable_data<T>(ctx.GetPlace()),
+          intermediate_out == nullptr
+              ? nullptr
+              : intermediate_out->mutable_data<T>(ctx.GetPlace()));
+    }
+  }
+}
+// --- backward
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut>
+struct FusedElemwiseAndActGradNoBroadcast {
+  HOSTDEVICE void operator()(size_t i) {
+    if (dx_ != nullptr) {
+      dx_[i] = UseIntermediateOut ? dx_op_(x_[i], y_[i], intermediate_out_[i],
+                                           out_[i], dout_[i])
+                                  : dx_op_(x_[i], y_[i], out_[i], dout_[i]);
+    }
+    if (dy_ != nullptr) {
+      dy_[i] = UseIntermediateOut ? dy_op_(x_[i], y_[i], intermediate_out_[i],
+                                           out_[i], dout_[i])
+                                  : dy_op_(x_[i], y_[i], out_[i], dout_[i]);
+    }
+  }
+  const T *x_;
+  const T *y_;
+  const T *intermediate_out_;
+  const T *out_;
+  const T *dout_;
+  DX_OP dx_op_;
+  DY_OP dy_op_;
+  T *dx_;
+  T *dy_;
+};
+template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP,
+          bool UseIntermediateOut>
+void FusedElemwiseAndActGradComputeNoBroadcast(
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
+    const framework::DDim &y_dim, const framework::Tensor *x,
+    const framework::Tensor *y, const framework::Tensor *intermediate_out,
+    const framework::Tensor *out, const framework::Tensor *dout, int axis,
+    framework::Tensor *dx, framework::Tensor *dy, DX_OP dx_op, DY_OP dy_op) {
+  size_t N = static_cast<size_t>(framework::product(x_dim));
+  platform::ForRange<DeviceContext> for_range(
+      ctx.template device_context<DeviceContext>(), N);
+  for_range(
+      FusedElemwiseAndActGradNoBroadcast<T, DX_OP, DY_OP, UseIntermediateOut>{
+          x->data<T>(), y->data<T>(),
+          intermediate_out ? intermediate_out->data<T>() : nullptr,
+          out->data<T>(), dout->data<T>(), dx_op, dy_op,
+          dx == nullptr ? nullptr : dx->mutable_data<T>(ctx.GetPlace()),
+          dy == nullptr ? nullptr : dy->mutable_data<T>(ctx.GetPlace())});
+}
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActGradBroadcast1CPU(const T *x, const T *y,
+                                                 const T *intermediate_out,
+                                                 const T *out, const T *dout,
+                                                 int h, int w, DX_OP dx_op,
+                                                 DY_OP dy_op, T *dx, T *dy) {
+  int64_t tmp_out_idx, x_idx, y_idx;
+  for (int i = 0; i < h; ++i) {
+    for (int j = 0; j < w; ++j) {
+      int offset = i * w + j;
+      tmp_out_idx = BcastY ? j : offset;
+      y_idx = BcastY ? j : offset;
+      x_idx = BcastY ? offset : j;
+      if (SameShapeOfIntermediateOutAndOut) {
+        tmp_out_idx = offset;
+      }
+      if (dx != nullptr) {
+        T tmp = UseIntermediateOut
+                    ? dx_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                            out[offset], dout[offset])
+                    : dx_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+        if (BcastY) {
+          dx[x_idx] = tmp;
+        } else {
+          if (i == 0) {
+            dx[x_idx] = tmp;
+          } else {
+            dx[x_idx] += tmp;
+          }
+        }
+      }
+      if (dy != nullptr) {
+        T tmp = UseIntermediateOut
+                    ? dy_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                            out[offset], dout[offset])
+                    : dy_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+        if (BcastY) {
+          if (i == 0) {
+            dy[y_idx] = tmp;
+          } else {
+            dy[y_idx] += tmp;
+          }
+        } else {
+          dy[y_idx] = tmp;
+        }
+      }
+    }
+  }
+}
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActGradBroadcast2CPU(const T *x, const T *y,
+                                                 const T *intermediate_out,
+                                                 const T *out, const T *dout,
+                                                 int pre, int n, int post,
+                                                 DX_OP dx_op, DY_OP dy_op,
+                                                 T *dx, T *dy) {
+  int64_t tmp_out_idx, x_idx, y_idx;
+  for (int i = 0; i < pre; ++i) {
+    for (int j = 0; j < n; ++j) {
+      for (int k = 0; k < post; ++k) {
+        int offset = i * n * post + j * post + k;
+        tmp_out_idx = BcastY ? j : offset;
+        y_idx = BcastY ? j : offset;
+        x_idx = BcastY ? offset : j;
+        if (SameShapeOfIntermediateOutAndOut) {
+          tmp_out_idx = offset;
+        }
+        if (dx != nullptr) {
+          T tmp = UseIntermediateOut
+                      ? dx_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                              out[offset], dout[offset])
+                      : dx_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+          if (BcastY) {
+            dx[x_idx] = tmp;
+          } else {
+            if (i == 0 && k == 0) {
+              dx[x_idx] = tmp;
+            } else {
+              dx[x_idx] += tmp;
+            }
+          }
+        }
+        if (dy != nullptr) {
+          T tmp = UseIntermediateOut
+                      ? dy_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                              out[offset], dout[offset])
+                      : dy_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+          if (BcastY) {
+            if (i == 0 && k == 0) {
+              dy[y_idx] = tmp;
+            } else {
+              dy[y_idx] += tmp;
+            }
+          } else {
+            dy[y_idx] = tmp;
+          }
+        }
+      }
+    }
+  }
+}
+#ifdef __NVCC__
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static __global__ void FusedElemwiseAndActGradBroadcast1CUDAKernel(
+    const T *x, const T *y, const T *intermediate_out, const T *out,
+    const T *dout, int h, int w, DX_OP dx_op, DY_OP dy_op, T *dx, T *dy) {
+  int j = blockIdx.x;
+  int i = threadIdx.x;
+  int tid = threadIdx.x;
+  T val(0);
+  int64_t tmp_out_idx, x_idx, y_idx;
+  do {
+    int offset = i * w + j;
+    tmp_out_idx = BcastY ? j : offset;
+    y_idx = BcastY ? j : offset;
+    x_idx = BcastY ? offset : j;
+    if (SameShapeOfIntermediateOutAndOut) {
+      tmp_out_idx = offset;
+    }
+    if (dx != nullptr) {
+      T tmp = UseIntermediateOut
+                  ? dx_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                          out[offset], dout[offset])
+                  : dx_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+      if (BcastY) {
+        dx[x_idx] = tmp;
+      } else {
+        val += tmp;
+      }
+    }
+    if (dy != nullptr) {
+      T tmp = UseIntermediateOut
+                  ? dy_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                          out[offset], dout[offset])
+                  : dy_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+      if (BcastY) {
+        val += tmp;
+      } else {
+        dy[y_idx] = tmp;
+      }
+    }
+    i += ELEMWISE_MAX_BLOCK_DIM;
+  } while (i < h);
+  if (BcastY) {
+    if (dy) {
+      h = h > ELEMWISE_MAX_BLOCK_DIM ? ELEMWISE_MAX_BLOCK_DIM : h;
+      val = paddle::platform::reduceSum(val, tid, h);
+      if (threadIdx.x == 0) {
+        dy[j] = val;
+      }
+    }
+  } else {
+    if (dx) {
+      h = h > ELEMWISE_MAX_BLOCK_DIM ? ELEMWISE_MAX_BLOCK_DIM : h;
+      val = paddle::platform::reduceSum(val, tid, h);
+      if (threadIdx.x == 0) {
+        dx[j] = val;
+      }
+    }
+  }
+}
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActGradBroadcast1CUDA(cudaStream_t stream,
+                                                  const T *x, const T *y,
+                                                  const T *intermediate_out,
+                                                  const T *out, const T *dout,
+                                                  int h, int w, DX_OP dx_op,
+                                                  DY_OP dy_op, T *dx, T *dy) {
+  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, h);
+  int gird_size = w;
+  FusedElemwiseAndActGradBroadcast1CUDAKernel<
+      T, DX_OP, DY_OP, UseIntermediateOut, BcastY,
+      SameShapeOfIntermediateOutAndOut><<<gird_size, block_size, 0, stream>>>(
+      x, y, intermediate_out, out, dout, h, w, dx_op, dy_op, dx, dy);
+}
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static __global__ void FusedElemwiseAndActGradBroadcast2CUDAKernel(
+    const T *x, const T *y, const T *intermediate_out, const T *out,
+    const T *dout, int pre, int n, int post, DX_OP dx_op, DY_OP dy_op, T *dx,
+    T *dy) {
+  int tid = threadIdx.x;
+  int j = blockIdx.x;
+  T val(0);
+  int ttid = tid;
+  int64_t tmp_out_idx, x_idx, y_idx;
+  while (true) {
+    int i = ttid / post;
+    int k = ttid % post;
+    if (i >= pre) break;
+    int offset = i * n * post + j * post + k;
+    tmp_out_idx = BcastY ? j : offset;
+    y_idx = BcastY ? j : offset;
+    x_idx = BcastY ? offset : j;
+    if (SameShapeOfIntermediateOutAndOut) {
+      tmp_out_idx = offset;
+    }
+    if (dx != nullptr) {
+      T tmp = UseIntermediateOut
+                  ? dx_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                          out[offset], dout[offset])
+                  : dx_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+      if (BcastY) {
+        dx[x_idx] = tmp;
+      } else {
+        val += tmp;
+      }
+    }
+    if (dy != nullptr) {
+      T tmp = UseIntermediateOut
+                  ? dy_op(x[x_idx], y[y_idx], intermediate_out[tmp_out_idx],
+                          out[offset], dout[offset])
+                  : dy_op(x[x_idx], y[y_idx], out[offset], dout[offset]);
+      if (BcastY) {
+        val += tmp;
+      } else {
+        dy[y_idx] = tmp;
+      }
+    }
+    ttid += ELEMWISE_MAX_BLOCK_DIM;
+  }
+  if (BcastY) {
+    if (dy) {
+      int h = pre * post;
+      h = h > ELEMWISE_MAX_BLOCK_DIM ? ELEMWISE_MAX_BLOCK_DIM : h;
+      val = paddle::platform::reduceSum(val, tid, h);
+      if (threadIdx.x == 0) {
+        dy[j] = val;
+      }
+    }
+  } else {
+    if (dx) {
+      int h = pre * post;
+      h = h > ELEMWISE_MAX_BLOCK_DIM ? ELEMWISE_MAX_BLOCK_DIM : h;
+      val = paddle::platform::reduceSum(val, tid, h);
+      if (threadIdx.x == 0) {
+        dx[j] = val;
+      }
+    }
+  }
+}
+template <typename T, typename DX_OP, typename DY_OP, bool UseIntermediateOut,
+          bool BcastY, bool SameShapeOfIntermediateOutAndOut>
+static void FusedElemwiseAndActGradBroadcast2CUDA(
+    cudaStream_t stream, const T *x, const T *y, const T *intermediate_out,
+    const T *out, const T *dout, int pre, int n, int post, DX_OP dx_op,
+    DY_OP dy_op, T *dx, T *dy) {
+  int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, pre * post);
+  int gird_size = n;
+  FusedElemwiseAndActGradBroadcast2CUDAKernel<
+      T, DX_OP, DY_OP, UseIntermediateOut, BcastY,
+      SameShapeOfIntermediateOutAndOut><<<gird_size, block_size, 0, stream>>>(
+      x, y, intermediate_out, out, dout, pre, n, post, dx_op, dy_op, dx, dy);
+}
+#endif
+template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP,
+          bool UseIntermediateOut, bool BcastY,
+          bool SameShapeOfIntermediateOutAndOut>
+void FusedElemwiseAndActGradComputeWithBroadcast(
+    const framework::ExecutionContext &ctx, const framework::DDim &x_dim,
+    const framework::DDim &y_dim_untrimed, const framework::Tensor *x,
+    const framework::Tensor *y, const framework::Tensor *intermediate_out,
+    const framework::Tensor *out, const framework::Tensor *dout, int axis,
+    framework::Tensor *dx, framework::Tensor *dy, DX_OP dx_op, DY_OP dy_op) {
+  axis = (axis == -1 ? x_dim.size() - y_dim_untrimed.size() : axis);
+  auto y_dim = trim_trailing_singular_dims(y_dim_untrimed);
+  axis = (y_dim.size() == 0) ? x_dim.size() : axis;
+  int pre, n, post;
+  get_mid_dims(x_dim, y_dim, axis, &pre, &n, &post);
+  if (post == 1) {
+    int h = pre;
+    int w = n;
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef __NVCC__
+      FusedElemwiseAndActGradBroadcast1CUDA<T, DX_OP, DY_OP, UseIntermediateOut,
+                                            BcastY,
+                                            SameShapeOfIntermediateOutAndOut>(
+          ctx.template device_context<DeviceContext>().stream(), x->data<T>(),
+          y->data<T>(),
+          intermediate_out == nullptr ? nullptr : intermediate_out->data<T>(),
+          out->data<T>(), dout->data<T>(), h, w, dx_op, dy_op,
+          dx == nullptr ? nullptr : dx->mutable_data<T>(ctx.GetPlace()),
+          dy == nullptr ? nullptr : dy->mutable_data<T>(ctx.GetPlace()));
+#endif
+    } else {
+      FusedElemwiseAndActGradBroadcast1CPU<T, DX_OP, DY_OP, UseIntermediateOut,
+                                           BcastY,
+                                           SameShapeOfIntermediateOutAndOut>(
+          x->data<T>(), y->data<T>(),
+          intermediate_out == nullptr ? nullptr : intermediate_out->data<T>(),
+          out->data<T>(), dout->data<T>(), h, w, dx_op, dy_op,
+          dx == nullptr ? nullptr : dx->mutable_data<T>(ctx.GetPlace()),
+          dy == nullptr ? nullptr : dy->mutable_data<T>(ctx.GetPlace()));
+    }
+  } else {
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef __NVCC__
+      FusedElemwiseAndActGradBroadcast2CUDA<T, DX_OP, DY_OP, UseIntermediateOut,
+                                            BcastY,
+                                            SameShapeOfIntermediateOutAndOut>(
+          ctx.template device_context<DeviceContext>().stream(), x->data<T>(),
+          y->data<T>(),
+          intermediate_out == nullptr ? nullptr : intermediate_out->data<T>(),
+          out->data<T>(), dout->data<T>(), pre, n, post, dx_op, dy_op,
+          dx == nullptr ? nullptr : dx->mutable_data<T>(ctx.GetPlace()),
+          dy == nullptr ? nullptr : dy->mutable_data<T>(ctx.GetPlace()));
+#endif
+    } else {
+      FusedElemwiseAndActGradBroadcast2CPU<T, DX_OP, DY_OP, UseIntermediateOut,
+                                           BcastY,
+                                           SameShapeOfIntermediateOutAndOut>(
+          x->data<T>(), y->data<T>(),
+          intermediate_out == nullptr ? nullptr : intermediate_out->data<T>(),
+          out->data<T>(), dout->data<T>(), pre, n, post, dx_op, dy_op,
+          dx == nullptr ? nullptr : dx->mutable_data<T>(ctx.GetPlace()),
+          dy == nullptr ? nullptr : dy->mutable_data<T>(ctx.GetPlace()));
+    }
+  }
+}
+template <typename DeviceContext, typename T, typename DX_OP, typename DY_OP,
+          bool UseIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+void FusedElemwiseAndActGradComputeEx(
+    const framework::ExecutionContext &ctx, const framework::Tensor *x,
+    const framework::Tensor *y, const framework::Tensor *out,
+    const framework::Tensor *intermediate_out, const framework::Tensor *dout,
+    int axis, framework::Tensor *dx, framework::Tensor *dy, DX_OP dx_op,
+    DY_OP dy_op) {
+  const framework::DDim &x_dim = x->dims();
+  const framework::DDim &y_dim = y->dims();
+  if (UseIntermediateOut) {
+    PADDLE_ENFORCE(intermediate_out, "intermediate_out should not be nullptr");
+  }
+  if (x_dim == y_dim) {
+    FusedElemwiseAndActGradComputeNoBroadcast<DeviceContext, T, DX_OP, DY_OP,
+                                              UseIntermediateOut>(
+        ctx, x_dim, y_dim, x, y, intermediate_out, out, dout, axis, dx, dy,
+        dx_op, dy_op);
+  } else {  // Y is a scalar
+    bool bcast_y = x_dim.size() >= y_dim.size();
+    if (x_dim.size() == y_dim.size()) {
+      for (int i = 0; i < x_dim.size(); ++i) {
+        if (x_dim[i] < y_dim[i]) {
+          bcast_y = false;
+          break;
+        }
+      }
+    }
+    // z = f1(x, f2(y))
+    // z = f1(f2(x, y))
+    if (bcast_y) {  // Y should be broadcast.
+      FusedElemwiseAndActGradComputeWithBroadcast<
+          DeviceContext, T, DX_OP, DY_OP, UseIntermediateOut, true /*BcastY*/,
+          SameShapeOfIntermediateOutAndOut>(ctx, x_dim, y_dim, x, y,
+                                            intermediate_out, out, dout, axis,
+                                            dx, dy, dx_op, dy_op);
+    } else {
+      FusedElemwiseAndActGradComputeWithBroadcast<
+          DeviceContext, T, DX_OP, DY_OP, UseIntermediateOut, false /*BcastY*/,
+          SameShapeOfIntermediateOutAndOut>(ctx, y_dim, x_dim, x, y,
+                                            intermediate_out, out, dout, axis,
+                                            dx, dy, dx_op, dy_op);
+    }
+  }
+}
+template <typename DeviceContext, typename T, typename CompoundFunctor,
+          bool KeepIntermediateOut, bool SameShapeOfIntermediateOutAndOut>
+void FusedElemwiseAndActComputeEx(const framework::ExecutionContext &ctx,
+                                  const framework::Tensor &x,
+                                  const framework::Tensor &y, int axis,
+                                  CompoundFunctor compound_functor,
+                                  framework::Tensor *out,
+                                  framework::Tensor *intermediate_out) {
+  if (KeepIntermediateOut) {
+    PADDLE_ENFORCE(intermediate_out,
+                   "The keep_intermediate_value is opened, "
+                   "intermediate_out should not be nullptr.");
+  }
+  const framework::DDim &x_dim = x.dims();
+  const framework::DDim &y_dim = y.dims();
+  if (x.dims() == y.dims()) {
+    FusedElemwiseAndActComputeNoBroadcast<DeviceContext, T, CompoundFunctor,
+                                          KeepIntermediateOut>(
+        ctx, x_dim, x, y, compound_functor, out, intermediate_out);
+  } else {
+    // Whether the shape of Y is a continuous subsequence of X,
+    // For more information please refer to the op's introduction.
+    bool bcast_y = x.dims().size() >= y.dims().size();
+    if (x.dims().size() == y.dims().size()) {
+      for (int i = 0; i < x.dims().size(); ++i) {
+        if (x.dims()[i] < y.dims()[i]) {
+          bcast_y = false;
+          break;
+        }
+      }
+    }
+    // z = f1(x, f2(y))
+    // z = f1(f2(x, y))
+    if (bcast_y) {  // Y should be broadcast.
+      // In this case,
+      // for 'f2(y)', the shape of intermediate_out should be equal to the shape
+      // of Y.
+      // for 'f2(x, y)', the shape of intermediate_out should be equal to the
+      // shape of Out.
+      // the shape of Out should be equal to the shape of X.
+      FusedElemwiseAndActComputeWithBroadcast<
+          DeviceContext, T, CompoundFunctor, true /*BcastY*/,
+          KeepIntermediateOut, SameShapeOfIntermediateOutAndOut>(
+          ctx, x_dim /*OutShape*/, y_dim, x, y, compound_functor, axis, out,
+          intermediate_out);
+    } else {
+      // In this case,
+      // for 'f2(y)', the shape of intermediate_out should be equal to the shape
+      // of Out.
+      // for 'f2(x, y)', the shape of intermediate_out should be equal to the
+      // shape of Out.
+      // the shape of Out should be equal to the shape of Y.
+      FusedElemwiseAndActComputeWithBroadcast<
+          DeviceContext, T, CompoundFunctor, false /*BcastY*/,
+          KeepIntermediateOut, SameShapeOfIntermediateOutAndOut>(
+          ctx, y_dim /*OutShape*/, x_dim, x, y, compound_functor, axis, out,
+          intermediate_out);
+    }
+  }
+}
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/fluid/operators/fused_elemwise_activation_op.cc
+++ b/paddle/fluid/operators/fused_elemwise_activation_op.cc
@@ -12,14 +12,60 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
+#include "paddle/fluid/operators/fused_elemwise_activation_op.h"
 #include <string>
 #include <vector>
-#include "paddle/fluid/operators/fused_elemwise_activation_op.h"
 namespace paddle {
 namespace operators {
+/*
+ * Whether the compound function is Unary(Binary(X, Y)).
+ * For Unary(Binary(X, Y)), the intermediate_out's shape is the same the final
+ * out.
+ */
+static bool IsUnaryCompound(const std::vector<std::string> &functor_list) {
+  PADDLE_ENFORCE_EQ(functor_list.size(), 2);
+  static std::unordered_set<std::string> binary_fun = {
+      "elementwise_add", "elementwise_mul", "elementwise_add_grad",
+      "elementwise_mul_grad"};
+  return binary_fun.count(functor_list[1]) != 0;
+}
+/*
+ * Whether the Input(X) could be absent.
+ */
+static bool InputXCanBeAbsent(const std::vector<std::string> &functor_list) {
+  PADDLE_ENFORCE_EQ(functor_list.size(), 2);
+  static std::unordered_set<std::string> binary_fun = {"elementwise_add_grad"};
+  return binary_fun.count(functor_list[0]) != 0 ||
+         binary_fun.count(functor_list[1]) != 0;
+}
+/*
+ * Whether the compound function is supported.
+ * For Unary(Binary(X, Y)), the intermediate_out's shape is the same the final
+ * out.
+ */
+static bool IsSupportedCompound(const std::vector<std::string> &functors) {
+  static std::unordered_set<std::string> unary_fun = {"scale", "relu"};
+  static std::unordered_set<std::string> binary_fun = {"elementwise_add",
+                                                       "elementwise_mul"};
+  std::string unary_fun_str;
+  if (binary_fun.count(functors[0])) {
+    unary_fun_str = functors[1];
+  } else if (binary_fun.count(functors[1])) {
+    unary_fun_str = functors[0];
+  } else {
+    PADDLE_THROW("%s and %s are not included in fused_list.", functors[0],
+                 functors[1]);
+  }
+  PADDLE_ENFORCE_EQ(unary_fun.count(unary_fun_str), 1,
+                    "%s is not included in fused_list.", unary_fun_str);
+  return true;
+}
 class FusedElemwiseActivationOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
@@ -37,11 +83,44 @@ class FusedElemwiseActivationOp : public framework::OperatorWithKernel {
    auto x_dim = ctx->GetInputDim("X");
    auto y_dim = ctx->GetInputDim("Y");
-    PADDLE_ENFORCE_GE(x_dim.size(), y_dim.size(),
-                      "Rank of first input must >= rank of second input.");
-    ctx->SetOutputDim("Out", x_dim);
+    // Whether the shape of Y is a continuous subsequence of X,
-    ctx->ShareLoD("X", /*->*/ "Out");
+    // For more information please refer to the op's introduction.
+    bool bcast_y = x_dim.size() >= y_dim.size();
+    if (x_dim.size() == y_dim.size()) {
+      for (int i = 0; i < x_dim.size(); ++i) {
+        if (x_dim[i] < y_dim[i]) {
+          bcast_y = false;
+          break;
+        }
+      }
+    }
+    auto &out_dim = bcast_y ? x_dim : y_dim;
+    std::string out_lod = bcast_y ? "X" : "Y";
+    if (ctx->Attrs().Get<bool>("keep_intermediate_value")) {
+      PADDLE_ENFORCE(ctx->HasOutput("IntermediateOut"),
+                     "Output(IntermediateOut) of FusedElemwiseActivationOp "
+                     "should not be null.");
+      if (IsUnaryCompound(
+              ctx->Attrs().Get<std::vector<std::string>>("functor_list"))) {
+        // for Unary(Binary(X, Y)), the shape and lod of out and
+        // intermediate_out are the same.
+        ctx->SetOutputDim("IntermediateOut", out_dim);
+        // set the lod of intermediate_out
+        ctx->ShareLoD(out_lod, /*->*/ "IntermediateOut");
+      } else {
+        // for Binary(X, Unary(Y)), the shape and lod of Y and
+        // intermediate_out are the same.
+        ctx->SetOutputDim("IntermediateOut", y_dim);
+        // set the lod of intermediate_out
+        ctx->ShareLoD("Y", /*->*/ "IntermediateOut");
+      }
+    }
+    ctx->SetOutputDim("Out", out_dim);
+    ctx->ShareLoD(out_lod, /*->*/ "Out");
  }
 protected:
@@ -59,29 +138,42 @@ class FusedElemwiseActivationOp : public framework::OperatorWithKernel {
 class FusedElemwiseActivationMaker : public framework::OpProtoAndCheckerMaker {
 public:
  void Make() override {
-    AddInput("X", "(vector<Tensor>)");
+    AddInput(
-    AddInput("Y", "(vector<Tensor>)");
+        "X",
-    AddOutput("Out", "vector<Tensor>");
+        "(Tensor) The input tensor of fused_elemwise_activation operator.");
+    AddInput(
+        "Y",
+        "(Tensor) The input tensor of fused_elemwise_activation operator.");
+    AddOutput("Out",
+              "vector<Tensor> The output tensor of fused_elemwise_activation "
+              "operator.");
+    AddOutput("IntermediateOut",
+              "Tensor The IntermediateOut tensor of fused_elemwise_activation "
+              "operator.")
+        .AsIntermediate();
    AddAttr<int>("axis",
                 "axis is used by elementwise_op, the default value is -1.")
        .SetDefault(-1);
    AddAttr<float>("scale",
                   "scale is used by scale_op, the default value is 0.0.")
        .SetDefault(0.0);
-    AddAttr<bool>("recomputation",
+    AddAttr<bool>(
+        "recomputation",
        "Whether to recompute the Out."
-                  "fused_elemwise_activation_grad has two methods to get the "
+        "The computation of fused_elemwise_activation_grad has two methods to "
-                  "dx and dy, one "
+        "get the dx and dy, one is to use the 'Out', and the other is not. "
-                  "is to use the 'Out', and the other is not to use it. "
+        "The former method will save the time of recomputing the 'Out', but it "
-                  "The former method will save the time of recomputing the "
+        "must occupy the memory to store the 'out'. While, the later method "
-                  "'Out', but it must occupy the memory to store the 'out'. "
+        "can avoid occupying the memory, but it must recompute the 'Out'. "
-                  "While, the later method can avoid occupying the memory, "
+        "It is useful for Unary(Binary(X, Y)). The default value is true.")
-                  "but it must recompute the 'Out'. The default value is true.")
        .SetDefault(true);
+    AddAttr<bool>("keep_intermediate_value",
+                  "Whether to save the intermediate_out.")
+        .SetDefault(false);
    AddAttr<std::vector<std::string>>("functor_list",
                                      "The functors that should be fused.")
        .AddCustomChecker([&](const std::vector<std::string> &functor_list) {
-          PADDLE_ENFORCE(ValidCheck(functor_list));
+          PADDLE_ENFORCE(IsSupportedCompound(functor_list));
        });
    AddComment(R"DOC(
@@ -93,30 +185,38 @@ operators (elementwise_op and activation_op):
    Z = Binary(X, Unary(Y))
    Z = Unary(Binary(X, Y))
-The attributions of activation_op can be get from fused_elemwise_activation_op's
+There are two cases for this operator:
-attributions. functor_list records the functors to be fused, for example
-"scale,elementwise_add".
-)DOC");
+1. The shape of $Y$ and $X$ is the same.
-  }
+2. The shape of $Y$ is a continuous subsequence of $X$ or the shape of $X$ is a continuous subsequence of $Y$.
- private:
+For case 2 (assume that the shape of $Y$ is a continuous subsequence of $X$ ):
-  bool ValidCheck(const std::vector<std::string> &functors) {
-    std::unordered_set<std::string> unary_fun = {"scale", "relu"};
-    std::unordered_set<std::string> binary_fun = {"elementwise_add"};
-    std::string unary_fun_str;
+1. Broadcast $Y$ to match the shape of $X$, where $axis$ is the start dimension index
-    if (binary_fun.count(functors[0])) {
+   for broadcasting $Y$ onto $X$.
-      unary_fun_str = functors[1];
+2. If $axis$ is -1 (default), $axis = rank(X) - rank(Y)$.
-    } else if (binary_fun.count(functors[1])) {
+3. The trailing dimensions of size 1 for $Y$ will be ignored for the consideration of
-      unary_fun_str = functors[0];
+   subsequence, such as shape(Y) = (2, 1) => (2).
-    } else {
-      PADDLE_THROW("%s and %s are not included in fused_list.", functors[0],
+For example:
-                   functors[1]);
-    }
+  .. code-block:: python
-    PADDLE_ENFORCE_EQ(unary_fun.count(unary_fun_str), 1,
-                      "%s is not included in fused_list.", unary_fun_str);
+    shape(X) = (2, 3, 4, 5), shape(Y) = (,)
-    return true;
+    shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
+    shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5), with axis=-1(default) or axis=2
+    shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
+    shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
+    shape(X) = (2, 3, 4, 5), shape(Y) = (2, 1), with axis=0
+The inputs $X$ and $Y$ can carry the different LoD information.
+But the output only shares the LoD information with the one whose shape is the same with Out.
+The attributions of activation_op can be get from fused_elemwise_activation_op's.
+The functor_list records the functions to be fused, for example
+["scale", "elementwise_add"].
+)DOC");
  }
 };
@@ -141,6 +241,7 @@ class FusedElemwiseActivationGradMaker
      op_desc_ptr->SetInput(framework::GradVarName(output_param),
                            this->OutputGrad(output_param));
    }
    op_desc_ptr->SetAttrMap(this->Attrs());
    std::vector<std::string> functor_names =
@@ -158,40 +259,59 @@ class FusedElemwiseActivationOpGrad : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;
  void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
-    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
-                   "Input(Out@GRAD) should not be null");
+                   "Input(Out@Grad) should not be null");
+    if (ctx->Attrs().Get<bool>("keep_intermediate_value")) {
-    auto x_dims = ctx->GetInputDim("X");
+      PADDLE_ENFORCE(ctx->HasInput("IntermediateOut"),
-    auto y_dims = ctx->GetInputDim("Y");
+                     "Input(IntermediateOut) should not be null");
-    auto out_dims = ctx->GetInputDim(framework::GradVarName("Out"));
+    } else {
+      PADDLE_ENFORCE_EQ(ctx->Inputs(framework::GradVarName("Out")).size(), 1);
-    PADDLE_ENFORCE_GE(x_dims.size(), y_dims.size(),
+    }
-                      "Rank of first input must >= rank of second input.");
+    auto funtor_list =
+        ctx->Attrs().Get<std::vector<std::string>>("functor_list");
    auto x_grad_name = framework::GradVarName("X");
    auto y_grad_name = framework::GradVarName("Y");
    if (ctx->HasOutput(x_grad_name)) {
-      ctx->SetOutputDim(x_grad_name, x_dims);
+      if (ctx->HasInputs("X")) {
+        ctx->SetOutputDim(x_grad_name, ctx->GetInputDim("X"));
+        ctx->ShareLoD("X", x_grad_name);
+      } else {
+        // Node: If "X" is absence, the shape of Y should be a continuous
+        // subsequence of X, if not, we could not infer the shape of dx.
+        // Currently, only when Binary is elementwise_add or elementwise_sub,
+        // the "X" could be absent.
+        PADDLE_ENFORCE(InputXCanBeAbsent(funtor_list),
+                       "Only when BinaryFunctor is elementwise_add, the 'X' "
+                       "could be absent.");
+        // For Unary(Binary(X, Y)), IntermediateOut should not be empty.
+        if (IsUnaryCompound(funtor_list)) {
+          PADDLE_ENFORCE(
+              ctx->HasInputs("IntermediateOut"),
+              "If the compound_functor is Unary(Binary(X, Y)) and Binary "
+              "is elementwise_add, the intermediate_out must be not absent.");
+        }
+        ctx->SetOutputDim(x_grad_name,
+                          ctx->GetInputDim(framework::GradVarName("Out")));
+        ctx->ShareLoD(framework::GradVarName("Out"), x_grad_name);
+      }
    }
    if (ctx->HasOutput(y_grad_name)) {
-      ctx->SetOutputDim(y_grad_name, y_dims);
+      PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");
+      ctx->SetOutputDim(y_grad_name, ctx->GetInputDim("Y"));
+      ctx->ShareLoD("Y", y_grad_name);
    }
  }
 protected:
  framework::OpKernelType GetExpectedKernelType(
      const framework::ExecutionContext &ctx) const override {
-    auto input_data_type_index = ctx.Input<framework::Tensor>("X")->type();
+    //    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");
-    PADDLE_ENFORCE_EQ(input_data_type_index,
+    auto input_data_type_index = ctx.Input<framework::Tensor>("Y")->type();
-                      ctx.Input<framework::Tensor>("Y")->type(),
-                      "The element's type of input should be the same.");
-    PADDLE_ENFORCE_EQ(
-        input_data_type_index,
-        ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type(),
-        "The element's type of input should be the same.");
    auto input_data_type = framework::ToDataType(input_data_type_index);
    return framework::OpKernelType(input_data_type, ctx.GetPlace());
  }

--- a/paddle/fluid/operators/fused_elemwise_activation_op.h
+++ b/paddle/fluid/operators/fused_elemwise_activation_op.h
@@ -20,208 +20,114 @@ limitations under the License. */
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/detail/safe_ref.h"
 #include "paddle/fluid/operators/elementwise_op_function.h"
+#include "paddle/fluid/operators/math/compound_functors.h"
 #include "paddle/fluid/operators/math/functors.h"
-namespace math = paddle::operators::math;
 namespace paddle {
 namespace operators {
-// CompoundFunctors
-// For example: Z = Binary(X, Unary(Y))
-template <typename T, typename BinaryFun, typename UnaryFun>
-struct BinaryCompoundFunctor {
-  BinaryCompoundFunctor(const BinaryFun &binary_fun, const UnaryFun &unary_fun)
-      : binary_fun_(binary_fun), unary_fun_(unary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y) {
-    return binary_fun_(x, unary_fun_(y));
-  }
- private:
-  BinaryFun binary_fun_;
-  UnaryFun unary_fun_;
-};
-// For example: Z = Unary(Binary(X, Y))
-template <typename T, typename UnaryFun, typename BinaryFun>
-struct UnaryCompoundFunctor {
-  UnaryCompoundFunctor(const UnaryFun &unary_fun, const BinaryFun &binary_fun)
-      : unary_fun_(unary_fun), binary_fun_(binary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y) {
-    return unary_fun_(binary_fun_(x, y));
-  }
- private:
-  UnaryFun unary_fun_;
-  BinaryFun binary_fun_;
-};
-// FIXME(zcd): DBinaryFun and DUnaryFun have to method to get
-// the dx, one is to use the 'out', and the other is not to use it.
-// the former method will save the time of recomputing the
-// 'out', but it must occupy the memory to store the 'out'.
-// While the later method can avoid occupying this memory,
-// but it must recompute the 'out'.
-template <typename T, typename DBinaryFun, typename UnaryFun,
-          bool Recomputation = true>
-struct BinaryCompoundGradDxFunctor {
-  BinaryCompoundGradDxFunctor(const DBinaryFun &d_binary_fun,
-                              const UnaryFun &unary_fun)
-      : d_binary_fun_(d_binary_fun), unary_fun_(unary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
-    if (Recomputation) {
-      return dout * d_binary_fun_(x, unary_fun_(y));
-    } else {
-      return dout * d_binary_fun_(x, unary_fun_(y), out);
-    }
-  }
- private:
-  DBinaryFun d_binary_fun_;
-  UnaryFun unary_fun_;
-};
-template <typename T, typename DBinaryFun, typename UnaryFun,
-          typename DUnaryFun, bool Recomputation = true>
-struct BinaryCompoundGradDyFunctor {
-  BinaryCompoundGradDyFunctor(const DBinaryFun &d_binary_fun,
-                              const UnaryFun &unary_fun,
-                              const DUnaryFun &d_unary_fun)
-      : d_binary_fun_(d_binary_fun),
-        unary_fun_(unary_fun),
-        d_unary_fun_(d_unary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
-    if (Recomputation) {
-      return dout * d_binary_fun_(unary_fun_(y), x) * d_unary_fun_(y);
-    } else {
-      return dout * d_binary_fun_(unary_fun_(y), x, out) * d_unary_fun_(y);
-    }
-  }
- private:
-  DBinaryFun d_binary_fun_;
-  UnaryFun unary_fun_;
-  DUnaryFun d_unary_fun_;
-};
-template <typename T, typename DUnaryFun, typename BinaryFun,
-          typename DBinaryFun, bool Recomputation = true>
-struct UnaryCompoundGradDxFunctor {
-  UnaryCompoundGradDxFunctor(const DUnaryFun &d_unary_fun,
-                             const BinaryFun &binary_fun,
-                             const DBinaryFun &d_binary_fun)
-      : d_unary_fun_(d_unary_fun),
-        binary_fun_(binary_fun),
-        d_binary_fun_(d_binary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
-    T base;
-    if (Recomputation) {
-      base = dout * d_unary_fun_(binary_fun_(x, y));
-    } else {
-      base = dout * d_unary_fun_(binary_fun_(x, y), out);
-    }
-    return base * d_binary_fun_(x, y);
-  }
- private:
-  DUnaryFun d_unary_fun_;
-  BinaryFun binary_fun_;
-  DBinaryFun d_binary_fun_;
-};
-template <typename T, typename DUnaryFun, typename BinaryFun,
-          typename DBinaryFun, bool Recomputation = true>
-struct UnaryCompoundGradDyFunctor {
-  UnaryCompoundGradDyFunctor(const DUnaryFun &d_unary_fun,
-                             const BinaryFun &binary_fun,
-                             const DBinaryFun &d_binary_fun)
-      : d_unary_fun_(d_unary_fun),
-        binary_fun_(binary_fun),
-        d_binary_fun_(d_binary_fun) {}
-  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
-    T base;
-    if (Recomputation) {
-      base = dout * d_unary_fun_(binary_fun_(x, y));
-    } else {
-      base = dout * d_unary_fun_(binary_fun_(x, y), out);
-    }
-    return base * d_binary_fun_(y, x);
-  }
- private:
-  DUnaryFun d_unary_fun_;
-  BinaryFun binary_fun_;
-  DBinaryFun d_binary_fun_;
-};
 template <typename DeviceContext, typename T, typename BinaryFunctor,
          typename UnaryFunctor>
-static void RunBinaryCompoundFunctor(const framework::ExecutionContext &ctx,
+static void RunBinaryCompoundFunctor(
-                                     const BinaryFunctor &binary_functor,
+    const framework::ExecutionContext &ctx, const BinaryFunctor &binary_functor,
-                                     const UnaryFunctor &unary_functor,
+    const UnaryFunctor &unary_functor, const framework::Tensor &in_x,
-                                     const framework::Tensor *in_x,
+    const framework::Tensor &in_y, std::vector<framework::Tensor *> *outputs) {
-                                     const framework::Tensor *in_y,
+  // Z = Binary(X, Unary(Y))
-                                     framework::Tensor *output) {
+  // intermediate_out = Unary(Y)
+  // out = Binary(X, Unary(Y))
+  // In this case, the shape of intermediate_out and out are different.
+  paddle::operators::math::BinaryCompoundFunctor<T, BinaryFunctor, UnaryFunctor>
+      compound_func(binary_functor, unary_functor);
  int axis = ctx.Attr<int>("axis");
-  using BinaryCompoundFunctor =
+  if (ctx.Attr<bool>("keep_intermediate_value")) {
-      BinaryCompoundFunctor<T, BinaryFunctor, UnaryFunctor>;
+    FusedElemwiseAndActComputeEx<DeviceContext, T,
+                                 paddle::operators::math::BinaryCompoundFunctor<
-  ElementwiseComputeEx<BinaryCompoundFunctor, DeviceContext, T>(
+                                     T, BinaryFunctor, UnaryFunctor>,
-      ctx, in_x, in_y, axis,
+                                 true /*KeepIntermediateValue*/,
-      BinaryCompoundFunctor(binary_functor, unary_functor), output);
+                                 false /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, axis, compound_func, (*outputs)[0], (*outputs)[1]);
+  } else {
+    FusedElemwiseAndActComputeEx<DeviceContext, T,
+                                 paddle::operators::math::BinaryCompoundFunctor<
+                                     T, BinaryFunctor, UnaryFunctor>,
+                                 false /*KeepIntermediateValue*/,
+                                 false /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, axis, compound_func, (*outputs)[0], (*outputs)[1]);
+  }
 }
 template <typename DeviceContext, typename T, typename UnaryFunctor,
          typename BinaryFunctor>
-static void RunUnaryCompoundFunctors(const framework::ExecutionContext &ctx,
+static void RunUnaryCompoundFunctors(
-                                     const UnaryFunctor &unary_functor,
+    const framework::ExecutionContext &ctx, const UnaryFunctor &unary_functor,
-                                     const BinaryFunctor &binary_functor,
+    const BinaryFunctor &binary_functor, const framework::Tensor &in_x,
-                                     const framework::Tensor *in_x,
+    const framework::Tensor &in_y, std::vector<framework::Tensor *> *outputs) {
-                                     const framework::Tensor *in_y,
+  // Z = Unary(Binary(X, Y))
-                                     framework::Tensor *output) {
+  // intermediate_out = Binary(X, Y)
+  // out = Unary(Binary(X, Y))
+  // In this case, the shape of intermediate_out and out are the same.
  int axis = ctx.Attr<int>("axis");
-  using UnaryCompoundFunctor =
+  paddle::operators::math::UnaryCompoundFunctor<T, UnaryFunctor, BinaryFunctor>
-      UnaryCompoundFunctor<T, UnaryFunctor, BinaryFunctor>;
+      compound_func(unary_functor, binary_functor);
-  ElementwiseComputeEx<UnaryCompoundFunctor, DeviceContext, T>(
+  if (ctx.Attr<bool>("keep_intermediate_value")) {
-      ctx, in_x, in_y, axis,
+    FusedElemwiseAndActComputeEx<DeviceContext, T,
-      UnaryCompoundFunctor(unary_functor, binary_functor), output);
+                                 paddle::operators::math::UnaryCompoundFunctor<
+                                     T, UnaryFunctor, BinaryFunctor>,
+                                 true /*KeepIntermediateValue*/,
+                                 true /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, axis, compound_func, (*outputs)[0], (*outputs)[1]);
+  } else {
+    FusedElemwiseAndActComputeEx<DeviceContext, T,
+                                 paddle::operators::math::UnaryCompoundFunctor<
+                                     T, UnaryFunctor, BinaryFunctor>,
+                                 false /*KeepIntermediateValue*/,
+                                 true /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, axis, compound_func, (*outputs)[0], (*outputs)[1]);
+  }
 }
 template <typename DeviceContext, typename T, typename BinaryGradFunctor,
-          typename UnaryFunctor, typename UnaryGradFunctor,
+          typename UnaryFunctor, typename UnaryGradFunctor>
-          bool Recomputation = true>
 static void RunBinaryCompoundGradFunctors(
    const framework::ExecutionContext &ctx,
    const BinaryGradFunctor &binary_grad_functor,
    const UnaryFunctor &unary_functor,
    const UnaryGradFunctor &unary_grad_functor, const framework::Tensor *in_x,
    const framework::Tensor *in_y, const framework::Tensor *in_out,
+    const framework::Tensor *in_intermediate_out,
    const framework::Tensor *in_out_grad, framework::Tensor *x_grad,
    framework::Tensor *y_grad) {
+  // Z = Binary(X, Unary(Y))
  int axis = ctx.Attr<int>("axis");
  using BinaryCompoundDxFunctor =
-      BinaryCompoundGradDxFunctor<T, BinaryGradFunctor, UnaryFunctor,
+      paddle::operators::math::BinaryCompoundGradDxFunctor<T, BinaryGradFunctor,
-                                  Recomputation>;
+                                                           UnaryFunctor>;
  using BinaryCompoundDyFunctor =
-      BinaryCompoundGradDyFunctor<T, BinaryGradFunctor, UnaryFunctor,
+      paddle::operators::math::BinaryCompoundGradDyFunctor<
-                                  UnaryGradFunctor, Recomputation>;
+          T, BinaryGradFunctor, UnaryFunctor, UnaryGradFunctor>;
-  ElemwiseGradCompute<DeviceContext, T, BinaryCompoundDxFunctor,
+  if (in_intermediate_out) {
-                      BinaryCompoundDyFunctor>(
+    FusedElemwiseAndActGradComputeEx<
-      ctx, *in_x, *in_y, *in_out, *in_out_grad, axis, x_grad, y_grad,
+        DeviceContext, T, BinaryCompoundDxFunctor, BinaryCompoundDyFunctor,
-      BinaryCompoundDxFunctor(binary_grad_functor, unary_functor),
+        true /*UseIntermediateOut*/,
+        false /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, in_out, in_intermediate_out, in_out_grad, axis, x_grad,
+        y_grad, BinaryCompoundDxFunctor(binary_grad_functor, unary_functor),
+        BinaryCompoundDyFunctor(binary_grad_functor, unary_functor,
+                                unary_grad_functor));
+  } else {
+    FusedElemwiseAndActGradComputeEx<
+        DeviceContext, T, BinaryCompoundDxFunctor, BinaryCompoundDyFunctor,
+        false /*UseIntermediateOut*/,
+        false /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, in_out, in_intermediate_out, in_out_grad, axis, x_grad,
+        y_grad, BinaryCompoundDxFunctor(binary_grad_functor, unary_functor),
        BinaryCompoundDyFunctor(binary_grad_functor, unary_functor,
                                unary_grad_functor));
+  }
 }
 template <typename DeviceContext, typename T, typename UnaryGradFunctor,
@@ -233,143 +139,159 @@ static void RunUnaryCompoundGradFunctors(
    const BinaryFunctor &binary_functor,
    const BinaryGradFunctor &binary_grad_functor, const framework::Tensor *in_x,
    const framework::Tensor *in_y, const framework::Tensor *in_out,
+    const framework::Tensor *in_intermediate_out,
    const framework::Tensor *in_out_grad, framework::Tensor *x_grad,
    framework::Tensor *y_grad) {
+  // Z = Unary(Binary(X, Y))
  int axis = ctx.Attr<int>("axis");
  using UnaryCompoundDxFunctor =
-      UnaryCompoundGradDxFunctor<T, UnaryGradFunctor, BinaryFunctor,
+      paddle::operators::math::UnaryCompoundGradDxFunctor<
-                                 BinaryGradFunctor, Recomputation>;
+          T, UnaryGradFunctor, BinaryFunctor, BinaryGradFunctor, Recomputation>;
  using UnaryCompoundDyFunctor =
-      UnaryCompoundGradDyFunctor<T, UnaryGradFunctor, BinaryFunctor,
+      paddle::operators::math::UnaryCompoundGradDyFunctor<
-                                 BinaryGradFunctor, Recomputation>;
+          T, UnaryGradFunctor, BinaryFunctor, BinaryGradFunctor, Recomputation>;
-  ElemwiseGradCompute<DeviceContext, T, UnaryCompoundDxFunctor,
+  if (in_intermediate_out) {
-                      UnaryCompoundDyFunctor>(
+    FusedElemwiseAndActGradComputeEx<
-      ctx, *in_x, *in_y, *in_out, *in_out_grad, axis, x_grad, y_grad,
+        DeviceContext, T, UnaryCompoundDxFunctor, UnaryCompoundDyFunctor,
-      UnaryCompoundDxFunctor(unary_grad_functor, binary_functor,
+        true /*UseIntermediateOut*/, true /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, in_out, in_intermediate_out, in_out_grad, axis, x_grad,
+        y_grad, UnaryCompoundDxFunctor(unary_grad_functor, binary_functor,
+                                       binary_grad_functor),
+        UnaryCompoundDyFunctor(unary_grad_functor, binary_functor,
+                               binary_grad_functor));
+  } else {
+    FusedElemwiseAndActGradComputeEx<DeviceContext, T, UnaryCompoundDxFunctor,
+                                     UnaryCompoundDyFunctor,
+                                     false /*UseIntermediateOut*/,
+                                     true /*SameShapeOfIntermediateOutAndOut*/>(
+        ctx, in_x, in_y, in_out, in_intermediate_out, in_out_grad, axis, x_grad,
+        y_grad, UnaryCompoundDxFunctor(unary_grad_functor, binary_functor,
                                       binary_grad_functor),
        UnaryCompoundDyFunctor(unary_grad_functor, binary_functor,
                               binary_grad_functor));
+  }
 }
 template <typename DeviceContext, typename T>
 static void RunFunctors(const framework::ExecutionContext &ctx,
-                        const framework::Tensor *in_x,
+                        const framework::Tensor &in_x,
-                        const framework::Tensor *in_y,
+                        const framework::Tensor &in_y,
-                        framework::Tensor *output) {
+                        std::vector<framework::Tensor *> *outputs) {
  auto &functors = ctx.Attr<std::vector<std::string>>("functor_list");
-  auto funcs_str = functors[0] + "," + functors[1];
  // TODO(zcd): The following code can be refined.
+  auto funcs_str = functors[0] + "," + functors[1];
  if (funcs_str == "elementwise_add,scale") {
    // Z = Binary(X, Unary(Y))
    T scale = static_cast<T>(ctx.Attr<float>("scale"));
-    RunBinaryCompoundFunctor<DeviceContext, T, math::AddFunctor<T>,
+    RunBinaryCompoundFunctor<DeviceContext, T,
-                             math::ScaleFunctor<T>>(
+                             paddle::operators::math::AddFunctor<T>,
-        ctx, math::AddFunctor<T>(), math::ScaleFunctor<T>(scale), in_x, in_y,
+                             paddle::operators::math::ScaleFunctor<T>>(
-        output);
+        ctx, paddle::operators::math::AddFunctor<T>(),
+        paddle::operators::math::ScaleFunctor<T>(scale), in_x, in_y, outputs);
  } else if (funcs_str == "scale,elementwise_add") {
    // Z = Unary(Binary(X, Y))
    T scale = static_cast<T>(ctx.Attr<float>("scale"));
-    RunUnaryCompoundFunctors<DeviceContext, T, math::ScaleFunctor<T>,
+    RunUnaryCompoundFunctors<DeviceContext, T,
-                             math::AddFunctor<T>>(
+                             paddle::operators::math::ScaleFunctor<T>,
-        ctx, math::ScaleFunctor<T>(scale), math::AddFunctor<T>(), in_x, in_y,
+                             paddle::operators::math::AddFunctor<T>>(
-        output);
+        ctx, paddle::operators::math::ScaleFunctor<T>(scale),
+        paddle::operators::math::AddFunctor<T>(), in_x, in_y, outputs);
  } else if (funcs_str == "elementwise_add,relu") {
-    RunBinaryCompoundFunctor<DeviceContext, T, math::AddFunctor<T>,
+    // Z = Binary(X, Unary(Y))
-                             math::ReluFunctor<T>>(
+    RunBinaryCompoundFunctor<DeviceContext, T,
-        ctx, math::AddFunctor<T>(), math::ReluFunctor<T>(), in_x, in_y, output);
+                             paddle::operators::math::AddFunctor<T>,
+                             paddle::operators::math::ReluFunctor<T>>(
+        ctx, paddle::operators::math::AddFunctor<T>(),
+        paddle::operators::math::ReluFunctor<T>(), in_x, in_y, outputs);
  } else if (funcs_str == "relu,elementwise_add") {
-    RunUnaryCompoundFunctors<DeviceContext, T, math::ReluFunctor<T>,
+    // Z = Unary(Binary(X, Y))
-                             math::AddFunctor<T>>(
+    RunUnaryCompoundFunctors<DeviceContext, T,
-        ctx, math::ReluFunctor<T>(), math::AddFunctor<T>(), in_x, in_y, output);
+                             paddle::operators::math::ReluFunctor<T>,
+                             paddle::operators::math::AddFunctor<T>>(
+        ctx, paddle::operators::math::ReluFunctor<T>(),
+        paddle::operators::math::AddFunctor<T>(), in_x, in_y, outputs);
+  } else if (funcs_str == "elementwise_mul,scale") {
+    // Z = Binary(X, Unary(Y))
+    T scale = static_cast<T>(ctx.Attr<float>("scale"));
+    RunBinaryCompoundFunctor<DeviceContext, T,
+                             paddle::operators::math::MulFunctor<T>,
+                             paddle::operators::math::ScaleFunctor<T>>(
+        ctx, paddle::operators::math::MulFunctor<T>(),
+        paddle::operators::math::ScaleFunctor<T>(scale), in_x, in_y, outputs);
  } else {
    PADDLE_THROW("%s has not been implemented.", funcs_str);
  }
 }
-template <typename DeviceContext, typename T>
+template <typename DeviceContext, typename T, bool ReComputation>
 static void RunGradFunctors(const framework::ExecutionContext &ctx,
                            const framework::Tensor *in_x,
                            const framework::Tensor *in_y,
                            const framework::Tensor *in_out,
+                            const framework::Tensor *in_intermediate_out,
                            const framework::Tensor *in_out_grad,
                            framework::Tensor *x_grad,
                            framework::Tensor *y_grad) {
  auto &functors = ctx.Attr<std::vector<std::string>>("functor_list");
  auto funcs_str = functors[0] + "," + functors[1];
-  bool recomputation = ctx.Attr<bool>("recomputation");
+  // TODO(zcd): The following code can be refined. for example, use registrition
-  // TODO(zcd): The following code can be refined. for example, use registion
  if (funcs_str == "elementwise_add_grad,scale_grad") {
    // The backward of Z = Binary(X, Unary(Y))
    T scale = static_cast<T>(ctx.Attr<float>("scale"));
-    if (recomputation) {
+    RunBinaryCompoundGradFunctors<DeviceContext, T,
-      RunBinaryCompoundGradFunctors<DeviceContext, T, math::AddGradFunctor<T>,
+                                  paddle::operators::math::AddGradFunctor<T>,
-                                    math::ScaleFunctor<T>,
+                                  paddle::operators::math::ScaleFunctor<T>,
-                                    math::ScaleGradFunctor<T>, true>(
+                                  paddle::operators::math::ScaleGradFunctor<T>>(
-          ctx, math::AddGradFunctor<T>(), math::ScaleFunctor<T>(scale),
+        ctx, paddle::operators::math::AddGradFunctor<T>(),
-          math::ScaleGradFunctor<T>(scale), in_x, in_y, in_out, in_out_grad,
+        paddle::operators::math::ScaleFunctor<T>(scale),
-          x_grad, y_grad);
+        paddle::operators::math::ScaleGradFunctor<T>(scale), in_x, in_y, in_out,
-    } else {
+        in_intermediate_out, in_out_grad, x_grad, y_grad);
-      RunBinaryCompoundGradFunctors<DeviceContext, T, math::AddGradFunctor<T>,
-                                    math::ScaleFunctor<T>,
-                                    math::ScaleGradFunctor<T>, false>(
-          ctx, math::AddGradFunctor<T>(), math::ScaleFunctor<T>(scale),
-          math::ScaleGradFunctor<T>(scale), in_x, in_y, in_out, in_out_grad,
-          x_grad, y_grad);
-    }
  } else if (funcs_str == "scale_grad,elementwise_add_grad") {
    // The backward of Z = Unary(Binary(X, Y))
    T scale = static_cast<T>(ctx.Attr<float>("scale"));
-    if (recomputation) {
+    RunUnaryCompoundGradFunctors<DeviceContext, T,
-      RunUnaryCompoundGradFunctors<DeviceContext, T, math::ScaleGradFunctor<T>,
+                                 paddle::operators::math::ScaleGradFunctor<T>,
-                                   math::AddFunctor<T>, math::AddGradFunctor<T>,
+                                 paddle::operators::math::AddFunctor<T>,
-                                   true>(ctx, math::ScaleGradFunctor<T>(scale),
+                                 paddle::operators::math::AddGradFunctor<T>,
-                                         math::AddFunctor<T>(),
+                                 ReComputation /*Recomputation*/>(
-                                         math::AddGradFunctor<T>(), in_x, in_y,
+        ctx, paddle::operators::math::ScaleGradFunctor<T>(scale),
-                                         in_out, in_out_grad, x_grad, y_grad);
+        paddle::operators::math::AddFunctor<T>(),
-    } else {
+        paddle::operators::math::AddGradFunctor<T>(), in_x, in_y, in_out,
-      RunUnaryCompoundGradFunctors<DeviceContext, T, math::ScaleGradFunctor<T>,
+        in_intermediate_out, in_out_grad, x_grad, y_grad);
-                                   math::AddFunctor<T>, math::AddGradFunctor<T>,
-                                   false>(ctx, math::ScaleGradFunctor<T>(scale),
-                                          math::AddFunctor<T>(),
-                                          math::AddGradFunctor<T>(), in_x, in_y,
-                                          in_out, in_out_grad, x_grad, y_grad);
-    }
  } else if (funcs_str == "elementwise_add_grad,relu_grad") {
-    if (recomputation) {
+    RunBinaryCompoundGradFunctors<DeviceContext, T,
-      RunBinaryCompoundGradFunctors<DeviceContext, T, math::AddGradFunctor<T>,
+                                  paddle::operators::math::AddGradFunctor<T>,
-                                    math::ReluFunctor<T>,
+                                  paddle::operators::math::ReluFunctor<T>,
-                                    math::ReluGradFunctor<T>, true>(
+                                  paddle::operators::math::ReluGradFunctor<T>>(
-          ctx, math::AddGradFunctor<T>(), math::ReluFunctor<T>(),
+        ctx, paddle::operators::math::AddGradFunctor<T>(),
-          math::ReluGradFunctor<T>(), in_x, in_y, in_out, in_out_grad, x_grad,
+        paddle::operators::math::ReluFunctor<T>(),
-          y_grad);
+        paddle::operators::math::ReluGradFunctor<T>(), in_x, in_y, in_out,
-    } else {
+        in_intermediate_out, in_out_grad, x_grad, y_grad);
-      RunBinaryCompoundGradFunctors<DeviceContext, T, math::AddGradFunctor<T>,
-                                    math::ReluFunctor<T>,
-                                    math::ReluGradFunctor<T>, false>(
-          ctx, math::AddGradFunctor<T>(), math::ReluFunctor<T>(),
-          math::ReluGradFunctor<T>(), in_x, in_y, in_out, in_out_grad, x_grad,
-          y_grad);
-    }
  } else if (funcs_str == "relu_grad,elementwise_add_grad") {
-    if (recomputation) {
+    RunUnaryCompoundGradFunctors<DeviceContext, T,
-      RunUnaryCompoundGradFunctors<DeviceContext, T, math::ReluGradFunctor<T>,
+                                 paddle::operators::math::ReluGradFunctor<T>,
-                                   math::AddFunctor<T>, math::AddGradFunctor<T>,
+                                 paddle::operators::math::AddFunctor<T>,
-                                   true>(ctx, math::ReluGradFunctor<T>(),
+                                 paddle::operators::math::AddGradFunctor<T>,
-                                         math::AddFunctor<T>(),
+                                 ReComputation /*Recomputation*/>(
-                                         math::AddGradFunctor<T>(), in_x, in_y,
+        ctx, paddle::operators::math::ReluGradFunctor<T>(),
-                                         in_out, in_out_grad, x_grad, y_grad);
+        paddle::operators::math::AddFunctor<T>(),
-    } else {
+        paddle::operators::math::AddGradFunctor<T>(), in_x, in_y, in_out,
-      RunUnaryCompoundGradFunctors<DeviceContext, T, math::ReluGradFunctor<T>,
+        in_intermediate_out, in_out_grad, x_grad, y_grad);
-                                   math::AddFunctor<T>, math::AddGradFunctor<T>,
+  } else if (funcs_str == "elementwise_mul_grad,scale_grad") {
-                                   false>(ctx, math::ReluGradFunctor<T>(),
+    // The backward of Z = Binary(X, Unary(Y))
-                                          math::AddFunctor<T>(),
+    T scale = static_cast<T>(ctx.Attr<float>("scale"));
-                                          math::AddGradFunctor<T>(), in_x, in_y,
+    RunBinaryCompoundGradFunctors<DeviceContext, T,
-                                          in_out, in_out_grad, x_grad, y_grad);
+                                  paddle::operators::math::MulGradFunctor<T>,
-    }
+                                  paddle::operators::math::ScaleFunctor<T>,
+                                  paddle::operators::math::ScaleGradFunctor<T>>(
+        ctx, paddle::operators::math::MulGradFunctor<T>(),
+        paddle::operators::math::ScaleFunctor<T>(scale),
+        paddle::operators::math::ScaleGradFunctor<T>(scale), in_x, in_y, in_out,
+        in_intermediate_out, in_out_grad, x_grad, y_grad);
  } else {
    PADDLE_THROW("%s has not been implemented.", funcs_str);
  }
@@ -385,11 +307,23 @@ class FusedElemwiseActivationKernel : public framework::OpKernel<T> {
    auto &in_y = detail::Ref(ctx.Input<framework::Tensor>("Y"),
                             "Cannot get input tensor %s, variable name = %s",
                             "Y", ctx.op().Input("Y"));
-    auto &output = detail::Ref(ctx.Output<framework::Tensor>("Out"),
+    PADDLE_ENFORCE(ctx.HasOutput("Out"), "The output(Out) should not be empty");
-                               "Cannot get input tensor %s, variable name = %s",
+    auto output = ctx.Output<framework::Tensor>("Out");
-                               "Out", ctx.op().Output("Out"));
+    std::vector<framework::Tensor *> outputs;
+    outputs.emplace_back(output);
+    if (ctx.Attr<bool>("keep_intermediate_value")) {
+      PADDLE_ENFORCE(ctx.HasOutput("IntermediateOut"),
+                     "The keep_intermediate_value is enable, so the "
+                     "IntermediateOut should not be empty.");
+      auto intermediate_out = ctx.Output<framework::Tensor>("IntermediateOut");
+      outputs.emplace_back(intermediate_out);
+    } else {
+      outputs.emplace_back(nullptr);
+    }
-    RunFunctors<DeviceContext, T>(ctx, &in_x, &in_y, &output);
+    RunFunctors<DeviceContext, T>(ctx, in_x, in_y, &outputs);
  }
 };
@@ -397,28 +331,66 @@ template <typename DeviceContext, typename T>
 class FusedElemwiseActivationGradKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext &ctx) const override {
-    auto &in_x = detail::Ref(ctx.Input<framework::Tensor>("X"),
+    auto x = ctx.Input<framework::Tensor>("X");
-                             "Cannot get input tensor %s, variable name = %s",
+    auto y = ctx.Input<framework::Tensor>("Y");
-                             "X", ctx.op().Input("X"));
-    auto &in_y = detail::Ref(ctx.Input<framework::Tensor>("Y"),
+    auto in_out = ctx.Input<framework::Tensor>("Out");
-                             "Cannot get input tensor %s, variable name = %s",
+    auto in_out_grad =
-                             "Y", ctx.op().Input("Y"));
+        ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
-    auto &in_out = detail::Ref(ctx.Input<framework::Tensor>("Out"),
-                               "Cannot get input tensor %s, variable name = %s",
-                               "Out", ctx.op().Input("Out"));
-    auto &in_out_grad =
-        detail::Ref(ctx.Input<framework::Tensor>(framework::GradVarName("Out")),
-                    "Cannot get input tensor %s, variable name = %s",
-                    framework::GradVarName("Out"),
-                    ctx.op().Input(framework::GradVarName("Out")));
    framework::Tensor *x_grad =
        ctx.Output<framework::Tensor>(framework::GradVarName("X"));
    framework::Tensor *y_grad =
        ctx.Output<framework::Tensor>(framework::GradVarName("Y"));
-    RunGradFunctors<DeviceContext, T>(ctx, &in_x, &in_y, &in_out, &in_out_grad,
+    PADDLE_ENFORCE(y != nullptr, "Input(Y) should not be nullptr.");
-                                      x_grad, y_grad);
+    if (ctx.Attr<bool>("recomputation")) {
+      PADDLE_ENFORCE(
+          x != nullptr,
+          "The recomputation is opened, so Input(X) should not be absent.");
+    } else {
+      PADDLE_ENFORCE(in_out != nullptr,
+                     "The recomputation is disabled, so the Input('Out') "
+                     "should not be empty.");
+    }
+    framework::Tensor *in_x;
+    auto functor_list = ctx.Attr<std::vector<std::string>>("functor_list");
+    // If functor_list contains elementwise_add, the backward doesn't use
+    // in_x, and in_outs.
+    if (x == nullptr) {
+      PADDLE_ENFORCE(functor_list[0] == "elementwise_add_grad" ||
+                         functor_list[1] == "elementwise_add_grad",
+                     "Only when the compoundfunctor contains "
+                     "elementwise_add_grad, the 'X' could be absent.");
+      in_x = const_cast<framework::Tensor *>(in_out_grad);
+      in_out = const_cast<framework::Tensor *>(in_out_grad);
+    } else {
+      in_x = const_cast<framework::Tensor *>(x);
+    }
+    framework::Tensor *in_intermediate_out;
+    if (ctx.Attr<bool>("keep_intermediate_value")) {
+      in_intermediate_out = const_cast<framework::Tensor *>(
+          ctx.Input<framework::Tensor>("IntermediateOut"));
+      PADDLE_ENFORCE(in_intermediate_out != nullptr,
+                     "The option of 'keep_intermediate_value' is opened, "
+                     "so the number of 'Out' should be two.");
+    } else {
+      in_intermediate_out = nullptr;
+    }
+    if (ctx.Attr<bool>("recomputation")) {
+      RunGradFunctors<DeviceContext, T, true /*Recomputation*/>(
+          ctx, in_x, y, in_out, in_intermediate_out, in_out_grad, x_grad,
+          y_grad);
+    } else {
+      RunGradFunctors<DeviceContext, T, false /*Recomputation*/>(
+          ctx, in_x, y, in_out, in_intermediate_out, in_out_grad, x_grad,
+          y_grad);
+    }
  }
 };
 }  // namespace operators

--- a/paddle/fluid/operators/gather_op.cc
+++ b/paddle/fluid/operators/gather_op.cc
@@ -101,5 +101,8 @@ namespace ops = paddle::operators;
 REGISTER_OPERATOR(gather, ops::GatherOp, ops::GatherOpMaker,
                  paddle::framework::DefaultGradOpDescMaker<true>);
 REGISTER_OPERATOR(gather_grad, ops::GatherGradOp);
-REGISTER_OP_CPU_KERNEL(gather, ops::GatherOpKernel<float>);
+REGISTER_OP_CPU_KERNEL(gather, ops::GatherOpKernel<float>,
-REGISTER_OP_CPU_KERNEL(gather_grad, ops::GatherGradientOpKernel<float>);
+                       ops::GatherOpKernel<int>, ops::GatherOpKernel<double>);
+REGISTER_OP_CPU_KERNEL(gather_grad, ops::GatherGradientOpKernel<float>,
+                       ops::GatherGradientOpKernel<int>,
+                       ops::GatherGradientOpKernel<double>);
--- a/paddle/fluid/operators/math/compound_functors.h
+++ b/paddle/fluid/operators/math/compound_functors.h
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <string>
+#include <unordered_set>
+#include <vector>
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename T, typename BinaryFunctor, typename UnaryFunctor>
+struct BinaryCompoundFunctor {
+  BinaryCompoundFunctor(const BinaryFunctor func1, const UnaryFunctor func2)
+      : func1_(func1), func2_(func2) {}
+  // Z = BinaryFunctor(X, UnaryFunctor(Y))
+  inline HOSTDEVICE T GetOut(T x, T y) { return func1_(x, func2_(y)); }
+  inline HOSTDEVICE T GetOutUseIntermediateOut(T x, T intermediat_out) {
+    return func1_(x, intermediat_out);
+  }
+  inline HOSTDEVICE T GetIntermediateOut(T x, T y) { return func2_(y); }
+  BinaryFunctor func1_;
+  UnaryFunctor func2_;
+};
+template <typename T, typename UnaryFunctor, typename BinaryFunctor>
+struct UnaryCompoundFunctor {
+  UnaryCompoundFunctor(const UnaryFunctor func1, const BinaryFunctor func2)
+      : func1_(func1), func2_(func2) {}
+  // Z = UnaryFunctor(BinaryFunctor(X, Y))
+  inline HOSTDEVICE T GetOut(T x, T y) { return func1_(func2_(x, y)); }
+  inline HOSTDEVICE T GetOutUseIntermediateOut(T x, T intermediat_out) {
+    return func1_(intermediat_out);
+  }
+  inline HOSTDEVICE T GetIntermediateOut(T x, T y) { return func2_(x, y); }
+  UnaryFunctor func1_;
+  BinaryFunctor func2_;
+};
+// FIXME(zcd): DBinaryFun and DUnaryFun have to method to get
+// the dx, one is to use the 'out', and the other is not to use it.
+// the former method will save the time of recomputing the
+// 'out', but it must occupy the memory to store the 'out'.
+// While the later method can avoid occupying this memory,
+// but it must recompute the 'out'.
+template <typename T, typename DBinaryFun, typename UnaryFun>
+struct BinaryCompoundGradDxFunctor {
+  BinaryCompoundGradDxFunctor(const DBinaryFun &d_binary_fun,
+                              const UnaryFun &unary_fun)
+      : d_binary_fun_(d_binary_fun), unary_fun_(unary_fun) {}
+  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
+    return dout * d_binary_fun_.Dx(x, unary_fun_(y));
+  }
+  inline HOSTDEVICE T operator()(T x, T y, T intermediate_out, T out, T dout) {
+    return dout * d_binary_fun_.Dx(x, intermediate_out);
+  }
+ private:
+  DBinaryFun d_binary_fun_;
+  UnaryFun unary_fun_;
+};
+template <typename T, typename DBinaryFun, typename UnaryFun,
+          typename DUnaryFun>
+struct BinaryCompoundGradDyFunctor {
+  BinaryCompoundGradDyFunctor(const DBinaryFun &d_binary_fun,
+                              const UnaryFun &unary_fun,
+                              const DUnaryFun &d_unary_fun)
+      : d_binary_fun_(d_binary_fun),
+        unary_fun_(unary_fun),
+        d_unary_fun_(d_unary_fun) {}
+  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
+    return dout * d_binary_fun_.Dy(x, unary_fun_(y)) * d_unary_fun_(y);
+  }
+  inline HOSTDEVICE T operator()(T x, T y, T intermediate_out, T out, T dout) {
+    return dout * d_binary_fun_.Dy(x, intermediate_out) *
+           d_unary_fun_(y, intermediate_out);
+  }
+ private:
+  DBinaryFun d_binary_fun_;
+  UnaryFun unary_fun_;
+  DUnaryFun d_unary_fun_;
+};
+template <typename T, typename DUnaryFun, typename BinaryFun,
+          typename DBinaryFun, bool Recomputation = true>
+struct UnaryCompoundGradDxFunctor {
+  UnaryCompoundGradDxFunctor(const DUnaryFun &d_unary_fun,
+                             const BinaryFun &binary_fun,
+                             const DBinaryFun &d_binary_fun)
+      : d_unary_fun_(d_unary_fun),
+        binary_fun_(binary_fun),
+        d_binary_fun_(d_binary_fun) {}
+  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
+    T base;
+    if (Recomputation) {
+      base = dout * d_unary_fun_(binary_fun_(x, y));
+    } else {
+      base = dout * d_unary_fun_(binary_fun_(x, y), out);
+    }
+    return base * d_binary_fun_.Dx(x, y);
+  }
+  inline HOSTDEVICE T operator()(T x, T y, T intermediate_out, T out, T dout) {
+    T base;
+    if (Recomputation) {
+      base = dout * d_unary_fun_(intermediate_out);
+    } else {
+      base = dout * d_unary_fun_(intermediate_out, out);
+    }
+    return base * d_binary_fun_.Dx(x, y);
+  }
+ private:
+  DUnaryFun d_unary_fun_;
+  BinaryFun binary_fun_;
+  DBinaryFun d_binary_fun_;
+};
+template <typename T, typename DUnaryFun, typename BinaryFun,
+          typename DBinaryFun, bool Recomputation = true>
+struct UnaryCompoundGradDyFunctor {
+  UnaryCompoundGradDyFunctor(const DUnaryFun &d_unary_fun,
+                             const BinaryFun &binary_fun,
+                             const DBinaryFun &d_binary_fun)
+      : d_unary_fun_(d_unary_fun),
+        binary_fun_(binary_fun),
+        d_binary_fun_(d_binary_fun) {}
+  inline HOSTDEVICE T operator()(T x, T y, T out, T dout) {
+    T base;
+    if (Recomputation) {
+      base = dout * d_unary_fun_(binary_fun_(x, y));
+    } else {
+      base = dout * d_unary_fun_(binary_fun_(x, y), out);
+    }
+    return base * d_binary_fun_.Dy(x, y);
+  }
+  inline HOSTDEVICE T operator()(T x, T y, T intermediate_out, T out, T dout) {
+    T base;
+    if (Recomputation) {
+      base = dout * d_unary_fun_(intermediate_out);
+    } else {
+      base = dout * d_unary_fun_(intermediate_out, out);
+    }
+    return base * d_binary_fun_.Dy(x, y);
+  }
+ private:
+  DUnaryFun d_unary_fun_;
+  BinaryFun binary_fun_;
+  DBinaryFun d_binary_fun_;
+};
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/fluid/operators/math/functors.h
+++ b/paddle/fluid/operators/math/functors.h
@@ -18,6 +18,19 @@ namespace paddle {
 namespace operators {
 namespace math {
+// MulFunctor
+template <typename T>
+struct MulFunctor {
+  // out = x * y;
+  inline HOSTDEVICE T operator()(T x, T y) { return x * y; }
+};
+template <typename T>
+struct MulGradFunctor {
+  inline HOSTDEVICE T Dx(T x, T y) { return y; }
+  inline HOSTDEVICE T Dy(T x, T y) { return x; }
+};
 // AddFunctor
 template <typename T>
 struct AddFunctor {
@@ -27,9 +40,8 @@ struct AddFunctor {
 template <typename T>
 struct AddGradFunctor {
-  inline HOSTDEVICE T operator()(T x, T y) { return 1; }
+  inline HOSTDEVICE T Dx(T x, T y) { return 1; }
+  inline HOSTDEVICE T Dy(T x, T y) { return 1; }
-  inline HOSTDEVICE T operator()(T x, T y, T out) const { return 1; }
 };
 template <typename T>

--- a/paddle/fluid/platform/device_context.h
+++ b/paddle/fluid/platform/device_context.h
@@ -24,7 +24,7 @@ limitations under the License. */
 #endif
 #ifdef PADDLE_WITH_MKLDNN
-#include <mkldnn.hpp>
+#include "mkldnn.hpp"
 #endif
 #include <map>

--- a/paddle/fluid/pybind/const_value.cc
+++ b/paddle/fluid/pybind/const_value.cc
@@ -43,6 +43,9 @@ void BindConstValue(pybind11::module* m) {
  op_proto_and_checker_maker.def(
      "kOpRoleVarAttrName",
      framework::OpProtoAndCheckerMaker::OpRoleVarAttrName);
+  op_proto_and_checker_maker.def(
+      "kOpNameScopeAttrName",
+      framework::OpProtoAndCheckerMaker::OpNamescopeAttrName);
 }
 }  // namespace pybind

--- a/python/paddle/fluid/contrib/memory_usage_calc.py
+++ b/python/paddle/fluid/contrib/memory_usage_calc.py
@@ -70,17 +70,31 @@ def memory_usage(program, batch_size):
    if not isinstance(program, Program):
        raise TypeError(
            "Calculating Memory Usage requires Program as its Parameter."
-            "But you passed in %s" % (type(prgram)))
+            "But you passed in %s" % (type(program)))
    if batch_size <= 0:
        raise ValueError("The batch size need to be positive.")
    # Get the var_name list of first block and calculate
    total_memory = 0.0
-    for var in six.itervalues(program.global_block().vars):
+    processed_var_names = set()
+    for op in program.global_block().ops:
+        for var_name in op.output_arg_names:
+            if var_name in processed_var_names:
+                continue
+            processed_var_names.add(var_name)
+            var = program.global_block().vars[var_name]
+            if var.desc.type() != core.VarDesc.VarType.LOD_TENSOR:
+                continue
            data_count = 1
+            neg_dim_count = 0
            for x in var.shape:
-            if x == -1:
+                if x < 0:
-                data_count *= batch_size
+                    if neg_dim_count >= 1:
+                        raise ValueError("Var %s has more than one negtive dim."
+                                         % (var_name))
+                    neg_dim_count += 1
+                    data_count *= batch_size * (-x)
                else:
                    data_count *= x
            var_memory = data_count * dtype_to_size[var.dtype]

--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
@@ -43,6 +43,7 @@ __all__ = [
    'default_main_program',
    'program_guard',
    'get_var',
+    'name_scope',
 ]
 EMPTY_VAR_NAME = core.kEmptyVarName()
@@ -52,6 +53,70 @@ ZERO_VAR_SUFFIX = core.kZeroVarSuffix()
 CONTROL_DEP_VAR_PREFIX = core.kControlDepVarName()
+class NameScope(object):
+    def __init__(self, name="", parent=None):
+        self._children = dict()
+        self._name = name
+        self._parent = parent
+    def child(self, prefix):
+        if prefix not in self._children:
+            new_child = NameScope(prefix, self)
+            self._children[prefix] = [new_child]
+        else:
+            new_child = NameScope(prefix + "_%d" % len(self._children[prefix]),
+                                  self)
+            self._children[prefix].append(new_child)
+        return new_child
+    def parent(self):
+        return self._parent
+    def name(self):
+        return self._name
+_name_scope = NameScope()
+@contextlib.contextmanager
+def name_scope(prefix=None):
+    """
+    Generate hierarchical name prefix for the operators.
+    Note: This should only used for debugging and visualization purpose.
+    Don't use it for serious analysis such as graph/program transformations.
+    Args:
+        prefix(str): prefix.
+    Examples:
+        .. code-block:: python
+          with name_scope("encoder"):
+             ...
+          with name_scope("decoder"):
+             ...
+             with name_scope("attention"):
+                ...
+    """
+    # TODO(panyx0718): Only [0-9a-z].
+    assert prefix, "namescope prefix cannot be empty."
+    global _name_scope
+    _name_scope = _name_scope.child(prefix)
+    yield
+    _name_scope = _name_scope.parent()
+def _full_name_scope():
+    global _name_scope
+    scope = _name_scope
+    name = ""
+    while scope:
+        name = scope.name() + "/" + name
+        scope = scope.parent()
+    return name
 def generate_control_dev_var_name():
    import random
    return CONTROL_DEP_VAR_PREFIX + "@" + str(random.random())
@@ -515,6 +580,9 @@ class Operator(object):
        self.desc.set_type(type)
        proto = OpProtoHolder.instance().get_op_proto(type)
+        namescope_var_name = op_maker.kOpNameScopeAttrName()
+        op_attrs[namescope_var_name] = _full_name_scope()
        def find_name(var_list, name):
            for var_name in var_list:
                if var_list[var_name] is not None and var_name == name:

--- a/python/paddle/fluid/layers/detection.py
+++ b/python/paddle/fluid/layers/detection.py
@@ -39,6 +39,7 @@ __all__ = [
    'detection_map',
    'rpn_target_assign',
    'anchor_generator',
+    'generate_proposal_labels',
    'generate_proposals',
 ]
@@ -57,6 +58,7 @@ for _OP in set(__auto__):
 def rpn_target_assign(loc,
                      scores,
                      anchor_box,
+                      anchor_var,
                      gt_box,
                      rpn_batch_size_per_im=256,
                      fg_fraction=0.25,
@@ -95,6 +97,8 @@ def rpn_target_assign(loc,
            if the input is image feature map, they are close to the origin
            of the coordinate system. [xmax, ymax] is the right bottom
            coordinate of the anchor box.
+        anchor_var(Variable): A 2-D Tensor with shape [M,4] holds expanded 
+            variances of anchors.
        gt_box (Variable): The ground-truth boudding boxes (bboxes) are a 2D
            LoDTensor with shape [Ng, 4], Ng is the total number of ground-truth
            bboxes of mini-batch input.
@@ -144,30 +148,29 @@ def rpn_target_assign(loc,
    # 1. Compute the regression target bboxes
    target_bbox = box_coder(
        prior_box=anchor_box,
+        prior_box_var=anchor_var,
        target_box=gt_box,
        code_type='encode_center_size',
        box_normalized=False)
    # 2. Compute overlaps between the prior boxes and the gt boxes overlaps
    iou = iou_similarity(x=gt_box, y=anchor_box)
    # 3. Assign target label to anchors
    loc_index = helper.create_tmp_variable(dtype=anchor_box.dtype)
    score_index = helper.create_tmp_variable(dtype=anchor_box.dtype)
    target_label = helper.create_tmp_variable(dtype=anchor_box.dtype)
    helper.append_op(
        type="rpn_target_assign",
-        inputs={'Overlap': iou, },
+        inputs={'DistMat': iou},
        outputs={
            'LocationIndex': loc_index,
            'ScoreIndex': score_index,
-            'TargetLabel': target_label,
+            'TargetLabel': target_label
        },
        attrs={
            'rpn_batch_size_per_im': rpn_batch_size_per_im,
            'rpn_positive_overlap': rpn_positive_overlap,
            'rpn_negative_overlap': rpn_negative_overlap,
-            'fg_fraction': fg_fraction,
+            'fg_fraction': fg_fraction
        })
    # 4. Reshape and gather the target entry
@@ -180,7 +183,7 @@ def rpn_target_assign(loc,
    predicted_location = nn.gather(loc, loc_index)
    target_label = nn.gather(target_label, score_index)
    target_bbox = nn.gather(target_bbox, loc_index)
-    return predicted_scores, predicted_loc, target_label, target_bbox
+    return predicted_scores, predicted_location, target_label, target_bbox
 def detection_output(loc,
@@ -1256,6 +1259,64 @@ def anchor_generator(input,
    return anchor, var
+def generate_proposal_labels(rpn_rois,
+                             gt_classes,
+                             gt_boxes,
+                             im_scales,
+                             batch_size_per_im=256,
+                             fg_fraction=0.25,
+                             fg_thresh=0.25,
+                             bg_thresh_hi=0.5,
+                             bg_thresh_lo=0.0,
+                             bbox_reg_weights=[0.1, 0.1, 0.2, 0.2],
+                             class_nums=None):
+    """
+    ** Generate proposal labels Faster-RCNN **
+    TODO(buxingyuan): Add Document
+    """
+    helper = LayerHelper('generate_proposal_labels', **locals())
+    rois = helper.create_tmp_variable(dtype=rpn_rois.dtype)
+    labels_int32 = helper.create_tmp_variable(dtype=gt_classes.dtype)
+    bbox_targets = helper.create_tmp_variable(dtype=rpn_rois.dtype)
+    bbox_inside_weights = helper.create_tmp_variable(dtype=rpn_rois.dtype)
+    bbox_outside_weights = helper.create_tmp_variable(dtype=rpn_rois.dtype)
+    helper.append_op(
+        type="generate_proposal_labels",
+        inputs={
+            'RpnRois': rpn_rois,
+            'GtClasses': gt_classes,
+            'GtBoxes': gt_boxes,
+            'ImScales': im_scales
+        },
+        outputs={
+            'Rois': rois,
+            'LabelsInt32': labels_int32,
+            'BboxTargets': bbox_targets,
+            'BboxInsideWeights': bbox_inside_weights,
+            'BboxOutsideWeights': bbox_outside_weights
+        },
+        attrs={
+            'batch_size_per_im': batch_size_per_im,
+            'fg_fraction': fg_fraction,
+            'fg_thresh': fg_thresh,
+            'bg_thresh_hi': bg_thresh_hi,
+            'bg_thresh_lo': bg_thresh_lo,
+            'bbox_reg_weights': bbox_reg_weights,
+            'class_nums': class_nums
+        })
+    rois.stop_gradient = True
+    labels_int32.stop_gradient = True
+    bbox_targets.stop_gradient = True
+    bbox_inside_weights.stop_gradient = True
+    bbox_outside_weights.stop_gradient = True
+    return rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights
 def generate_proposals(scores,
                       bbox_deltas,
                       im_info,

--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
@@ -85,6 +85,8 @@ __all__ = [
    'one_hot',
    'autoincreased_step_counter',
    'reshape',
+    'squeeze',
+    'unsqueeze',
    'lod_reset',
    'lrn',
    'pad',
@@ -4532,6 +4534,89 @@ def reshape(x, shape, actual_shape=None, act=None, inplace=True, name=None):
    return helper.append_activation(out)
+def squeeze(input, axes, name=None):
+    """
+    Remove single-dimensional entries from the shape of a tensor. Takes a 
+    parameter axes with a list of axes to squeeze. If axes is not provided, all 
+    the single dimensions will be removed from the shape. If an axis is 
+    selected with shape entry not equal to one, an error is raised.
+    Examples:
+    Case 1:
+      Given 
+        X.shape = (1, 3, 1, 5)
+      and
+        axes = [0]
+      we get:
+        Out.shape = (3, 1, 5)
+      Case 2:
+        Given
+          X.shape = (1, 3, 1, 5)
+        and 
+          axes = []
+        we get:
+          Out.shape = (3, 5)
+    Args:
+        input (Variable): The input variable to be squeezed.
+        axes (list): List of integers, indicating the dimensions to be squeezed.
+        name (str|None): Name for this layer.
+    Returns:
+        Variable: Output squeezed variable.
+    Examples:
+        .. code-block:: python
+            x = layers.data(name='x', shape=[5, 1, 10])
+            y = layers.sequeeze(input=x, axes=[1])
+    """
+    helper = LayerHelper("squeeze", **locals())
+    out = helper.create_tmp_variable(dtype=input.dtype)
+    helper.append_op(
+        type="squeeze",
+        inputs={"X": input},
+        attrs={"axes": axes},
+        outputs={"Out": out})
+    return out
+def unsqueeze(input, axes, name=None):
+    """
+    Insert single-dimensional entries to the shape of a tensor. Takes one 
+    required argument axes, a list of dimensions that will be inserted. 
+    Dimension indices in axes are as seen in the output tensor. 
+    For example: 
+      Given a tensor such that tensor with shape [3, 4, 5], 
+      then Unsqueezed tensor with axes=[0, 4] has shape [1, 3, 4, 5, 1].
+    Args:
+        input (Variable): The input variable to be unsqueezed.
+        axes (list): List of integers, indicating the dimensions to be inserted.
+        name (str|None): Name for this layer.
+    Returns:
+        Variable: Output unsqueezed variable.
+    Examples:
+        .. code-block:: python
+            x = layers.data(name='x', shape=[5, 10])
+            y = layers.unsequeeze(input=x, axes=[1])
+    """
+    helper = LayerHelper("unsqueeze", **locals())
+    out = helper.create_tmp_variable(dtype=input.dtype)
+    helper.append_op(
+        type="unsqueeze",
+        inputs={"X": input},
+        attrs={"axes": axes},
+        outputs={"Out": out})
+    return out
 def lod_reset(x, y=None, target_lod=None):
    """
    Set LoD of :attr:`x` to a new one specified by :attr:`y` or

--- a/python/paddle/fluid/layers/ops.py
+++ b/python/paddle/fluid/layers/ops.py
@@ -64,6 +64,7 @@ __all__ = [
    'logical_not',
    'uniform_random_batch_size_like',
    'gaussian_random',
+    'sampling_id',
    'gaussian_random_batch_size_like',
    'sum',
    'slice',

--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
@@ -15,7 +15,7 @@
 from __future__ import print_function
 import re
 from collections import defaultdict
-from paddle.fluid.framework import Program, Variable
+from paddle.fluid.framework import Program, Variable, name_scope
 from . import framework
 from . import layers
 from .backward import append_backward
@@ -237,7 +237,7 @@ class Optimizer(object):
                if param_and_grad[1] is None:
                    continue
                with param_and_grad[0].block.program.optimized_guard(
-                        param_and_grad):
+                        param_and_grad), name_scope("optimizer"):
                    if param_and_grad[0].trainable is True:
                        optimize_op = self._append_optimize_op(loss.block,
                                                               param_and_grad)

--- a/python/paddle/fluid/tests/test_detection.py
+++ b/python/paddle/fluid/tests/test_detection.py
@@ -146,6 +146,55 @@ class TestAnchorGenerator(unittest.TestCase):
        assert anchor.shape[3] == 4
+class TestGenerateProposalLabels(unittest.TestCase):
+    def test_generate_proposal_labels(self):
+        rpn_rois = layers.data(
+            name='rpn_rois',
+            shape=[4, 4],
+            dtype='float32',
+            lod_level=1,
+            append_batch_size=False)
+        gt_classes = layers.data(
+            name='gt_classes',
+            shape=[6],
+            dtype='int32',
+            lod_level=1,
+            append_batch_size=False)
+        gt_boxes = layers.data(
+            name='gt_boxes',
+            shape=[6, 4],
+            dtype='float32',
+            lod_level=1,
+            append_batch_size=False)
+        im_scales = layers.data(
+            name='im_scales',
+            shape=[1],
+            dtype='float32',
+            lod_level=1,
+            append_batch_size=False)
+        class_nums = 5
+        rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights = fluid.layers.generate_proposal_labels(
+            rpn_rois=rpn_rois,
+            gt_classes=gt_classes,
+            gt_boxes=gt_boxes,
+            im_scales=im_scales,
+            batch_size_per_im=2,
+            fg_fraction=0.5,
+            fg_thresh=0.5,
+            bg_thresh_hi=0.5,
+            bg_thresh_lo=0.0,
+            bbox_reg_weights=[0.1, 0.1, 0.2, 0.2],
+            class_nums=class_nums)
+        assert rois.shape[1] == 4
+        assert rois.shape[0] == labels_int32.shape[0]
+        assert rois.shape[0] == bbox_targets.shape[0]
+        assert rois.shape[0] == bbox_inside_weights.shape[0]
+        assert rois.shape[0] == bbox_outside_weights.shape[0]
+        assert bbox_targets.shape[1] == 4 * class_nums
+        assert bbox_inside_weights.shape[1] == 4 * class_nums
+        assert bbox_outside_weights.shape[1] == 4 * class_nums
 class TestMultiBoxHead(unittest.TestCase):
    def test_multi_box_head(self):
        data_shape = [3, 224, 224]
@@ -201,6 +250,59 @@ class TestDetectionMAP(unittest.TestCase):
        print(str(program))
+class TestRpnTargetAssign(unittest.TestCase):
+    def test_rpn_target_assign(self):
+        program = Program()
+        with program_guard(program):
+            loc_shape = [10, 50, 4]
+            score_shape = [10, 50, 2]
+            anchor_shape = [50, 4]
+            loc = layers.data(
+                name='loc',
+                shape=loc_shape,
+                append_batch_size=False,
+                dtype='float32')
+            scores = layers.data(
+                name='scores',
+                shape=score_shape,
+                append_batch_size=False,
+                dtype='float32')
+            anchor_box = layers.data(
+                name='anchor_box',
+                shape=anchor_shape,
+                append_batch_size=False,
+                dtype='float32')
+            anchor_var = layers.data(
+                name='anchor_var',
+                shape=anchor_shape,
+                append_batch_size=False,
+                dtype='float32')
+            gt_box = layers.data(
+                name='gt_box', shape=[4], lod_level=1, dtype='float32')
+            predicted_scores, predicted_location, target_label, target_bbox = layers.rpn_target_assign(
+                loc=loc,
+                scores=scores,
+                anchor_box=anchor_box,
+                anchor_var=anchor_var,
+                gt_box=gt_box,
+                rpn_batch_size_per_im=256,
+                fg_fraction=0.25,
+                rpn_positive_overlap=0.7,
+                rpn_negative_overlap=0.3)
+            self.assertIsNotNone(predicted_scores)
+            self.assertIsNotNone(predicted_location)
+            self.assertIsNotNone(target_label)
+            self.assertIsNotNone(target_bbox)
+            assert predicted_scores.shape[1] == 2
+            assert predicted_location.shape[1] == 4
+            assert predicted_location.shape[1] == target_bbox.shape[1]
+        print(str(program))
 class TestGenerateProposals(unittest.TestCase):
    def test_generate_proposals(self):
        data_shape = [20, 64, 64]

--- a/python/paddle/fluid/tests/unittests/dist_transformer.py
+++ b/python/paddle/fluid/tests/unittests/dist_transformer.py
@@ -1667,16 +1667,6 @@ def get_model(is_dist, is_async):
    return sum_cost, avg_cost, predict, token_num, local_lr_scheduler
-def get_transpiler(trainer_id, main_program, pserver_endpoints, trainers):
-    t = fluid.DistributeTranspiler()
-    t.transpile(
-        trainer_id=trainer_id,
-        program=main_program,
-        pservers=pserver_endpoints,
-        trainers=trainers)
-    return t
 def update_args():
    src_dict = DataReader.load_dict(TrainTaskConfig.src_vocab_fpath)
    trg_dict = DataReader.load_dict(TrainTaskConfig.trg_vocab_fpath)
@@ -1691,69 +1681,46 @@ def update_args():
 class DistTransformer2x2(TestDistRunnerBase):
-    def run_pserver(self, pserver_endpoints, trainers, current_endpoint,
+    def run_pserver(self, args):
-                    trainer_id, sync_mode):
+        get_model(True, not args.sync_mode)
-        get_model(True, not sync_mode)
+        t = self.get_transpiler(args.trainer_id,
-        t = get_transpiler(trainer_id,
+                                fluid.default_main_program(), args.endpoints,
-                           fluid.default_main_program(), pserver_endpoints,
+                                args.trainers, args.sync_mode)
-                           trainers)
+        pserver_prog = t.get_pserver_program(args.current_endpoint)
-        pserver_prog = t.get_pserver_program(current_endpoint)
+        startup_prog = t.get_startup_program(args.current_endpoint,
-        startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
+                                             pserver_prog)
        place = fluid.CPUPlace()
        exe = fluid.Executor(place)
        exe.run(startup_prog)
        exe.run(pserver_prog)
-    def _wait_ps_ready(self, pid):
+    def run_trainer(self, place, args):
-        retry_times = 20
-        while True:
-            assert retry_times >= 0, "wait ps ready failed"
-            time.sleep(3)
-            try:
-                # the listen_and_serv_op would touch a file which contains the listen port
-                # on the /tmp directory until it was ready to process all the RPC call.
-                os.stat("/tmp/paddle.%d.port" % pid)
-                return
-            except os.error:
-                retry_times -= 1
-    def run_trainer(self,
-                    place,
-                    endpoints,
-                    trainer_id,
-                    trainers,
-                    is_dist=True,
-                    sync_mode=True):
        sum_cost, avg_cost, predict, token_num, local_lr_scheduler = get_model(
-            is_dist, not sync_mode)
+            args.is_dist, not args.sync_mode)
-        if is_dist:
+        if args.is_dist:
-            t = get_transpiler(trainer_id,
+            t = self.get_transpiler(args.trainer_id,
-                               fluid.default_main_program(), endpoints,
+                                    fluid.default_main_program(),
-                               trainers)
+                                    args.endpoints, args.trainers,
+                                    args.sync_mode)
            trainer_prog = t.get_trainer_program()
            TrainTaskConfig.batch_size = 10
            TrainTaskConfig.train_file_pattern = TrainTaskConfig.data_path + "train.tok.clean.bpe.32000.en-de.train_{}".format(
-                trainer_id)
+                args.trainer_id)
        else:
            TrainTaskConfig.batch_size = 20
            trainer_prog = fluid.default_main_program()
        startup_exe = fluid.Executor(place)
-        TrainTaskConfig.local = not is_dist
+        TrainTaskConfig.local = not args.is_dist
        train_loop(startup_exe, trainer_prog, 1, sum_cost, avg_cost,
                   local_lr_scheduler, token_num, predict)
 if __name__ == "__main__":
-    if len(sys.argv) != 8:
-        print(
-            "Usage: python dist_transformer.py [pserver/trainer] [endpoints] [trainer_id] [current_endpoint] [trainers] [is_dist] [sync_mode]"
-        )
    update_args()
    runtime_main(DistTransformer2x2)
--- a/python/paddle/fluid/tests/unittests/op_test.py
+++ b/python/paddle/fluid/tests/unittests/op_test.py
@@ -47,7 +47,8 @@ def get_numeric_gradient(place,
                         input_to_check,
                         output_names,
                         delta=0.005,
-                         in_place=False):
+                         in_place=False,
+                         sum_outputs=None):
    # FIXME: change this method by compile time concepts
    set_input(scope, op, inputs, place)
@@ -58,9 +59,11 @@ def get_numeric_gradient(place,
        sum = []
        op.run(scope, place)
        for output_name in output_names:
+            if sum_outputs and output_name not in sum_outputs:
+                continue
            sum.append(
                np.array(scope.find_var(output_name).get_tensor()).mean())
-        return np.array(sum).mean()
+        return np.array(sum).sum() / len(output_names)
    tensor_to_check = scope.find_var(input_to_check).get_tensor()
    tensor_size = product(tensor_to_check.shape())
@@ -396,13 +399,14 @@ class OpTest(unittest.TestCase):
                   numeric_grad_delta=0.005,
                   in_place=False,
                   max_relative_error=0.005,
-                   user_defined_grads=None):
+                   user_defined_grads=None,
+                   sum_outputs=None):
        places = self._get_places()
        for place in places:
            self.check_grad_with_place(place, inputs_to_check, output_names,
                                       no_grad_set, numeric_grad_delta,
                                       in_place, max_relative_error,
-                                       user_defined_grads)
+                                       user_defined_grads, sum_outputs)
    def check_grad_with_place(self,
                              place,
@@ -412,7 +416,8 @@ class OpTest(unittest.TestCase):
                              numeric_grad_delta=0.005,
                              in_place=False,
                              max_relative_error=0.005,
-                              user_defined_grads=None):
+                              user_defined_grads=None,
+                              sum_outputs=None):
        self.scope = core.Scope()
        op_inputs = self.inputs if hasattr(self, "inputs") else dict()
        op_outputs = self.outputs if hasattr(self, "outputs") else dict()
@@ -435,7 +440,8 @@ class OpTest(unittest.TestCase):
                input_to_check,
                output_names,
                delta=numeric_grad_delta,
-                in_place=in_place) for input_to_check in inputs_to_check
+                in_place=in_place,
+                sum_outputs=sum_outputs) for input_to_check in inputs_to_check
        ]
        analytic_grads = self._get_gradient(inputs_to_check, place,
                                            output_names, no_grad_set)

--- a/python/paddle/fluid/tests/unittests/test_dist_base.py
+++ b/python/paddle/fluid/tests/unittests/test_dist_base.py
@@ -21,7 +21,7 @@ import sys
 import six
 import signal
 import subprocess
-import six
+import argparse
 class TestDistRunnerBase(object):
@@ -43,40 +43,35 @@ class TestDistRunnerBase(object):
            sync_mode=sync_mode)
        return t
-    def run_pserver(self,
+    def run_pserver(self, args):
-                    pserver_endpoints,
-                    trainers,
-                    current_endpoint,
-                    trainer_id,
-                    sync_mode=True):
        import paddle
        import paddle.fluid as fluid
        self.get_model(batch_size=2)
-        t = self.get_transpiler(trainer_id,
+        if args.mem_opt:
-                                fluid.default_main_program(), pserver_endpoints,
+            fluid.memory_optimize(fluid.default_main_program())
-                                trainers, sync_mode)
+        t = self.get_transpiler(args.trainer_id,
-        pserver_prog = t.get_pserver_program(current_endpoint)
+                                fluid.default_main_program(), args.endpoints,
-        startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
+                                args.trainers, args.sync_mode)
+        pserver_prog = t.get_pserver_program(args.current_endpoint)
+        startup_prog = t.get_startup_program(args.current_endpoint,
+                                             pserver_prog)
        place = fluid.CPUPlace()
        exe = fluid.Executor(place)
        exe.run(startup_prog)
        exe.run(pserver_prog)
-    def run_trainer(self,
+    def run_trainer(self, place, args):
-                    place,
-                    endpoints,
-                    trainer_id,
-                    trainers,
-                    is_dist=True,
-                    sync_mode=True):
        import paddle
        import paddle.fluid as fluid
        test_program, avg_cost, train_reader, test_reader, batch_acc, predict = \
            self.get_model(batch_size=2)
-        if is_dist:
+        if args.mem_opt:
-            t = self.get_transpiler(trainer_id,
+            fluid.memory_optimize(fluid.default_main_program())
-                                    fluid.default_main_program(), endpoints,
+        if args.is_dist:
-                                    trainers, sync_mode)
+            t = self.get_transpiler(args.trainer_id,
+                                    fluid.default_main_program(),
+                                    args.endpoints, args.trainers,
+                                    args.sync_mode)
            trainer_prog = t.get_trainer_program()
        else:
            trainer_prog = fluid.default_main_program()
@@ -87,8 +82,18 @@ class TestDistRunnerBase(object):
        strategy = fluid.ExecutionStrategy()
        strategy.num_threads = 1
        strategy.allow_op_delay = False
+        build_stra = fluid.BuildStrategy()
+        if args.use_reduce:
+            build_stra.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce
+        else:
+            build_stra.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.AllReduce
        exe = fluid.ParallelExecutor(
-            True, loss_name=avg_cost.name, exec_strategy=strategy)
+            True,
+            loss_name=avg_cost.name,
+            exec_strategy=strategy,
+            build_strategy=build_stra)
        feed_var_list = [
            var for var in trainer_prog.global_block().vars.values()
@@ -117,27 +122,28 @@ def runtime_main(test_class):
    import paddle.fluid as fluid
    import paddle.fluid.core as core
-    if len(sys.argv) != 8:
+    parser = argparse.ArgumentParser(description='Run dist test.')
-        print(
+    parser.add_argument(
-            "Usage: python dist_se_resnext.py [pserver/trainer] [endpoints] [trainer_id] [current_endpoint] [trainers] [is_dist] [sync_mode]"
+        '--role', type=str, required=True, choices=['pserver', 'trainer'])
-        )
+    parser.add_argument('--endpoints', type=str, required=False, default="")
-    role = sys.argv[1]
+    parser.add_argument('--is_dist', action='store_true')
-    endpoints = sys.argv[2]
+    parser.add_argument('--trainer_id', type=int, required=False, default=0)
-    trainer_id = int(sys.argv[3])
+    parser.add_argument('--trainers', type=int, required=False, default=1)
-    current_endpoint = sys.argv[4]
+    parser.add_argument(
-    trainers = int(sys.argv[5])
+        '--current_endpoint', type=str, required=False, default="")
-    is_dist = True if sys.argv[6] == "TRUE" else False
+    parser.add_argument('--sync_mode', action='store_true')
-    sync_mode = True if sys.argv[7] == "TRUE" else False
+    parser.add_argument('--mem_opt', action='store_true')
+    parser.add_argument('--use_reduce', action='store_true')
+    args = parser.parse_args()
    model = test_class()
-    if role == "pserver":
+    if args.role == "pserver" and args.is_dist:
-        model.run_pserver(endpoints, trainers, current_endpoint, trainer_id,
+        model.run_pserver(args)
-                          sync_mode)
    else:
        p = fluid.CUDAPlace(0) if core.is_compiled_with_cuda(
        ) else fluid.CPUPlace()
-        model.run_trainer(p, endpoints, trainer_id, trainers, is_dist,
+        model.run_trainer(p, args)
-                          sync_mode)
 import paddle.compat as cpt
@@ -153,30 +159,39 @@ class TestDistBase(unittest.TestCase):
        self._ps_endpoints = "127.0.0.1:9123,127.0.0.1:9124"
        self._python_interp = "python"
        self._sync_mode = True
+        self._mem_opt = False
+        self._use_reduce = False
        self._setup_config()
    def start_pserver(self, model_file, check_error_log):
-        sync_mode_str = "TRUE" if self._sync_mode else "FALSE"
        ps0_ep, ps1_ep = self._ps_endpoints.split(",")
-        ps0_cmd = "%s %s pserver %s 0 %s %d TRUE %s" % \
+        ps_cmd = "%s %s --role pserver --endpoints %s --trainer_id 0 --current_endpoint %s --trainers %d --is_dist"
+        ps0_cmd = ps_cmd % \
            (self._python_interp, model_file, self._ps_endpoints, ps0_ep,
-             self._trainers, sync_mode_str)
+             self._trainers)
-        ps1_cmd = "%s %s pserver %s 0 %s %d TRUE %s" % \
+        ps1_cmd = ps_cmd % \
            (self._python_interp, model_file, self._ps_endpoints, ps1_ep,
-             self._trainers, sync_mode_str)
+             self._trainers)
+        if self._sync_mode:
+            ps0_cmd += " --sync_mode"
+            ps1_cmd += " --sync_mode"
+        if self._mem_opt:
+            ps0_cmd += " --mem_opt"
+            ps1_cmd += " --mem_opt"
        ps0_pipe = subprocess.PIPE
        ps1_pipe = subprocess.PIPE
        if check_error_log:
-            print("ps0_cmd:", ps0_cmd)
+            print(ps0_cmd)
-            print("ps1_cmd:", ps1_cmd)
+            print(ps1_cmd)
            ps0_pipe = open("/tmp/ps0_err.log", "wb")
            ps1_pipe = open("/tmp/ps1_err.log", "wb")
        ps0_proc = subprocess.Popen(
-            ps0_cmd.split(" "), stdout=subprocess.PIPE, stderr=ps0_pipe)
+            ps0_cmd.strip().split(" "), stdout=subprocess.PIPE, stderr=ps0_pipe)
        ps1_proc = subprocess.Popen(
-            ps1_cmd.split(" "), stdout=subprocess.PIPE, stderr=ps1_pipe)
+            ps1_cmd.strip().split(" "), stdout=subprocess.PIPE, stderr=ps1_pipe)
        if not check_error_log:
            return ps0_proc, ps1_proc, None, None
@@ -199,7 +214,7 @@ class TestDistBase(unittest.TestCase):
                retry_times -= 1
    def check_with_place(self, model_file, delta=1e-3, check_error_log=False):
-        # *ATTENTION* THIS TEST NEEDS AT LEAST 2GPUS TO RUN
+        # TODO(typhoonzero): should auto adapt GPU count on the machine.
        required_envs = {
            "PATH": os.getenv("PATH"),
            "PYTHONPATH": os.getenv("PYTHONPATH"),
@@ -215,10 +230,7 @@ class TestDistBase(unittest.TestCase):
        # Run local to get a base line
        env_local = {"CUDA_VISIBLE_DEVICES": "0"}
        env_local.update(required_envs)
-        sync_mode_str = "TRUE" if self._sync_mode else "FALSE"
+        local_cmd = "%s %s --role trainer" % (self._python_interp, model_file)
-        local_cmd = "%s %s trainer %s 0 %s %d FLASE %s" % \
-            (self._python_interp, model_file,
-             "127.0.0.1:1234", "127.0.0.1:1234", 1, sync_mode_str)
        if not check_error_log:
            local_proc = subprocess.Popen(
                local_cmd.split(" "),
@@ -226,7 +238,6 @@ class TestDistBase(unittest.TestCase):
                stderr=subprocess.PIPE,
                env=env_local)
        else:
-            print("trainer cmd:", local_cmd)
            err_log = open("/tmp/trainer.err.log", "wb")
            local_proc = subprocess.Popen(
                local_cmd.split(" "),
@@ -247,12 +258,23 @@ class TestDistBase(unittest.TestCase):
        self._wait_ps_ready(ps1.pid)
        ps0_ep, ps1_ep = self._ps_endpoints.split(",")
-        tr0_cmd = "%s %s trainer %s 0 %s %d TRUE %s" % \
+        tr_cmd = "%s %s --role trainer --endpoints %s --trainer_id %d --current_endpoint %s --trainers %d --is_dist"
-            (self._python_interp, model_file, self._ps_endpoints, ps0_ep,
+        tr0_cmd = tr_cmd % \
-             self._trainers, sync_mode_str)
+            (self._python_interp, model_file, self._ps_endpoints,
-        tr1_cmd = "%s %s trainer %s 1 %s %d TRUE %s" % \
+             0, ps0_ep, self._trainers)
-            (self._python_interp, model_file, self._ps_endpoints, ps1_ep,
+        tr1_cmd = tr_cmd % \
-             self._trainers, sync_mode_str)
+            (self._python_interp, model_file, self._ps_endpoints,
+             1, ps1_ep, self._trainers)
+        if self._sync_mode:
+            tr0_cmd += " --sync_mode"
+            tr1_cmd += " --sync_mode"
+        if self._mem_opt:
+            tr0_cmd += " --mem_opt"
+            tr1_cmd += " --mem_opt"
+        if self._use_reduce:
+            tr0_cmd += " --use_reduce"
+            tr1_cmd += " --use_reduce"
        env0 = {"CUDA_VISIBLE_DEVICES": "0"}
        env1 = {"CUDA_VISIBLE_DEVICES": "1"}
@@ -269,12 +291,12 @@ class TestDistBase(unittest.TestCase):
            tr1_pipe = open("/tmp/tr1_err.log", "wb")
        tr0_proc = subprocess.Popen(
-            tr0_cmd.split(" "),
+            tr0_cmd.strip().split(" "),
            stdout=subprocess.PIPE,
            stderr=tr0_pipe,
            env=env0)
        tr1_proc = subprocess.Popen(
-            tr1_cmd.split(" "),
+            tr1_cmd.strip().split(" "),
            stdout=subprocess.PIPE,
            stderr=tr1_pipe,
            env=env1)
@@ -303,6 +325,10 @@ class TestDistBase(unittest.TestCase):
        # FIXME: use terminate() instead of sigkill.
        os.kill(ps0.pid, signal.SIGKILL)
        os.kill(ps1.pid, signal.SIGKILL)
+        ps0.terminate()
+        ps1.terminate()
+        ps0.wait()
+        ps1.wait()
        FNULL.close()
        self.assertAlmostEqual(local_first_loss, dist_first_loss, delta=delta)

--- a/python/paddle/fluid/tests/unittests/test_dist_mnist.py
+++ b/python/paddle/fluid/tests/unittests/test_dist_mnist.py
@@ -20,6 +20,16 @@ from test_dist_base import TestDistBase
 class TestDistMnist2x2(TestDistBase):
    def _setup_config(self):
        self._sync_mode = True
+        self._use_reduce = False
+    def test_se_resnext(self):
+        self.check_with_place("dist_mnist.py", delta=1e-7)
+class TestDistMnist2x2WithMemopt(TestDistBase):
+    def _setup_config(self):
+        self._sync_mode = True
+        self._mem_opt = True
    def test_se_resnext(self):
        self.check_with_place("dist_mnist.py", delta=1e-7)
@@ -28,10 +38,30 @@ class TestDistMnist2x2(TestDistBase):
 class TestDistMnistAsync(TestDistBase):
    def _setup_config(self):
        self._sync_mode = False
+        self._use_reduce = False
    def test_se_resnext(self):
        self.check_with_place("dist_mnist.py", delta=200)
+# FIXME(typhoonzero): enable these tests once we have 4
+# 4 GPUs on CI machine, and the base class should be updated.
+#
+# class TestDistMnist2x2ReduceMode(TestDistBase):
+#     def _setup_config(self):
+#         self._sync_mode = True
+#         self._use_reduce = True
+#     def test_se_resnext(self):
+#         self.check_with_place("dist_mnist.py", delta=1e-7)
+# class TestDistMnistAsyncReduceMode(TestDistBase):
+#     def _setup_config(self):
+#         self._sync_mode = False
+#         self._use_reduce = True
+#     def test_se_resnext(self):
+#         self.check_with_place("dist_mnist.py", delta=200)
 if __name__ == "__main__":
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_fused_elemwise_activation_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fused_elemwise_activation_op.py
@@ -15,32 +15,31 @@
 from __future__ import print_function
 import unittest
 import numpy as np
+from functools import partial
 import paddle.fluid.core as core
 from op_test import OpTest
-# scale + add
+#   TestFusedElementwiseActivationOp
-#   TestElementwiseAddOp
+#   TestFusedElementwiseActivationOp_scalar
-#   TestFusedOperatorsOp_scalar
+#   TestFusedElementwiseActivationOp_scalar2
-#   TestFusedOperatorsOp_scalar2
+#   TestFusedElementwiseActivationOp_Vector
-#   TestFusedOperatorsOp_Vector
+#   TestFusedElementwiseActivationOp_broadcast_0
-#   TestFusedOperatorsOp_broadcast_0
+#   TestFusedElementwiseActivationOp_broadcast_1
-#   TestFusedOperatorsOp_broadcast_1
+#   TestFusedElementwiseActivationOp_broadcast_2
-#   TestFusedOperatorsOp_broadcast_2
+#   TestFusedElementwiseActivationOp_broadcast_3
-#   TestFusedOperatorsOp_broadcast_3
+#   TestFusedElementwiseActivationOp_broadcast_4
-#   TestFusedOperatorsOp_broadcast_4
+#   TestFusedElementwiseActivationOp_rowwise_add_0
-#   TestFusedOperatorsOp_rowwise_add_0
+#   TestFusedElementwiseActivationOp_rowwise_add_1
-#   TestFusedOperatorsOp_rowwise_add_1
+#   TestFusedElementwiseActivationOp_channelwise_add
-#   TestFusedOperatorsOp_channelwise_add
+def create_test_class(test_case, callback, attrs):
-class TestElementwiseAddOp(OpTest):
+    class TestFusedElementwiseActivationOp_base(OpTest):
        def setUp(self):
            self.op_type = "fused_elemwise_activation"
            self.dtype = np.float32
            self.axis = -1
-        self.init_axis()
-        self.init_dtype()
            self.init_input()
            self.init_output()
            self.init_attr()
@@ -49,772 +48,294 @@ class TestElementwiseAddOp(OpTest):
                'X': OpTest.np_dtype_to_fluid_dtype(self.x),
                'Y': OpTest.np_dtype_to_fluid_dtype(self.y)
            }
+            if self.attrs["keep_intermediate_value"]:
+                self.outputs = {
+                    'Out': self.out,
+                    "IntermediateOut": self.intermediate_out
+                }
+            else:
                self.outputs = {'Out': self.out}
        def init_input(self):
            self.x = np.random.uniform(0.1, 1, [13, 17]).astype(self.dtype)
            self.y = np.random.uniform(0.1, 1, [13, 17]).astype(self.dtype)
+            self.axis = -1
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y) * self.scale
+                callback(self.x, self.y, self.x, self.y)
        def init_attr(self):
-        self.attrs = {
+            self.attrs = {'axis': self.axis, }
-            'axis': self.axis,
+            for key in attrs.keys():
-            'scale': self.scale,
+                self.attrs[key] = attrs[key]
-            'functor_list': ["scale", "elementwise_add"]
-        }
-    def init_dtype(self):
-        pass
-    def init_axis(self):
-        pass
        def test_check_output(self):
            self.check_output()
        def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.005)
+            if self.attrs["keep_intermediate_value"]:
+                self.check_grad(
+                    ['X', 'Y'], ['Out', 'IntermediateOut'],
+                    max_relative_error=0.005,
+                    sum_outputs=['Out'])
+            else:
+                self.check_grad(['X', 'Y'], ['Out'], max_relative_error=0.005)
        def test_check_grad_ingore_x(self):
+            if self.attrs["keep_intermediate_value"]:
+                self.check_grad(
+                    ['Y'], ['Out', 'IntermediateOut'],
+                    max_relative_error=0.005,
+                    no_grad_set=set("X"),
+                    sum_outputs=['Out'])
+            else:
                self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+                    ['Y'], ['Out'],
+                    max_relative_error=0.005,
+                    no_grad_set=set("X"))
        def test_check_grad_ingore_y(self):
+            if self.attrs["keep_intermediate_value"]:
                self.check_grad(
-            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+                    ['X'], ['Out', 'IntermediateOut'],
+                    max_relative_error=0.005,
+                    no_grad_set=set("Y"),
+                    sum_outputs=['Out'])
+            else:
+                self.check_grad(
+                    ['X'], ['Out'],
+                    max_relative_error=0.005,
+                    no_grad_set=set("Y"))
-class TestFusedOperatorsOp_scalar(TestElementwiseAddOp):
+    class TestFusedElementwiseActivationOp_scalar(
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(1).astype(self.dtype)
-    def init_output(self):
+    class TestFusedElementwiseActivationOp_scalar2(
-        self.scale = 0.1
+            TestFusedElementwiseActivationOp_base):
-        self.out = (self.x + self.y) * self.scale
-class TestFusedOperatorsOp_scalar2(TestElementwiseAddOp):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(1, 1).astype(self.dtype)
-    def init_output(self):
+    class TestFusedElementwiseActivationOp_Vector(
-        self.scale = 0.1
+            TestFusedElementwiseActivationOp_base):
-        self.out = (self.x + self.y) * self.scale
-class TestFusedOperatorsOp_Vector(TestElementwiseAddOp):
        def init_input(self):
            self.x = np.random.random((32, )).astype(self.dtype)
            self.y = np.random.random((32, )).astype(self.dtype)
-    def init_output(self):
+    class TestFusedElementwiseActivationOp_broadcast_0(
-        self.scale = 0.1
+            TestFusedElementwiseActivationOp_base):
-        self.out = (self.x + self.y) * self.scale
-class TestFusedOperatorsOp_broadcast_0(TestElementwiseAddOp):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(2).astype(self.dtype)
-    def init_axis(self):
            self.axis = 0
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(2, 1, 1)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(2, 1, 1))
+    class TestFusedElementwiseActivationOp_broadcast_1(
-class TestFusedOperatorsOp_broadcast_1(TestElementwiseAddOp):
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(3).astype(self.dtype)
-    def init_axis(self):
            self.axis = 1
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(1, 3, 1)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(1, 3, 1))
-class TestFusedOperatorsOp_broadcast_2(TestElementwiseAddOp):
+    class TestFusedElementwiseActivationOp_broadcast_2(
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(4).astype(self.dtype)
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(1, 1, 4)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(1, 1, 4))
+    class TestFusedElementwiseActivationOp_broadcast_3(
-class TestFusedOperatorsOp_broadcast_3(TestElementwiseAddOp):
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4, 5).astype(self.dtype)
            self.y = np.random.rand(3, 4).astype(self.dtype)
-    def init_axis(self):
            self.axis = 1
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(1, 3, 4, 1)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(1, 3, 4, 1))
+    class TestFusedElementwiseActivationOp_broadcast_4(
-class TestFusedOperatorsOp_broadcast_4(TestElementwiseAddOp):
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4, 5).astype(self.dtype)
            self.y = np.random.rand(2, 1).astype(self.dtype)
-    def init_axis(self):
            self.axis = 0
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(2, 1, 1, 1)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(2, 1, 1, 1))
-class TestFusedOperatorsOp_rowwise_add_0(TestElementwiseAddOp):
+    class TestFusedElementwiseActivationOp_rowwise_add_0(
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 3, 4).astype(self.dtype)
            self.y = np.random.rand(3, 4).astype(self.dtype)
-    def init_axis(self):
            self.axis = 1
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(1, 3, 4)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(1, 3, 4))
-class TestFusedOperatorsOp_rowwise_add_1(TestElementwiseAddOp):
+    class TestFusedElementwiseActivationOp_rowwise_add_1(
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(2, 1).astype(self.dtype)
            self.y = np.random.rand(1).astype(self.dtype)
-    def init_axis(self):
            self.axis = 1
        def init_output(self):
-        self.scale = 0.1
+            self.x, self.y, self.intermediate_out, self.out = \
-        self.out = (self.x + self.y.reshape(1, 1)) * self.scale
+                callback(self.x, self.y, self.x, self.y.reshape(1, 1))
-class TestFusedOperatorsOp_channelwise_add(TestElementwiseAddOp):
+    class TestFusedElementwiseActivationOp_channelwise_add(
+            TestFusedElementwiseActivationOp_base):
        def init_input(self):
            self.x = np.random.rand(3, 20, 20).astype(self.dtype)
            self.y = np.random.rand(3, 1, 1).astype(self.dtype)
-    def init_axis(self):
+    TestFusedElementwiseActivationOp_base.__name__ = test_case + "_base"
-        self.axis = -1
+    TestFusedElementwiseActivationOp_scalar.__name__ = test_case + "_scalar"
+    TestFusedElementwiseActivationOp_scalar2.__name__ = test_case + "_scalar2"
-    def init_output(self):
+    TestFusedElementwiseActivationOp_Vector.__name__ = test_case + "_Vector"
-        self.scale = 0.1
+    TestFusedElementwiseActivationOp_broadcast_0.__name__ = test_case + "_broadcast_0"
-        self.out = (self.x + self.y) * self.scale
+    TestFusedElementwiseActivationOp_broadcast_1.__name__ = test_case + "_broadcast_1"
+    TestFusedElementwiseActivationOp_broadcast_2.__name__ = test_case + "_broadcast_2"
+    TestFusedElementwiseActivationOp_broadcast_3.__name__ = test_case + "_broadcast_3"
-# add + scale
+    TestFusedElementwiseActivationOp_broadcast_4.__name__ = test_case + "_broadcast_4"
-#   TestElementwiseAddOp_f_add_scale
+    TestFusedElementwiseActivationOp_rowwise_add_0.__name__ = test_case + "_rowwise_add_0"
-#   TestFusedOperatorsOp_scalar_f_add_scale
+    TestFusedElementwiseActivationOp_rowwise_add_1.__name__ = test_case + "_rowwise_add_1"
-#   TestFusedOperatorsOp_scalar2_f_add_scale
+    TestFusedElementwiseActivationOp_channelwise_add.__name__ = test_case + "_channelwise_add"
-#   TestFusedOperatorsOp_Vector_f_add_scale
-#   TestFusedOperatorsOp_broadcast_0_f_add_scale
+    globals()[test_case + "_base"] = TestFusedElementwiseActivationOp_base
-#   TestFusedOperatorsOp_broadcast_1_f_add_scale
+    globals()[test_case + "_scalar"] = TestFusedElementwiseActivationOp_scalar
-#   TestFusedOperatorsOp_broadcast_2_f_add_scale
+    globals()[test_case + "_scalar2"] = TestFusedElementwiseActivationOp_scalar2
-#   TestFusedOperatorsOp_broadcast_3_f_add_scale
+    globals()[test_case + "_Vector"] = TestFusedElementwiseActivationOp_Vector
-#   TestFusedOperatorsOp_broadcast_4_f_add_scale
+    globals()[test_case +
-#   TestFusedOperatorsOp_rowwise_add_0_f_add_scale
+              "_broadcast_0"] = TestFusedElementwiseActivationOp_broadcast_0
-#   TestFusedOperatorsOp_rowwise_add_1_f_add_scale
+    globals()[test_case +
-#   TestFusedOperatorsOp_channelwise_add_f_add_scale
+              "_broadcast_1"] = TestFusedElementwiseActivationOp_broadcast_1
+    globals()[test_case +
+              "_broadcast_2"] = TestFusedElementwiseActivationOp_broadcast_2
-class TestFusedOperatorsOp_f_add_scale(TestElementwiseAddOp):
+    globals()[test_case +
-    def init_output(self):
+              "_broadcast_3"] = TestFusedElementwiseActivationOp_broadcast_3
-        self.scale = 0.1
+    globals()[test_case +
-        self.out = self.x + self.y * self.scale
+              "_broadcast_4"] = TestFusedElementwiseActivationOp_broadcast_4
+    globals()[test_case +
-    def init_attr(self):
+              "_rowwise_add_0"] = TestFusedElementwiseActivationOp_rowwise_add_0
-        self.attrs = {
+    globals()[test_case +
-            'axis': self.axis,
+              "_rowwise_add_1"] = TestFusedElementwiseActivationOp_rowwise_add_1
-            'scale': self.scale,
+    globals(
-            'functor_list': ["elementwise_add", "scale"]
+    )[test_case +
-        }
+      "_channelwise_add"] = TestFusedElementwiseActivationOp_channelwise_add
-class TestFusedOperatorsOp_scalar_f_add_scale(TestFusedOperatorsOp_scalar):
+def scale_add_func(x, y, x_bcast, y_bcast, scale, mode=0):
-    def init_output(self):
+    if mode == 0:
-        self.scale = 0.1
+        return x, y, (x_bcast + y_bcast), (x_bcast + y_bcast) * scale
-        self.out = self.x + self.y * self.scale
+    else:
+        return y, x, (x_bcast + y_bcast), (x_bcast + y_bcast) * scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
+def add_scale_func(x, y, x_bcast, y_bcast, scale, mode=0):
-            'scale': self.scale,
+    if mode == 0:
-            'functor_list': ["elementwise_add", "scale"]
+        return x, y, y * scale, x_bcast + y_bcast * scale
-        }
+    else:
+        return y, x, x * scale, y_bcast + x_bcast * scale
-class TestFusedOperatorsOp_scalar2_f_add_scale(TestFusedOperatorsOp_scalar2):
-    def init_output(self):
+def add_relu_func(x, y, x_bcast, y_bcast, mode=0):
-        self.scale = 0.1
-        self.out = self.x + self.y * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_Vector_f_add_scale(TestFusedOperatorsOp_Vector):
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_broadcast_0_f_add_scale(
-        TestFusedOperatorsOp_broadcast_0):
-    def init_axis(self):
-        self.axis = 0
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y.reshape(2, 1, 1) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_broadcast_1_f_add_scale(
-        TestFusedOperatorsOp_broadcast_1):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y.reshape(1, 3, 1) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_broadcast_2_f_add_scale(
-        TestFusedOperatorsOp_broadcast_2):
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y.reshape(1, 1, 4) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_broadcast_3_f_add_scale(
-        TestFusedOperatorsOp_broadcast_3):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y.reshape(1, 3, 4, 1) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_broadcast_4_f_add_scale(
-        TestFusedOperatorsOp_broadcast_4):
-    def init_axis(self):
-        self.axis = 0
-    def init_output(self):
-        self.scale = 0.2
-        self.out = self.x + self.y.reshape(2, 1, 1, 1) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_rowwise_add_0_f_add_scale(
-        TestFusedOperatorsOp_rowwise_add_0):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.scale = 0.1
-        self.out = self.x + self.y.reshape(1, 3, 4) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_rowwise_add_1_f_add_scale(
-        TestFusedOperatorsOp_rowwise_add_1):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.scale = 0.2
-        self.out = self.x + self.y.reshape(1, 1) * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-class TestFusedOperatorsOp_channelwise_add_f_add_scale(
-        TestFusedOperatorsOp_channelwise_add):
-    def init_axis(self):
-        self.axis = -1
-    def init_output(self):
-        self.scale = 0.2
-        self.out = self.x + self.y * self.scale
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'scale': self.scale,
-            'functor_list': ["elementwise_add", "scale"]
-        }
-# add + relu
-#   TestElementwiseAddOp_f_add_relu
-#   TestFusedOperatorsOp_scalar_f_add_relu
-#   TestFusedOperatorsOp_scalar2_f_add_relu
-#   TestFusedOperatorsOp_Vector_f_add_relu
-#   TestFusedOperatorsOp_broadcast_0_f_add_relu
-#   TestFusedOperatorsOp_broadcast_1_f_add_relu
-#   TestFusedOperatorsOp_broadcast_2_f_add_relu
-#   TestFusedOperatorsOp_broadcast_3_f_add_relu
-#   TestFusedOperatorsOp_broadcast_4_f_add_relu
-#   TestFusedOperatorsOp_rowwise_add_0_f_add_relu
-#   TestFusedOperatorsOp_rowwise_add_1_f_add_relu
-#   TestFusedOperatorsOp_channelwise_add_f_add_relu
-class TestFusedOperatorsOp_f_add_relu(TestElementwiseAddOp):
-    def init_output(self):
-        # Copy from test_activation_op.py
-        # Because we set delta = 0.005 in calculating numeric gradient,
-        # if x is too small, such as 0.002, x_neg will be -0.003
-        # x_pos will be 0.007, so the numeric gradient is inaccurate.
-        # we should avoid this
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y, 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_scalar_f_add_relu(TestFusedOperatorsOp_scalar):
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y, 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_scalar2_f_add_relu(TestFusedOperatorsOp_scalar2):
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y, 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_Vector_f_add_relu(TestFusedOperatorsOp_Vector):
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y, 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_broadcast_0_f_add_relu(
-        TestFusedOperatorsOp_broadcast_0):
-    def init_axis(self):
-        self.axis = 0
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(2, 1, 1), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_broadcast_1_f_add_relu(
-        TestFusedOperatorsOp_broadcast_1):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(1, 3, 1), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_broadcast_2_f_add_relu(
-        TestFusedOperatorsOp_broadcast_2):
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(1, 1, 4), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_broadcast_3_f_add_relu(
-        TestFusedOperatorsOp_broadcast_3):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(1, 3, 4, 1), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_broadcast_4_f_add_relu(
-        TestFusedOperatorsOp_broadcast_4):
-    def init_axis(self):
-        self.axis = 0
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(2, 1, 1, 1), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_rowwise_add_0_f_add_relu(
-        TestFusedOperatorsOp_rowwise_add_0):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(1, 3, 4), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_rowwise_add_1_f_add_relu(
-        TestFusedOperatorsOp_rowwise_add_1):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y.reshape(1, 1), 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-class TestFusedOperatorsOp_channelwise_add_f_add_relu(
-        TestFusedOperatorsOp_channelwise_add):
-    def init_axis(self):
-        self.axis = -1
-    def init_output(self):
-        self.y[np.abs(self.y) < 0.005] = 0.02
-        self.out = self.x + np.maximum(self.y, 0)
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["elementwise_add", "relu"]
-        }
-# relu + add
-#   TestElementwiseAddOp_f_relu_add
-#   TestFusedOperatorsOp_scalar_f_relu_add
-#   TestFusedOperatorsOp_scalar2_f_relu_add
-#   TestFusedOperatorsOp_Vector_f_relu_add
-#   TestFusedOperatorsOp_broadcast_0_f_relu_add
-#   TestFusedOperatorsOp_broadcast_1_f_relu_add
-#   TestFusedOperatorsOp_broadcast_2_f_relu_add
-#   TestFusedOperatorsOp_broadcast_3_f_relu_add
-#   TestFusedOperatorsOp_broadcast_4_f_relu_add
-#   TestFusedOperatorsOp_rowwise_add_0_f_relu_add
-#   TestFusedOperatorsOp_rowwise_add_1_f_relu_add
-#   TestFusedOperatorsOp_channelwise_add_f_relu_add
-class TestFusedOperatorsOp_f_relu_add(TestElementwiseAddOp):
-    def init_output(self):
    # Copy from test_activation_op.py
    # Because we set delta = 0.005 in calculating numeric gradient,
    # if x is too small, such as 0.002, x_neg will be -0.003
    # x_pos will be 0.007, so the numeric gradient is inaccurate.
    # we should avoid this
-        self.out = self.x + self.y
+    if mode == 0:
-        self.out = np.maximum(self.out, 0)
+        y[np.abs(y) < 0.005] = 0.02
-        self.out[np.abs(self.out) < 0.005] = 0.02
+        y_bcast[np.abs(y_bcast) < 0.005] = 0.02
+        return x, y, np.maximum(y, 0), x_bcast + np.maximum(y_bcast, 0)
-    def init_attr(self):
+    else:
-        self.attrs = {
+        x[np.abs(x) < 0.005] = 0.02
-            'axis': self.axis,
+        x_bcast[np.abs(x_bcast) < 0.005] = 0.02
-            'functor_list': ["relu", "elementwise_add"]
+        return y, x, np.maximum(x, 0), y_bcast + np.maximum(x_bcast, 0)
-        }
+def relu_add_func(x, y, x_bcast, y_bcast, mode=0):
-class TestFusedOperatorsOp_scalar_f_relu_add(TestFusedOperatorsOp_scalar):
+    intermediate_out = x_bcast + y_bcast
-    def init_output(self):
+    out = np.maximum(intermediate_out, 0)
-        self.out = self.x + self.y
+    out[np.abs(out) < 0.005] = 0.02
-        self.out = np.maximum(self.out, 0)
+    if mode == 0:
-        self.out[np.abs(self.out) < 0.005] = 0.02
+        return x, y, intermediate_out, out
+    else:
-    def init_attr(self):
+        return y, x, intermediate_out, out
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
+def mul_scale_func(x, y, x_bcast, y_bcast, scale, mode=0):
-        }
+    if mode == 0:
+        return x, y, y * scale, x_bcast * (y_bcast * scale)
+    else:
-class TestFusedOperatorsOp_scalar2_f_relu_add(TestFusedOperatorsOp_scalar2):
+        return y, x, x * scale, y_bcast * (x_bcast * scale)
-    def init_output(self):
-        self.out = self.x + self.y
-        self.out = np.maximum(self.out, 0)
+scale = 0.1
-        self.out[np.abs(self.out) < 0.005] = 0.02
+scale_add_func = partial(scale_add_func, scale=scale)
+add_scale_func = partial(add_scale_func, scale=scale)
-    def init_attr(self):
+mul_scale_func = partial(mul_scale_func, scale=scale)
-        self.attrs = {
-            'axis': self.axis,
+for mode in {0, 1}:
-            'functor_list': ["relu", "elementwise_add"]
+    scale_add_func = partial(scale_add_func, mode=mode)
-        }
+    add_scale_func = partial(add_scale_func, mode=mode)
+    mul_scale_func = partial(mul_scale_func, mode=mode)
+    relu_add_func = partial(relu_add_func, mode=mode)
-class TestFusedOperatorsOp_Vector_f_relu_add(TestFusedOperatorsOp_Vector):
+    add_relu_func = partial(add_relu_func, mode=mode)
-    def init_output(self):
-        self.out = self.x + self.y
+    for recomputation in {True, False}:
-        self.out = np.maximum(self.out, 0)
+        for keep_intermediate_value in {True, False}:
-        self.out[np.abs(self.out) < 0.005] = 0.02
+            suffix = ("_keep_intermediate_value" if keep_intermediate_value else "") \
+                     + ("_recomputation" if recomputation else "") \
-    def init_attr(self):
+                     + ("_mode_"+ str(mode))
-        self.attrs = {
+            create_test_class('scale_add' + suffix, scale_add_func, {
-            'axis': self.axis,
+                'scale': scale,
-            'functor_list': ["relu", "elementwise_add"]
+                'functor_list': ["scale", "elementwise_add"],
-        }
+                'keep_intermediate_value': keep_intermediate_value,
+                'recomputation': recomputation
+            })
-class TestFusedOperatorsOp_broadcast_0_f_relu_add(
+            create_test_class('add_scale' + suffix, add_scale_func, {
-        TestFusedOperatorsOp_broadcast_0):
+                'scale': scale,
-    def init_axis(self):
+                'functor_list': ["elementwise_add", "scale"],
-        self.axis = 0
+                'keep_intermediate_value': keep_intermediate_value,
+                'recomputation': recomputation
-    def init_output(self):
+            })
-        self.out = self.x + self.y.reshape(2, 1, 1)
+            create_test_class('add_relu' + suffix, add_relu_func, {
-        self.out = np.maximum(self.out, 0)
+                'functor_list': ["elementwise_add", "relu"],
-        self.out[np.abs(self.out) < 0.005] = 0.02
+                'keep_intermediate_value': keep_intermediate_value,
+                'recomputation': recomputation
-    def init_attr(self):
+            })
-        self.attrs = {
+            create_test_class('relu_add' + suffix, relu_add_func, {
-            'axis': self.axis,
+                'functor_list': ["relu", "elementwise_add"],
-            'functor_list': ["relu", "elementwise_add"]
+                'keep_intermediate_value': keep_intermediate_value,
-        }
+                'recomputation': recomputation
+            })
+            create_test_class('mul_scale' + suffix, mul_scale_func, {
-class TestFusedOperatorsOp_broadcast_1_f_relu_add(
+                'scale': scale,
-        TestFusedOperatorsOp_broadcast_1):
+                'functor_list': ["elementwise_mul", "scale"],
-    def init_axis(self):
+                'keep_intermediate_value': keep_intermediate_value,
-        self.axis = 1
+                'recomputation': recomputation
+            })
-    def init_output(self):
-        self.out = self.x + self.y.reshape(1, 3, 1)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_broadcast_2_f_relu_add(
-        TestFusedOperatorsOp_broadcast_2):
-    def init_output(self):
-        self.out = self.x + self.y.reshape(1, 1, 4)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_broadcast_3_f_relu_add(
-        TestFusedOperatorsOp_broadcast_3):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.out = self.x + self.y.reshape(1, 3, 4, 1)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_broadcast_4_f_relu_add(
-        TestFusedOperatorsOp_broadcast_4):
-    def init_axis(self):
-        self.axis = 0
-    def init_output(self):
-        self.out = self.x + self.y.reshape(2, 1, 1, 1)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_rowwise_add_0_f_relu_add(
-        TestFusedOperatorsOp_rowwise_add_0):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.out = self.x + self.y.reshape(1, 3, 4)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_rowwise_add_1_f_relu_add(
-        TestFusedOperatorsOp_rowwise_add_1):
-    def init_axis(self):
-        self.axis = 1
-    def init_output(self):
-        self.out = self.x + self.y.reshape(1, 1)
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
-class TestFusedOperatorsOp_channelwise_add_f_relu_add(
-        TestFusedOperatorsOp_channelwise_add):
-    def init_axis(self):
-        self.axis = -1
-    def init_output(self):
-        self.out = self.x + self.y
-        self.out = np.maximum(self.out, 0)
-        self.out[np.abs(self.out) < 0.005] = 0.02
-    def init_attr(self):
-        self.attrs = {
-            'axis': self.axis,
-            'functor_list': ["relu", "elementwise_add"]
-        }
 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_generate_proposal_labels.py
+++ b/python/paddle/fluid/tests/unittests/test_generate_proposal_labels.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://w_idxw.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+import numpy as np
+import sys
+import math
+import paddle.fluid as fluid
+from op_test import OpTest
+def generate_proposal_labels_in_python(
+        rpn_rois, gt_classes, gt_boxes, im_scales, batch_size_per_im,
+        fg_fraction, fg_thresh, bg_thresh_hi, bg_thresh_lo, bbox_reg_weights,
+        class_nums):
+    rois = []
+    labels_int32 = []
+    bbox_targets = []
+    bbox_inside_weights = []
+    bbox_outside_weights = []
+    lod = []
+    assert len(rpn_rois) == len(
+        im_scales), 'batch size of rpn_rois and ground_truth is not matched'
+    for im_i in range(len(im_scales)):
+        frcn_blobs = _sample_rois(
+            rpn_rois[im_i], gt_classes[im_i], gt_boxes[im_i], im_scales[im_i],
+            batch_size_per_im, fg_fraction, fg_thresh, bg_thresh_hi,
+            bg_thresh_lo, bbox_reg_weights, class_nums)
+        lod.append(frcn_blobs['rois'].shape[0])
+        rois.append(frcn_blobs['rois'])
+        labels_int32.append(frcn_blobs['labels_int32'])
+        bbox_targets.append(frcn_blobs['bbox_targets'])
+        bbox_inside_weights.append(frcn_blobs['bbox_inside_weights'])
+        bbox_outside_weights.append(frcn_blobs['bbox_outside_weights'])
+    return rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights, lod
+def _sample_rois(rpn_rois, gt_classes, gt_boxes, im_scale, batch_size_per_im,
+                 fg_fraction, fg_thresh, bg_thresh_hi, bg_thresh_lo,
+                 bbox_reg_weights, class_nums):
+    rois_per_image = int(batch_size_per_im)
+    fg_rois_per_im = int(np.round(fg_fraction * rois_per_image))
+    # Roidb
+    inv_im_scale = 1. / im_scale
+    rpn_rois = rpn_rois * inv_im_scale
+    boxes = np.vstack([gt_boxes, rpn_rois])
+    gt_overlaps = np.zeros((boxes.shape[0], class_nums))
+    box_to_gt_ind_map = np.zeros((boxes.shape[0]), dtype=np.int32)
+    if len(gt_boxes) > 0:
+        proposal_to_gt_overlaps = _bbox_overlaps(boxes, gt_boxes)
+        overlaps_argmax = proposal_to_gt_overlaps.argmax(axis=1)
+        overlaps_max = proposal_to_gt_overlaps.max(axis=1)
+        # Boxes which with non-zero overlap with gt boxes
+        overlapped_boxes_ind = np.where(overlaps_max > 0)[0]
+        overlapped_boxes_gt_classes = gt_classes[overlaps_argmax[
+            overlapped_boxes_ind]]
+        gt_overlaps[overlapped_boxes_ind,
+                    overlapped_boxes_gt_classes] = overlaps_max[
+                        overlapped_boxes_ind]
+        box_to_gt_ind_map[overlapped_boxes_ind] = overlaps_argmax[
+            overlapped_boxes_ind]
+    max_overlaps = gt_overlaps.max(axis=1)
+    max_classes = gt_overlaps.argmax(axis=1)
+    # Foreground
+    fg_inds = np.where(max_overlaps >= fg_thresh)[0]
+    fg_rois_per_this_image = np.minimum(fg_rois_per_im, fg_inds.shape[0])
+    # Sample foreground if there are too many
+    if fg_inds.shape[0] > fg_rois_per_this_image:
+        fg_inds = np.random.choice(
+            fg_inds, size=fg_rois_per_this_image, replace=False)
+    # Background
+    bg_inds = np.where((max_overlaps < bg_thresh_hi) & (max_overlaps >=
+                                                        bg_thresh_lo))[0]
+    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image
+    bg_rois_per_this_image = np.minimum(bg_rois_per_this_image,
+                                        bg_inds.shape[0])
+    # Sample background if there are too many
+    if bg_inds.shape[0] > bg_rois_per_this_image:
+        bg_inds = np.random.choice(
+            bg_inds, size=bg_rois_per_this_image, replace=False)
+    keep_inds = np.append(fg_inds, bg_inds)
+    sampled_labels = max_classes[keep_inds]
+    sampled_labels[fg_rois_per_this_image:] = 0
+    sampled_boxes = boxes[keep_inds]
+    sampled_gts = gt_boxes[box_to_gt_ind_map[keep_inds]]
+    sampled_gts[fg_rois_per_this_image:, :] = gt_boxes[0]
+    bbox_label_targets = _compute_targets(sampled_boxes, sampled_gts,
+                                          sampled_labels, bbox_reg_weights)
+    bbox_targets, bbox_inside_weights = _expand_bbox_targets(bbox_label_targets,
+                                                             class_nums)
+    bbox_outside_weights = np.array(
+        bbox_inside_weights > 0, dtype=bbox_inside_weights.dtype)
+    # Scale rois
+    sampled_rois = sampled_boxes * im_scale
+    # Faster RCNN blobs
+    frcn_blobs = dict(
+        rois=sampled_rois,
+        labels_int32=sampled_labels,
+        bbox_targets=bbox_targets,
+        bbox_inside_weights=bbox_inside_weights,
+        bbox_outside_weights=bbox_outside_weights)
+    return frcn_blobs
+def _bbox_overlaps(roi_boxes, gt_boxes):
+    w1 = np.maximum(roi_boxes[:, 2] - roi_boxes[:, 0] + 1, 0)
+    h1 = np.maximum(roi_boxes[:, 3] - roi_boxes[:, 1] + 1, 0)
+    w2 = np.maximum(gt_boxes[:, 2] - gt_boxes[:, 0] + 1, 0)
+    h2 = np.maximum(gt_boxes[:, 3] - gt_boxes[:, 1] + 1, 0)
+    area1 = w1 * h1
+    area2 = w2 * h2
+    overlaps = np.zeros((roi_boxes.shape[0], gt_boxes.shape[0]))
+    for ind1 in range(roi_boxes.shape[0]):
+        for ind2 in range(gt_boxes.shape[0]):
+            inter_x1 = np.maximum(roi_boxes[ind1, 0], gt_boxes[ind2, 0])
+            inter_y1 = np.maximum(roi_boxes[ind1, 1], gt_boxes[ind2, 1])
+            inter_x2 = np.minimum(roi_boxes[ind1, 2], gt_boxes[ind2, 2])
+            inter_y2 = np.minimum(roi_boxes[ind1, 3], gt_boxes[ind2, 3])
+            inter_w = np.maximum(inter_x2 - inter_x1 + 1, 0)
+            inter_h = np.maximum(inter_y2 - inter_y1 + 1, 0)
+            inter_area = inter_w * inter_h
+            iou = inter_area / (area1[ind1] + area2[ind2] - inter_area)
+            overlaps[ind1, ind2] = iou
+    return overlaps
+def _compute_targets(roi_boxes, gt_boxes, labels, bbox_reg_weights):
+    assert roi_boxes.shape[0] == gt_boxes.shape[0]
+    assert roi_boxes.shape[1] == 4
+    assert gt_boxes.shape[1] == 4
+    targets = np.zeros(roi_boxes.shape)
+    bbox_reg_weights = np.asarray(bbox_reg_weights)
+    targets = _box_to_delta(
+        ex_boxes=roi_boxes, gt_boxes=gt_boxes, weights=bbox_reg_weights)
+    return np.hstack([labels[:, np.newaxis], targets]).astype(
+        np.float32, copy=False)
+def _box_to_delta(ex_boxes, gt_boxes, weights):
+    ex_w = ex_boxes[:, 2] - ex_boxes[:, 0] + 1
+    ex_h = ex_boxes[:, 3] - ex_boxes[:, 1] + 1
+    ex_ctr_x = ex_boxes[:, 0] + 0.5 * ex_w
+    ex_ctr_y = ex_boxes[:, 1] + 0.5 * ex_h
+    gt_w = gt_boxes[:, 2] - gt_boxes[:, 0] + 1
+    gt_h = gt_boxes[:, 3] - gt_boxes[:, 1] + 1
+    gt_ctr_x = gt_boxes[:, 0] + 0.5 * gt_w
+    gt_ctr_y = gt_boxes[:, 1] + 0.5 * gt_h
+    dx = (gt_ctr_x - ex_ctr_x) / ex_w / weights[0]
+    dy = (gt_ctr_y - ex_ctr_y) / ex_h / weights[1]
+    dw = (np.log(gt_w / ex_w)) / ex_w / weights[2]
+    dh = (np.log(gt_h / ex_h)) / ex_h / weights[3]
+    targets = np.vstack([dx, dy, dw, dh]).transpose()
+    return targets
+def _expand_bbox_targets(bbox_targets_input, class_nums):
+    class_labels = bbox_targets_input[:, 0]
+    fg_inds = np.where(class_labels > 0)[0]
+    bbox_targets = np.zeros((class_labels.shape[0], 4 * class_nums))
+    bbox_inside_weights = np.zeros(bbox_targets.shape)
+    for ind in fg_inds:
+        class_label = int(class_labels[ind])
+        start_ind = class_label * 4
+        end_ind = class_label * 4 + 4
+        bbox_targets[ind, start_ind:end_ind] = bbox_targets_input[ind, 1:]
+        bbox_inside_weights[ind, start_ind:end_ind] = (1.0, 1.0, 1.0, 1.0)
+    return bbox_targets, bbox_inside_weights
+class TestGenerateProposalLabelsOp(OpTest):
+    def set_data(self):
+        self.init_test_params()
+        self.init_test_input()
+        self.init_test_output()
+        self.inputs = {
+            'RpnRois': (self.rpn_rois[0], self.rpn_rois_lod),
+            'GtClasses': (self.gt_classes[0], self.gts_lod),
+            'GtBoxes': (self.gt_boxes[0], self.gts_lod),
+            'ImScales': self.im_scales[0]
+        }
+        self.attrs = {
+            'batch_size_per_im': self.batch_size_per_im,
+            'fg_fraction': self.fg_fraction,
+            'fg_thresh': self.fg_thresh,
+            'bg_thresh_hi': self.bg_thresh_hi,
+            'bg_thresh_lo': self.bg_thresh_lo,
+            'bbox_reg_weights': self.bbox_reg_weights,
+            'class_nums': self.class_nums
+        }
+        self.outputs = {
+            'Rois': (self.rois[0], [self.lod]),
+            'LabelsInt32': (self.labels_int32[0], [self.lod]),
+            'BboxTargets': (self.bbox_targets[0], [self.lod]),
+            'BboxInsideWeights': (self.bbox_inside_weights[0], [self.lod]),
+            'BboxOutsideWeights': (self.bbox_outside_weights[0], [self.lod]),
+        }
+    def test_check_output(self):
+        self.check_output()
+    def setUp(self):
+        self.op_type = 'generate_proposal_labels'
+        self.set_data()
+    def init_test_params(self):
+        self.batch_size_per_im = 10
+        self.fg_fraction = 1.0
+        self.fg_thresh = 0.5
+        self.bg_thresh_hi = 0.5
+        self.bg_thresh_lo = 0.0
+        self.bbox_reg_weights = [0.1, 0.1, 0.2, 0.2]
+        self.class_nums = 81
+    def init_test_input(self):
+        np.random.seed(0)
+        image_nums = 1
+        gt_nums = 6  # Keep same with batch_size_per_im for unittest
+        proposal_nums = self.batch_size_per_im - gt_nums
+        images_shape = []
+        self.im_scales = []
+        for i in range(image_nums):
+            images_shape.append(np.random.randint(200, size=2))
+            self.im_scales.append(np.ones((1)).astype(np.float32))
+        self.rpn_rois, self.rpn_rois_lod = _generate_proposals(images_shape,
+                                                               proposal_nums)
+        ground_truth, self.gts_lod = _generate_groundtruth(
+            images_shape, self.class_nums, gt_nums)
+        self.gt_classes = [gt['gt_classes'] for gt in ground_truth]
+        self.gt_boxes = [gt['boxes'] for gt in ground_truth]
+    def init_test_output(self):
+        self.rois, self.labels_int32, self.bbox_targets, \
+        self.bbox_inside_weights, self.bbox_outside_weights, \
+        self.lod = generate_proposal_labels_in_python(
+                self.rpn_rois, self.gt_classes, self.gt_boxes, self.im_scales,
+                self.batch_size_per_im, self.fg_fraction,
+                self.fg_thresh, self.bg_thresh_hi, self.bg_thresh_lo,
+                self.bbox_reg_weights, self.class_nums
+            )
+def _generate_proposals(images_shape, proposal_nums):
+    rpn_rois = []
+    rpn_rois_lod = []
+    num_proposals = 0
+    for i, image_shape in enumerate(images_shape):
+        proposals = _generate_boxes(image_shape, proposal_nums)
+        rpn_rois.append(proposals)
+        num_proposals += len(proposals)
+        rpn_rois_lod.append(num_proposals)
+    return rpn_rois, [rpn_rois_lod]
+def _generate_groundtruth(images_shape, class_nums, gt_nums):
+    ground_truth = []
+    gts_lod = []
+    num_gts = 0
+    for i, image_shape in enumerate(images_shape):
+        # Avoid background
+        gt_classes = np.random.randint(
+            low=1, high=class_nums, size=gt_nums).astype(np.int32)
+        gt_boxes = _generate_boxes(image_shape, gt_nums)
+        ground_truth.append(dict(gt_classes=gt_classes, boxes=gt_boxes))
+        num_gts += len(gt_classes)
+        gts_lod.append(num_gts)
+    return ground_truth, [gts_lod]
+def _generate_boxes(image_size, box_nums):
+    width = image_size[0]
+    height = image_size[1]
+    xywh = np.random.rand(box_nums, 4)
+    xy1 = xywh[:, [0, 1]] * image_size
+    wh = xywh[:, [2, 3]] * (image_size - xy1)
+    xy2 = xy1 + wh
+    boxes = np.hstack([xy1, xy2])
+    boxes[:, [0, 2]] = np.minimum(width - 1., np.maximum(0., boxes[:, [0, 2]]))
+    boxes[:, [1, 3]] = np.minimum(height - 1., np.maximum(0., boxes[:, [1, 3]]))
+    return boxes.astype(np.float32)
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_layers.py
+++ b/python/paddle/fluid/tests/unittests/test_layers.py
@@ -240,6 +240,22 @@ class TestBook(unittest.TestCase):
            self.assertIsNotNone(layers.softmax(hid))
        print(str(program))
+    def test_sequence_unsqueeze(self):
+        program = Program()
+        with program_guard(program):
+            x = layers.data(name='x', shape=[8, 2], dtype='float32')
+            out = layers.unsqueeze(input=x, axes=[1])
+            self.assertIsNotNone(out)
+        print(str(program))
+    def test_squeeze(self):
+        program = Program()
+        with program_guard(program):
+            x = layers.data(name='x', shape=[1, 1, 4], dtype='float32')
+            out = layers.squeeze(input=x, axes=[2])
+            self.assertIsNotNone(out)
+        print(str(program))
    def test_lrn(self):
        program = Program()
        with program_guard(program):

--- a/python/paddle/fluid/tests/unittests/test_name_scope.py
+++ b/python/paddle/fluid/tests/unittests/test_name_scope.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import unittest
+import paddle.fluid as fluid
+class TestNameScope(unittest.TestCase):
+    def test_name_scope(self):
+        with fluid.name_scope("s1"):
+            a = fluid.layers.data(name='data', shape=[1], dtype='int32')
+            b = a + 1
+            with fluid.name_scope("s2"):
+                c = b * 1
+            with fluid.name_scope("s3"):
+                d = c / 1
+        with fluid.name_scope("s1"):
+            f = fluid.layers.pow(d, 2.0)
+        with fluid.name_scope("s4"):
+            g = f - 1
+        for op in fluid.default_main_program().block(0).ops:
+            if op.type == 'elementwise_add':
+                self.assertEqual(op.desc.attr("op_namescope"), '/s1/')
+            elif op.type == 'elementwise_mul':
+                self.assertEqual(op.desc.attr("op_namescope"), '/s1/s2/')
+            elif op.type == 'elementwise_div':
+                self.assertEqual(op.desc.attr("op_namescope"), '/s1/s3/')
+            elif op.type == 'elementwise_sub':
+                self.assertEqual(op.desc.attr("op_namescope"), '/s4/')
+            elif op.type == 'pow':
+                self.assertEqual(op.desc.attr("op_namescope"), '/s1_1/')
--- a/python/paddle/fluid/tests/unittests/test_operator_desc.py
+++ b/python/paddle/fluid/tests/unittests/test_operator_desc.py
@@ -67,7 +67,10 @@ class TestOperator(unittest.TestCase):
        self.assertEqual(mul_op.output("Out"), ["mul.out"])
        self.assertEqual(
            set(mul_op.attr_names),
-            set(["x_num_col_dims", "y_num_col_dims", "op_role", "op_role_var"]))
+            set([
+                "x_num_col_dims", "y_num_col_dims", "op_role", "op_role_var",
+                "op_namescope"
+            ]))
        self.assertEqual(mul_op.has_attr("x_num_col_dims"), True)
        self.assertEqual(mul_op.attr_type("x_num_col_dims"), core.AttrType.INT)
        self.assertEqual(mul_op.attr("x_num_col_dims"), 1)

--- a/python/paddle/fluid/tests/unittests/test_parallel_executor_mnist.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_executor_mnist.py
@@ -67,6 +67,7 @@ def fc_with_batchnorm(use_feed):
    hidden = img
    for _ in range(1):
+        with fluid.name_scope("hidden"):
            hidden = fluid.layers.fc(
                hidden,
                size=200,
@@ -75,8 +76,9 @@ def fc_with_batchnorm(use_feed):
                    initializer=fluid.initializer.Constant(value=1.0)))
            hidden = fluid.layers.batch_norm(input=hidden)
+    with fluid.name_scope("fc_layer"):
        prediction = fluid.layers.fc(hidden, size=10, act='softmax')
+    with fluid.name_scope("loss"):
        loss = fluid.layers.cross_entropy(input=prediction, label=label)
        loss = fluid.layers.mean(loss)
    return loss

--- a/python/paddle/fluid/transpiler/details/program_utils.py
+++ b/python/paddle/fluid/transpiler/details/program_utils.py
@@ -62,9 +62,12 @@ def variable_to_code(var):
    Returns:
        string: The formatted string.
    """
+    if var.type == core.VarDesc.VarType.SELECTED_ROWS or var.type == core.VarDesc.VarType.LOD_TENSOR:
        var_str = "{name} : fluid.{type}.shape{shape}.astype({dtype})".\
            format(i="{", e="}", name=var.name, type=var.type, shape=var.shape, dtype=var.dtype)
+    else:
+        var_str = "{name} : fluid.{type})".\
+            format(i="{", e="}", name=var.name, type=var.type)
    if type(var) == paddle.fluid.framework.Parameter:
        if var.trainable:
@@ -142,27 +145,17 @@ def op_to_code(op):
    return op_str
-def program_to_code(prog):
+def block_to_code(block, block_idx):
-    """
-    Print readable codes of fluid program.
-    Args:
-        prog : A fluid program.
-    An example result like bellow:
-    https://github.com/PaddlePaddle/Paddle/pull/12673
-    """
    indent = 0
-    block_idx = 0
-    for block in prog.blocks:
    print("{0}{1} // block {2}".format(
        get_indent_space(indent), '{', block_idx))
    indent += 1
    # sort all vars
-        all_vars = sorted(six.iteritems(block.vars), key=lambda x: x[0])
+    all_vars = sorted(block.vars.iteritems(), key=lambda x: x[0])
    for var in all_vars:
-            print("{}{}".format(
+        print("{}{}".format(get_indent_space(indent), variable_to_code(var[1])))
-                get_indent_space(indent), variable_to_code(var[1])))
    if len(all_vars) > 0:
        print("")
@@ -170,5 +163,21 @@ def program_to_code(prog):
    for op in block.ops:
        print("{}{}".format(get_indent_space(indent), op_to_code(op)))
    indent -= 1
    print("{0}{1}".format(get_indent_space(indent), '}'))
+def program_to_code(prog):
+    """
+    Print readable codes of fluid program.
+    Args:
+        prog : A fluid program.
+    An example result like bellow:
+    https://github.com/PaddlePaddle/Paddle/pull/12673
+    """
+    block_idx = 0
+    for block in prog.blocks:
+        block_to_code(block, block_idx)
        block_idx += 1
--- a/python/paddle/fluid/transpiler/distribute_transpiler.py
+++ b/python/paddle/fluid/transpiler/distribute_transpiler.py
@@ -31,6 +31,7 @@ Steps to transpile pserver:
 """
 import math
+import sys
 import numpy as np
 import collections
 import six
@@ -181,7 +182,8 @@ class DistributeTranspiler(object):
                  program=None,
                  pservers="127.0.0.1:6174",
                  trainers=1,
-                  sync_mode=True):
+                  sync_mode=True,
+                  startup_program=None):
        """
        Run the transpiler.
@@ -194,13 +196,17 @@ class DistributeTranspiler(object):
                list.
            trainers (int): number of trainers in the distributed job.
            sync_mode (bool): Do sync training or not, default is True.
+            startup_program (Program|None): startup_program to transpile,
+                default is fluid.default_main_program().
        """
        if program is None:
            program = default_main_program()
+        if startup_program is None:
+            startup_program = default_startup_program()
        self.origin_program = program
-        self.origin_startup_program = default_startup_program().clone()
+        self.startup_program = startup_program
+        self.origin_startup_program = self.startup_program.clone()
-        self.startup_program = default_startup_program()
        self.trainer_num = trainers
        self.sync_mode = sync_mode
        self.trainer_id = trainer_id
@@ -267,6 +273,10 @@ class DistributeTranspiler(object):
                name=framework.generate_control_dev_var_name())
            grad_name_to_send_dummy_out[grad_varname] = dummy_output
+            # get send op_role_var, if not splited, the grad should have .trainer suffix
+            # if splited, grad should be the original grad var name (split_by_ref and send
+            # will be on the same place). ParallelExecutor
+            # will use op_role_var to get expected device place to run this op.
            program.global_block()._insert_op(
                index=index + 1,
                type="send",
@@ -275,8 +285,10 @@ class DistributeTranspiler(object):
                attrs={
                    "epmap": eplist,
                    RPC_OP_ROLE_ATTR_NAME: RPC_OP_ROLE_ATTR_VALUE,
-                    OP_ROLE_VAR_ATTR_NAME:
+                    OP_ROLE_VAR_ATTR_NAME: [
-                    [self.grad_name_to_param_name[grad_varname], grad_varname],
+                        self.grad_name_to_param_name[grad_varname],
+                        splited_grad_varname
+                    ],
                    "sync_mode": not self.sync_mode,
                })
            for _, var in enumerate(splited_vars):
@@ -320,6 +332,15 @@ class DistributeTranspiler(object):
                recv_dep_in = grad_name_to_send_dummy_out[
                    self.param_name_to_grad_name[param_varname]]
            all_recv_outputs.extend(splited_var)
+            # get recv op_role_var, if not splited, the grad should have .trainer suffix
+            # if splited, grad should be the original grad var name. ParallelExecutor
+            # will use op_role_var to get expected device place to run this op.
+            orig_grad_name = self.param_name_to_grad_name[param_varname]
+            recv_op_role_var_name = orig_grad_name
+            splited_trainer_grad = self.grad_var_mapping[orig_grad_name]
+            if len(splited_trainer_grad) == 1:
+                recv_op_role_var_name = splited_trainer_grad[0].name
            program.global_block().append_op(
                type="recv",
                inputs={"X": [recv_dep_in]},
@@ -327,10 +348,8 @@ class DistributeTranspiler(object):
                attrs={
                    "epmap": eps,
                    RPC_OP_ROLE_ATTR_NAME: RPC_OP_ROLE_ATTR_VALUE,
-                    OP_ROLE_VAR_ATTR_NAME: [
+                    OP_ROLE_VAR_ATTR_NAME:
-                        param_varname,
+                    [param_varname, recv_op_role_var_name],
-                        self.param_name_to_grad_name[param_varname]
-                    ],
                    "sync_mode": not self.sync_mode
                })
@@ -376,20 +395,17 @@ class DistributeTranspiler(object):
        return self.origin_program
-    def _get_trainer_startup_program(self,
+    def _get_trainer_startup_program(self, recv_vars, eplist):
-                                     recv_vars,
-                                     eplist,
-                                     startup_program=None):
        """
        Get transpiled trainer side startup program.
        Args:
-            startup_program(Program): Startup program.
+            recv_vars (list): Variable list to recv for current trainer_id
+            eplist (list): A list of strings indicating 
        Returns:
            Program: trainer side startup program.
        """
-        if startup_program is None:
        startup_program = self.startup_program
        # FIXME(gongwb): delete not need ops.
@@ -438,7 +454,18 @@ class DistributeTranspiler(object):
            #add concat ops to merge splited parameters received from parameter servers.
            if len(splited_var) <= 1:
                continue
+            # NOTE: if enable memory optimization, origin vars maybe removed.
+            if startup_program.global_block().vars.has_key(varname):
                orig_param = startup_program.global_block().vars[varname]
+            else:
+                origin_param_var = self.origin_program.global_block().vars[
+                    varname]
+                orig_param = startup_program.global_block().create_var(
+                    name=varname,
+                    persistable=origin_param_var.persistable,
+                    type=origin_param_var.type,
+                    dtype=origin_param_var.dtype,
+                    shape=origin_param_var.shape)
            startup_program.global_block().append_op(
                type="concat",
                inputs={"X": splited_var},
@@ -461,7 +488,9 @@ class DistributeTranspiler(object):
        # NOTE: assume blocks of the same variable is not distributed
        # on the same pserver, only change param/grad varnames for
        # trainers to fetch.
+        sys.stderr.write("get_pserver_program() is deprecated, call\
+            get_pserver_programs() to get pserver main and startup\
+            in a single call.")
        # step1
        pserver_program = Program()
        pserver_program.random_seed = self.origin_program.random_seed
@@ -651,32 +680,58 @@ class DistributeTranspiler(object):
            endpoint)
        pserver_program._sync_with_cpp()
+        # save pserver program to generate pserver side startup relatively.
+        self.pserver_program = pserver_program
        return pserver_program
+    def get_pserver_programs(self, endpoint):
+        """
+        Get pserver side main program and startup program for distributed training.
+        Args:
+            endpoint (str): current pserver endpoint.
+        Returns:
+            tuple: (main_program, startup_program), of type "Program"
+        """
+        pserver_prog = self.get_pserver_program(endpoint)
+        pserver_startup = self.get_startup_program(endpoint)
+        return pserver_prog, pserver_startup
    def get_startup_program(self,
                            endpoint,
-                            pserver_program,
+                            pserver_program=None,
                            startup_program=None):
        """
+        **Deprecated**
        Get startup program for current parameter server.
        Modify operator input variables if there are variables that
        were split to several blocks.
        Args:
            endpoint (str): current pserver endpoint.
-            pserver_program (Program): call get_pserver_program first and
+            pserver_program (Program): deprecated, call get_pserver_program first.
-                pass the result here.
+            startup_program (Program): deprecated, should pass startup_program
-            startup_program (Program): if pass None, will use
+                when initalizing 
-                default_startup_program
        Returns:
            Program: parameter server side startup program.
        """
+        sys.stderr.write("get_startup_program() is deprecated, call\
+            get_pserver_programs() to get pserver main and startup\
+            in a single call.")
+        if pserver_program != None:
+            sys.stderr.write("passing pserver_program to get_startup_program()\
+                is deprecated, you can use new API get_pserver_programs() to\
+                get both pserver main program and startup program.")
+        if startup_program != None:
+            sys.stderr.write("passing startup_program to get_startup_program()\
+                is deprecated, use fluid.program_guard() or pass this argument\
+                to transpile() call.")
        s_prog = Program()
-        if not startup_program:
+        orig_s_prog = self.startup_program
-            orig_s_prog = default_startup_program()
-        else:
-            orig_s_prog = startup_program
        s_prog.random_seed = orig_s_prog.random_seed
        params = self.param_grad_ep_mapping[endpoint]["params"]