Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-listen-and-serv-op

a39e6077 · qiaolongfei · e84f353e · ba927b88 · a39e6077 · a39e6077
217 changed file
--- a/doc/fluid/api/evaluator.rst
+++ b/doc/fluid/api/evaluator.rst
@@ -5,17 +5,24 @@
 evaluator
 =========
-Accuracy
+ChunkEvaluator
--------
+--------------
-..  autoclass:: paddle.fluid.evaluator.Accuracy
+..  autoclass:: paddle.fluid.evaluator.ChunkEvaluator
    :members:
    :noindex:
-ChunkEvaluator
+EditDistance
 --------------
-..  autoclass:: paddle.fluid.evaluator.ChunkEvaluator
+..  autoclass:: paddle.fluid.evaluator.EditDistance
    :members:
    :noindex:
+DetectionMAP
+--------------
+..  autoclass:: paddle.fluid.evaluator.DetectionMAP
+    :members:
+    :noindex:
--- a/doc/fluid/api/initializer.rst
+++ b/doc/fluid/api/initializer.rst
@@ -67,8 +67,7 @@ XavierInitializer
 ..  autoclass:: paddle.fluid.initializer.XavierInitializer
    :members:
    :noindex:
-    MSRA
-    ------
 MSRAInitializer
 -----------------

--- a/doc/fluid/api/layers.rst
+++ b/doc/fluid/api/layers.rst
@@ -815,3 +815,8 @@ zeros
 ..  autofunction:: paddle.fluid.layers.zeros
    :noindex:
+topk
+----
+..  autofunction:: paddle.fluid.layers.topk
+    :noindex:
--- a/doc/fluid/api/optimizer.rst
+++ b/doc/fluid/api/optimizer.rst
@@ -47,10 +47,51 @@ DecayedAdagrad
    :members:
    :noindex:
+SGDOptimizer
+------------
+..  autoclass:: paddle.fluid.optimizer.SGDOptimizer
+    :members:
+    :noindex:
+MomentumOptimizer
+-----------------
+..  autoclass:: paddle.fluid.optimizer.MomentumOptimizer
+    :members:
+    :noindex:
+AdagradOptimizer
+----------------
+..  autoclass:: paddle.fluid.optimizer.AdagradOptimizer
+    :members:
+    :noindex:
+AdamOptimizer
+-------------
+..  autoclass:: paddle.fluid.optimizer.AdamOptimizer
+    :members:
+    :noindex:
+AdamaxOptimizer
+---------------
+..  autoclass:: paddle.fluid.optimizer.AdamaxOptimizer
+    :members:
+    :noindex:
+DecayedAdagradOptimizer
+-----------------------
+..  autoclass:: paddle.fluid.optimizer.DecayedAdagradOptimizer
+    :members:
+    :noindex:
 Adadelta
 --------------
 ..  autoclass:: paddle.fluid.optimizer.AdadeltaOptimizer
    :members:
    :noindex:
--- a/doc/fluid/api/regularizer.rst
+++ b/doc/fluid/api/regularizer.rst
@@ -25,3 +25,16 @@ L2Decay
    :members:
    :noindex:
+L1DecayRegularizer
+---------------------
+..  autoclass:: paddle.fluid.regularizer.L1DecayRegularizer
+    :members:
+    :noindex:
+L2DecayRegularizer
+---------------------
+..  autoclass:: paddle.fluid.regularizer.L2DecayRegularizer
+    :members:
+    :noindex:
--- a/doc/fluid/dev/contribute_to_paddle_cn.md
+++ b/doc/fluid/dev/contribute_to_paddle_cn.md
+../../v2/dev/contribute_to_paddle_cn.md
\ No newline at end of file
--- a/doc/fluid/dev/contribute_to_paddle_en.md
+++ b/doc/fluid/dev/contribute_to_paddle_en.md
+../../v2/dev/contribute_to_paddle_en.md
\ No newline at end of file
--- a/doc/fluid/dev/index_cn.rst
+++ b/doc/fluid/dev/index_cn.rst
@@ -4,6 +4,8 @@
 .. toctree::
  :maxdepth: 1
+  contribute_to_paddle_cn.md
+  write_docs_cn.md
  api_doc_std_cn.md
  new_op_cn.md
  new_op_kernel.md

--- a/doc/fluid/dev/index_en.rst
+++ b/doc/fluid/dev/index_en.rst
@@ -4,6 +4,8 @@ Development
 .. toctree::
  :maxdepth: 1
+  contribute_to_paddle_en.md
+  write_docs_en.md
  api_doc_std_en.md
  new_op_en.md
  new_op_kernel.md

--- a/doc/fluid/dev/new_op_cn.md
+++ b/doc/fluid/dev/new_op_cn.md
@@ -54,10 +54,10 @@
 </table>
-实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
+实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
-下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
+下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
 ## 实现C++类
@@ -85,17 +85,17 @@ The equation is: Out = X * Y
 };
 ```
-[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L43)继承自`framework::OpProtoAndCheckerMaker`，构造函数含有2个参数：
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)继承自`framework::OpProtoAndCheckerMaker`，构造函数含有2个参数：
   - `framework::OpProto` ： 前者存储Op的输入输出和参数属性，将用于Python API接口的生成。
   - `framework::OpAttrChecker` ：后者用于检查参数属性的合法性。
 构造函数里通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
-上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/name_convention.md)。
+上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/name_convention.md)。
-再以[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37)为例：
+再以[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55)为例：
 ```cpp
 template <typename AttrType>
@@ -103,21 +103,21 @@ class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "The input tensor of scale operator.").NotInGradient();
+    AddInput("X", "(Tensor) Input tensor of scale operator.");
-    AddOutput("Out", "The output tensor of scale operator.").NotInGradient();
+    AddOutput("Out", "(Tensor) Output tensor of scale operator.");
-    AddComment(R"DOC(Scale operator
+    AddComment(R"DOC(
-The equation is: Out = scale*X
+Scale operator
+$$Out = scale*X$$
 )DOC");
-    AddAttr<AttrType>("scale", "scale of scale operator.").SetDefault(1.0);
+    AddAttr<AttrType>("scale",
+                      "(float, default 1.0)"
+                      "The scaling factor of the scale operator.")
+        .SetDefault(1.0);
  }
 };
 ```
-这个例子有两处不同：
+这个例子有`AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
- `AddInput("X","...").NotInGradient()` : 表示`X`这个输入不参与`ScaleOp`对应的梯度Op计算之中，如果Op的某个输入不参与反向梯度的计算，请显示地调用`.NotInGradient()`进行设置。
- `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
 ### 定义Operator类
@@ -205,7 +205,6 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
 为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用，我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md)。
 到此，前向Op实现完成。接下来，需要在`.cc`文件中注册该op和kernel。
 反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。**但需注意反向Op没有`ProtoMaker`**。
@@ -215,7 +214,9 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
    ```cpp
    namespace ops = paddle::operators;
-    REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
+    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>)
+    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
    REGISTER_OP_CPU_KERNEL(mul_grad,
                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
@@ -223,8 +224,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
   在上面的代码中：
-    - `REGISTER_OP` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`。
+    - `REGISTER_OPERATOR` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`。
-    - `REGISTER_OP_WITHOUT_GRADIENT` ： 用于注册没有反向的Op。
    - `REGISTER_OP_CPU_KERNEL` ：注册`ops::MulKernel`类，并特化模板参数为`paddle::platform::CPUPlace`和`float`类型，同理，注册`ops::MulGradKernel`类。
@@ -255,7 +255,7 @@ make mul_op
 ## 实现单元测试
-单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)。
+单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py)。
 ### 前向Operator单测
@@ -315,7 +315,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp
 ### 编译和执行
-`python/paddle/v2/framework/tests` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
+`python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
 请注意，**不同于Op的编译测试，运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake paddle_dir -DWITH_TESTING=ON`。编译成功后，执行下面的命令来运行单元测试：
@@ -331,7 +331,6 @@ ctest -R test_mul_op
 ## 注意事项
- 为每个Op创建单独的`*_op.h`（如有）、`*_op.cc`和`*_op.cu`（如有）。不允许一个文件中包含多个Op，这将会导致编译出错。
+- 注册Op时的类型名，需要和该Op的名字一样。即不允许在`A_op.cc`里面，注册`REGISTER_OPERATOR(B, ...)`等，这将会导致单元测试出错。
- 注册Op时的类型名，需要和该Op的名字一样。即不允许在`A_op.cc`里面，注册`REGISTER_OP(B, ...)`等，这将会导致单元测试出错。
 - 如果Op没有实现CUDA Kernel，请不要创建空的`*_op.cu`，这将会导致单元测试出错。
 - 如果多个Op依赖一些共用的函数，可以创建非`*_op.*`格式的文件来存放，如`gather.h`文件。
--- a/doc/fluid/dev/new_op_en.md
+++ b/doc/fluid/dev/new_op_en.md
@@ -61,10 +61,10 @@ Registering the Op           | Ops are registered in `.cc` files; For Kernel reg
 </table>
-New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
+New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
-Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc), as an example to introduce the writing of an Operator with Kernel.
+Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc), as an example to introduce the writing of an Operator with Kernel.
 ## Implementing C++ Types
@@ -92,17 +92,17 @@ The equation is: Out = X * Y
 };
 ```
-[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L43)is inherited from`framework::OpProtoAndCheckerMaker`, consisting of 2 variables in the constructor：
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)is inherited from`framework::OpProtoAndCheckerMaker`, consisting of 2 variables in the constructor：
   - `framework::OpProto` stores Operator input and variable attribute, used for generating Python API interfaces.
   - `framework::OpAttrChecker` is used to validate variable attributes.
 The constructor utilizes `AddInput`, `AddOutput`, and `AddComment`, so that the corresponding information will be added to `OpProto`.
-The code above adds two inputs `X` and `Y` to `MulOp`, an output `Out`, and their corresponding descriptions, in accordance to Paddle's [naming convention](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/name_convention.md).
+The code above adds two inputs `X` and `Y` to `MulOp`, an output `Out`, and their corresponding descriptions, in accordance to Paddle's [naming convention](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/name_convention.md).
-An additional example [`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37) is implemented as follows:
+An additional example [`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55) is implemented as follows:
 ```cpp
 template <typename AttrType>
@@ -120,11 +120,7 @@ The equation is: Out = scale*X
 };
 ```
-There are two changes in this example:
+Note `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0.
- `AddInput("X","...").NotInGradient()` expresses that input `X` is not involved in `ScaleOp`'s corresponding computation. If an input to an operator is not participating in back-propagation, please explicitly set `.NotInGradient()`.
- `AddAttr<AttrType>("scale", "...").SetDefault(1.0);`  adds `scale`constant as an attribute, and sets the default value to 1.0.
 ### Defining Operator
@@ -154,7 +150,7 @@ class MulOp : public framework::OperatorWithKernel {
 };
 ```
-[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L22) is inherited from `OperatorWithKernel`. Its `public` member
+[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L24) is inherited from `OperatorWithKernel`. Its `public` member
 ```cpp
 using framework::OperatorWithKernel::OperatorWithKernel;
@@ -209,7 +205,7 @@ Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, w
 Note that **different devices (CPU, CUDA)share one Op definition; whether or not they share the same `OpKernel` depends on whether `Compute` calls functions can support both devices.**
-`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing  `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).
+`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing  `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc).
 To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md).
@@ -224,7 +220,9 @@ The definition of its corresponding backward operator, if applicable, is similar
    ```cpp
    namespace ops = paddle::operators;
-    REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
+    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>)
+    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
    REGISTER_OP_CPU_KERNEL(mul_grad,
@@ -233,9 +231,8 @@ The definition of its corresponding backward operator, if applicable, is similar
   In that code block,
-    - `REGISTER_OP` registers the `ops::MulOp` class, type named `mul`, its type `ProtoMaker` is `ops::MulOpMaker`, registering `ops::MulOpGrad` as `mul_grad`.
+    - `REGISTER_OPERATOR` registers the `ops::MulOp` class, type named `mul`, its type `ProtoMaker` is `ops::MulOpMaker`, registering `ops::MulOpGrad` as `mul_grad`.
    - `REGISTER_OP_WITHOUT_GRADIENT` registers an operator without gradient.
    - `REGISTER_OP_CPU_KERNEL` registers `ops::MulKernel` class and specialized template types `paddle::platform::CPUPlace` and `float`, which also registers `ops::MulGradKernel`.
@@ -275,7 +272,7 @@ Unit tests for an operator include
 3. a scaling test for the backward operator.
-Here, we introduce the [unit tests for `MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py).
+Here, we introduce the [unit tests for `MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py).
 ### Testing Forward Operators
@@ -339,7 +336,7 @@ Some key points in checking gradient above include:
 ### Compiling and Running
-Any new unit testing file of the format `test_*.py`  added to the director `python/paddle/v2/framework/tests` is automatically added to the project to compile.
+Any new unit testing file of the format `test_*.py`  added to the director `python/paddle/fluid/tests/unittests/` is automatically added to the project to compile.
 Note that **unlike the compile test for Ops, running unit tests requires compiling the entire project** and requires compiling with flag `WITH_TESTING` on i.e. `cmake paddle_dir -DWITH_TESTING=ON`.
@@ -357,7 +354,6 @@ ctest -R test_mul_op
 ## Remarks
- Every `*_op.h` (if applicable), `*_op.cc`, and `*_op.cu` (if applicable) must be created for a unique Op. Compiling will fail if multiple operators are included per file.
+- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OPERATOR(B, ...)` in `A_op.cc` will cause unit testing failures.
- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OP(B, ...)` in `A_op.cc` will cause unit testing failures.
 - If the operator does not implement a CUDA kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail.
 - If multiple operators rely on some shared methods, a file NOT named `*_op.*` can be created to store them, such as `gather.h`.
--- a/doc/fluid/dev/write_docs_cn.rst
+++ b/doc/fluid/dev/write_docs_cn.rst
+../../v2/dev/write_docs_cn.rst
\ No newline at end of file
--- a/doc/fluid/dev/write_docs_en.rst
+++ b/doc/fluid/dev/write_docs_en.rst
+../../v2/dev/write_docs_en.rst
\ No newline at end of file
--- a/doc/v2/api/data/data_reader.rst
+++ b/doc/v2/api/data/data_reader.rst
@@ -6,7 +6,43 @@ Data Reader Interface
 DataTypes
 =========
-..  automodule:: paddle.v2.data_type
+..  autofunction:: paddle.v2.data_type.dense_array
+    :noindex:
+..  autofunction:: paddle.v2.data_type.integer_value
+    :noindex:
+..  autofunction:: paddle.v2.data_type.integer_value_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.integer_value_sub_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_binary_vector
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_binary_vector_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_binary_vector_sub_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_float_vector
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_float_vector_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_float_vector_sub_sequence
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_non_value_slot
+    :noindex:
+..  autofunction:: paddle.v2.data_type.sparse_value_slot
+    :noindex:
+..  autoclass:: paddle.v2.data_type.InputType
    :members:
    :noindex:

--- a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md
+++ b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md
-k8s_aws_en.md
\ No newline at end of file
--- a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md
+++ b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md
--- a/doc/v2/howto/rnn/hrnn_rnn_api_compare_cn.rst
+++ b/doc/v2/howto/rnn/hrnn_rnn_api_compare_cn.rst
@@ -134,7 +134,7 @@
 **输入不等长** 是指recurrent_group的多个输入序列，在每个时间步的子序列长度可以不相等。但序列输出时，需要指定与某一个输入的序列信息是一致的。使用\ :red:`targetInlink`\ 可以指定哪一个输入和输出序列信息一致，默认指定第一个输入。 
-示例3的配置分别为\ `单层不等长RNN <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.conf>`_\ 和\ `双层不等长RNN <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.conf>`_\ 。
+示例3的配置分别为\ `单层不等长RNN <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.py>`_\ 和\ `双层不等长RNN <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py>`_\ 。
 示例3对于单层RNN和双层RNN数据完全相同。

--- a/doc/v2/howto/rnn/hrnn_rnn_api_compare_en.rst
+++ b/doc/v2/howto/rnn/hrnn_rnn_api_compare_en.rst
+..  _algo_hrnn_rnn_api_compare:
+#####################
 API comparision between RNN and hierarchical RNN
-================================================
+#####################
+This article takes PaddlePaddle's hierarchical RNN unit test as an example. We will use several examples to illestrate the usage of single-layer and hierarchical RNNs. Each example has two model configurations, one for single-layer, and the other for hierarchical RNN. Although the implementations are different, both the two model configurations' effects are the same. All of the examples in this article only describe the API interface of the hierarchical RNN, while we do not use this hierarchical RNN to solve practical problems. If you want to understand the use of hierarchical RNN in specific issues, please refer to \ :ref:`algo_hrnn_demo`\ 。The unit test file used in this article's example is \ `test_RecurrentGradientMachine.cpp <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/test_RecurrentGradientMachine.cpp>`_\ 。
+Example 1：Hierarchical RNN without Memory between subsequences
+================================
+The classical case in the hierarchical RNN is to perform sequence operations on each time series data in the inner layers seperately. And the sequence operations in the inner layers is independent, that is, it does not need to use Memory. 
+In this example, the network configuration of single-layer RNNs and hierarchical RNNs are all to use LSTM as en encoder to compress a word-segmented sentence into a vector. The difference is that, RNN uses a hierarchical RNN model, treating multiple sentences as a whole to use encoder to compress simultaneously. They are completely consistent in their semantic meanings. This pair of semantically identical example configurations is as follows：
+* RNN\: `sequence_layer_group.conf <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_layer_group.conf>`_
+* Hierarchical RNN\: `sequence_nest_layer_group.conf <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_nest_layer_group.conf>`_
+Reading hierarchical sequence data
+----------------
+Firstly, the original data in this example is as follows \:
+- The original data in this example has 10 samples. Each of the sample includes two components: a lable(all 2 here), and a word-segmented sentence. This data is used by single RNN as well. 
+..  literalinclude:: ../../../../paddle/gserver/tests/Sequence/tour_train_wdseg
+    :language: text
+- The data for hierarchical RNN has 4 samples. Every sample is seperated by a blank line, while the content of the data is the same as the original data. But as for hierarchical LSTM, the first sample will encode two sentences into two vectors simultaneously. The sentence count dealed simultaneously by this 4 samples are \ :code:`[2, 3, 2, 3]`\ .
+..  literalinclude:: ../../../../paddle/gserver/tests/Sequence/tour_train_wdseg.nest
+    :language: text
+Secondly, as for these two types of different input data formats, the contrast of different DataProviders are as follows (`sequenceGen.py <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequenceGen.py>`_)\：
+..  literalinclude:: ../../../../paddle/gserver/tests/sequenceGen.py
+    :language: python
+    :lines: 21-39
+    :linenos:
+- This is the DataProvider code for an ordinary single-layer time series. Its description is as follows: 
+  * DataProvider returns two parts, that are "words" and "label"，as line 19 in the above code. 
+    - "words" is a list of word table indices corresponding to each word in the sentence in the original data. Its data type is integer_value_sequence, that is integer list. So, "words" is a singler-layer time series in the data. 
+    - "label" is the categorical label of each sentence, whose data type is integer_value. 
+..  literalinclude:: ../../../../paddle/gserver/tests/sequenceGen.py
+    :language: python
+    :lines: 42-71
+    :linenos:
+- As for the same data, the DataProvider code for hierarchical time series. Its description is as follows: 
+  - DataProvider returns two lists of data, that are "sentences" and "labels", corresponding to the sentences and labels in each group in the original data of hierarchical time series. 
+  - "sentences" comes from the hierarchical time series original data. As it contains every sentences in each group internally, and each sentences are represented by a list of word table indices, so its data type is integer_value_sub_sequence, which is hierarchical time series. 
+  - "labels" is the categorical lable of each sentence, so it is a sigle-layer time series. 
+Model configuration
+------------------------------------------
+Firstly, let's look at the configuration of single-layer RNN. The hightlighted part of line 9 to line 15 is the usage of single-layer RNN. Here we use the pre-defined RNN process function in PaddlePaddle. In this function, for each time step, RNN passes through an LSTM network. 
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_layer_group.conf
+    :language: python
+    :lines: 38-63
+    :linenos:
+    :emphasize-lines:  9-15
+Secondly, let's look at the model configuration of hierarchical RNN which has the same semantic meaning. \:
+* Most layers in PaddlePaddle do not care about whether the input is time series or not, e.g. \ :code:`embedding_layer`\ . In these layers, every operation is processed on each time step. 
+* In the hightlighted part of line 7 to line 26 of this configuration, we transform the hierarchical time series data into single-layer time series data, then process each single-layer time series. 
+  * Use the function \ :code:`recurrent_group`\ to transform. Input sequences need to be passed in when transforming. As we want to transform hierarchical time series into single-layer sequences, we need to lable the input data as \ :code:`SubsequenceInput`\ .
+  * In this example, we disassemble every group of the original data into sentences using \ :code:`recurrent_group`\ . Each of the disassembled sentences passes through an LSTM network. This is equivalent to single-layer RNN configuration. 
+* Similar to single-layer RNN configuration, we only use the last vector after the encode of LSTM. So we use the operation of \ :code:`last_seq`\ to \ :code:`recurrent_group`\ . But unlike single-layer RNN, we use the last element of every subsequence, so we need to set \ :code:`agg_level=AggregateLevel.TO_SEQUENCE`\ . 
+* Till now, \ :code:`lstm_last`\ has the same result as \ :code:`lstm_last`\ in single-layer RNN configuration. 
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_nest_layer_group.conf
+    :language: python
+    :lines: 38-64
+    :linenos:
+    :emphasize-lines: 7-26
+Example 2：Hierarchical RNN with Memory between subsequences
+================================
+This example is intended to implement two fully-equivalent fully-connected RNNs using single-layer RNN and hierarchical RNN. 
+* As for single-layer RNN, input is a full time series, e.g. \ :code:`[4, 5, 2, 0, 9, 8, 1, 4]`\ .
+* As for hierarchical RNN, input is a hierarchical time series which elements are arbitrarily combination of data in single-layer RNN, e.g. \ :code:`[ [4, 5, 2], [0, 9], [8, 1, 4]]`. 
+model configuration
+------------------
+We select the different parts between single-layer RNN and hierarchical RNN configurations, to compare and analyze the reason why they have same semantic meanings. 
+- single-layer RNN：passes through a simple recurrent_group. For each time step, the current input y and the last time step's output rnn_state pass through a fully-connected layer. 
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_rnn.conf
+    :language: python
+    :lines: 36-48
+- hierarchical RNN, the outer layer's memory is an element. 
+  - The recurrent_group of inner layer's inner_step is nearly the same as single-layer sequence, except for the case of boot_layer=outer_mem, which means using the outer layer's outer_mem as the initial state for the inner layer's memory. In the outer layer's out_step, outer_mem is the last vector of a subsequence, that is, the whole hierarchical group uses the last vector of the previous subsequence as the initial state for the next subsequence's memory. 
+  - From the aspect of the input data, sentences from single-layer and hierarchical RNN are the same. The only difference is that, hierarchical RNN disassembes the sequence into subsequences. So in the hierarchical RNN configuration, we must use the last element of the previous subsequence as a boot_layer for the memory of the next subsequence, so that it makes no difference with "every time step uses the output of last time step" in the sigle-layer RNN configuration. 
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_nest_rnn.conf
+    :language: python
+    :lines: 39-66
+..  warning::
+    Currently PaddlePaddle only supports the case that the lengths of the time series of Memory in each time step are the same. 
+Example 3：hierarchical RNN with unequal length inputs
+==========================
+.. role:: red
+.. raw:: html
+    <style> .red {color:red} </style>
+**unequal length inputs** means in the multiple input sequences of recurrent_group, the lengths of subsequences can be unequal. But the output of the sequence, needs to be consistent with one of the input sequences. Using \ :red:`targetInlink`\ can help you specify which of the input sequences and the output sequence can be consistent, by default is the first input. 
+The configurations of Example 3 are \ `sequence_rnn_multi_unequalength_inputs <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.py>`_ \ and \ `sequence_nest_rnn_multi_unequalength_inputs <https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py>`_\ . 
+The data for the configurations of Example 3's single-layer RNN and hierarchical RNN are exactly the same. 
+* For the single-layer RNN, the data has two samples, which are \ :code:`[1, 2, 4, 5, 2], [5, 4, 1, 3, 1]`\ and \ :code:`[0, 2, 2, 5, 0, 1, 2], [1, 5, 4, 2, 3, 6, 1]`\ . Each of the data for the single-layer RNN has two group of features. 
+* On the basis of the single-layer's data, hierarchical RNN's data randomly adds some partitions. For example, the first sample is transformed to \ :code:`[[0, 2], [2, 5], [0, 1, 2]],[[1, 5], [4], [2, 3, 6, 1]]`\ . 
+* You need to pay attention that, PaddlePaddle only supports multiple input hierarchical RNNs that have same amount of subsequences currently. In this example, the two features both have 3 subsequences. Although the length of each subsequence can be different, the amount of subsequences should be the same. 
+model configuration
+--------
+Similar to Example 2's configuration, Example 3's configuration uses single-layer and hierarchical RNN to implement 2 fully-equivalent fully-connected RNNs. 
+* single-layer RNN\:
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.py
+    :language: python
+    :lines: 42-59
+    :linenos:
+* hierarchical RNN\ \:
+..  literalinclude:: ../../../../paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py
+    :language: python
+    :lines: 41-80
+    :linenos:
+In the above code, the usage of single-layer and hierarchical RNNs are similar to Example 2, which difference is that it processes 2 inputs simultaneously. As for the hierarchical RNN, the lengths of the 2 input's subsequences are not equal. But we use the parameter \ :code:`targetInlink` \ to set the outper layer's \ :code:`recurrent_group` \ 's output format, so the shape of outer layer's output is the same as the shape of \ :code:`emb2`\ . 
+Glossary
+======
+..  _glossary_memory:
+Memory
+------
+Memory is a concept when PaddlePaddle is implementing RNN. RNN, recurrent neural network, usually requires some dependency between time steps, that is, the neural network in current time step depends on one of the neurons in the neural network in previous time steps, as the following figure shows: 
+..  graphviz:: src/glossary_rnn.dot
+The dotted connections in the figure, is the network connections across time steps. When PaddlePaddle is implementing RNN, this connection accross time steps is implemented using a special neural network unit, called Memory. Memory can cache the output of one of the neurons in previous time step, then can be passed to another neuron in next time step. The implementation of an RNN using Memory is as follows: 
+..  graphviz:: src/glossary_rnn_with_memory.dot
+With this method, PaddlePaddle can easily determine which outputs should cross time steps, and which should not. 
+..  _glossary_timestep:
+time step
+------
+refers to time series
+..  _glossary_sequence:
+time series
+--------
+Time series is a series of featured data. The order among these featured data is meaningful. So it is a list of features, not a set of features. As for each element of this list, or the featured data in each series, is called a time step. It must be noted that, the concepts of time series and time steps, are not necessarrily related to "time". As long as the "order" in a series of featured data is meaningful, it can be the input of time series. 
+For example, in text classification task, we regard a sentence as a time series. So, each word in the sentence can become the index of the word in the word table. So this sentence can be represented as a list of these indices, e.g.:code:`[9, 2, 3, 5, 3]` . 
+For a more detailed and accurate definition of the time series, please refer to `Wikipedia of Time series <https://en.wikipedia.org/wiki/Time_series>`_  or `Chinese Wikipedia of time series <https://zh.wikipedia.org/wiki/%E6%99%82%E9%96%93%E5%BA%8F%E5%88%97>`_  . 
+In additioin, Paddle always calls time series as :code:`Sequence` . They are a same concept in Paddle's documentations and APIs. 
+..  _glossary_RNN:
+RNN
+---
+In PaddlePaddle's documentations, RNN is usually represented as :code:`Recurrent neural network` . For more information, please refer to `Wikipedia Recurrent neural network <https://en.wikipedia.org/wiki/Recurrent_neural_network>`_ or `Chinese Wikipedia <https://zh.wikipedia.org/wiki/%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C>`_ . 
+In PaddlePaddle, RNN usually means, for the input data of a time series, the neural network between each time steps has a certain relevance. For example, the input of a certain neuron is the output of a certain neuron in the neural network of the last time step. Or, as for each time step, the network structure of the neural network has a directed ring structure. 
+..  _glossary_hierarchical_RNN:
+hierarchical RNN
+-------
+Hierarchical RNN, as the name suggests, means there is a nested relationship in RNNs. The input data is a time series, but for each of the inner featured data, it is also a time series, namely 2-dimentional array, or, array of array. Hierarchical RNN is a neural network that can process this type of input data. 
+For example, the task of text classification of a paragragh, meaning to classify a paragraph of sentences. We can treat a paragraph as an array of sentences, and each sentence is an array of words. This is a type of the input data for the hierarchical RNN. We encode each sentence of this paragraph into a vector using LSTM, then encode each of the encoded vectors into a vector of this paragraph using LSTM. Finally we use this paragraph vector perform classification, which is the neural network structure of this hierarchical RNN. 
-TBD
--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@@ -102,7 +102,7 @@ cc_test(init_test SRCS init_test.cc DEPS init)
 cc_test(op_kernel_type_test SRCS op_kernel_type_test.cc DEPS place device_context framework_proto)
 cc_test(cow_ptr_tests SRCS details/cow_ptr_test.cc)
-cc_test(channel_test SRCS channel_test.cc)
+# cc_test(channel_test SRCS channel_test.cc)
 cc_test(tuple_test SRCS tuple_test.cc )
 cc_test(concurrency_test SRCS concurrency_test.cc DEPS go_op channel_close_op channel_create_op
        channel_send_op channel_recv_op sum_op select_op elementwise_add_op compare_op

--- a/paddle/fluid/framework/concurrency_test.cc
+++ b/paddle/fluid/framework/concurrency_test.cc
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include <thread>
+#include <thread>  // NOLINT
 #include "gtest/gtest.h"
 #include "paddle/fluid/framework/block_desc.h"
@@ -40,10 +40,10 @@ namespace paddle {
 namespace framework {
 template <typename T>
-LoDTensor *CreateVariable(Scope &scope, p::CPUPlace &place, std::string name,
+LoDTensor *CreateVariable(Scope *scope, const p::CPUPlace &place,
-                          T value) {
+                          std::string name, T value) {
  // Create LoDTensor<int> of dim [1]
-  auto var = scope.Var(name);
+  auto var = scope->Var(name);
  auto tensor = var->GetMutable<LoDTensor>();
  tensor->Resize({1});
  T *expect = tensor->mutable_data<T>(place);
@@ -77,9 +77,9 @@ void AddCase(ProgramDesc *program, Scope *scope, p::CPUPlace *place,
  BlockDesc *caseBlock = program->AppendBlock(*casesBlock);
  func(caseBlock, scope);
-  CreateVariable(*scope, *place, caseCondName, false);
+  CreateVariable(scope, *place, caseCondName, false);
-  CreateVariable(*scope, *place, caseCondXVarName, caseId);
+  CreateVariable(scope, *place, caseCondXVarName, caseId);
-  CreateVariable(*scope, *place, caseVarName, caseId);
+  CreateVariable(scope, *place, caseVarName, caseId);
  scope->Var("step_scope");
@@ -96,21 +96,21 @@ void AddFibonacciSelect(Scope *scope, p::CPUPlace *place, ProgramDesc *program,
                        std::string quitChanName) {
  BlockDesc *whileBlock = program->AppendBlock(*parentBlock);
-  CreateVariable(*scope, *place, "whileExitCond", true);
+  CreateVariable(scope, *place, "whileExitCond", true);
-  CreateVariable(*scope, *place, "caseToExecute", -1);
+  CreateVariable(scope, *place, "caseToExecute", -1);
-  CreateVariable(*scope, *place, "case1var", 0);
+  CreateVariable(scope, *place, "case1var", 0);
-  CreateVariable(*scope, *place, "xtemp", 0);
+  CreateVariable(scope, *place, "xtemp", 0);
  // TODO(thuan): Need to create fibXToSend, since channel send moves the actual
  // data,
  // which causes the data to be no longer accessible to do the fib calculation
  // TODO(abhinav): Change channel send to do a copy instead of a move!
-  CreateVariable(*scope, *place, "fibXToSend", 0);
+  CreateVariable(scope, *place, "fibXToSend", 0);
-  CreateVariable(*scope, *place, "fibX", 0);
+  CreateVariable(scope, *place, "fibX", 0);
-  CreateVariable(*scope, *place, "fibY", 1);
+  CreateVariable(scope, *place, "fibY", 1);
-  CreateVariable(*scope, *place, "quitVar", 0);
+  CreateVariable(scope, *place, "quitVar", 0);
  BlockDesc *casesBlock = program->AppendBlock(*whileBlock);
  std::function<void(BlockDesc * caseBlock)> f = [](BlockDesc *caseBlock) {};
@@ -138,7 +138,7 @@ void AddFibonacciSelect(Scope *scope, p::CPUPlace *place, ProgramDesc *program,
    // Exit the while loop after we receive from quit channel.
    // We assign a false to "whileExitCond" variable, which will
    // break out of while_op loop
-    CreateVariable(*scope, *place, "whileFalse", false);
+    CreateVariable(scope, *place, "whileFalse", false);
    AddOp("assign", {{"X", {"whileFalse"}}}, {{"Out", {"whileExitCond"}}}, {},
          caseBlock);
  };
@@ -174,9 +174,9 @@ TEST(Concurrency, Go_Op) {
  // Create Variables, x0 will be put into channel,
  // result will be pulled from channel
-  CreateVariable(scope, place, "Status", false);
+  CreateVariable(&scope, place, "Status", false);
-  CreateVariable(scope, place, "x0", 99);
+  CreateVariable(&scope, place, "x0", 99);
-  CreateVariable(scope, place, "result", 0);
+  CreateVariable(&scope, place, "result", 0);
  framework::Executor executor(place);
  ProgramDesc program;
@@ -226,9 +226,9 @@ TEST(Concurrency, Select) {
  // Initialize scope variables
  p::CPUDeviceContext ctx(place);
-  CreateVariable(scope, place, "Status", false);
+  CreateVariable(&scope, place, "Status", false);
-  CreateVariable(scope, place, "result", 0);
+  CreateVariable(&scope, place, "result", 0);
-  CreateVariable(scope, place, "currentXFib", 0);
+  CreateVariable(&scope, place, "currentXFib", 0);
  framework::Executor executor(place);
  ProgramDesc program;
@@ -246,7 +246,7 @@ TEST(Concurrency, Select) {
        {{"capacity", 0}, {"data_type", f::proto::VarType::LOD_TENSOR}}, block);
  // Create Go Op routine, which loops 10 times over fibonacci sequence
-  CreateVariable(scope, place, "xReceiveVar", 0);
+  CreateVariable(&scope, place, "xReceiveVar", 0);
  BlockDesc *goOpBlock = program.AppendBlock(program.Block(0));
  for (int i = 0; i < 10; ++i) {
@@ -264,7 +264,7 @@ TEST(Concurrency, Select) {
          goOpBlock);
  }
-  CreateVariable(scope, place, "quitSignal", 0);
+  CreateVariable(&scope, place, "quitSignal", 0);
  AddOp("channel_send", {{"Channel", {quitChanName}}, {"X", {"quitSignal"}}},
        {{"Status", {"Status"}}}, {}, goOpBlock);

--- a/paddle/fluid/framework/data_layout.h
+++ b/paddle/fluid/framework/data_layout.h
@@ -16,6 +16,7 @@ limitations under the License. */
 #include <cctype>
 #include <ostream>
+#include <string>
 #include "paddle/fluid/platform/enforce.h"

--- a/paddle/fluid/framework/data_layout_transform.cc
+++ b/paddle/fluid/framework/data_layout_transform.cc
@@ -13,6 +13,7 @@
 // limitations under the License.
 #include "paddle/fluid/framework/data_layout_transform.h"
+#include <vector>
 #include "paddle/fluid/operators/math/math_function.h"

--- a/paddle/fluid/framework/data_layout_transform.h
+++ b/paddle/fluid/framework/data_layout_transform.h
@@ -14,6 +14,7 @@
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/op_kernel_type.h"
 #include "paddle/fluid/framework/tensor.h"
 #include "paddle/fluid/framework/variable.h"

--- a/paddle/fluid/framework/data_type_transform.h
+++ b/paddle/fluid/framework/data_type_transform.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <utility>
 #include "paddle/fluid/framework/op_kernel_type.h"
 #include "paddle/fluid/framework/tensor.h"
 #include "paddle/fluid/framework/variable.h"

--- a/paddle/fluid/framework/details/broadcast_op_handle.h
+++ b/paddle/fluid/framework/details/broadcast_op_handle.h
@@ -29,9 +29,7 @@ namespace framework {
 namespace details {
 struct BroadcastOpHandle : public OpHandleBase {
-  const std::vector<Scope *> &local_scopes_;
+ public:
-  const std::vector<platform::Place> &places_;
  BroadcastOpHandle(const std::vector<Scope *> &local_scopes,
                    const std::vector<platform::Place> &places);
@@ -41,6 +39,10 @@ struct BroadcastOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  const std::vector<Scope *> &local_scopes_;
+  const std::vector<platform::Place> &places_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/broadcast_op_handle_test.cc
+++ b/paddle/fluid/framework/details/broadcast_op_handle_test.cc
@@ -90,7 +90,7 @@ struct TestBroadcastOpHandle {
    op_handle_->AddInput(dummy_var_handle);
    for (size_t j = 0; j < gpu_list_.size(); ++j) {
-      op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
+      op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get());
      VarHandle* out_var_handle = new VarHandle(2, j, "out", gpu_list_[j]);
      vars_.emplace_back(out_var_handle);
      op_handle_->AddOutput(out_var_handle);

--- a/paddle/fluid/framework/details/computation_op_handle.cc
+++ b/paddle/fluid/framework/details/computation_op_handle.cc
@@ -28,8 +28,8 @@ ComputationOpHandle::ComputationOpHandle(const OpDesc &op_desc, Scope *scope,
 void ComputationOpHandle::RunImpl() {
  auto *cur_ctx = dev_ctxes_[place_];
  for (auto *in : inputs_) {
-    bool need_wait =
+    bool need_wait = in->generated_op_ &&
-        in->generated_op_ && in->generated_op_->dev_ctxes_[place_] != cur_ctx;
+                     in->generated_op_->DeviceContext(place_) != cur_ctx;
    if (need_wait) {
      in->generated_op_->Wait(cur_ctx);
    }

--- a/paddle/fluid/framework/details/computation_op_handle.h
+++ b/paddle/fluid/framework/details/computation_op_handle.h
@@ -14,6 +14,9 @@
 #pragma once
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/details/op_handle_base.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
@@ -24,10 +27,7 @@ namespace paddle {
 namespace framework {
 namespace details {
 struct ComputationOpHandle : public OpHandleBase {
-  std::unique_ptr<OperatorBase> op_;
+ public:
-  Scope *scope_;
-  platform::Place place_;
  ComputationOpHandle(const OpDesc &op_desc, Scope *scope,
                      platform::Place place);
@@ -35,6 +35,11 @@ struct ComputationOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  std::unique_ptr<OperatorBase> op_;
+  Scope *scope_;
+  platform::Place place_;
 };
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/fetch_op_handle.cc
+++ b/paddle/fluid/framework/details/fetch_op_handle.cc
@@ -51,23 +51,23 @@ void FetchOpHandle::RunImpl() {
    auto *var = static_cast<VarHandle *>(input);
    var->generated_op_->Wait(cpu_ctx);
  }
  tensors_.resize(inputs_.size());
-  auto *var = static_cast<VarHandle *>(inputs_[0]);
+  auto *var_handle = static_cast<VarHandle *>(inputs_[0]);
-  auto &var_name = var->name_;
+  auto &var_name = var_handle->name_;
  platform::CPUPlace cpu;
  auto &scopes = *local_scopes_;
  for (size_t i = 0; i < scopes.size(); ++i) {
    auto &scope = scopes[i];
-    auto &t = scope->FindVar(kLocalExecScopeName)
+    auto *var =
-                  ->Get<Scope *>()
+        scope->FindVar(kLocalExecScopeName)->Get<Scope *>()->FindVar(var_name);
-                  ->FindVar(var_name)
+    PADDLE_ENFORCE_NOT_NULL(var, "Cannot find variable %s in execution scope",
-                  ->Get<framework::LoDTensor>();
+                            var_name);
-    if (platform::is_gpu_place(var->place_)) {
+    auto &t = var->Get<framework::LoDTensor>();
+    if (platform::is_gpu_place(t.place())) {
 #ifdef PADDLE_WITH_CUDA
      TensorCopy(t, cpu, *dev_ctxes_[t.place()], &tensors_[i]);
-      dev_ctxes_[t.place()]->Wait();
+      dev_ctxes_.at(t.place())->Wait();
 #endif
    } else {
      tensors_[i].ShareDataWith(t);

--- a/paddle/fluid/framework/details/fetch_op_handle.h
+++ b/paddle/fluid/framework/details/fetch_op_handle.h
@@ -14,6 +14,9 @@
 #pragma once
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/details/op_handle_base.h"
 #include "paddle/fluid/framework/feed_fetch_type.h"
 #include "paddle/fluid/framework/scope.h"
@@ -24,11 +27,7 @@ namespace framework {
 namespace details {
 struct FetchOpHandle : public OpHandleBase {
-  FeedFetchList *data_;
+ public:
-  size_t offset_;
-  std::vector<Scope *> *local_scopes_;
-  std::vector<LoDTensor> tensors_;
  FetchOpHandle(FeedFetchList *data, size_t offset,
                std::vector<Scope *> *local_scopes);
@@ -42,6 +41,12 @@ struct FetchOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  FeedFetchList *data_;
+  size_t offset_;
+  std::vector<Scope *> *local_scopes_;
+  std::vector<LoDTensor> tensors_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/gather_op_handle.h
+++ b/paddle/fluid/framework/details/gather_op_handle.h
@@ -29,9 +29,7 @@ namespace framework {
 namespace details {
 struct GatherOpHandle : public OpHandleBase {
-  const std::vector<Scope *> &local_scopes_;
+ public:
-  const std::vector<platform::Place> &places_;
  GatherOpHandle(const std::vector<Scope *> &local_scopes,
                 const std::vector<platform::Place> &places);
@@ -41,6 +39,10 @@ struct GatherOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  const std::vector<Scope *> &local_scopes_;
+  const std::vector<platform::Place> &places_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/gather_op_handle_test.cc
+++ b/paddle/fluid/framework/details/gather_op_handle_test.cc
@@ -78,7 +78,7 @@ struct TestGatherOpHandle {
    op_handle_.reset(new GatherOpHandle(local_scopes_, gpu_list_));
    // add input
    for (size_t j = 0; j < gpu_list_.size(); ++j) {
-      op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
+      op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get());
      auto* in_var_handle = new VarHandle(1, j, "input", gpu_list_[j]);
      vars_.emplace_back(in_var_handle);
      op_handle_->AddInput(in_var_handle);

--- a/paddle/fluid/framework/details/multi_devices_graph_builder.cc
+++ b/paddle/fluid/framework/details/multi_devices_graph_builder.cc
@@ -60,7 +60,8 @@ void MultiDevSSAGraphBuilder::CreateOpHandleIOs(SSAGraph *result,
                                                const platform::Place &p,
                                                const size_t &i) const {
  auto *op_handle = result->ops_.back().get();
-  op_handle->dev_ctxes_[p] = platform::DeviceContextPool::Instance().Get(p);
+  op_handle->SetDeviceContext(p,
+                              platform::DeviceContextPool::Instance().Get(p));
  auto var_names = op.InputArgumentNames();
@@ -89,101 +90,25 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
  bool is_forwarding = true;
  for (auto *op : program.Block(0).AllOps()) {
-    bool change_forward = false;
+    if (op->Type() == "send") {
-    if (!is_forwarding) {
+      // append send op if program is distributed trainer main program.
-      // FIXME(yy): Do not hard code like this
+      // always use the first device
-      if (op->OutputArgumentNames().size() == 1 &&
+      CreateSendOp(&result, *op);
-          op->OutputArgumentNames()[0] == GradVarName(loss_var_name_)) {
+    } else if (IsScaleLossOp(*op)) {
-        continue;  // Drop fill 1. for backward coeff;
+      CreateScaleLossGradOp(&result);
-      }
-    }
-    // append send op if program is distributed trainer main program.
-    // always use the first device
-    if (!is_forwarding && op->Type() == "send") {
-      auto &p = places_[0];
-      auto *s = local_scopes_[0];
-      // FIXME(wuyi): send op always copy from GPU 0
-      result.ops_.emplace_back(new SendOpHandle(*op, s, p));
-      // Create inputs for output on original place and no ssa output
-      // is created for send op.
-      CreateOpHandleIOs(&result, *op, p, 0);
-      continue;
-    }
-    for (size_t i = 0; i < places_.size(); ++i) {
-      auto &p = places_[i];
-      auto *s = local_scopes_[i];
-      result.ops_.emplace_back(new ComputationOpHandle(*op, s, p));
-      auto *op_handle = result.ops_.back().get();
-      CreateOpHandleIOs(&result, *op, p, i);
-      auto var_names = op->OutputArgumentNames();
-      if (is_forwarding) {
-        if (var_names.size() == 1 && var_names[0] == loss_var_name_) {
-// Insert ScaleCost OpHandle
-#ifdef PADDLE_WITH_CUDA
-          auto *communication_dev_ctx = nccl_ctxs_->DevCtx(p);
-#else
-          auto *communication_dev_ctx =
-              platform::DeviceContextPool::Instance().Get(platform::CPUPlace());
-#endif
-          op_handle = new ScaleLossGradOpHandle(local_scopes_.size(), s, p,
-                                                communication_dev_ctx);
-          result.ops_.emplace_back(op_handle);
-          // FIXME: Currently ScaleLossGradOp only use device_count as scale
-          // factor. So it does not depend on any other operators.
-          // VarHandle *loss = GetVarHandle(loss_var_name, place);
-          // loss->pending_ops_.emplace_back(op_handle);
-          // op_handle->inputs_.emplace_back(loss);
-          CreateOpOutput(&result, op_handle, GradVarName(loss_var_name_), p, i);
-          change_forward = true;
-        }
-      }
-    }
-    if (change_forward) {
      is_forwarding = false;
-    }
+    } else {
+      CreateComputationalOps(&result, *op);
-    if (!is_forwarding) {
+      if (!is_forwarding) {
-      auto var_names = op->OutputArgumentNames();
+        // Currently, we assume that once gradient is generated, it can be
-      // Currently, we assume that once gradient is generated, it can be
+        // broadcast, and each gradient is only broadcast once. But there are no
-      // broadcast, and each gradient is only broadcast once. But there are no
+        // other cases, for example, we need to adjust the gradient according to
-      // other cases, for example, we need to adjust the gradient according to
+        // the input when we get the gradient, which is not considered at
-      // the input when we get the gradient, which is not considered at present.
+        // present.
-      for (auto &og : var_names) {
+        for (auto &og : op->OutputArgumentNames()) {
-        if (grad_names_.count(og) != 0 &&
+          if (IsParameterGradientOnce(og, &og_has_been_broadcast)) {
-            og_has_been_broadcast.count(og) == 0) {  // is param grad
+            InsertNCCLAllReduceOp(&result, og);
-                                                     // Insert NCCL AllReduce Op
-          og_has_been_broadcast.insert(og);
-#ifdef PADDLE_WITH_CUDA
-          result.ops_.emplace_back(
-              new NCCLAllReduceOpHandle(local_scopes_, places_, *nccl_ctxs_));
-          auto *op_handle = result.ops_.back().get();
-          for (size_t i = 0; i < places_.size(); ++i) {
-            auto &p = places_[i];
-            auto &vars = result.vars_[i][og];
-            if (vars.empty()) {  // This device has no data. continue.
-              continue;
-            }
-            auto &prev_grad = vars[vars.size() - 1];
-            op_handle->AddInput(prev_grad.get());
-            auto var = new VarHandle(vars.size() - 1, i, og, p);
-            vars.emplace_back(var);
-            op_handle->AddOutput(var);
          }
-#else
-          PADDLE_ENFORCE("Not implemented");
-#endif
        }
      }
    }
@@ -207,7 +132,95 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
  }
  return std::unique_ptr<SSAGraph>(graph);
-}  // namespace details
+}
+void MultiDevSSAGraphBuilder::InsertNCCLAllReduceOp(
+    SSAGraph *result, const std::string &og) const {
+#ifdef PADDLE_WITH_CUDA
+  result->ops_.emplace_back(
+      new NCCLAllReduceOpHandle(local_scopes_, places_, *nccl_ctxs_));
+  auto *op_handle = result->ops_.back().get();
+  for (size_t i = 0; i < places_.size(); ++i) {
+    auto &p = places_[i];
+    auto &vars = result->vars_[i][og];
+    PADDLE_ENFORCE(!vars.empty());
+    auto &prev_grad = vars.back();
+    op_handle->AddInput(prev_grad.get());
+    auto var = new VarHandle(vars.size() - 1, i, og, p);
+    vars.emplace_back(var);
+    op_handle->AddOutput(var);
+  }
+#else
+  PADDLE_ENFORCE("Not implemented");
+#endif
+}
+bool MultiDevSSAGraphBuilder::IsParameterGradientOnce(
+    const std::string &og,
+    std::unordered_set<std::string> *og_has_been_broadcast) const {
+  bool is_pg_once =
+      grad_names_.count(og) != 0 && og_has_been_broadcast->count(og) == 0;
+  if (is_pg_once) {
+    // Insert NCCL AllReduce Op
+    og_has_been_broadcast->insert(og);
+  }
+  return is_pg_once;
+}
+void MultiDevSSAGraphBuilder::CreateScaleLossGradOp(SSAGraph *result) const {
+  for (size_t i = 0; i < places_.size(); ++i) {
+// Insert ScaleCost OpHandle
+#ifdef PADDLE_WITH_CUDA
+    auto *communication_dev_ctx = nccl_ctxs_->DevCtx(places_[i]);
+#else
+    auto *communication_dev_ctx =
+        platform::DeviceContextPool::Instance().Get(platform::CPUPlace());
+#endif
+    auto *op_handle =
+        new ScaleLossGradOpHandle(local_scopes_.size(), local_scopes_[i],
+                                  places_[i], communication_dev_ctx);
+    result->ops_.emplace_back(op_handle);
+    // FIXME: Currently ScaleLossGradOp only use device_count as scale
+    // factor. So it does not depend on any other operators.
+    // VarHandle *loss = GetVarHandle(loss_var_name, place);
+    // loss->pending_ops_.emplace_back(op_handle);
+    // op_handle->inputs_.emplace_back(loss);
+    CreateOpOutput(result, op_handle, GradVarName(loss_var_name_), places_[i],
+                   i);
+  }
+}
+void MultiDevSSAGraphBuilder::CreateComputationalOps(SSAGraph *result,
+                                                     const OpDesc &op) const {
+  for (size_t scope_idx = 0; scope_idx < places_.size(); ++scope_idx) {
+    auto p = places_[scope_idx];
+    auto s = local_scopes_[scope_idx];
+    result->ops_.emplace_back(new ComputationOpHandle(op, s, p));
+    CreateOpHandleIOs(result, op, p, scope_idx);
+  }
+}
+void MultiDevSSAGraphBuilder::CreateSendOp(SSAGraph *result,
+                                           const OpDesc &op) const {
+  auto &p = places_[0];
+  auto *s = local_scopes_[0];
+  // FIXME(wuyi): send op always copy from GPU 0
+  result->ops_.emplace_back(new SendOpHandle(op, s, p));
+  // Create inputs for output on original place and no ssa output
+  // is created for send op.
+  CreateOpHandleIOs(result, op, p, 0);
+}
+bool MultiDevSSAGraphBuilder::IsScaleLossOp(const OpDesc &op) const {
+  // FIXME(yy): Do not hard code like this
+  return op.OutputArgumentNames().size() == 1 &&
+         op.OutputArgumentNames()[0] == GradVarName(loss_var_name_);
+}
 }  // namespace details
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/details/multi_devices_graph_builder.h
+++ b/paddle/fluid/framework/details/multi_devices_graph_builder.h
@@ -57,6 +57,20 @@ class MultiDevSSAGraphBuilder : public SSAGraphBuilder {
 #ifdef PADDLE_WITH_CUDA
  platform::NCCLContextMap *nccl_ctxs_;
 #endif
+  bool IsScaleLossOp(const OpDesc &op) const;
+  void CreateSendOp(SSAGraph *result, const OpDesc &op) const;
+  void CreateComputationalOps(SSAGraph *result, const OpDesc &op) const;
+  void CreateScaleLossGradOp(SSAGraph *result) const;
+  bool IsParameterGradientOnce(
+      const std::string &og,
+      std::unordered_set<std::string> *og_has_been_broadcast) const;
+  void InsertNCCLAllReduceOp(SSAGraph *result, const std::string &og) const;
 };
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/nccl_all_reduce_op_handle.cc
+++ b/paddle/fluid/framework/details/nccl_all_reduce_op_handle.cc
@@ -73,8 +73,9 @@ void NCCLAllReduceOpHandle::RunImpl() {
    for (size_t i = 0; i < local_scopes_.size(); ++i) {
      auto *s = local_scopes_[i];
+      auto &local_scope = *s->FindVar(kLocalExecScopeName)->Get<Scope *>();
-      auto &lod_tensor = s->FindVar(var_name)->Get<LoDTensor>();
+      auto &lod_tensor = local_scope.FindVar(var_name)->Get<LoDTensor>();
      lod_tensors.emplace_back(lod_tensor);
    }
@@ -110,17 +111,21 @@ void NCCLAllReduceOpHandle::RunImpl() {
        }
      });
    } else {  // Special handle CPU only Operator's gradient. Like CRF
-      auto &trg =
+      auto &trg = *this->local_scopes_[0]
-          *this->local_scopes_[0]->Var()->GetMutable<framework::LoDTensor>();
+                       ->FindVar(kLocalExecScopeName)
+                       ->Get<Scope *>()
+                       ->Var()
+                       ->GetMutable<framework::LoDTensor>();
      // Reduce All Tensor to trg in CPU
      ReduceLoDTensor func(lod_tensors, &trg);
      VisitDataType(ToDataType(lod_tensors[0].type()), func);
      for (size_t i = 0; i < local_scopes_.size(); ++i) {
-        auto &scope = local_scopes_[i];
+        auto &scope =
+            *local_scopes_[i]->FindVar(kLocalExecScopeName)->Get<Scope *>();
        auto &p = places_[i];
-        auto *var = scope->FindVar(var_name);
+        auto *var = scope.FindVar(var_name);
        auto *dev_ctx = dev_ctxes_[p];
        RunAndRecordEvent(p, [&trg, var, dev_ctx, p] {

--- a/paddle/fluid/framework/details/nccl_all_reduce_op_handle.h
+++ b/paddle/fluid/framework/details/nccl_all_reduce_op_handle.h
@@ -27,10 +27,6 @@ namespace framework {
 namespace details {
 struct NCCLAllReduceOpHandle : public OpHandleBase {
-  const std::vector<Scope *> &local_scopes_;
-  const std::vector<platform::Place> &places_;
-  const platform::NCCLContextMap &nccl_ctxs_;
  NCCLAllReduceOpHandle(const std::vector<Scope *> &local_scopes,
                        const std::vector<platform::Place> &places,
                        const platform::NCCLContextMap &ctxs);
@@ -43,6 +39,11 @@ struct NCCLAllReduceOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  const std::vector<Scope *> &local_scopes_;
+  const std::vector<platform::Place> &places_;
+  const platform::NCCLContextMap &nccl_ctxs_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/op_handle_base.h
+++ b/paddle/fluid/framework/details/op_handle_base.h
@@ -27,28 +27,15 @@ namespace details {
 constexpr char kLocalExecScopeName[] = "@LCOAL_SCOPE@";
 class OpHandleBase {
- private:
-  DISABLE_COPY_AND_ASSIGN(OpHandleBase);
 public:
-  std::vector<VarHandleBase *> inputs_;
-  std::vector<VarHandleBase *> outputs_;
-  std::unordered_map<platform::Place, platform::DeviceContext *,
-                     platform::PlaceHash>
-      dev_ctxes_;
-#ifdef PADDLE_WITH_CUDA
-  std::unordered_map<int, cudaEvent_t> events_;
-#endif
  OpHandleBase() {}
+  virtual ~OpHandleBase();
  std::string DebugString() const;
  virtual std::string Name() const = 0;
-  virtual ~OpHandleBase();
  void Run(bool use_event);
  virtual void Wait(platform::DeviceContext *waited_dev);
@@ -61,6 +48,18 @@ class OpHandleBase {
  // will likely block other computations.
  virtual bool IsMultiDeviceTransfer() { return false; }
+  const platform::DeviceContext *DeviceContext(platform::Place place) {
+    return dev_ctxes_[place];
+  }
+  void SetDeviceContext(platform::Place place, platform::DeviceContext *ctx_) {
+    dev_ctxes_[place] = ctx_;
+  }
+  const std::vector<VarHandleBase *> &Inputs() const { return inputs_; }
+  const std::vector<VarHandleBase *> &Outputs() const { return outputs_; }
 protected:
  void RunAndRecordEvent(const std::function<void()> &callback);
@@ -68,6 +67,18 @@ class OpHandleBase {
                         const std::function<void()> &callback);
  virtual void RunImpl() = 0;
+  std::vector<VarHandleBase *> inputs_;
+  std::vector<VarHandleBase *> outputs_;
+  std::unordered_map<platform::Place, platform::DeviceContext *,
+                     platform::PlaceHash>
+      dev_ctxes_;
+#ifdef PADDLE_WITH_CUDA
+  std::unordered_map<int, cudaEvent_t> events_;
+#endif
+  DISABLE_COPY_AND_ASSIGN(OpHandleBase);
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc
+++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc
@@ -30,10 +30,11 @@ ScaleLossGradOpHandle::~ScaleLossGradOpHandle() {}
 void ScaleLossGradOpHandle::RunImpl() {
  std::string var_name = static_cast<VarHandle *>(this->outputs_[0])->name_;
+  auto &local_scope = *scope_->FindVar(kLocalExecScopeName)->Get<Scope *>();
-  float *tmp =
+  float *tmp = local_scope.FindVar(var_name)
-      scope_->FindVar(var_name)->GetMutable<LoDTensor>()->mutable_data<float>(
+                   ->GetMutable<LoDTensor>()
-          make_ddim({1}), place_);
+                   ->mutable_data<float>(make_ddim({1}), place_);
  if (platform::is_cpu_place(place_)) {
    *tmp = coeff_;

--- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.h
+++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.h
@@ -14,6 +14,8 @@
 #pragma once
+#include <string>
 #include "paddle/fluid/framework/details/op_handle_base.h"
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/scope.h"
@@ -23,10 +25,6 @@ namespace framework {
 namespace details {
 struct ScaleLossGradOpHandle : public OpHandleBase {
-  float coeff_;
-  Scope *scope_;
-  platform::Place place_;
  ScaleLossGradOpHandle(size_t num_dev, Scope *scope, platform::Place place,
                        platform::DeviceContext *context);
@@ -36,6 +34,11 @@ struct ScaleLossGradOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  float coeff_;
+  Scope *scope_;
+  platform::Place place_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/send_op_handle.h
+++ b/paddle/fluid/framework/details/send_op_handle.h
@@ -28,10 +28,6 @@ namespace framework {
 namespace details {
 struct SendOpHandle : public OpHandleBase {
-  std::unique_ptr<OperatorBase> op_;
-  const Scope* local_scope_;
-  const platform::Place& place_;
  SendOpHandle(const framework::OpDesc& op_desc, const Scope* local_scope,
               const platform::Place& place);
@@ -43,6 +39,11 @@ struct SendOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+ private:
+  std::unique_ptr<OperatorBase> op_;
+  const Scope* local_scope_;
+  const platform::Place& place_;
 };
 }  // namespace details

--- a/paddle/fluid/framework/details/ssa_graph_builder.cc
+++ b/paddle/fluid/framework/details/ssa_graph_builder.cc
@@ -117,12 +117,12 @@ void SSAGraphBuilder::PrintGraphviz(const SSAGraph &graph, std::ostream &sout) {
    std::string op_name = "op_" + std::to_string(op_id++);
    sout << op_name << " [label=\"" << op->Name() << "\", shape=rect]"
         << std::endl;
-    for (auto in : op->inputs_) {
+    for (auto in : op->Inputs()) {
      std::string var_name = "var_" + std::to_string(vars[in]);
      sout << var_name << " -> " << op_name << std::endl;
    }
-    for (auto out : op->outputs_) {
+    for (auto out : op->Outputs()) {
      std::string var_name = "var_" + std::to_string(vars[out]);
      sout << op_name << " -> " << var_name << std::endl;
    }
@@ -133,7 +133,7 @@ void SSAGraphBuilder::PrintGraphviz(const SSAGraph &graph, std::ostream &sout) {
 void SSAGraphBuilder::AddOutputToLeafOps(SSAGraph *graph) {
  for (auto &op : graph->ops_) {
-    if (!op->outputs_.empty()) {
+    if (!op->Outputs().empty()) {
      continue;
    }
    auto *dummy_leaf = new DummyVarHandle();

--- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
+++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
@@ -53,7 +53,7 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
  };
  auto InsertPendingOp = [&pending_ops](OpHandleBase &op_instance) {
-    pending_ops.insert({&op_instance, op_instance.inputs_.size()});
+    pending_ops.insert({&op_instance, op_instance.Inputs().size()});
  };
  // Transform SSAGraph to pending_ops & pending_vars
@@ -69,7 +69,7 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
  }
  for (auto &op : graph_->ops_) {
-    if (op->inputs_.empty()) {  // Special case, Op has no input.
+    if (op->Inputs().empty()) {  // Special case, Op has no input.
      ready_ops.insert(op.get());
    } else {
      InsertPendingOp(*op);
@@ -99,7 +99,7 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
    fetch_ops.emplace_back(op);
    for (auto &p : places_) {
-      op->dev_ctxes_[p] = fetch_ctxs_.Get(p);
+      op->SetDeviceContext(p, fetch_ctxs_.Get(p));
    }
    for (auto *var : vars) {
@@ -180,7 +180,7 @@ void ThreadedSSAGraphExecutor::RunOp(
      op->Run(use_event_);
      VLOG(10) << op << " " << op->Name() << " Done ";
      running_ops_--;
-      ready_var_q->Extend(op->outputs_);
+      ready_var_q->Extend(op->Outputs());
      VLOG(10) << op << " " << op->Name() << "Signal posted";
    } catch (platform::EnforceNotMet ex) {
      exception_.reset(new platform::EnforceNotMet(ex));

--- a/paddle/fluid/framework/feed_fetch_method.cc
+++ b/paddle/fluid/framework/feed_fetch_method.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/feed_fetch_method.h"
+#include <string>
+#include <vector>
 #include "glog/logging.h"
 #include "paddle/fluid/framework/variable.h"

--- a/paddle/fluid/framework/feed_fetch_method.h
+++ b/paddle/fluid/framework/feed_fetch_method.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <string>
 #include "paddle/fluid/framework/feed_fetch_type.h"
 #include "paddle/fluid/framework/scope.h"

--- a/paddle/fluid/framework/grad_op_desc_maker.h
+++ b/paddle/fluid/framework/grad_op_desc_maker.h
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
+#include <algorithm>
 #include <string>
 #include <unordered_set>
 #include <vector>
@@ -69,8 +70,7 @@ class GradOpDescMakerBase {
                      " for input argument with a list of variables, "
                      " drop_empty_grad is not allowed because it makes"
                      " the correspondence bewteen a variable and its gradient"
-                      " ambiguous. Use REGISTER_OP_EX to register the op"
+                      " ambiguous."
-                      " or call InputGrad(?,false) in GradOpDescMaker."
                      " Op type %s",
                      fwd_op_.Type());

--- a/paddle/fluid/framework/lod_rank_table.h
+++ b/paddle/fluid/framework/lod_rank_table.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
 #include <iosfwd>
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 namespace paddle {

--- a/paddle/fluid/framework/mixed_vector.h
+++ b/paddle/fluid/framework/mixed_vector.h
@@ -14,6 +14,7 @@
 #pragma once
+#include <algorithm>
 #include <initializer_list>
 #include <vector>

--- a/paddle/fluid/framework/op_desc.cc
+++ b/paddle/fluid/framework/op_desc.cc
@@ -13,8 +13,10 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/op_desc.h"
+#include <algorithm>
 #include <functional>
-#include <mutex>
+#include <mutex>  // NOLINT
+#include <string>
 #include <unordered_map>
 #include "glog/logging.h"
 #include "paddle/fluid/framework/block_desc.h"

--- a/paddle/fluid/framework/op_desc.h
+++ b/paddle/fluid/framework/op_desc.h
@@ -119,7 +119,7 @@ class OpDesc {
  void InferVarType(BlockDesc *block) const;
-  void MarkAsTarget() { desc_.set_is_target(true); }
+  void SetIsTarget(bool is_target) { desc_.set_is_target(is_target); }
  void Flush();

--- a/paddle/fluid/framework/op_kernel_type.h
+++ b/paddle/fluid/framework/op_kernel_type.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <string>
 #include "paddle/fluid/framework/data_layout.h"
 #include "paddle/fluid/framework/data_type.h"
 #include "paddle/fluid/framework/library_type.h"

--- a/paddle/fluid/framework/op_proto_maker.cc
+++ b/paddle/fluid/framework/op_proto_maker.cc
@@ -12,6 +12,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/op_proto_maker.h"
+#include <string>
 namespace paddle {
 namespace framework {

--- a/paddle/fluid/framework/op_proto_maker.h
+++ b/paddle/fluid/framework/op_proto_maker.h
@@ -13,6 +13,7 @@ limitations under the License. */
 #pragma once
+#include <string>
 #include "paddle/fluid/framework/attribute.h"
 #include "paddle/fluid/framework/framework.pb.h"

--- a/paddle/fluid/framework/op_registry.h
+++ b/paddle/fluid/framework/op_registry.h
@@ -16,6 +16,8 @@ limitations under the License. */
 #include <algorithm>
 #include <atomic>
+#include <string>
+#include <tuple>
 #include <type_traits>
 #include <typeinfo>
 #include <unordered_map>
@@ -141,36 +143,6 @@ class OpKernelRegistrar : public Registrar {
    return 0;                                                          \
  }
-/**
- * Macro to register Operator. When the input is duplicable, you should
- * use REGISTER_OP_EX with drop_empty_grad=false instead.
- */
-#define REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, \
-                    grad_op_class)                                   \
-  REGISTER_OP_EX(op_type, op_class, op_maker_class, grad_op_type,    \
-                 grad_op_class, true)
-// When an argument is duplicable, we need to use this version.
-// Perhaps we can omit DropEmptyIG template parameter and
-// only have one version of REGISTER_OP.
-#define REGISTER_OP_EX(op_type, op_class, op_maker_class, grad_op_type,       \
-                       grad_op_class, drop_empty_grad)                        \
-  REGISTER_OPERATOR(grad_op_type, grad_op_class);                             \
-  class _GradOpDescMaker_##grad_op_type##_                                    \
-      : public ::paddle::framework::DefaultGradOpDescMaker<drop_empty_grad> { \
-    using ::paddle::framework::DefaultGradOpDescMaker<                        \
-        drop_empty_grad>::DefaultGradOpDescMaker;                             \
-                                                                              \
-   protected:                                                                 \
-    virtual std::string GradOpType() const { return #grad_op_type; }          \
-  };                                                                          \
-  REGISTER_OPERATOR(op_type, op_class, _GradOpDescMaker_##grad_op_type##_,    \
-                    op_maker_class);
-#define REGISTER_OP_WITH_KERNEL(op_type, ...)                         \
-  REGISTER_OPERATOR(op_type, ::paddle::framework::OperatorWithKernel, \
-                    ##__VA_ARGS__)
 #define REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class) \
  REGISTER_OPERATOR(op_type, op_class, op_maker_class)

--- a/paddle/fluid/framework/parallel_executor.cc
+++ b/paddle/fluid/framework/parallel_executor.cc
@@ -44,6 +44,7 @@ class ParallelExecutorPrivate {
 #endif
  std::vector<std::tuple<std::string, proto::VarType::Type, bool>> var_types_;
+  bool own_local_scope;
 };
 std::vector<Scope *> &ParallelExecutor::GetLocalScopes() {
@@ -63,13 +64,16 @@ ParallelExecutor::ParallelExecutor(
  // Step 1. Bcast the params to devs.
  // Create local scopes
  if (local_scopes.empty()) {
-    for (size_t i = 0; i < member_->places_.size(); ++i) {
+    member_->own_local_scope = true;
-      member_->local_scopes_.push_back(&scope->NewScope());
+    member_->local_scopes_.emplace_back(member_->global_scope_);
+    for (size_t i = 1; i < member_->places_.size(); ++i) {
+      member_->local_scopes_.emplace_back(&scope->NewScope());
    }
  } else {
+    member_->own_local_scope = false;
    PADDLE_ENFORCE_EQ(member_->places_.size(), local_scopes.size());
    for (size_t i = 0; i < member_->places_.size(); ++i) {
-      member_->local_scopes_.push_back(local_scopes[i]);
+      member_->local_scopes_.emplace_back(local_scopes[i]);
    }
  }
@@ -159,7 +163,9 @@ void ParallelExecutor::Run(const std::vector<std::string> &fetch_tensors,
                           const std::string &fetched_var_name) {
  platform::RecordBlock b(0);
  // Create local scopes.
-  for (auto &scope : member_->local_scopes_) {
+  for (auto it = member_->local_scopes_.rbegin();
+       it != member_->local_scopes_.rend(); ++it) {
+    auto &scope = *it;
    Scope &local_scope = scope->NewScope();
    *scope->Var(details::kLocalExecScopeName)->GetMutable<Scope *>() =
        &local_scope;
@@ -173,7 +179,7 @@ void ParallelExecutor::Run(const std::vector<std::string> &fetch_tensors,
        InitializeVariable(scope->Var(std::get<0>(name_type_pair)),
                           std::get<1>(name_type_pair));
      } else {
-        InitializeVariable(scope->Var(std::get<0>(name_type_pair)),
+        InitializeVariable(local_scope.Var(std::get<0>(name_type_pair)),
                           std::get<1>(name_type_pair));
      }
    }
@@ -228,5 +234,13 @@ void ParallelExecutor::FeedAndSplitTensorIntoLocalScopes(
  }
 }
+ParallelExecutor::~ParallelExecutor() {
+  if (member_->own_local_scope) {
+    for (size_t i = 1; i < member_->local_scopes_.size(); ++i) {
+      member_->global_scope_->DeleteScope(member_->local_scopes_[i]);
+    }
+  }
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/parallel_executor.h
+++ b/paddle/fluid/framework/parallel_executor.h
@@ -42,6 +42,8 @@ class ParallelExecutor {
                            const std::vector<Scope*>& local_scopes,
                            bool allow_op_delay);
+  ~ParallelExecutor();
  std::vector<Scope*>& GetLocalScopes();
  /**

--- a/paddle/fluid/framework/scope.cc
+++ b/paddle/fluid/framework/scope.cc
@@ -91,7 +91,7 @@ std::vector<std::string> Scope::LocalVarNames() const {
  return known_vars;
 }
-void Scope::DeleteScope(Scope* scope) {
+void Scope::DeleteScope(Scope* scope) const {
  std::unique_lock<std::mutex> lock(mutex_);
  auto it = std::find(this->kids_.begin(), this->kids_.end(), scope);
  PADDLE_ENFORCE(it != this->kids_.end(), "Cannot find %p as kid scope", scope);

--- a/paddle/fluid/framework/scope.h
+++ b/paddle/fluid/framework/scope.h
@@ -63,7 +63,7 @@ class Scope {
  /// Find the scope or an ancestor scope that contains the given variable.
  const Scope* FindScope(const Variable* var) const;
-  void DeleteScope(Scope* scope);
+  void DeleteScope(Scope* scope) const;
  /// Drop all kids scopes belonged to this scope.
  void DropKids();

--- a/paddle/fluid/framework/selected_rows.cc
+++ b/paddle/fluid/framework/selected_rows.cc
@@ -17,6 +17,52 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
+struct ReAllocateVisitor {
+  ReAllocateVisitor(framework::Tensor* tensor, const framework::DDim& dims)
+      : tensor_(tensor), dims_(dims) {}
+  template <typename T>
+  void operator()() const {
+    framework::Tensor cpu_tensor;
+    platform::CPUPlace cpu;
+    T* ptr = cpu_tensor.mutable_data<T>(dims_, cpu);
+    const T* old_ptr =
+        tensor_->memory_size() == 0 ? nullptr : tensor_->data<T>();
+    if (old_ptr != nullptr) {
+      std::copy(old_ptr, old_ptr + tensor_->numel(), ptr);
+    }
+    tensor_->ShareDataWith(cpu_tensor);
+  }
+  framework::Tensor* tensor_;
+  framework::DDim dims_;
+};
+struct TensorCopyVisitor {
+  TensorCopyVisitor(framework::Tensor* dst, int64_t dst_offset,
+                    const framework::Tensor src, int64_t src_offset,
+                    int64_t size)
+      : dst_(dst),
+        dst_offset_(dst_offset),
+        src_(src),
+        src_offset_(src_offset),
+        size_(size) {}
+  template <typename T>
+  void operator()() const {
+    // TODO(Yancey1989): support other place
+    platform::CPUPlace cpu;
+    memory::Copy(cpu, dst_->mutable_data<T>(cpu) + dst_offset_, cpu,
+                 src_.data<T>() + src_offset_, size_ * sizeof(T));
+  }
+  framework::Tensor* dst_;
+  int64_t dst_offset_;
+  framework::Tensor src_;
+  int64_t src_offset_;
+  int64_t size_;
+};
 void SerializeToStream(std::ostream& os, const SelectedRows& selected_rows,
                       const platform::DeviceContext& dev_ctx) {
  {  // the 1st field, uint32_t version
@@ -69,5 +115,66 @@ void DeserializeFromStream(std::istream& is, SelectedRows* selected_rows,
  TensorFromStream(is, selected_rows->mutable_value(), dev_ctx);
 }
+bool SelectedRows::HasKey(int64_t key) const {
+  return std::find(rows_.begin(), rows_.end(), key) == rows_.end() ? false
+                                                                   : true;
+}
+std::vector<int64_t> SelectedRows::Get(std::vector<int64_t> keys,
+                                       framework::Tensor* value) const {
+  PADDLE_ENFORCE(value->IsInitialized(),
+                 "The value tensor should be initialized.");
+  std::vector<int64_t> non_keys;
+  int64_t value_width = value_->numel() / value_->dims()[0];
+  PADDLE_ENFORCE_EQ(value_width, value->numel() / value->dims()[0],
+                    "output tensor should have the same shape with table "
+                    "execpt the dims[0].");
+  for (size_t i = 0; i < keys.size(); ++i) {
+    int64_t index = Index(keys[i]);
+    if (index == -1) {
+      non_keys.push_back(keys[i]);
+    } else {
+      framework::VisitDataType(
+          framework::ToDataType(value_->type()),
+          TensorCopyVisitor(value, i * value_width, *value_.get(),
+                            index * value_width, value_width));
+    }
+  }
+  return non_keys;
+}
+bool SelectedRows::Set(int64_t key, const framework::Tensor& value) {
+  PADDLE_ENFORCE(value.IsInitialized(), "The value should be initialized.");
+  if (value_->IsInitialized()) {
+    PADDLE_ENFORCE_EQ(
+        value.type(), value_->type(),
+        "The type of the value should be same with the original value");
+  }
+  PADDLE_ENFORCE_EQ(value.dims()[0], static_cast<size_t>(1),
+                    "The first dim of value should be 1.");
+  auto index = Index(key);
+  bool is_new_key = false;
+  if (index == -1) {
+    rows_.push_back(key);
+    index = rows_.size() - 1;
+    is_new_key = true;
+    // whether need to resize the table
+    if (static_cast<int64_t>(rows_.size()) > value_->dims()[0]) {
+      auto dims = value_->dims();
+      dims[0] = (dims[0] + 1) << 1;
+      framework::VisitDataType(framework::ToDataType(value.type()),
+                               ReAllocateVisitor(value_.get(), dims));
+    }
+  }
+  framework::VisitDataType(
+      framework::ToDataType(value.type()),
+      TensorCopyVisitor(value_.get(),
+                        index * value_->numel() / value_->dims()[0], value,
+                        static_cast<int64_t>(0), value.numel()));
+  return is_new_key;
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/selected_rows.h
+++ b/paddle/fluid/framework/selected_rows.h
@@ -14,15 +14,33 @@ limitations under the License. */
 #pragma once
+#include <algorithm>
 #include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/memory/memcpy.h"
 namespace paddle {
 namespace framework {
 class SelectedRows {
+  /*
+   * @brief We can use the SelectedRows structure to reproduce a sparse table.
+   *  A sparse table is a key-value structure that the key is an `int64_t`
+   * number,
+   *  and the value is a Tensor which the first dimension is 0.
+   *  You can use the following interface to operate the sparse table, and you
+   * can find
+   *  some detail information from the comments of each interface:
+   *
+   *  HasKey(key), whether the sparse table has the specified key.
+   *  Set(key, value), set a key-value pair into the sparse table.
+   *  Get(keys, value*), get value by given key list and apply it to the given
+   * value pointer
+   *    with the specified offset.
+   *
+   */
 public:
  SelectedRows(const std::vector<int64_t>& rows, const int64_t& height)
      : rows_(rows), height_(height) {
@@ -50,12 +68,45 @@ class SelectedRows {
  void set_rows(const Vector<int64_t>& rows) { rows_ = rows; }
-  /**
+  /*
-   * get the index of id in rows
+   * @brief wheter has the specified key in the table.
+   *
+   * @return true if the key is exists.
+   */
+  bool HasKey(int64_t key) const;
+  /*
+   * @brief Get value by the key list, if the
+   *
+   * @return a list of keys which does not exists in table
+   */
+  std::vector<int64_t> Get(std::vector<int64_t> keys,
+                           framework::Tensor* tensor) const;
+  /*
+   * @brief Set a key-value pair into the table.
+   *  This function will double the value memory if it's not engouth.
+   *
+   * @note:
+   *    1. The first dim of the value should be 1
+   *    2. The value should be initialized and the data type
+   *       should be the same with the table.
+   *
+   * @return true if the key is a new one, otherwise false
+   *
+   */
+  bool Set(int64_t key, const Tensor& value);
+  /*
+   * @brief Get the index of key in rows
+   *
+   * @return -1 if the key does not exists.
   */
-  int64_t index(int64_t id) const {
+  int64_t Index(int64_t key) const {
-    auto it = std::find(rows_.begin(), rows_.end(), id);
+    auto it = std::find(rows_.begin(), rows_.end(), key);
-    PADDLE_ENFORCE(it != rows_.end(), "id should be in rows");
+    if (it == rows_.end()) {
+      return static_cast<int64_t>(-1);
+    }
    return static_cast<int64_t>(std::distance(rows_.begin(), it));
  }

--- a/paddle/fluid/framework/selected_rows_test.cc
+++ b/paddle/fluid/framework/selected_rows_test.cc
@@ -17,7 +17,7 @@ namespace framework {
 class SelectedRowsTester : public ::testing::Test {
 public:
-  virtual void SetUp() override {
+  void SetUp() override {
    std::vector<int64_t> rows{0, 4, 7};
    int64_t height = 10;
    int64_t row_numel = 100;
@@ -59,5 +59,40 @@ TEST_F(SelectedRowsTester, SerializeAndDeseralize) {
  ASSERT_EQ(selected_rows_->GetCompleteDims(), dst_tensor.GetCompleteDims());
 }
+TEST_F(SelectedRowsTester, Table) {
+  platform::CPUPlace cpu;
+  SelectedRows table;
+  // initialize a sparse table
+  table.mutable_value()->Resize(framework::make_ddim({1, 100}));
+  table.mutable_value()->mutable_data<float>(cpu);
+  table.mutable_rows()->push_back(1);
+  int64_t key = 10000;
+  int64_t non_key = 999;
+  framework::Tensor value;
+  value.Resize(framework::make_ddim({1, 100}));
+  auto ptr = value.mutable_data<float>(cpu);
+  ptr[0] = static_cast<float>(10);
+  ASSERT_EQ(table.rows().size(), static_cast<size_t>(1));
+  ASSERT_EQ(table.HasKey(key), false);
+  table.Set(key, value);
+  ASSERT_EQ(table.rows().size(), static_cast<size_t>(2));
+  ASSERT_EQ(table.HasKey(key), true);
+  // check re-allocate
+  ASSERT_EQ(table.value().dims()[0], static_cast<int64_t>(4));
+  framework::Tensor get_value;
+  get_value.mutable_data<float>(framework::make_ddim({2, 100}), cpu);
+  std::vector<int64_t> keys({non_key, key});
+  auto non_keys = table.Get(keys, &get_value);
+  ASSERT_EQ(get_value.data<float>()[100], static_cast<float>(10));
+  ASSERT_EQ(non_keys.size(), static_cast<size_t>(1));
+  ASSERT_EQ(non_keys[0], non_key);
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/shape_inference.cc
+++ b/paddle/fluid/framework/shape_inference.cc
@@ -11,8 +11,12 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/shape_inference.h"
-#include "grad_op_desc_maker.h"
+#include <algorithm>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/grad_op_desc_maker.h"
 #include "paddle/fluid/framework/operator.h"
 namespace paddle {

--- a/paddle/fluid/framework/shape_inference.h
+++ b/paddle/fluid/framework/shape_inference.h
@@ -14,6 +14,8 @@ limitations under the License. */
 #pragma once
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/attribute.h"
 #include "paddle/fluid/framework/ddim.h"
 #include "paddle/fluid/framework/framework.pb.h"

--- a/paddle/fluid/framework/tensor_util.h
+++ b/paddle/fluid/framework/tensor_util.h
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/data_type.h"
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/framework.pb.h"

--- a/paddle/fluid/framework/var_desc.h
+++ b/paddle/fluid/framework/var_desc.h
@@ -14,6 +14,8 @@ limitations under the License. */
 #pragma once
+#include <algorithm>
+#include <string>
 #include <vector>
 #include "glog/logging.h"
 #include "paddle/fluid/framework/framework.pb.h"

--- a/paddle/fluid/framework/var_type_inference_test.cc
+++ b/paddle/fluid/framework/var_type_inference_test.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/var_type_inference.h"
+#include <string>
 #include "gtest/gtest.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"

--- a/paddle/fluid/framework/variable.h
+++ b/paddle/fluid/framework/variable.h
@@ -14,6 +14,7 @@
 #pragma once
 #include <memory>
+#include <string>
 #include <typeindex>
 #include <typeinfo>
@@ -67,7 +68,7 @@ class Variable {
  // parameter of Variable.
  template <typename T>
  struct PlaceholderImpl : public Placeholder {
-    PlaceholderImpl(T* ptr) : ptr_(ptr), type_(typeid(T)) {}
+    explicit PlaceholderImpl(T* ptr) : ptr_(ptr), type_(typeid(T)) {}
    virtual const std::type_info& Type() const { return type_; }
    virtual void* Ptr() const { return static_cast<void*>(ptr_.get()); }

--- a/paddle/fluid/operators/CMakeLists.txt
+++ b/paddle/fluid/operators/CMakeLists.txt
@@ -110,12 +110,12 @@ function(op_library TARGET)
    # Note that it's enough to just adding one operator to pybind in a *_op.cc file.
    # And for detail pybind information, please see generated paddle/pybind/pybind.h.
    file(READ ${TARGET}.cc TARGET_CONTENT)
-    string(REGEX MATCH "REGISTER_OP\\(.*REGISTER_OP\\(" multi_register "${TARGET_CONTENT}")
+    string(REGEX MATCH "REGISTER_OPERATOR\\(.*REGISTER_OPERATOR\\(" multi_register "${TARGET_CONTENT}")
-    string(REGEX MATCH "REGISTER_OP\\([a-z0-9_]*," one_register "${multi_register}")
+    string(REGEX MATCH "REGISTER_OPERATOR\\([a-z0-9_]*," one_register "${multi_register}")
    if (one_register STREQUAL "")
        string(REPLACE "_op" "" TARGET "${TARGET}")
    else ()
-        string(REPLACE "REGISTER_OP(" "" TARGET "${one_register}")
+        string(REPLACE "REGISTER_OPERATOR(" "" TARGET "${one_register}")
        string(REPLACE "," "" TARGET "${TARGET}")
    endif()

--- a/paddle/fluid/operators/activation_op.cc
+++ b/paddle/fluid/operators/activation_op.cc
@@ -558,95 +558,126 @@ $$out = \frac{x}{1 + e^{- \beta x}}$$
 namespace ops = paddle::operators;
-REGISTER_OP(sigmoid, ops::ActivationOp, ops::SigmoidOpMaker, sigmoid_grad,
+REGISTER_OPERATOR(sigmoid, ops::ActivationOp, ops::SigmoidOpMaker,
-            ops::ActivationOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(sigmoid_grad, ops::ActivationOpGrad);
-REGISTER_OP(logsigmoid, ops::ActivationOp, ops::LogSigmoidOpMaker,
+REGISTER_OPERATOR(logsigmoid, ops::ActivationOp, ops::LogSigmoidOpMaker,
-            logsigmoid_grad, ops::ActivationOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(logsigmoid_grad, ops::ActivationOpGrad);
-REGISTER_OP(exp, ops::ActivationOp, ops::ExpOpMaker, exp_grad,
+REGISTER_OPERATOR(exp, ops::ActivationOp, ops::ExpOpMaker,
-            ops::ActivationOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(exp_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(relu, ops::ActivationWithMKLDNNOp, ops::ReluOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(relu_grad, ops::ActivationWithMKLDNNOpGrad);
+REGISTER_OPERATOR(tanh, ops::ActivationWithMKLDNNOp, ops::TanhOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(tanh_grad, ops::ActivationWithMKLDNNOpGrad);
+REGISTER_OPERATOR(tanh_shrink, ops::ActivationOp, ops::TanhShrinkOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(tanh_shrink_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(softshrink, ops::ActivationOp, ops::SoftShrinkOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(softshrink_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(sqrt, ops::ActivationWithMKLDNNOp, ops::SqrtOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(sqrt_grad, ops::ActivationWithMKLDNNOpGrad);
+REGISTER_OPERATOR(abs, ops::ActivationWithMKLDNNOp, ops::AbsOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(abs_grad, ops::ActivationWithMKLDNNOpGrad);
+REGISTER_OPERATOR(ceil, ops::ActivationOp, ops::CeilOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(ceil_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(floor, ops::ActivationOp, ops::FloorOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(floor_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(cos, ops::ActivationOp, ops::CosOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(cos_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(sin, ops::ActivationOp, ops::SinOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(sin_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(round, ops::ActivationOp, ops::RoundOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(round_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(reciprocal, ops::ActivationOp, ops::ReciprocalOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(reciprocal_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(log, ops::ActivationOp, ops::LogOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(log_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(square, ops::ActivationOp, ops::SquareOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(square_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(softplus, ops::ActivationOp, ops::SoftplusOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(softplus_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(softsign, ops::ActivationOp, ops::SoftsignOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(softsign_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(brelu, ops::ActivationOp, ops::BReluOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(brelu_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(leaky_relu, ops::ActivationOp, ops::LeakyReluOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(leaky_relu_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(soft_relu, ops::ActivationOp, ops::SoftReluOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(soft_relu_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(elu, ops::ActivationOp, ops::ELUOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(elu_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(relu6, ops::ActivationOp, ops::Relu6OpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(relu6_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(pow, ops::ActivationOp, ops::PowOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(pow_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(stanh, ops::ActivationOp, ops::STanhOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(stanh_grad, ops::ActivationOpGrad);
-REGISTER_OP(relu, ops::ActivationWithMKLDNNOp, ops::ReluOpMaker, relu_grad,
+REGISTER_OPERATOR(hard_shrink, ops::ActivationOp, ops::HardShrinkOpMaker,
-            ops::ActivationWithMKLDNNOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(hard_shrink_grad, ops::ActivationOpGrad);
-REGISTER_OP(tanh, ops::ActivationWithMKLDNNOp, ops::TanhOpMaker, tanh_grad,
+REGISTER_OPERATOR(thresholded_relu, ops::ActivationOp,
-            ops::ActivationWithMKLDNNOpGrad);
+                  ops::ThresholdedReluOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
-REGISTER_OP(tanh_shrink, ops::ActivationOp, ops::TanhShrinkOpMaker,
+REGISTER_OPERATOR(thresholded_relu_grad, ops::ActivationOpGrad);
-            tanh_shrink_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(hard_sigmoid, ops::ActivationOp, ops::HardSigmoidOpMaker,
-REGISTER_OP(softshrink, ops::ActivationOp, ops::SoftShrinkOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
-            softshrink_grad, ops::ActivationOpGrad);
+REGISTER_OPERATOR(hard_sigmoid_grad, ops::ActivationOpGrad);
-REGISTER_OP(sqrt, ops::ActivationWithMKLDNNOp, ops::SqrtOpMaker, sqrt_grad,
+REGISTER_OPERATOR(swish, ops::ActivationOp, ops::SwishOpMaker,
-            ops::ActivationWithMKLDNNOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(swish_grad, ops::ActivationOpGrad);
-REGISTER_OP(abs, ops::ActivationWithMKLDNNOp, ops::AbsOpMaker, abs_grad,
-            ops::ActivationWithMKLDNNOpGrad);
-REGISTER_OP(ceil, ops::ActivationOp, ops::CeilOpMaker, ceil_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(floor, ops::ActivationOp, ops::FloorOpMaker, floor_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(cos, ops::ActivationOp, ops::CosOpMaker, cos_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(sin, ops::ActivationOp, ops::SinOpMaker, sin_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(round, ops::ActivationOp, ops::RoundOpMaker, round_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(reciprocal, ops::ActivationOp, ops::ReciprocalOpMaker,
-            reciprocal_grad, ops::ActivationOpGrad);
-REGISTER_OP(log, ops::ActivationOp, ops::LogOpMaker, log_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(square, ops::ActivationOp, ops::SquareOpMaker, square_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(softplus, ops::ActivationOp, ops::SoftplusOpMaker, softplus_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(softsign, ops::ActivationOp, ops::SoftsignOpMaker, softsign_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(brelu, ops::ActivationOp, ops::BReluOpMaker, brelu_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(leaky_relu, ops::ActivationOp, ops::LeakyReluOpMaker,
-            leaky_relu_grad, ops::ActivationOpGrad);
-REGISTER_OP(soft_relu, ops::ActivationOp, ops::SoftReluOpMaker, soft_relu_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(elu, ops::ActivationOp, ops::ELUOpMaker, elu_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(relu6, ops::ActivationOp, ops::Relu6OpMaker, relu6_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(pow, ops::ActivationOp, ops::PowOpMaker, pow_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(stanh, ops::ActivationOp, ops::STanhOpMaker, stanh_grad,
-            ops::ActivationOpGrad);
-REGISTER_OP(hard_shrink, ops::ActivationOp, ops::HardShrinkOpMaker,
-            hard_shrink_grad, ops::ActivationOpGrad);
-REGISTER_OP(thresholded_relu, ops::ActivationOp, ops::ThresholdedReluOpMaker,
-            thresholded_relu_grad, ops::ActivationOpGrad);
-REGISTER_OP(hard_sigmoid, ops::ActivationOp, ops::HardSigmoidOpMaker,
-            hard_sigmoid_grad, ops::ActivationOpGrad);
-REGISTER_OP(swish, ops::ActivationOp, ops::SwishOpMaker, swish_grad,
-            ops::ActivationOpGrad);
 #define REGISTER_ACTIVATION_CPU_KERNEL(act_type, functor, grad_functor)   \
  REGISTER_OP_CPU_KERNEL(                                                 \

--- a/paddle/fluid/operators/beam_search_decode_op.cc
+++ b/paddle/fluid/operators/beam_search_decode_op.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/beam_search_decode_op.h"
+#include <string>
 #include "paddle/fluid/platform/device_context.h"
 namespace paddle {

--- a/paddle/fluid/operators/beam_search_decode_op.h
+++ b/paddle/fluid/operators/beam_search_decode_op.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor_array.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -87,7 +88,7 @@ struct BeamSearchDecoder {
   */
  std::vector<BeamNodeVector<T>> PackTwoSteps(
      const LoDTensor& cur_ids, const LoDTensor& cur_scores,
-      std::vector<BeamNodeVector<T>>& prefixes_list,
+      std::vector<BeamNodeVector<T>>* prefixes_list,
      std::vector<SentenceVector<T>>* sentence_vector_list) const;
  /**
@@ -140,7 +141,7 @@ Sentence<T> BeamSearchDecoder<T>::MakeSentence(const BeamNode<T>* node) const {
 template <typename T>
 std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
    const LoDTensor& cur_ids, const LoDTensor& cur_scores,
-    std::vector<BeamNodeVector<T>>& prefixes_list,
+    std::vector<BeamNodeVector<T>>* prefixes_list,
    std::vector<SentenceVector<T>>* sentence_vector_list) const {
  std::vector<BeamNodeVector<T>> result;
@@ -153,7 +154,7 @@ std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
    // if prefixes size is 0, it means this is the first step. In this step,
    // all candidate id is the start of candidate sentences.
-    if (prefixes_list.empty()) {
+    if (prefixes_list->empty()) {
      PADDLE_ENFORCE_EQ(cur_ids.lod().at(kSourceLevel).back(),
                        cur_ids.lod().at(kSentenceLevel).back(),
                        "in the first step");
@@ -162,7 +163,7 @@ std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
            cur_ids.data<int64_t>()[id_idx], cur_scores.data<T>()[id_idx])));
      }
    } else {
-      BeamNodeVector<T>& prefixes = prefixes_list[src_idx];
+      BeamNodeVector<T>& prefixes = prefixes_list->at(src_idx);
      SentenceVector<T>& sentence_vector = (*sentence_vector_list)[src_idx];
      PADDLE_ENFORCE_EQ(src_end - src_start, prefixes.size(),
@@ -262,7 +263,7 @@ void BeamSearchDecoder<T>::PackAllSteps(const LoDTensorArray& step_ids,
  for (size_t step_id = 0; step_id < step_num; ++step_id) {
    beamnode_vector_list =
        PackTwoSteps(step_ids.at(step_id), step_scores.at(step_id),
-                     beamnode_vector_list, &sentence_vector_list);
+                     &beamnode_vector_list, &sentence_vector_list);
  }
  // append last beam_node to result
  for (size_t src_idx = 0; src_idx < src_num; ++src_idx) {

--- a/paddle/fluid/operators/beam_search_decode_op_test.cc
+++ b/paddle/fluid/operators/beam_search_decode_op_test.cc
@@ -125,7 +125,7 @@ TEST(BeamSearchDecodeOp, PackTwoStepsFistStep) {
  BeamSearchDecoder<float> helper;
  beamnode_vector_list = helper.PackTwoSteps(
-      ids[0], scores[0], beamnode_vector_list, &sentence_vector_list);
+      ids[0], scores[0], &beamnode_vector_list, &sentence_vector_list);
  ASSERT_EQ(beamnode_vector_list.size(), 2UL);
  ASSERT_EQ(beamnode_vector_list[0].size(), 2UL);
  ASSERT_EQ(beamnode_vector_list[1].size(), 4UL);
@@ -167,7 +167,7 @@ TEST(BeamSearchDecodeOp, PackTwoSteps) {
  BeamSearchDecoder<float> helper1;
  beamnode_vector_list = helper1.PackTwoSteps(
-      ids[0], scores[0], beamnode_vector_list, &sentence_vector_list);
+      ids[0], scores[0], &beamnode_vector_list, &sentence_vector_list);
  ASSERT_EQ(sentence_vector_list[0].size(), 1UL);
  ASSERT_EQ(sentence_vector_list[1].size(), 0UL);

--- a/paddle/fluid/operators/beam_search_op.cc
+++ b/paddle/fluid/operators/beam_search_op.cc
@@ -14,7 +14,10 @@ limitations under the License. */
 #include "paddle/fluid/operators/beam_search_op.h"
+#include <algorithm>
 #include <map>
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/op_registry.h"

--- a/paddle/fluid/operators/beam_search_op.h
+++ b/paddle/fluid/operators/beam_search_op.h
@@ -18,6 +18,8 @@ limitations under the License. */
 #include "gtest/gtest.h"
 #endif
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/operator.h"

--- a/paddle/fluid/operators/bilinear_tensor_product_op.cc
+++ b/paddle/fluid/operators/bilinear_tensor_product_op.cc
@@ -153,9 +153,11 @@ class BilinearTensorProductOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(bilinear_tensor_product, ops::BilinearTensorProductOp,
+REGISTER_OPERATOR(bilinear_tensor_product, ops::BilinearTensorProductOp,
-            ops::BilinearTensorProductOpMaker, bilinear_tensor_product_grad,
+                  ops::BilinearTensorProductOpMaker,
-            ops::BilinearTensorProductOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(bilinear_tensor_product_grad,
+                  ops::BilinearTensorProductOpGrad);
 REGISTER_OP_CPU_KERNEL(
    bilinear_tensor_product,
    ops::BilinearTensorProductKernel<paddle::platform::CPUDeviceContext, float>,

--- a/paddle/fluid/operators/channel_recv_op.cc
+++ b/paddle/fluid/operators/channel_recv_op.cc
@@ -29,11 +29,11 @@ namespace paddle {
 namespace operators {
 void SetReceiveStatus(const platform::Place &dev_place,
-                      framework::Variable &status_var, bool status) {
+                      framework::Variable *status_var, bool status) {
  auto cpu = platform::CPUPlace();
  auto status_tensor =
-      status_var.GetMutable<framework::LoDTensor>()->mutable_data<bool>({1},
+      status_var->GetMutable<framework::LoDTensor>()->mutable_data<bool>({1},
-                                                                        cpu);
+                                                                         cpu);
  status_tensor[0] = status;
 }
@@ -66,7 +66,7 @@ class ChannelRecvOp : public framework::OperatorBase {
    bool ok = concurrency::ChannelReceive(ch, output_var);
    // Set the status output of the `ChannelReceive` call.
-    SetReceiveStatus(dev_place, *scope.FindVar(Output(Status)), ok);
+    SetReceiveStatus(dev_place, scope.FindVar(Output(Status)), ok);
  }
 };

--- a/paddle/fluid/operators/chunk_eval_op.cc
+++ b/paddle/fluid/operators/chunk_eval_op.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/chunk_eval_op.h"
+#include <string>
+#include <vector>
 namespace paddle {
 namespace operators {

--- a/paddle/fluid/operators/chunk_eval_op.h
+++ b/paddle/fluid/operators/chunk_eval_op.h
@@ -14,6 +14,9 @@ limitations under the License. */
 #pragma once
 #include <set>
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -36,11 +39,11 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
  };
  void GetSegments(const int64_t* label, int length,
-                   std::vector<Segment>& segments, int num_chunk_types,
+                   std::vector<Segment>* segments, int num_chunk_types,
                   int num_tag_types, int other_chunk_type, int tag_begin,
                   int tag_inside, int tag_end, int tag_single) const {
-    segments.clear();
+    segments->clear();
-    segments.reserve(length);
+    segments->reserve(length);
    int chunk_start = 0;
    bool in_chunk = false;
    int tag = -1;
@@ -58,7 +61,7 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
            i - 1,        // end
            prev_type,
        };
-        segments.push_back(segment);
+        segments->push_back(segment);
        in_chunk = false;
      }
      if (ChunkBegin(prev_tag, prev_type, tag, type, other_chunk_type,
@@ -73,7 +76,7 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
          length - 1,   // end
          type,
      };
-      segments.push_back(segment);
+      segments->push_back(segment);
    }
  }
@@ -177,8 +180,8 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
    for (int i = 0; i < num_sequences; ++i) {
      int seq_length = lod[0][i + 1] - lod[0][i];
      EvalOneSeq(inference_data + lod[0][i], label_data + lod[0][i], seq_length,
-                 output_segments, label_segments, *num_infer_chunks_data,
+                 &output_segments, &label_segments, num_infer_chunks_data,
-                 *num_label_chunks_data, *num_correct_chunks_data,
+                 num_label_chunks_data, num_correct_chunks_data,
                 num_chunk_types, num_tag_types, other_chunk_type, tag_begin,
                 tag_inside, tag_end, tag_single, excluded_chunk_types);
    }
@@ -197,10 +200,10 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
  }
  void EvalOneSeq(const int64_t* output, const int64_t* label, int length,
-                  std::vector<Segment>& output_segments,
+                  std::vector<Segment>* output_segments,
-                  std::vector<Segment>& label_segments,
+                  std::vector<Segment>* label_segments,
-                  int64_t& num_output_segments, int64_t& num_label_segments,
+                  int64_t* num_output_segments, int64_t* num_label_segments,
-                  int64_t& num_correct, int num_chunk_types, int num_tag_types,
+                  int64_t* num_correct, int num_chunk_types, int num_tag_types,
                  int other_chunk_type, int tag_begin, int tag_inside,
                  int tag_end, int tag_single,
                  const std::set<int>& excluded_chunk_types) const {
@@ -209,25 +212,29 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
    GetSegments(label, length, label_segments, num_chunk_types, num_tag_types,
                other_chunk_type, tag_begin, tag_inside, tag_end, tag_single);
    size_t i = 0, j = 0;
-    while (i < output_segments.size() && j < label_segments.size()) {
+    while (i < output_segments->size() && j < label_segments->size()) {
-      if (output_segments[i] == label_segments[j] &&
+      if (output_segments->at(i) == label_segments->at(j) &&
-          excluded_chunk_types.count(output_segments[i].type) != 1) {
+          excluded_chunk_types.count(output_segments->at(i).type) != 1) {
-        ++num_correct;
+        ++(*num_correct);
      }
-      if (output_segments[i].end < label_segments[j].end) {
+      if (output_segments->at(i).end < label_segments->at(j).end) {
        ++i;
-      } else if (output_segments[i].end > label_segments[j].end) {
+      } else if (output_segments->at(i).end > label_segments->at(j).end) {
        ++j;
      } else {
        ++i;
        ++j;
      }
    }
-    for (auto& segment : label_segments) {
+    for (auto& segment : (*label_segments)) {
-      if (excluded_chunk_types.count(segment.type) != 1) ++num_label_segments;
+      if (excluded_chunk_types.count(segment.type) != 1) {
+        ++(*num_label_segments);
+      }
    }
-    for (auto& segment : output_segments) {
+    for (auto& segment : (*output_segments)) {
-      if (excluded_chunk_types.count(segment.type) != 1) ++num_output_segments;
+      if (excluded_chunk_types.count(segment.type) != 1) {
+        ++(*num_output_segments);
+      }
    }
  }
 };

--- a/paddle/fluid/operators/clip_op.cc
+++ b/paddle/fluid/operators/clip_op.cc
@@ -81,8 +81,9 @@ class ClipOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(clip, ops::ClipOp, ops::ClipOpMaker<float>, clip_grad,
+REGISTER_OPERATOR(clip, ops::ClipOp, ops::ClipOpMaker<float>,
-            ops::ClipOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(clip_grad, ops::ClipOpGrad);
 REGISTER_OP_CPU_KERNEL(
    clip, ops::ClipKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(

--- a/paddle/fluid/operators/concat_op.cc
+++ b/paddle/fluid/operators/concat_op.cc
@@ -103,10 +103,12 @@ class ConcatOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP_EX(concat, ops::ConcatOp, ops::ConcatOpMaker, concat_grad,
+REGISTER_OPERATOR(concat, ops::ConcatOp, ops::ConcatOpMaker,
-               ops::ConcatOpGrad, false)
+                  paddle::framework::DefaultGradOpDescMaker<
+                      false> /* set false to disable empty grad */);
+REGISTER_OPERATOR(concat_grad, ops::ConcatOpGrad);
 REGISTER_OP_CPU_KERNEL(
-    concat, ops::ConcatKernel<paddle::platform::CPUDeviceContext, float>)
+    concat, ops::ConcatKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(
    concat_grad,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, float>)
+    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, float>);
--- a/paddle/fluid/operators/conditional_block_op.cc
+++ b/paddle/fluid/operators/conditional_block_op.cc
@@ -47,7 +47,7 @@ class ConditionalOp : public framework::OperatorBase {
    if (!(ips.size() == 1UL && ips[0]->IsInitialized())) {
      PADDLE_THROW("should have one initialized input as condition");
    }
-    if (!(ips[0]->type().hash_code() == typeid(bool).hash_code() &&
+    if (!(ips[0]->type().hash_code() == typeid(bool).hash_code() &&  // NOLINT
          ips[0]->numel() == 1)) {
      PADDLE_THROW(
          "condition input's data type should be bool, "

--- a/paddle/fluid/operators/conv_mkldnn_op.cc
+++ b/paddle/fluid/operators/conv_mkldnn_op.cc
@@ -73,9 +73,11 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel<T> {
        dst_tz, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);
    auto src_memory =
-        mkldnn::memory({src_md, mkldnn_engine}, (void*)input_data);
+        mkldnn::memory({src_md, mkldnn_engine},
+                       reinterpret_cast<void*>(const_cast<T*>(input_data)));
    auto weights_memory =
-        mkldnn::memory({weights_md, mkldnn_engine}, (void*)filter_data);
+        mkldnn::memory({weights_md, mkldnn_engine},
+                       reinterpret_cast<void*>(const_cast<T*>(filter_data)));
    auto dst_memory = mkldnn::memory({dst_md, mkldnn_engine}, output_data);
    std::shared_ptr<mkldnn::convolution_forward::primitive_desc> conv_pd =
@@ -180,8 +182,9 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
        dst_tz, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);
    // create memory
-    auto diff_dst_memory = mkldnn::memory({diff_weights_md, mkldnn_engine},
+    auto diff_dst_memory = mkldnn::memory(
-                                          (void*)output_grad_data);
+        {diff_weights_md, mkldnn_engine},
+        reinterpret_cast<void*>(const_cast<T*>(output_grad_data)));
    // Retrieve conv_pd from device context
    auto conv_pd =
        std::static_pointer_cast<mkldnn::convolution_forward::primitive_desc>(
@@ -198,10 +201,12 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
                                      mkldnn_engine);
      // create memory
-      auto diff_weights_memory = mkldnn::memory(
+      auto diff_weights_memory =
-          {diff_weights_md, mkldnn_engine}, (void*)filter_grad_data);
+          mkldnn::memory({diff_weights_md, mkldnn_engine},
+                         reinterpret_cast<void*>(filter_grad_data));
      auto src_memory =
-          mkldnn::memory({src_md, mkldnn_engine}, (void*)input_data);
+          mkldnn::memory({src_md, mkldnn_engine},
+                         reinterpret_cast<void*>(const_cast<T*>(input_data)));
      // create backward conv primitive for weights
      auto conv_bwd_weights_prim = mkldnn::convolution_backward_weights(
@@ -220,10 +225,12 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
                                   strides, paddings, *conv_pd, mkldnn_engine);
      // create memory
-      auto diff_src_memory =
+      auto diff_src_memory = mkldnn::memory(
-          mkldnn::memory({diff_src_md, mkldnn_engine}, (void*)input_grad_data);
+          {diff_src_md, mkldnn_engine},
+          reinterpret_cast<void*>(const_cast<T*>(input_grad_data)));
      auto weights_memory =
-          mkldnn::memory({weights_md, mkldnn_engine}, (void*)filter_data);
+          mkldnn::memory({weights_md, mkldnn_engine},
+                         reinterpret_cast<void*>(const_cast<T*>(filter_data)));
      // create backward conv primitive for data
      auto conv_bwd_data_prim = mkldnn::convolution_backward_data(

--- a/paddle/fluid/operators/conv_op.cc
+++ b/paddle/fluid/operators/conv_op.cc
@@ -335,14 +335,17 @@ framework::OpKernelType ConvOpGrad::GetExpectedKernelType(
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(conv2d, ops::ConvOp, ops::Conv2DOpMaker, conv2d_grad,
+REGISTER_OPERATOR(conv2d, ops::ConvOp, ops::Conv2DOpMaker,
-            ops::ConvOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(conv2d_grad, ops::ConvOpGrad);
 // depthwise convolution op
-REGISTER_OP(depthwise_conv2d, ops::ConvOp, ops::Conv2DOpMaker,
+REGISTER_OPERATOR(depthwise_conv2d, ops::ConvOp, ops::Conv2DOpMaker,
-            depthwise_conv2d_grad, ops::ConvOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
-REGISTER_OP(conv3d, ops::ConvOp, ops::Conv3DOpMaker, conv3d_grad,
+REGISTER_OPERATOR(depthwise_conv2d_grad, ops::ConvOpGrad);
-            ops::ConvOpGrad);
+REGISTER_OPERATOR(conv3d, ops::ConvOp, ops::Conv3DOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(conv3d_grad, ops::ConvOpGrad);
 // depthwise conv kernel
 // TODO(xingzhaolong): neon kernel for mobile

--- a/paddle/fluid/operators/conv_op.h
+++ b/paddle/fluid/operators/conv_op.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/math/depthwise_conv.h"
@@ -41,9 +42,10 @@ inline int ConvOutputSize(int input_size, int filter_size, int dilation,
  return output_size;
 }
-inline bool IsExpand(std::vector<int64_t>& filter_dim,
+inline bool IsExpand(const std::vector<int64_t>& filter_dim,
-                     std::vector<int>& strides, std::vector<int>& paddings,
+                     const std::vector<int>& strides,
-                     std::vector<int>& dilations) {
+                     const std::vector<int>& paddings,
+                     const std::vector<int>& dilations) {
  bool filter_1 = true, strides_1 = true, padding_0 = true, dilation_1 = true;
  for (size_t j = 0; j < strides.size(); ++j) {
    filter_1 = filter_1 && (static_cast<int>(filter_dim[j + 2]) == 1);

--- a/paddle/fluid/operators/conv_shift_op.cc
+++ b/paddle/fluid/operators/conv_shift_op.cc
@@ -193,8 +193,9 @@ class ConvShiftGradKernel<platform::CPUPlace, T>
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(conv_shift, ops::ConvShiftOp, ops::ConvShiftOpMaker,
+REGISTER_OPERATOR(conv_shift, ops::ConvShiftOp, ops::ConvShiftOpMaker,
-            conv_shift_grad, ops::ConvShiftGradOp);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(conv_shift_grad, ops::ConvShiftGradOp);
 REGISTER_OP_CPU_KERNEL(conv_shift,
                       ops::ConvShiftKernel<paddle::platform::CPUPlace, float>);
 REGISTER_OP_CPU_KERNEL(

--- a/paddle/fluid/operators/conv_transpose_op.cc
+++ b/paddle/fluid/operators/conv_transpose_op.cc
@@ -298,8 +298,10 @@ framework::OpKernelType ConvTransposeOpGrad::GetExpectedKernelType(
 namespace ops = paddle::operators;
-REGISTER_OP(conv2d_transpose, ops::ConvTransposeOp, ops::Conv2DTransposeOpMaker,
+REGISTER_OPERATOR(conv2d_transpose, ops::ConvTransposeOp,
-            conv2d_transpose_grad, ops::ConvTransposeOpGrad);
+                  ops::Conv2DTransposeOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(conv2d_transpose_grad, ops::ConvTransposeOpGrad);
 REGISTER_OP_CPU_KERNEL(
    conv2d_transpose,
@@ -311,8 +313,10 @@ REGISTER_OP_CPU_KERNEL(
    ops::GemmConvTransposeGradKernel<paddle::platform::CPUDeviceContext,
                                     double>);
-REGISTER_OP(conv3d_transpose, ops::ConvTransposeOp, ops::Conv3DTransposeOpMaker,
+REGISTER_OPERATOR(conv3d_transpose, ops::ConvTransposeOp,
-            conv3d_transpose_grad, ops::ConvTransposeOpGrad);
+                  ops::Conv3DTransposeOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(conv3d_transpose_grad, ops::ConvTransposeOpGrad);
 REGISTER_OP_CPU_KERNEL(
    conv3d_transpose,

--- a/paddle/fluid/operators/cos_sim_op.cc
+++ b/paddle/fluid/operators/cos_sim_op.cc
@@ -153,8 +153,9 @@ class CosSimOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(cos_sim, ops::CosSimOp, ops::CosSimOpMaker, cos_sim_grad,
+REGISTER_OPERATOR(cos_sim, ops::CosSimOp, ops::CosSimOpMaker,
-            ops::CosSimOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(cos_sim_grad, ops::CosSimOpGrad);
 REGISTER_OP_CPU_KERNEL(
    cos_sim, ops::CosSimKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(

--- a/paddle/fluid/operators/crop_op.cc
+++ b/paddle/fluid/operators/crop_op.cc
@@ -153,7 +153,9 @@ class CropOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(crop, ops::CropOp, ops::CropOpMaker, crop_grad, ops::CropOpGrad);
+REGISTER_OPERATOR(crop, ops::CropOp, ops::CropOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(crop_grad, ops::CropOpGrad);
 REGISTER_OP_CPU_KERNEL(crop, ops::CropKernel<float>);
 REGISTER_OP_CPU_KERNEL(
    crop_grad, ops::CropGradKernel<paddle::platform::CPUDeviceContext, float>);
--- a/paddle/fluid/operators/cross_entropy_op.cc
+++ b/paddle/fluid/operators/cross_entropy_op.cc
@@ -164,8 +164,9 @@ or not. But the output only shares the LoD information with input X.
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(cross_entropy, ops::CrossEntropyOp, ops::CrossEntropyOpMaker,
+REGISTER_OPERATOR(cross_entropy, ops::CrossEntropyOp, ops::CrossEntropyOpMaker,
-            cross_entropy_grad, ops::CrossEntropyGradientOp);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(cross_entropy_grad, ops::CrossEntropyGradientOp);
 REGISTER_OP_CPU_KERNEL(cross_entropy, ops::CrossEntropyOpKernel<float>,
                       ops::CrossEntropyOpKernel<double>);
 REGISTER_OP_CPU_KERNEL(cross_entropy_grad,

--- a/paddle/fluid/operators/cumsum_op.cc
+++ b/paddle/fluid/operators/cumsum_op.cc
@@ -79,4 +79,4 @@ using CPU = paddle::platform::CPUDeviceContext;
 REGISTER_OPERATOR(cumsum, ops::CumOp, ops::CumsumOpMaker, ops::CumsumGradMaker);
 REGISTER_OP_CPU_KERNEL(cumsum, ops::CumKernel<CPU, ops::CumsumFunctor<float>>,
                       ops::CumKernel<CPU, ops::CumsumFunctor<double>>,
-                       ops::CumKernel<CPU, ops::CumsumFunctor<int>>)
+                       ops::CumKernel<CPU, ops::CumsumFunctor<int>>);
--- a/paddle/fluid/operators/cumsum_op.cu
+++ b/paddle/fluid/operators/cumsum_op.cu
@@ -19,4 +19,4 @@ using CUDA = paddle::platform::CUDADeviceContext;
 REGISTER_OP_CUDA_KERNEL(cumsum, ops::CumKernel<CUDA, ops::CumsumFunctor<float>>,
                        ops::CumKernel<CUDA, ops::CumsumFunctor<double>>,
-                        ops::CumKernel<CUDA, ops::CumsumFunctor<int>>)
+                        ops::CumKernel<CUDA, ops::CumsumFunctor<int>>);
--- a/paddle/fluid/operators/detail/grpc_server.cc
+++ b/paddle/fluid/operators/detail/grpc_server.cc
@@ -60,7 +60,7 @@ class RequestSend final : public RequestBase {
                       framework::Scope* scope, ReceivedQueue* queue,
                       const platform::DeviceContext* dev_ctx)
      : RequestBase(service, cq, dev_ctx), queue_(queue), responder_(&ctx_) {
-    request_.reset(new VariableResponse(scope, dev_ctx_));
+    request_.reset(new VariableResponse(false, scope, dev_ctx_));
    int method_id = static_cast<int>(detail::GrpcMethod::kSendVariable);
    service_->RequestAsyncUnary(method_id, &ctx_, request_.get(), &responder_,
                                cq_, cq_, this);
@@ -146,7 +146,7 @@ class RequestPrefetch final : public RequestBase {
        executor_(executor),
        program_(program),
        prefetch_ctx_(prefetch_ctx) {
-    request_.reset(new VariableResponse(scope, dev_ctx_));
+    request_.reset(new VariableResponse(false, scope, dev_ctx_));
    int method_id = static_cast<int>(detail::GrpcMethod::kPrefetchVariable);
    service_->RequestAsyncUnary(method_id, &ctx_, request_.get(), &responder_,
                                cq_, cq_, this);

--- a/paddle/fluid/operators/detail/sendrecvop_utils.cc
+++ b/paddle/fluid/operators/detail/sendrecvop_utils.cc
@@ -82,7 +82,7 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
        platform::CPUPlace cpu;
        auto& gpu_dev_ctx =
            static_cast<const platform::CUDADeviceContext&>(ctx);
-        auto copy_size = tensor.memory_size();
+        auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type());
        payload = memory::Alloc(cpu, copy_size);
        memory::Copy(cpu, payload,
@@ -99,7 +99,7 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
      } else {
        payload = tensor.data<void>();
      }
-      payload_size = tensor.memory_size();
+      payload_size = tensor.numel() * framework::SizeOfType(tensor.type());
      e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size);
    } break;
    case framework::proto::VarType_Type_SELECTED_ROWS: {
@@ -118,7 +118,8 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
        platform::CPUPlace cpu;
        auto& gpu_dev_ctx =
            static_cast<const platform::CUDADeviceContext&>(ctx);
-        auto copy_size = tensor->memory_size();
+        auto copy_size =
+            tensor->numel() * framework::SizeOfType(tensor->type());
        payload = memory::Alloc(cpu, copy_size);
        memory::Copy(cpu, payload,
                     boost::get<platform::CUDAPlace>(tensor->place()),
@@ -133,7 +134,7 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
      } else {
        payload = slr->mutable_value()->data<void>();
      }
-      payload_size = tensor->memory_size();
+      payload_size = tensor->numel() * framework::SizeOfType(tensor->type());
      e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size);
    } break;
    default:
@@ -185,7 +186,7 @@ void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg,
                               const platform::DeviceContext& ctx,
                               const framework::Scope* scope,
                               framework::Variable** var) {
-  operators::detail::VariableResponse resp(scope, &ctx);
+  operators::detail::VariableResponse resp(false, scope, &ctx);
  PADDLE_ENFORCE(resp.Parse(msg) == 0, "parse bytebuffer to tensor error!");
  *var = resp.GetVar();
 }

--- a/paddle/fluid/operators/detail/serde_test.cc
+++ b/paddle/fluid/operators/detail/serde_test.cc
@@ -84,7 +84,7 @@ void RunSerdeTestSelectedRows(platform::Place place) {
  // operators::detail::DeserializeFromByteBuffer(msg, ctx, &var2);
  framework::Scope scope;
  scope.Var("myvar");
-  operators::detail::VariableResponse resp(&scope, &ctx);
+  operators::detail::VariableResponse resp(false, &scope, &ctx);
  EXPECT_EQ(resp.Parse(msg), 0);
  framework::Variable* var2 = resp.GetVar();
@@ -171,7 +171,7 @@ void RunTestLodTensor(platform::Place place, int from_type = 0) {
  // deserialize zero-copy
  framework::Scope scope;
  scope.Var("myvar");
-  operators::detail::VariableResponse resp(&scope, &ctx);
+  operators::detail::VariableResponse resp(false, &scope, &ctx);
  if (from_type == 0) {
    EXPECT_EQ(resp.Parse(msg), 0);
  } else {

--- a/paddle/fluid/operators/detail/variable_response.cc
+++ b/paddle/fluid/operators/detail/variable_response.cc
@@ -114,8 +114,7 @@ bool VariableResponse::CopyLodTensorData(
    ::google::protobuf::io::CodedInputStream* input,
    const platform::DeviceContext& ctx, const framework::DDim& dims,
    int length) {
-  auto var = scope_->FindVar(meta_.varname());
+  auto* tensor = InitVar()->GetMutable<framework::LoDTensor>();
-  auto* tensor = var->GetMutable<framework::LoDTensor>();
  tensor->Resize(dims);
  framework::LoD lod;
@@ -151,8 +150,7 @@ bool VariableResponse::CopySelectRowsTensorData(
    ::google::protobuf::io::CodedInputStream* input,
    const platform::DeviceContext& ctx, const framework::DDim& dims,
    int length) {
-  auto var = scope_->FindVar(meta_.varname());
+  auto* slr = InitVar()->GetMutable<framework::SelectedRows>();
-  auto* slr = var->GetMutable<framework::SelectedRows>();
  slr->set_height(meta_.slr_height());
  auto* tensor = slr->mutable_value();
  tensor->Resize(dims);
@@ -174,8 +172,7 @@ bool VariableResponse::CopySelectRowsTensorData(
 bool VariableResponse::CopySelectRowsData(
    ::google::protobuf::io::CodedInputStream* input,
    const platform::DeviceContext& ctx, int length) {
-  auto var = scope_->FindVar(meta_.varname());
+  auto* slr = InitVar()->GetMutable<framework::SelectedRows>();
-  auto* slr = var->GetMutable<framework::SelectedRows>();
  slr->mutable_rows()->resize(length /
                              framework::SizeOfType(typeid(int64_t)));  // int64
  int64_t* rows_data = slr->mutable_rows()->data();

--- a/paddle/fluid/operators/detail/variable_response.h
+++ b/paddle/fluid/operators/detail/variable_response.h
@@ -36,11 +36,13 @@ namespace detail {
 class VariableResponse {
 public:
-  VariableResponse(const framework::Scope* scope,
+  VariableResponse(bool use_local_scope, const framework::Scope* scope,
                   const platform::DeviceContext* dev_ctx)
-      : scope_(scope), dev_ctx_(dev_ctx) {}
+      : use_local_scope_(use_local_scope), scope_(scope), dev_ctx_(dev_ctx) {
+    local_scope_ = &scope->NewScope();
+  }
-  virtual ~VariableResponse() {}
+  virtual ~VariableResponse() { scope_->DeleteScope(local_scope_); }
  // return:
  // 0:ok.
@@ -54,11 +56,25 @@ class VariableResponse {
  // other: number of error field.
  int Parse(const ::grpc::ByteBuffer& byte_buffer);
+  const framework::Scope& GetLocalScope() const { return *local_scope_; }
  inline std::string Varname() { return meta_.varname(); }
  inline std::string OutVarname() { return meta_.out_varname(); }
  // should call parse first.
-  framework::Variable* GetVar() { return scope_->FindVar(meta_.varname()); }
+  framework::Variable* GetVar() {
+    return local_scope_->FindVar(meta_.varname());
+  }
+  framework::Variable* InitVar() {
+    if (use_local_scope_) {
+      bool has_var = (scope_->FindVar(meta_.varname()) != nullptr);
+      PADDLE_ENFORCE(has_var);
+      return local_scope_->Var(meta_.varname());
+    } else {
+      return scope_->FindVar(meta_.varname());
+    }
+  }
 private:
  bool CopySelectRowsTensorData(::google::protobuf::io::CodedInputStream* input,
@@ -73,7 +89,9 @@ class VariableResponse {
                         const framework::DDim& dims, int length);
 private:
+  bool use_local_scope_ = false;
  const framework::Scope* scope_;
+  framework::Scope* local_scope_ = nullptr;
  const platform::DeviceContext* dev_ctx_;
  // only Skeleton
  sendrecv::VariableMessage meta_;

--- a/paddle/fluid/operators/detection_map_op.cc
+++ b/paddle/fluid/operators/detection_map_op.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/detection_map_op.h"
+#include <string>
 namespace paddle {
 namespace operators {

--- a/paddle/fluid/operators/detection_map_op.h
+++ b/paddle/fluid/operators/detection_map_op.h
@@ -13,6 +13,11 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
+#include <algorithm>
+#include <map>
+#include <string>
+#include <utility>
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -82,7 +87,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    std::vector<std::map<int, std::vector<Box>>> gt_boxes;
    std::vector<std::map<int, std::vector<std::pair<T, Box>>>> detect_boxes;
-    GetBoxes(*in_label, *in_detect, gt_boxes, detect_boxes);
+    GetBoxes(*in_label, *in_detect, &gt_boxes, detect_boxes);
    std::map<int, int> label_pos_count;
    std::map<int, std::vector<std::pair<T, int>>> true_pos;
@@ -95,20 +100,20 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    }
    if (in_pos_count != nullptr && state) {
-      GetInputPos(*in_pos_count, *in_true_pos, *in_false_pos, label_pos_count,
+      GetInputPos(*in_pos_count, *in_true_pos, *in_false_pos, &label_pos_count,
-                  true_pos, false_pos, class_num);
+                  &true_pos, &false_pos, class_num);
    }
    CalcTrueAndFalsePositive(gt_boxes, detect_boxes, evaluate_difficult,
-                             overlap_threshold, label_pos_count, true_pos,
+                             overlap_threshold, &label_pos_count, &true_pos,
-                             false_pos);
+                             &false_pos);
    int background_label = ctx.Attr<int>("background_label");
    T map = CalcMAP(ap_type, label_pos_count, true_pos, false_pos,
                    background_label);
-    GetOutputPos(ctx, label_pos_count, true_pos, false_pos, *out_pos_count,
+    GetOutputPos(ctx, label_pos_count, true_pos, false_pos, out_pos_count,
-                 *out_true_pos, *out_false_pos, class_num);
+                 out_true_pos, out_false_pos, class_num);
    T* map_data = out_map->mutable_data<T>(ctx.GetPlace());
    map_data[0] = map;
@@ -155,7 +160,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
  void GetBoxes(const framework::LoDTensor& input_label,
                const framework::LoDTensor& input_detect,
-                std::vector<std::map<int, std::vector<Box>>>& gt_boxes,
+                std::vector<std::map<int, std::vector<Box>>>* gt_boxes,
                std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
                    detect_boxes) const {
    auto labels = framework::EigenTensor<T, 2>::From(input_label);
@@ -179,7 +184,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          box.is_difficult = true;
        boxes[label].push_back(box);
      }
-      gt_boxes.push_back(boxes);
+      gt_boxes->push_back(boxes);
    }
    auto detect_index = detect_lod[0];
@@ -200,9 +205,9 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      const std::map<int, int>& label_pos_count,
      const std::map<int, std::vector<std::pair<T, int>>>& true_pos,
      const std::map<int, std::vector<std::pair<T, int>>>& false_pos,
-      framework::Tensor& output_pos_count,
+      framework::Tensor* output_pos_count,
-      framework::LoDTensor& output_true_pos,
+      framework::LoDTensor* output_true_pos,
-      framework::LoDTensor& output_false_pos, const int class_num) const {
+      framework::LoDTensor* output_false_pos, const int class_num) const {
    int true_pos_count = 0;
    int false_pos_count = 0;
    for (auto it = true_pos.begin(); it != true_pos.end(); ++it) {
@@ -214,12 +219,12 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      false_pos_count += fp.size();
    }
-    int* pos_count_data = output_pos_count.mutable_data<int>(
+    int* pos_count_data = output_pos_count->mutable_data<int>(
        framework::make_ddim({class_num, 1}), ctx.GetPlace());
-    T* true_pos_data = output_true_pos.mutable_data<T>(
+    T* true_pos_data = output_true_pos->mutable_data<T>(
        framework::make_ddim({true_pos_count, 2}), ctx.GetPlace());
-    T* false_pos_data = output_false_pos.mutable_data<T>(
+    T* false_pos_data = output_false_pos->mutable_data<T>(
        framework::make_ddim({false_pos_count, 2}), ctx.GetPlace());
    true_pos_count = 0;
    false_pos_count = 0;
@@ -261,21 +266,21 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    framework::LoD false_pos_lod;
    false_pos_lod.emplace_back(false_pos_starts);
-    output_true_pos.set_lod(true_pos_lod);
+    output_true_pos->set_lod(true_pos_lod);
-    output_false_pos.set_lod(false_pos_lod);
+    output_false_pos->set_lod(false_pos_lod);
    return;
  }
  void GetInputPos(const framework::Tensor& input_pos_count,
                   const framework::LoDTensor& input_true_pos,
                   const framework::LoDTensor& input_false_pos,
-                   std::map<int, int>& label_pos_count,
+                   std::map<int, int>* label_pos_count,
-                   std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+                   std::map<int, std::vector<std::pair<T, int>>>* true_pos,
-                   std::map<int, std::vector<std::pair<T, int>>>& false_pos,
+                   std::map<int, std::vector<std::pair<T, int>>>* false_pos,
                   const int class_num) const {
    const int* pos_count_data = input_pos_count.data<int>();
    for (int i = 0; i < class_num; ++i) {
-      label_pos_count[i] = pos_count_data[i];
+      (*label_pos_count)[i] = pos_count_data[i];
    }
    auto SetData = [](const framework::LoDTensor& pos_tensor,
@@ -291,8 +296,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      }
    };
-    SetData(input_true_pos, true_pos);
+    SetData(input_true_pos, *true_pos);
-    SetData(input_false_pos, false_pos);
+    SetData(input_false_pos, *false_pos);
    return;
  }
@@ -301,9 +306,9 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      const std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
          detect_boxes,
      bool evaluate_difficult, float overlap_threshold,
-      std::map<int, int>& label_pos_count,
+      std::map<int, int>* label_pos_count,
-      std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      std::map<int, std::vector<std::pair<T, int>>>* true_pos,
-      std::map<int, std::vector<std::pair<T, int>>>& false_pos) const {
+      std::map<int, std::vector<std::pair<T, int>>>* false_pos) const {
    int batch_size = gt_boxes.size();
    for (int n = 0; n < batch_size; ++n) {
      auto image_gt_boxes = gt_boxes[n];
@@ -320,10 +325,10 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          continue;
        }
        int label = it->first;
-        if (label_pos_count.find(label) == label_pos_count.end()) {
+        if (label_pos_count->find(label) == label_pos_count->end()) {
-          label_pos_count[label] = count;
+          (*label_pos_count)[label] = count;
        } else {
-          label_pos_count[label] += count;
+          (*label_pos_count)[label] += count;
        }
      }
    }
@@ -338,8 +343,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          int label = it->first;
          for (size_t i = 0; i < pred_boxes.size(); ++i) {
            auto score = pred_boxes[i].first;
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
        }
        continue;
@@ -351,8 +356,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
        if (image_gt_boxes.find(label) == image_gt_boxes.end()) {
          for (size_t i = 0; i < pred_boxes.size(); ++i) {
            auto score = pred_boxes[i].first;
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
          continue;
        }
@@ -381,17 +386,17 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
                (!evaluate_difficult && !matched_bboxes[max_idx].is_difficult);
            if (match_evaluate_difficult) {
              if (!visited[max_idx]) {
-                true_pos[label].push_back(std::make_pair(score, 1));
+                (*true_pos)[label].push_back(std::make_pair(score, 1));
-                false_pos[label].push_back(std::make_pair(score, 0));
+                (*false_pos)[label].push_back(std::make_pair(score, 0));
                visited[max_idx] = true;
              } else {
-                true_pos[label].push_back(std::make_pair(score, 0));
+                (*true_pos)[label].push_back(std::make_pair(score, 0));
-                false_pos[label].push_back(std::make_pair(score, 1));
+                (*false_pos)[label].push_back(std::make_pair(score, 1));
              }
            }
          } else {
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
        }
      }

--- a/paddle/fluid/operators/dropout_op.cc
+++ b/paddle/fluid/operators/dropout_op.cc
@@ -101,8 +101,9 @@ class DropoutOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(dropout, ops::DropoutOp, ops::DropoutOpMaker, dropout_grad,
+REGISTER_OPERATOR(dropout, ops::DropoutOp, ops::DropoutOpMaker,
-            ops::DropoutOpGrad);
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(dropout_grad, ops::DropoutOpGrad);
 REGISTER_OP_CPU_KERNEL(
    dropout, ops::CPUDropoutKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(

--- a/paddle/fluid/operators/dropout_op.cu
+++ b/paddle/fluid/operators/dropout_op.cu
@@ -24,21 +24,11 @@ namespace paddle {
 namespace operators {
 template <typename T>
-__global__ void RandomGenerator(const size_t n, const int seed,
+__global__ void RandomGenerator(const size_t n, const T* src,
-                                const float dropout_prob, const T* src,
+                                const T* cpu_mask_data, T* mask_data, T* dst) {
-                                T* mask_data, T* dst) {
-  thrust::minstd_rand rng;
-  rng.seed(seed);
-  thrust::uniform_real_distribution<float> dist(0, 1);
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  for (; idx < n; idx += blockDim.x * gridDim.x) {
-    rng.discard(idx);
+    mask_data[idx] = cpu_mask_data[idx];
-    if (dist(rng) < dropout_prob) {
-      mask_data[idx] = static_cast<T>(0);
-    } else {
-      mask_data[idx] = static_cast<T>(1);
-    }
    dst[idx] = mask_data[idx] * src[idx];
  }
 }
@@ -66,15 +56,27 @@ class GPUDropoutKernel : public framework::OpKernel<T> {
      std::random_device rnd;
      int seed =
          context.Attr<bool>("fix_seed") ? context.Attr<int>("seed") : rnd();
+      std::minstd_rand engine;
+      engine.seed(seed);
+      std::uniform_real_distribution<float> dist(0, 1);
+      framework::Vector<T> cpu_mask(size);
+      for (size_t i = 0; i < size; ++i) {
+        if (dist(engine) < dropout_prob) {
+          cpu_mask[i] = static_cast<T>(0);
+        } else {
+          cpu_mask[i] = static_cast<T>(1);
+        }
+      }
      int threads = 512;
      int grid = (x->numel() + threads - 1) / threads;
      RandomGenerator<
          T><<<grid, threads, 0, context.cuda_device_context().stream()>>>(
-          size, seed, dropout_prob, x_data, mask_data, y_data);
+          size, x_data, cpu_mask.CUDAData(context.GetPlace()), mask_data,
+          y_data);
    } else {
-      auto X = EigenMatrix<T>::Reshape(*x, 1);
+      auto X = EigenVector<T>::Flatten(*x);
-      auto Y = EigenMatrix<T>::Reshape(*y, 1);
+      auto Y = EigenVector<T>::Flatten(*y);
      Y.device(place) = X * static_cast<T>(1.0f - dropout_prob);
    }
  }
@@ -87,6 +89,8 @@ namespace ops = paddle::operators;
 namespace plat = paddle::platform;
 REGISTER_OP_CUDA_KERNEL(
    dropout, ops::GPUDropoutKernel<plat::CUDADeviceContext, float>,
+    ops::GPUDropoutKernel<plat::CUDADeviceContext, double>,
    ops::GPUDropoutKernel<plat::CUDADeviceContext, plat::float16>);
 REGISTER_OP_CUDA_KERNEL(dropout_grad,
+                        ops::DropoutGradKernel<plat::CUDADeviceContext, double>,
                        ops::DropoutGradKernel<plat::CUDADeviceContext, float>);
--- a/paddle/fluid/operators/dropout_op.h
+++ b/paddle/fluid/operators/dropout_op.h
@@ -24,7 +24,7 @@ namespace operators {
 using Tensor = framework::Tensor;
 template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
-using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
 template <typename DeviceContext, typename T>
 class CPUDropoutKernel : public framework::OpKernel<T> {
@@ -60,8 +60,8 @@ class CPUDropoutKernel : public framework::OpKernel<T> {
        }
      }
    } else {
-      auto X = EigenMatrix<T>::Reshape(*x, 1);
+      auto X = EigenVector<T>::Flatten(*x);
-      auto Y = EigenMatrix<T>::Reshape(*y, 1);
+      auto Y = EigenVector<T>::Flatten(*y);
      auto& place =
          *context.template device_context<DeviceContext>().eigen_device();
      Y.device(place) = X * (1.0f - dropout_prob);
@@ -81,9 +81,9 @@ class DropoutGradKernel : public framework::OpKernel<T> {
    auto* mask = context.Input<Tensor>("Mask");
    grad_x->mutable_data<T>(context.GetPlace());
-    auto M = EigenMatrix<T>::Reshape(*mask, 1);
+    auto M = EigenVector<T>::Flatten(*mask);
-    auto dX = EigenMatrix<T>::Reshape(*grad_x, 1);
+    auto dX = EigenVector<T>::Flatten(*grad_x);
-    auto dY = EigenMatrix<T>::Reshape(*grad_y, 1);
+    auto dY = EigenVector<T>::Flatten(*grad_y);
    auto& place =
        *context.template device_context<DeviceContext>().eigen_device();

--- a/paddle/fluid/operators/dropout_op_test.cc
+++ b/paddle/fluid/operators/dropout_op_test.cc
--- a/paddle/fluid/operators/edit_distance_op.cu
+++ b/paddle/fluid/operators/edit_distance_op.cu
--- a/paddle/fluid/operators/elementwise_div_op.cc
+++ b/paddle/fluid/operators/elementwise_div_op.cc
--- a/paddle/fluid/operators/elementwise_max_op.cc
+++ b/paddle/fluid/operators/elementwise_max_op.cc
--- a/paddle/fluid/operators/elementwise_min_op.cc
+++ b/paddle/fluid/operators/elementwise_min_op.cc
--- a/paddle/fluid/operators/elementwise_mul_op.cc
+++ b/paddle/fluid/operators/elementwise_mul_op.cc
--- a/paddle/fluid/operators/elementwise_sub_op.cc
+++ b/paddle/fluid/operators/elementwise_sub_op.cc
--- a/paddle/fluid/operators/expand_op.cc
+++ b/paddle/fluid/operators/expand_op.cc
--- a/paddle/fluid/operators/expand_op.h
+++ b/paddle/fluid/operators/expand_op.h
--- a/paddle/fluid/operators/fc_op.cc
+++ b/paddle/fluid/operators/fc_op.cc
--- a/paddle/fluid/operators/gather_op.cc
+++ b/paddle/fluid/operators/gather_op.cc
--- a/paddle/fluid/operators/gather_op.cu
+++ b/paddle/fluid/operators/gather_op.cu
--- a/paddle/fluid/operators/gather_op.h
+++ b/paddle/fluid/operators/gather_op.h
--- a/paddle/fluid/operators/gather_test.cc
+++ b/paddle/fluid/operators/gather_test.cc
--- a/paddle/fluid/operators/get_places_op.cc
+++ b/paddle/fluid/operators/get_places_op.cc
--- a/paddle/fluid/operators/go_op.cc
+++ b/paddle/fluid/operators/go_op.cc
--- a/paddle/fluid/operators/gru_op.cc
+++ b/paddle/fluid/operators/gru_op.cc
--- a/paddle/fluid/operators/gru_unit_op.cc
+++ b/paddle/fluid/operators/gru_unit_op.cc
--- a/paddle/fluid/operators/hinge_loss_op.cc
+++ b/paddle/fluid/operators/hinge_loss_op.cc
--- a/paddle/fluid/operators/huber_loss_op.cc
+++ b/paddle/fluid/operators/huber_loss_op.cc
--- a/paddle/fluid/operators/im2sequence_op.cc
+++ b/paddle/fluid/operators/im2sequence_op.cc
--- a/paddle/fluid/operators/increment_op.cc
+++ b/paddle/fluid/operators/increment_op.cc
--- a/paddle/fluid/operators/increment_op.cu
+++ b/paddle/fluid/operators/increment_op.cu
--- a/paddle/fluid/operators/iou_similarity_op.cc
+++ b/paddle/fluid/operators/iou_similarity_op.cc
--- a/paddle/fluid/operators/iou_similarity_op.cu
+++ b/paddle/fluid/operators/iou_similarity_op.cu
--- a/paddle/fluid/operators/l1_norm_op.cc
+++ b/paddle/fluid/operators/l1_norm_op.cc
--- a/paddle/fluid/operators/label_smooth_op.cc
+++ b/paddle/fluid/operators/label_smooth_op.cc
--- a/paddle/fluid/operators/layer_norm_op.cc
+++ b/paddle/fluid/operators/layer_norm_op.cc
--- a/paddle/fluid/operators/linear_chain_crf_op.cc
+++ b/paddle/fluid/operators/linear_chain_crf_op.cc
--- a/paddle/fluid/operators/lod_reset_op.cc
+++ b/paddle/fluid/operators/lod_reset_op.cc
--- a/paddle/fluid/operators/log_loss_op.cc
+++ b/paddle/fluid/operators/log_loss_op.cc
--- a/paddle/fluid/operators/lookup_table_op.h
+++ b/paddle/fluid/operators/lookup_table_op.h
--- a/paddle/fluid/operators/lrn_op.cc
+++ b/paddle/fluid/operators/lrn_op.cc
--- a/paddle/fluid/operators/lstm_op.cc
+++ b/paddle/fluid/operators/lstm_op.cc
--- a/paddle/fluid/operators/lstm_unit_op.cc
+++ b/paddle/fluid/operators/lstm_unit_op.cc
--- a/paddle/fluid/operators/lstmp_op.cc
+++ b/paddle/fluid/operators/lstmp_op.cc
--- a/paddle/fluid/operators/margin_rank_loss_op.cc
+++ b/paddle/fluid/operators/margin_rank_loss_op.cc
--- a/paddle/fluid/operators/matmul_op.cc
+++ b/paddle/fluid/operators/matmul_op.cc
--- a/paddle/fluid/operators/maxout_op.cc
+++ b/paddle/fluid/operators/maxout_op.cc
--- a/paddle/fluid/operators/modified_huber_loss_op.cc
+++ b/paddle/fluid/operators/modified_huber_loss_op.cc
--- a/paddle/fluid/operators/mul_op.cc
+++ b/paddle/fluid/operators/mul_op.cc
--- a/paddle/fluid/operators/multiclass_nms_op.cc
+++ b/paddle/fluid/operators/multiclass_nms_op.cc
--- a/paddle/fluid/operators/nccl_op.cu.cc
+++ b/paddle/fluid/operators/nccl_op.cu.cc
--- a/paddle/fluid/operators/nce_op.cc
+++ b/paddle/fluid/operators/nce_op.cc
--- a/paddle/fluid/operators/nce_op.h
+++ b/paddle/fluid/operators/nce_op.h
--- a/paddle/fluid/operators/norm_op.cc
+++ b/paddle/fluid/operators/norm_op.cc
--- a/paddle/fluid/operators/pool_op.cc
+++ b/paddle/fluid/operators/pool_op.cc
--- a/paddle/fluid/operators/pool_with_index_op.cc
+++ b/paddle/fluid/operators/pool_with_index_op.cc
--- a/paddle/fluid/operators/pool_with_index_op.cu.cc
+++ b/paddle/fluid/operators/pool_with_index_op.cu.cc
--- a/paddle/fluid/operators/prelu_op.cc
+++ b/paddle/fluid/operators/prelu_op.cc
--- a/paddle/fluid/operators/print_op.cc
+++ b/paddle/fluid/operators/print_op.cc
--- a/paddle/fluid/operators/rank_loss_op.cc
+++ b/paddle/fluid/operators/rank_loss_op.cc
--- a/paddle/fluid/operators/reduce_op.cc
+++ b/paddle/fluid/operators/reduce_op.cc
--- a/paddle/fluid/operators/reduce_op.h
+++ b/paddle/fluid/operators/reduce_op.h
--- a/paddle/fluid/operators/reshape_op.cc
+++ b/paddle/fluid/operators/reshape_op.cc
--- a/paddle/fluid/operators/roi_pool_op.cc
+++ b/paddle/fluid/operators/roi_pool_op.cc
--- a/paddle/fluid/operators/row_conv_op.cc
+++ b/paddle/fluid/operators/row_conv_op.cc
--- a/paddle/fluid/operators/save_load_combine_op_test.cc
+++ b/paddle/fluid/operators/save_load_combine_op_test.cc
--- a/paddle/fluid/operators/scatter_op.cc
+++ b/paddle/fluid/operators/scatter_op.cc
--- a/paddle/fluid/operators/select_op.cc
+++ b/paddle/fluid/operators/select_op.cc
--- a/paddle/fluid/operators/sequence_concat_op.cc
+++ b/paddle/fluid/operators/sequence_concat_op.cc
--- a/paddle/fluid/operators/sequence_conv_op.cc
+++ b/paddle/fluid/operators/sequence_conv_op.cc
--- a/paddle/fluid/operators/sequence_expand_op.cc
+++ b/paddle/fluid/operators/sequence_expand_op.cc
--- a/paddle/fluid/operators/sequence_slice_op.cc
+++ b/paddle/fluid/operators/sequence_slice_op.cc
--- a/paddle/fluid/operators/sequence_slice_op.cu
+++ b/paddle/fluid/operators/sequence_slice_op.cu
--- a/paddle/fluid/operators/sequence_softmax_cudnn_op.cu.cc
+++ b/paddle/fluid/operators/sequence_softmax_cudnn_op.cu.cc
--- a/paddle/fluid/operators/sequence_softmax_op.cc
+++ b/paddle/fluid/operators/sequence_softmax_op.cc
--- a/paddle/fluid/operators/sequence_softmax_op.cu.cc
+++ b/paddle/fluid/operators/sequence_softmax_op.cu.cc
--- a/paddle/fluid/operators/sgd_op.h
+++ b/paddle/fluid/operators/sgd_op.h
--- a/paddle/fluid/operators/sigmoid_cross_entropy_with_logits_op.cc
+++ b/paddle/fluid/operators/sigmoid_cross_entropy_with_logits_op.cc
--- a/paddle/fluid/operators/smooth_l1_loss_op.cc
+++ b/paddle/fluid/operators/smooth_l1_loss_op.cc
--- a/paddle/fluid/operators/softmax_mkldnn_op.cc
+++ b/paddle/fluid/operators/softmax_mkldnn_op.cc
--- a/paddle/fluid/operators/softmax_op.cc
+++ b/paddle/fluid/operators/softmax_op.cc
--- a/paddle/fluid/operators/split_byref_op.cc
+++ b/paddle/fluid/operators/split_byref_op.cc
--- a/paddle/fluid/operators/split_byref_op.cu.cc
+++ b/paddle/fluid/operators/split_byref_op.cu.cc
--- a/paddle/fluid/operators/split_byref_op.h
+++ b/paddle/fluid/operators/split_byref_op.h
--- a/paddle/fluid/operators/split_op.cc
+++ b/paddle/fluid/operators/split_op.cc
--- a/paddle/fluid/operators/split_op.h
+++ b/paddle/fluid/operators/split_op.h
--- a/paddle/fluid/operators/spp_op.cc
+++ b/paddle/fluid/operators/spp_op.cc
--- a/paddle/fluid/operators/squared_l2_distance_op.cc
+++ b/paddle/fluid/operators/squared_l2_distance_op.cc
--- a/paddle/fluid/operators/squared_l2_norm_op.cc
+++ b/paddle/fluid/operators/squared_l2_norm_op.cc
--- a/paddle/fluid/operators/top_k_op.h
+++ b/paddle/fluid/operators/top_k_op.h
--- a/paddle/fluid/operators/transpose_op.cc
+++ b/paddle/fluid/operators/transpose_op.cc
--- a/paddle/fluid/operators/unpool_op.cc
+++ b/paddle/fluid/operators/unpool_op.cc
--- a/paddle/fluid/operators/warpctc_op.cc
+++ b/paddle/fluid/operators/warpctc_op.cc
--- a/paddle/fluid/platform/nccl_helper.h
+++ b/paddle/fluid/platform/nccl_helper.h
--- a/paddle/fluid/pybind/protobuf.cc
+++ b/paddle/fluid/pybind/protobuf.cc
--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
--- a/python/paddle/fluid/__init__.py
+++ b/python/paddle/fluid/__init__.py
--- a/python/paddle/fluid/distribute_transpiler.py
+++ b/python/paddle/fluid/distribute_transpiler.py
--- a/python/paddle/fluid/evaluator.py
+++ b/python/paddle/fluid/evaluator.py
--- a/python/paddle/fluid/executor.py
+++ b/python/paddle/fluid/executor.py
--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
--- a/python/paddle/fluid/inference_transpiler.py
+++ b/python/paddle/fluid/inference_transpiler.py
--- a/python/paddle/fluid/io.py
+++ b/python/paddle/fluid/io.py
--- a/python/paddle/fluid/layers/control_flow.py
+++ b/python/paddle/fluid/layers/control_flow.py
--- a/python/paddle/fluid/layers/learning_rate_scheduler.py
+++ b/python/paddle/fluid/layers/learning_rate_scheduler.py
--- a/python/paddle/fluid/layers/metric.py
+++ b/python/paddle/fluid/layers/metric.py
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
--- a/python/paddle/fluid/parallel_executor.py
+++ b/python/paddle/fluid/parallel_executor.py
--- a/python/paddle/fluid/regularizer.py
+++ b/python/paddle/fluid/regularizer.py
--- a/python/paddle/fluid/tests/book/test_image_classification.py
+++ b/python/paddle/fluid/tests/book/test_image_classification.py
--- a/python/paddle/fluid/tests/unittests/test_conv3d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv3d_op.py
--- a/python/paddle/fluid/tests/unittests/test_layers.py
+++ b/python/paddle/fluid/tests/unittests/test_layers.py
--- a/python/paddle/fluid/tests/unittests/test_parallel_executor.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_executor.py
--- a/python/paddle/fluid/tests/unittests/test_pool2d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_op.py
--- a/python/paddle/fluid/tests/unittests/test_pool3d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool3d_op.py
--- a/python/paddle/fluid/tests/unittests/test_split_op.py
+++ b/python/paddle/fluid/tests/unittests/test_split_op.py
--- a/python/paddle/v2/dataset/imdb.py
+++ b/python/paddle/v2/dataset/imdb.py
--- a/tools/aws_benchmarking/README.md
+++ b/tools/aws_benchmarking/README.md
--- a/tools/aws_benchmarking/client/cluster_launcher.py
+++ b/tools/aws_benchmarking/client/cluster_launcher.py
--- a/tools/aws_benchmarking/server/cluster_master.py
+++ b/tools/aws_benchmarking/server/cluster_master.py
--- a/tools/aws_benchmarking/server/pserver.sh.template
+++ b/tools/aws_benchmarking/server/pserver.sh.template
--- a/tools/aws_benchmarking/server/trainer.sh.template
+++ b/tools/aws_benchmarking/server/trainer.sh.template