diff --git a/doc/api/v2/config/networks.rst b/doc/api/v2/config/networks.rst
index 6e813ab1a820d068ea3e54cad6178f1cf928eadc..048379cf01f4aec5e73e2fe3ddfa728f3c17a5d1 100644
--- a/doc/api/v2/config/networks.rst
+++ b/doc/api/v2/config/networks.rst
@@ -125,3 +125,8 @@ simple_attention
     :members: simple_attention
     :noindex:
 
+dot_product_attention
+---------------------
+..  automodule:: paddle.v2.networks
+    :members: dot_product_attention
+    :noindex:
diff --git a/doc/design/executor.md b/doc/design/executor.md
new file mode 100644
index 0000000000000000000000000000000000000000..b5fb6c5c3c1da3c112ce63878322083dd5c42b70
--- /dev/null
+++ b/doc/design/executor.md
@@ -0,0 +1,23 @@
+# Executor Design Doc
+
+## Motivation
+
+We use executor to do the runtime evaluation of a `ProgramDesc`.
+
+## Overview
+
+An executor takes a `ProgramDesc`, a `block_id` and a `Scope`.  The `ProgramDesc` is a list of blocks and each block contains the protobuf definition of all the parameters and operators. The `block_id` specifies the entrance block. And the `Scope` is the container of all the variable instance, which is persistent throughout different runs.
+
+### What does executor do?
+
+It evaluates all the operators in the `block_id`th block of a `ProgramDesc`.
+
+### What does executor NOT do?
+
+It does not do runtime optimization, meaning intelligently parse the dependency of each op a choose which one to be run and in which order they should be run.
+
+It does not do graph partitioning, meaning dividing the `ProgramDesc` into several small pieces and executing them on different devices.
+
+## Implementation
+
+`Executor` evaluates a `ProgramDesc`. Essentially, it instantiates Variables and Operators, then run all the operators in sequence. [[code]](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.cc)
diff --git a/doc/design/images/feed_forward.png b/doc/design/images/feed_forward.png
new file mode 100644
index 0000000000000000000000000000000000000000..d312371a04c26aa6cd196e0bd1f51becb425180b
Binary files /dev/null and b/doc/design/images/feed_forward.png differ
diff --git a/doc/design/images/feed_forward_regularized.png b/doc/design/images/feed_forward_regularized.png
new file mode 100644
index 0000000000000000000000000000000000000000..677e99bfd9f8e72ed9fe4b27127af2ced202f447
Binary files /dev/null and b/doc/design/images/feed_forward_regularized.png differ
diff --git a/doc/design/images/l1_regularization.png b/doc/design/images/l1_regularization.png
new file mode 100644
index 0000000000000000000000000000000000000000..e1b9c7a44f94dc027598a98da93ddb8133190972
Binary files /dev/null and b/doc/design/images/l1_regularization.png differ
diff --git a/doc/design/images/l2_regularization.png b/doc/design/images/l2_regularization.png
new file mode 100644
index 0000000000000000000000000000000000000000..d5c2fcbc2ccae75ad083162e5a2dceb0210be298
Binary files /dev/null and b/doc/design/images/l2_regularization.png differ
diff --git a/doc/design/images/loss_equation.png b/doc/design/images/loss_equation.png
new file mode 100644
index 0000000000000000000000000000000000000000..14212ec8d36c803de96bde8a9a4b5591bd20434e
Binary files /dev/null and b/doc/design/images/loss_equation.png differ
diff --git a/doc/design/infer_var_type.md b/doc/design/infer_var_type.md
new file mode 100644
index 0000000000000000000000000000000000000000..d9d5397becba2ef1806d9341cd49cd9aabbf4a6a
--- /dev/null
+++ b/doc/design/infer_var_type.md
@@ -0,0 +1,78 @@
+# Design Doc: InferVarType
+
+## The Problem Posed
+
+The variable in our design can hold variant types. Such as `LoDTensor` and `SelectedRows`. An operator should be able to inference the variable types of its output.
+
+For example, a `lookup table` operator takes two `LoDTensor`; one is a float tensor as the embedding table, the other is an int tensor as word ID. The gradient operator of `lookup table` will generate a `SelectedRows` as its output. A `sum` operator can take both `LoDTensor` and `SelectedRows` as its inputs and will generate a `LoDTensor` if any of its inputs is `LoDTensor`, otherwise, the `sum` operator will generate `SelectedRows` as its output.
+
+The variable type will be constant at runtime. Every variable's type can either be set by the user (input data and parameter) or be inferred by the operator in compile time.
+
+## Proposed Solution
+
+The `InferVarType` is a compile-time function which is registered to each operator. The inferface of that function is:
+
+
+```c++
+using InferVarTypeFN = std::function<
+    void (const OpDescBind& /*op_desc*/, BlockDescBind* /*block*/)>;
+```
+
+It takes an operator description as its input and will write the output variable type and store them in block description.
+
+The `InferVarTypeFN` will be registered in `OpInfo`, to replace `infer_var_type_` field. The `OpInfo` should be
+
+```cpp
+struct OpInfo {
+  InferVarTypeFN infer_var_type_;
+  ...
+};
+```
+
+The default `InferVarType` will set output type as `LoDTensor`. It can be done by `GetInferVarType()`.
+
+```cpp
+void DefaultInferVarType(const OpDescBind& op_desc, BlockDescBind* block) {
+  // set the output type of variable as `LoDTensor`.
+  // ...
+}
+
+struct OpInfo {
+  InferVarTypeFN infer_var_type_;
+  InferVarTypeFN GetInferVarType() const {
+    if (infer_var_type_) {
+      return infer_var_type_;
+    } else {
+      return DefaultInferVarType;
+    }
+  }
+};
+```
+
+## Register InferVarType
+
+We provide a thin base class for registering an `InferVarTypeFN`. To use a base class will ease the implementation of registry since we can detect the registry entry is an `InferVarTypeFN` or not.
+
+```cpp
+class VarTypeInferer {
+public:
+  virtual void operator()(const OpDescBind& op_desc, BlockDescBind* block) const = 0;
+}
+```
+
+Operator developers can write the specialize `VarTypeInferer` as follow.
+
+```cpp
+class SpecialVarTypeInferer : public VarTypeInferer {
+public:
+  virtual void operator()(const OpDescBind& op_desc, BlockDescBind* block) const {
+    // .. own logic
+  }
+}
+```
+
+Then user can register the `InferVarType` just like `GradOpDescMaker` and `OpInfoMaker`.
+
+```
+REGISTER_OPERATOR(some_op, OpType, SpecialVarTypeInferer, ...);
+```
diff --git a/doc/design/python_api.md b/doc/design/python_api.md
index 56ae1d925a96622b5576013f38e33e5f92cbbb90..cb5fdc765b7126fc66a1c8978d4b96c0dc5a9f2c 100644
--- a/doc/design/python_api.md
+++ b/doc/design/python_api.md
@@ -179,40 +179,104 @@ init_attr={
 
 `optimize_op_attrs` is not in the `VarDesc` message, but kept in the Python instance, as it will be used in the Python space when creating the optimize operator's `OpDesc`, and will be in the `OpDesc` message.
 
-## Layer Functions
+## Layer Function
 
-A layer is a Python function that creates some operators and variables.  Layers simplify the work of application programmers.
+A layer is a Python function that creates some operators and variables. Layers simplify the work of application programmers.
 
-### Data Layer
+Layer functions take `Variable` and configuration parameters as its input and return the output variable(s).
+
+For example, `FullyConnected` take one or more variable as its input. The input could be input data or another layer's output. There are many configuration options for a `FullyConnected` layer, such as layer size, activation, parameter names, initialization strategies of parameters, and so on. The `FullyConnected` layer will return an output variable.
+
+
+### Necessity for reusing code between layer functions
+
+There are a lot of code that can be reused. Such as
+
+* Give the default value of configuration. e.g., default initialize strategy for parameters is uniform random with `min = -1.0`, `max = 1.0`. and default initialize strategy for bias is to fill zero.
+* Append the activation operator.
+* Create a temporary variable.
+* Create parameter.
+* Generate a unique name.
+* Add a bias.
+* ...
+
+A mechanism to reuse code between layer functions is necessary. It will be around [150 lines of code](https://github.com/PaddlePaddle/Paddle/pull/4724/files#diff-823b27e07e93914ada859232ae23f846R12) if we write a `FullyConnected` layer without any helper functions.
+
+
+
+### Comparision between global functions and helper class
+
+The `FullyConnected` layer will be as follow when we provide global functions:
 
 ```python
-def data_layer(name, type, column_name):
-    block = the_current_program.glolal_block()
-    var = block.create_global_var(
-            name=name,
-            shape=[None] + type.dims(),
-            dtype=type.dtype)
-    block.prepend_operator(block,
-                           type="Feed",
-                           inputs = None,
-                           outputs = [var],
-                           {column_name: column_name})
-    return var
+def fc_layer(input, size, param_attr=None, bias_attr=None, act=None, name=None):
+  if name is None:
+    name = unique_name("fc")
+  input = multiple_input(input)
+  param_attr = default_param_attr(param_attr)
+  param_attr = multiple_param_attr(param_attr, len(input))
+
+  # mul
+  mul_results = []
+  for ipt, attr in zip(input, param_attr):
+    shape = ipt.shape[1:] + [size]
+    w = g_program.global_block().create_parameter(shape, ipt.dtype, name, attr)
+    tmp = create_tmp_var(name)
+    g_program.current_block().append_op("mul", {ipt, w}, {tmp})
+  mul_results.append(tmp)
+
+  # add sum
+  ...
+  # add bias
+  ...
+  # add activation
+  ...
+  return out
 ```
 
-The input to the feed operator is a special variable in the global scope, which is the output of [Python readers](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/reader/README.md).
+We can provide many helpers functions for layer developers. However, there are several disadvantages for global helper functions:
+
+1. We need a namespace for these methods, then layer developers can quickly figure out what method they can use.
+2. Global functions will force layer developers to pass its parameter time by time.
+
+So we provide a helper class, `LayerHelper`, to share code between layer functions. The `FullyConnected` Layer will be as follow.
+
+```python
+def fc_layer(input, size, param_attr=None, bias_attr=None, act=None, name=None):
+  helper = LayerHelper(locals())  # pass all parameter to LayerHelper
+
+  mul_results = []
+  for ipt, param in helper.iter_multiple_input_and_param():
+    w = helper.create_parameter(shape=ipt.shape[1:] + [size], dtype = ipt.dtype)
+    tmp = helper.create_tmp_variable()
+    helper.append_op('mul', {ipt, w}, {tmp})
+    mul_results.append(tmp)
+
+  pre_bias = helper.add_sum(mul_results)
+  pre_activation = helper.add_bias(pre_bias)
+  return helper.add_activation(pre_activation)
+```
+
+We not only use the fewer lines of code to write `fc_layer` but also make the code clearer to understand. At the same time, layer developers can figure out what function they can invoke by typing `helper.` in a python editor.
+
+
+### Implementation of layer helper
 
-### FC Layer
+We just keep all parameters of a layer function as a dictionary in layer helper as a private data member. Every method of layer helper will look up the dictionary after it is invoked. In that way, we can implement a layer helper for all layer functions even some layer does not contain some operator. For example, The `activation` is used by the FullyConnected layer or convolution layers, but a cross-entropy layer does not use it. The example code of `add_activation` are:
 
 ```python
-def fc_layer(input, size, ...):
-    block = program.current_block()
-    w = block.create_parameter(...)
-    b = block.create_parameter(...)
-    out = block.create_var()
-    op = block.append_operator("FC", X=input, W=w, b=b, out=out)
-    out.writer = op
-    return out
+class LayerHelper(object):
+  def __init__(self, **kwargs):  # kwargs is short for `keyword arguments`
+    self.kwargs = kwargs
+
+  def add_activation(self, input_var):
+    act = self.kwargs.get("act", None)  # default value is None
+    if act is None:  # do nothing if no act
+      return input_var
+
+    tmp = self.create_tmp_var(self)
+    self.append_op(type=act, input=input_var, output=tmp)
+    return tmp
 ```
 
 ## Optimizer
diff --git a/doc/design/regularization.md b/doc/design/regularization.md
new file mode 100644
index 0000000000000000000000000000000000000000..703a9fbdd4392aa7f44733cce2da19caa1b51e4a
--- /dev/null
+++ b/doc/design/regularization.md
@@ -0,0 +1,103 @@
+# Regularization in PaddlePaddle
+
+## Introduction to Regularization
+A central problem in machine learning is how to design an algorithm that will perform well not just on the training data, but also on new data. Many strategies are used by machine learning practitioners to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as **regularization**. 
+
+### Parameter Norm Penalties
+Most common regularization approaches in deep learning are based on limiting the capacity of the models by adding a parameter norm penalty to the objective function `J`. This is given as follows:
+
+<img src="./images/loss_equation.png" align="center"/><br/>
+
+The parameter `alpha` is a hyperparameter that weights the relative contribution of the norm penalty term, `omega`, relative to the standard objective function `J`.
+
+The most commonly used norm penalties are the L2 norm penalty and the L1 norm penalty. These are given as follows:
+
+##### L2 Regularization:
+<img src="./images/l2_regularization.png" align="center"/><br/>
+
+##### L1 Regularization
+<img src="./images/l1_regularization.png" align="center"/><br/>
+
+A much more detailed mathematical background of reguilarization can be found [here](http://www.deeplearningbook.org/contents/regularization.html).
+
+
+## How to do Regularization in PaddlePaddle
+
+On surveying existing frameworks like Tensorflow, PyTorch, Caffe, etc, it can be seen that there are 2 common approaches of doing regularization:
+
+1. Making regularization a part of the optimizer using an attribute like `weight_decay` that is used to control the scale of the L2 Penalty. This approach is used in PyTorch as follows:
+	```python
+	opt =  torch.optim.SGD(params, lr=0.2, weight_decay=0.2)
+	```
+    At every optimization step, this code will add the gradient of the L2 Norm of the params to the gradient of the params with respect to the loss function. This can seen in the following code snippet:
+    ```python
+    if weight_decay != 0:
+    	d_p.add_(weight_decay, p.data)
+    ```
+    This is a very restyrictive way of doing regularization and does not give the users enough flexibility. 
+    
+    **Advantages**:
+    -  It is easy to implement for us.
+    -  Faster execution of backward. However, it can be done manually by advanced users too.
+
+	**Disadvantages**:
+    - Not flexible for other regularizations such as L1/L0 regularization.
+    - Does not allow for different regularization coefficient for different parameters. For example, in most models, ony the weight matrices are regularized and the bias vectors are unregularized.
+    - Tightly coupled optimizer and regularization implementation. 
+
+
+2. Adding regularization ops to the graph through Python API. This approach is used by Tensorflow and Caffe. Using this approach, we manually add regularization ops to the graph and then add the regularization loss to the final loss function before sending them to the optimizer.
+
+	**Advantages**:
+    - Allows for greater flexibility to the users of Paddle. Using this approach, the users can put different regularization to different parameters and also choose parameters that are not a part of regularization.
+    - Makes it easy for the users to customize and extend the framework. 
+
+	**Disadvantages**:
+    - Implementation requires comprehensive design and time. 
+
+## Proposal for Regularization in PaddlePaddle
+
+### Low-Level implementation
+
+In the new design, we propose to create new operations for regularization. For now, we can add 2 ops thgat correspond to the most frequently used regularizations:
+- L2_regularization_op
+- L1_regularization_op
+
+These ops can be like any other ops with their own CPU/GPU implementations either using Eigen or separate Cpu and GPU kernels. As the initial implementation, we can implement their kernels using Eigen following the abstraction pattern implemented for [Activation Ops](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/accuracy_op.h). This abstraction pattern can make it very easy to implement new regularization schemes. other than L1 and L2 norm penalties. 
+
+The idea of building ops for regularization is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way these ops will be added to the computation graph, will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API. 
+
+### Computation Graph
+
+Below is an example of a really simple feed forward neural network.
+
+<img src="./images/feed_forward.png" align="center"/><br/>
+
+The Python API will modify this computation graph to add regularization operators. The modified computation graph will look as follows:
+
+<img src="./images/feed_forward_regularized.png" align="center"/><br/>
+   
+### Python API implementation for Regularization
+
+Using the low level ops, `L2_regularization_op` and `L1_regularization_op`, any user can add regularization to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support regularization. An example of such an API can be seen in [Keras](https://keras.io/regularizers/). As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since regularization is a property of parameters, it makes sense to create these in the layer functions. 
+
+#### Creation of Regularization ops
+There are two possibilities for creating the regularization ops:
+1. We create these ops immediately while building the computation graph. 
+2. We add these ops in a lazy manner, just before the backward, similar to the way the optimization ops are added. 
+
+The proposal is to add these ops in a lazy manner just before the backward pass. 
+
+#### Storage of Regularization attributes
+
+Since we want to create the regularization ops in a lazy manner, the regularization attributes (type of regularization and weight of regularization penalty) can be stored as attributes of the [`Parameter`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/framework.py#L421) class. This is because regularization is a property of the parameters and storing regularization properties with Parameters also allows for shared parameters. 
+
+#### High-level API
+
+In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we lso need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).
+
+
+
+
+
+    
diff --git a/doc/design/selected_rows.md b/doc/design/selected_rows.md
index 9e6f3b20cbcdc55e481fbe7bf5fa555d8b3c3d45..1a98839a957612b91b2276b58818623ecc62d1d5 100644
--- a/doc/design/selected_rows.md
+++ b/doc/design/selected_rows.md
@@ -1,6 +1,6 @@
 # Design Doc: Selected Rows
 
-`SelectedRows` is a kind of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in that tensor. It is straightforward to represent the sparse tensor by the following sparse tensor data structure:
+`SelectedRows` is a type of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in this tensor. It is straight-forward to represent a sparse tensor by the following sparse tensor data structure:
 
 ```cpp
 class SelectedRows {
@@ -11,7 +11,7 @@ class SelectedRows {
 };
 ```
 
-The field `height_` shows the first dimension of `SelectedRows`. The `rows` are the indices of which rows of `SelectedRows` are non-zeros. The `value_` field is an N-dim tensor and shape is `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
+The field `height_` is the first dimension of `SelectedRows`. The `rows` are the indices of the non-zero rows of `SelectedRows`. The `value_` field is an N-dim tensor of shape `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
 
 Suppose that a SelectedRows-typed variable `x` has many rows, but only two of them have values -- row 73 is `[1, 2]` and row 84 is `[3, 4]`, the `SelectedRows` representation would be:
 
@@ -25,7 +25,7 @@ x = SelectedRow {
 
 ## SelectedRows in Protobuf
 
-`SelectedRows` is a kind of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time since the `rows_` and `value_` are related to training data. 
+`SelectedRows` is a type of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time because the `rows_` and `value_` are dependent on the training data. 
 So we use `TensorDesc` to unify `data_type` and `dims`. A LodTensorDesc contains a `TensorDesc` and `lod_level`. The description of `SelectedRows` is a Tensor description.
 
 ```proto
@@ -54,7 +54,7 @@ message VarDesc {
 
 ## InferShape for Selected Rows
 
-Just like `LoD` information, `InferShape` method will inference output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
+Just like `LoD` information, `InferShape` method will infer the output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
 
 For example, the gradient operator of `TableLookup` will always generate `SelectedRows`. Its `InferShape` method should be like following
 
@@ -68,7 +68,7 @@ void TableLookupGrad::InferShape(context) {
 
 ## Sparse Operators
 
-There are several operators should be written to support `SelectedRows`. They are:
+There are several operators that need to be written to support `SelectedRows`. These are:
 
-1. Operators which generates `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
+1. Operators which generate `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
 2. Optimize operators which support `SelectedRows` gradient. e.g. `SGD` or `AdaGrad` for `SelectedRows`. However, there should be only one `SGD` operator. `OpWithKernel::Run` should select a suitable kernel for both `dense` tensor or `SelectedRows`.
diff --git a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
index 90dc84718c9ce1374cda6022de177afeeb60279d..1fc58c37cc9151d5e4d99b939e30c29aa99e04f1 100644
--- a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
@@ -1,9 +1,46 @@
 # 构建Android平台上的PaddlePaddle库
 
-用户可通过交叉编译的方式，在用户熟悉的开发平台（Linux，Mac OS X和Windows）上编译Android平台上适用的PaddlePaddle库。
+用户可通过如下两种方式，交叉编译Android平台上适用的PaddlePaddle库：
+- 基于Docker容器的编译方式
+- 基于Linux交叉编译环境的编译方式
+
+## 基于Docker容器的编译方式
+Docker能在所有主要操作系统（包括Linux，Mac OS X和Windows）上运行，因此，使用基于Docker容器的编译方式，用户可在自己熟悉的开发平台上编译Android平台上适用的PaddlePaddle库。
+
+### 构建PaddlePaddle的Android开发镜像
+我们把PaddlePaddle的交叉编译环境打包成一个镜像，称为开发镜像，里面涵盖了交叉编译Android版PaddlePaddle库需要的所有编译工具。
+
+```bash
+$ git clone https://github.com/PaddlePaddle/Paddle.git
+$ cd Paddle
+$ docker build -t username/paddle-android:dev . -f Dockerfile.android
+```
+
+### 编译PaddlePaddle C-API库
+构建好开发镜像后，即可使用开发镜像来编译Android版PaddlePaddle C-API库。
+Android的Docker开发镜像向用户提供两个可配置的参数：
+
+| Argument        | Optional Values         | Default |
+|-----------------|-------------------------|---------|
+|`ANDROID_ABI`    |`armeabi-v7a, arm64-v8a` | `armeabi-v7a` |
+|`ANDROID_API`    |`>= 21` | `21` |
+
+- 编译`armeabi-v7a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+执行上述`docker run`命令时，容器默认执行[paddle/scripts/docker/build_android.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置，并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`，`ANDROID_API<21`时，Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文**配置交叉编译参数**章节，根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后，PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录，所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。
+
+## 基于Linux交叉编译环境的编译方式
 本文档将以Linux x86-64平台为例，介绍交叉编译Android平台上适用的PaddlePaddle库的方法和步骤。
 
-## 准备交叉编译环境
+### 准备交叉编译环境
 
 从源码交叉编译PaddlePaddle，用户需要提前准备好交叉编译环境。Android平台上使用的C/C++交叉编译工具链为[Android NDK](https://developer.android.com/ndk/downloads/index.html?hl=zh-cn)，用户可自行前往下载预编译好的版本，也可通过以下命令获取：
 
@@ -13,18 +50,27 @@ unzip -q android-ndk-r14b-linux-x86_64.zip
 ```
 
 Android NDK中包含了所有Android API级别、所有架构（arm/arm64/x86/mips）需要用到的编译工具和系统库。用户可根据自己的编译目标架构、所需支持的最低Android API级别，构建[独立工具链](https://developer.android.google.cn/ndk/guides/standalone_toolchain.html?hl=zh-cn)。
-比如：
+
+- 构建`armeabi-v7a`、 `Android API 21`的独立工具链：
 
 ```bash
 your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
-        --arch=arm --platform=android-21 --install-dir=your/path/to/my_standalone_toolchain
+        --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain
 ```
 
-此命令将在your/path/to/my_standalone_toolchain目录生成一套编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，使用的编译器为arm-linux-androideabi-gcc (GCC) 4.9。
+此命令将在`your/path/to/arm_standalone_toolchain`目录生成一套独立编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。
 
-注意：**PaddlePaddle要求使用的编译工具链所支持的Andoid API级别不小于21**。
+- 构建`arm64-v8a`、 `Android API 21`的独立工具链：
+```bash
+your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
+        --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain
+```
 
-## 配置交叉编译参数
+此命令将在`your/path/to/arm64_standalone_toolchain`目录生成一套独立编译工具链，面向架构为64位ARM64架构，支持的最小Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。
+
+注意：**PaddlePaddle要求使用的编译工具链所支持的Android API级别不小于21**。
+
+### 配置交叉编译参数
 
 CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。为了简化cmake配置，PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/android.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake)，以提供一些默认的编译器和编译参数相关配置。注意，从CMake 3.7版本开始，CMake官方对Android平台的交叉编译提供了通用的支持。PaddlePaddle若检测到用户使用的CMake版本不低于3.7时，将会将用户传进来的配置参数传递CMake系统，交由CMake系统本身来处理。有关参数配置的详细说明见[cmake-toolchains](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling)。
 
@@ -36,32 +82,57 @@ CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cm
 Android平台可选配置参数：
 
 - `ANDROID_STANDALONE_TOOLCHAIN`，独立工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别；否则，用户需要在cmake时手动设置这些值。无默认值。
-- `ANDROID_ABI`，目标架构ABI。目前只支持`armeabi-v7a`，默认值为`armeabi-v7a`。
+- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
+	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
+	- Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。
+- `ANDROID_ABI`，目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`，默认值为`armeabi-v7a`。
 - `ANDROID_NATIVE_API_LEVEL`，工具链的Android API级别。若没有显式设置，PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。
-- `ANROID_ARM_MODE`，是否使用ARM模式。可设置`ON/OFF`，默认值为`ON`。
-- `ANDROID_ARM_NEON`，是否使用NEON指令。目前必须设置成`ON`，默认值为`ON`。
+- `ANROID_ARM_MODE`，是否使用ARM模式。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
+- `ANDROID_ARM_NEON`，是否使用NEON指令。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
 
 其他配置参数：
 
+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算。可设置`ON/OFF`，默认值为`OFF`。
 - `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC`的值；若环境变量`CC`没有设置，则设置成`cc`编译器。
 
-一种常用的cmake配置如下：
+常用的cmake配置如下：
 
 ```bash
 cmake -DCMAKE_SYSTEM_NAME=Android \
-      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/my_standalone_toolchain \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \
       -DANDROID_ABI=armeabi-v7a \
       -DANDROID_ARM_NEON=ON \
       -DANDROID_ARM_MODE=ON \
+      -DUSE_EIGEN_FOR_BLAS=ON \
       -DCMAKE_INSTALL_PREFIX=your/path/to/install \
       -DWITH_C_API=ON \
       -DWITH_SWIG_PY=OFF \
       ..
 ```
 
+```
+cmake -DCMAKE_SYSTEM_NAME=Android \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
+      -DANDROID_ABI=arm64-v8a \
+      -DUSE_EIGEN_FOR_BLAS=OFF \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \  
+      -DWITH_C_API=ON \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
 用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。
 
-## 编译和安装
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 使用`clang`编译工具链
+- `armeabi-v7a`时，设置`USE_EIGEN_BLAS=ON`，使用Eigen进行矩阵计算；`arm64-v8a`时，设置`USE_EIGEN_FOR_BLAS=OFF`，使用OpenBLAS进行矩阵计算
+
+### 编译和安装
 
 CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。
 
@@ -72,4 +143,4 @@ make install
 
 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
 
-执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Android版本的库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`、`lib`和`third_party`目录，其中`include`中包含C-API的头文件，`lib`中包含若干个不同Android ABI的PaddlePaddle库，`third_party`中包含所依赖的所有第三方库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
diff --git a/go/pserver/client/client.go b/go/pserver/client/client.go
index 20d91e77034e1a0c6825bc401175e6dc1afec52f..e5187ce3df77cb983e070508230c51c078f1e07b 100644
--- a/go/pserver/client/client.go
+++ b/go/pserver/client/client.go
@@ -137,7 +137,7 @@ func (c *Client) FinishInitParams() error {
 			return err
 		}
 	}
-	return nil
+	return c.sel.Done()
 }
 
 // SendGrads sends gradients to parameter servers for updating
diff --git a/paddle/framework/CMakeLists.txt b/paddle/framework/CMakeLists.txt
index 9d039a54d65e3b9fb49813f3ded699b3b5d6d7d5..8df92c4ad415443eb458fc2e04b954acf6c33fb7 100644
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@@ -43,13 +43,6 @@ cc_library(backward SRCS backward.cc DEPS net_op)
 cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)
 
 cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward)
-set(EXECUTOR_TEST_OP elementwise_add_op gaussian_random_op feed_op fetch_op
-    mul_op sum_op squared_l2_distance_op fill_constant_op sgd_op mean_op)
-if(WITH_GPU)
-    nv_test(executor_test SRCS executor_test.cc DEPS executor ${EXECUTOR_TEST_OP})
-else()
-    cc_test(executor_test SRCS executor_test.cc DEPS executor ${EXECUTOR_TEST_OP})
-endif()
 
 cc_library(prune SRCS prune.cc DEPS framework_proto)
 cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context)
@@ -57,5 +50,7 @@ cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_con
 cc_library(tensor_array SRCS tensor_array.cc DEPS lod_tensor)
 cc_test(tensor_array_test SRCS tensor_array_test.cc DEPS tensor_array place)
 
+cc_test(var_type_inference_test SRCS var_type_inference_test.cc DEPS op_registry
+        proto_desc)
 cc_library(selected_rows SRCS selected_rows.cc DEPS tensor)
 cc_test(selected_rows_test SRCS selected_rows_test.cc DEPS selected_rows)
diff --git a/paddle/framework/attribute.cc b/paddle/framework/attribute.cc
index d6a2975aaa419406aef7b228e78381dbce78890d..29fe352ca450740e55ee87b63392e3aabac8aa40 100644
--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@@ -19,19 +19,7 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
 
-static ProgramDesc* g_program_desc = nullptr;
-
-ProgramDesc& GetProgramDesc() {
-  if (g_program_desc == nullptr) {
-    g_program_desc = new ProgramDesc();
-    auto root_block = g_program_desc->mutable_blocks()->Add();
-    root_block->set_idx(0);
-    root_block->set_parent_idx(-1);
-  }
-  return *g_program_desc;
-}
-
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
   switch (attr_desc.type()) {
     case framework::AttrType::BOOLEAN: {
       return attr_desc.b();
@@ -74,7 +62,9 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
       return val;
     }
     case framework::AttrType::BLOCK: {
-      return GetProgramDesc().mutable_blocks(attr_desc.block_idx());
+      PADDLE_ENFORCE(program != nullptr,
+                     "Need to specify ProgramDesc when get a block attr");
+      return program->mutable_blocks(attr_desc.block_idx());
     }
   }
   PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");
diff --git a/paddle/framework/attribute.h b/paddle/framework/attribute.h
index d13530e3408a54c7ecab87c3bd9e6288e342f9af..9744662b8f7229b0b17e910ae5cd997fa7d31e06 100644
--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -26,16 +26,13 @@ limitations under the License. */
 
 namespace paddle {
 namespace framework {
-
-ProgramDesc& GetProgramDesc();
-
 template <typename T>
 inline AttrType AttrTypeID() {
   Attribute tmp = T();
   return static_cast<AttrType>(tmp.which() - 1);
 }
 
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc);
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* desc);
 
 class AttrReader {
  public:
@@ -120,6 +117,57 @@ class EnumInContainer {
   std::unordered_set<T> container_;
 };
 
+template <typename T>
+struct ExtractAttribute {
+  explicit ExtractAttribute(const std::string& attr_name)
+      : attr_name_(attr_name) {}
+
+  T* operator()(Attribute& attr) const {
+    T* attr_value = nullptr;
+    try {
+      attr_value = &boost::get<T>(attr);
+    } catch (boost::bad_get& bad_get) {
+      PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s",
+                   attr_name_, typeid(T).name(), attr.type().name());
+    }
+    return attr_value;
+  }
+
+  const std::string& attr_name_;
+};
+
+// special handle bool
+// FIXME(yuyang18): Currently we cast bool into int in python binding. It is
+// hard to change the logic there. In another way, we should correct handle
+// if the user set `some_flag=1`.
+//
+// FIX ME anytime if there is a better solution.
+template <>
+struct ExtractAttribute<bool> {
+  explicit ExtractAttribute(const std::string& attr_name)
+      : attr_name_(attr_name) {}
+
+  bool* operator()(Attribute& attr) const {
+    if (attr.type() == typeid(int)) {  // NOLINT
+      int val = boost::get<int>(attr);
+      attr = static_cast<bool>(val);
+    } else if (attr.type() == typeid(float)) {  // NOLINT
+      float val = boost::get<float>(attr);
+      attr = static_cast<bool>(val);
+    }
+    bool* attr_value = nullptr;
+    try {
+      attr_value = &boost::get<bool>(attr);
+    } catch (boost::bad_get& bad_get) {
+      PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s",
+                   attr_name_, attr.type().name());
+    }
+    return attr_value;
+  }
+
+  const std::string& attr_name_;
+};
+
 // check whether a certain attribute fit its limits
 // an attribute can have more than one limits
 template <typename T>
@@ -171,9 +219,10 @@ class TypedAttrChecker {
       attr_map[attr_name_] = val;
     }
     Attribute& attr = attr_map.at(attr_name_);
-    T& attr_value = boost::get<T>(attr);
+    ExtractAttribute<T> extract_attr(attr_name_);
+    T* attr_value = extract_attr(attr);
     for (const auto& checker : value_checkers_) {
-      checker(attr_value);
+      checker(*attr_value);
     }
   }
 
diff --git a/paddle/framework/backward.cc b/paddle/framework/backward.cc
index e3d7dacd7f0ad61a606e8b3e8f6a84b98deac729..fb552fe3448b3f17e97e1262b5c9a0842f68f8b9 100644
--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -281,12 +281,16 @@ static void CreateGradVarInBlock(
   auto ops = block_desc->AllOps();
   for (size_t op_index = grad_op_start_index; op_index < ops.size();
        ++op_index) {
+    bool need_infer_shape = false;
     ForEachVarName(ops[op_index]->Outputs(),
                    [&](const std::string& grad_var_name) {
                      if (block_desc->HasVar(grad_var_name)) {
                        return false;
                      }
-                     block_desc->Var(grad_var_name);
+                     need_infer_shape = true;
+                     auto var = block_desc->Var(grad_var_name);
+                     // FIXME(qiao) infer the datatype
+                     var->SetDataType(framework::DataType::FP32);
                      auto it = param_name_map.find(grad_var_name);
                      if (it == param_name_map.end()) {
                        return false;
@@ -298,12 +302,14 @@ static void CreateGradVarInBlock(
                      grad_record.op_idx_ = static_cast<int>(op_index);
                      return false; /* not break */
                    });
+    if (need_infer_shape) {
+      ops[op_index]->InferShape(*block_desc);
+    }
   }
 }
 
 std::vector<std::unique_ptr<OpDescBind>> MakeOpGrad(
-    const std::unique_ptr<OpDescBind>& op_desc,
-    std::unordered_set<std::string>* no_grad_vars,
+    const OpDescBind* op_desc, std::unordered_set<std::string>* no_grad_vars,
     std::unordered_map<std::string, std::string>* grad_to_var) {
   std::vector<std::unique_ptr<OpDescBind>> grad_op_descs;
   // All input gradients of forwarding operator do not need to calculate.
@@ -350,7 +356,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
     std::unordered_set<std::string>* no_grad_vars,
     std::unordered_map<std::string, std::string>* grad_to_var) {
   BlockDescBind* cur_block = program_desc.Block(block_idx);
-  std::deque<std::unique_ptr<OpDescBind>>& op_descs = cur_block->ops_;
+  std::vector<OpDescBind*> op_descs = cur_block->AllOps();
   std::unordered_map<std::string, std::vector<size_t>> dup_out_ops;
   size_t grad_desc_idx = 0;
   std::vector<std::unique_ptr<OpDescBind>> backward_descs;
@@ -368,7 +374,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
           program_desc, step_block_idx, no_grad_vars, grad_to_var);
       BlockDescBind* backward_block = program_desc.AppendBlock(*cur_block);
       for (auto& ptr : backward_block_op_descs) {
-        backward_block->ops_.push_back(std::move(ptr));
+        backward_block->AppendAllocatedOp(std::move(ptr));
       }
       op_grads[0]->SetBlockAttr("step_block", *backward_block);
     }
@@ -425,17 +431,22 @@ ParamGradInfoMap AppendBackward(
 
   const int root_block_idx = 0;
   auto root_block = program_desc.Block(root_block_idx);
-  auto& all_ops = root_block->ops_;
 
   // insert fill one op for target
+  // TODO(qiao) add some check to the target.
   std::string fill_one_op_out = GradVarName(target.Name());
+  std::vector<int64_t> target_shape_desc = target.Shape();
+  std::vector<int> target_shape;
+  std::transform(target_shape_desc.begin(), target_shape_desc.end(),
+                 std::back_inserter(target_shape),
+                 [](int64_t dim) { return static_cast<int>(dim); });
   std::unique_ptr<OpDescBind> fill_one_op(
       new OpDescBind("fill_constant", {}, {{"Out", {fill_one_op_out}}},
-                     {{"shape", std::vector<int>{1}},
+                     {{"shape", target_shape},
                       {"value", static_cast<float>(1.0)},
-                      {"dataType", framework::DataType::FP32}}));
-  all_ops.push_back(std::move(fill_one_op));
-  size_t forward_op_num = all_ops.size();
+                      {"data_type", framework::DataType::FP32}}));
+  root_block->AppendAllocatedOp(std::move(fill_one_op));
+  size_t forward_op_num = root_block->OpSize();
   size_t forward_block_num = program_desc.Size();
 
   // Insert backward operators
@@ -443,13 +454,22 @@ ParamGradInfoMap AppendBackward(
   auto backward_op_descs = MakeBlockBackward(program_desc, root_block_idx,
                                              &no_grad_var_names, &grad_to_var);
 
-  std::unordered_map<std::string, GradVarInfo> retv;
-
-  // Create Variable
   for (auto& ptr : backward_op_descs) {
-    all_ops.push_back(std::move(ptr));
+    root_block->AppendAllocatedOp(std::move(ptr));
   }
-  root_block->Var(fill_one_op_out);
+  // Create Variable
+
+  // Create target gradient variable
+  std::unordered_map<std::string, GradVarInfo> retv;
+
+  auto var = root_block->Var(fill_one_op_out);
+  // FIXME(qiao) infer the data type
+  var->SetDataType(framework::DataType::FP32);
+  var->SetShape(target.Shape());
+  auto& target_grad = retv[target.Name()];
+  target_grad.name_ = fill_one_op_out;
+  target_grad.block_idx_ = root_block_idx;
+  target_grad.op_idx_ = static_cast<int>(forward_op_num);
 
   // create grad_var for all blocks in this program
   CreateGradVarInBlock(forward_op_num, grad_to_var, root_block, &retv);
diff --git a/paddle/framework/backward_test.cc b/paddle/framework/backward_test.cc
index 5302afcafb5c0e1c057302dac174be935649ef11..10301f7e39423c8ff0eba33277edecab14c119bf 100644
--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -26,6 +26,20 @@ namespace framework {
 
 using DeviceContext = platform::DeviceContext;
 
+class NoneOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {}
+};
+
+template <typename Place, typename T>
+class NoneKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &context) const override {}
+};
+
 class RowWiseAddOpMaker : public OpProtoAndCheckerMaker {
  public:
   RowWiseAddOpMaker(OpProto *proto, OpAttrChecker *op_checker)
@@ -215,19 +229,51 @@ class MinusOpMaker : public OpProtoAndCheckerMaker {
 namespace f = paddle::framework;
 namespace ops = paddle::operators;
 using EnforceNotMet = paddle::platform::EnforceNotMet;
-REGISTER_OPERATOR(rowwise_add, f::NOP, f::RowWiseAddOpMaker,
+// rowwise_add
+REGISTER_OPERATOR(rowwise_add, f::NoneOp, f::RowWiseAddOpMaker,
                   f::RowWiseAddGradMaker);
-REGISTER_OPERATOR(rowwise_add_grad, f::NOP);
-REGISTER_OP(mul, f::NOP, f::MulOpMaker, mul_grad, f::NOP);
-REGISTER_OP(sigmoid, f::NOP, f::SigmoidOpMaker, sigmoid_grad, f::NOP);
-REGISTER_OP_WITHOUT_GRADIENT(nograd, f::NOP, f::NoGradOpMaker);
-REGISTER_OP_WITHOUT_GRADIENT(fill_zeros_like, f::NOP, f::FillZeroOpMaker);
-REGISTER_OP(sum, f::NOP, f::SumOpMaker, sum_grad, f::NOP);
+REGISTER_OP_CPU_KERNEL(rowwise_add,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OPERATOR(rowwise_add_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(rowwise_add_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// mul
+REGISTER_OP(mul, f::NoneOp, f::MulOpMaker, mul_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(mul, f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(mul_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// sigmoid
+REGISTER_OP(sigmoid, f::NoneOp, f::SigmoidOpMaker, sigmoid_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(sigmoid,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_WITHOUT_GRADIENT(nograd, f::NoneOp, f::NoGradOpMaker);
+// fill_zeros_like
+REGISTER_OP_WITHOUT_GRADIENT(fill_zeros_like, f::NoneOp, f::FillZeroOpMaker);
+REGISTER_OP_CPU_KERNEL(fill_zeros_like,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// sum
+REGISTER_OP(sum, f::NoneOp, f::SumOpMaker, sum_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(sum, f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(sum_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// fc
 REGISTER_OP_WITHOUT_GRADIENT(fc, f::FcOp, f::FcOpMaker);
-REGISTER_OP(many_output_op, f::NOP, f::ManyOutputOpMaker, many_output_op_grad,
-            f::NOP);
-REGISTER_OP(mult_in_out, f::NOP, f::MultInOutOpMaker, mult_in_out_grad, f::NOP);
-REGISTER_OPERATOR(minus, f::NOP, f::MinusOpMaker, f::MinusGradOpDescMaker);
+// many_output_op
+REGISTER_OP(many_output_op, f::NoneOp, f::ManyOutputOpMaker,
+            many_output_op_grad, f::NoneOp);
+// mult_in_out
+REGISTER_OP(mult_in_out, f::NoneOp, f::MultInOutOpMaker, mult_in_out_grad,
+            f::NoneOp);
+REGISTER_OP_CPU_KERNEL(mult_in_out,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(mult_in_out_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// minus
+REGISTER_OPERATOR(minus, f::NoneOp, f::MinusOpMaker, f::MinusGradOpDescMaker);
+REGISTER_OP_CPU_KERNEL(minus, f::NoneKernel<paddle::platform::CPUPlace, float>);
+// scale
+REGISTER_OPERATOR(scale, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(scale, f::NoneKernel<paddle::platform::CPUPlace, float>);
 
 TEST(Backward, simple_op_not_need_grad) {
   auto fwd = f::OpRegistry::CreateOp(
@@ -449,20 +495,10 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {
   EXPECT_EQ(bwd_net->ops_[2]->Outputs(all).size(), 0UL);
 }
 
-// =================================== //
-
-f::ProgramDesc *GetNewProgramDesc() {
-  auto *program_desc = new f::ProgramDesc();
-  auto *root_block = program_desc->add_blocks();
-  root_block->set_idx(0);
-  root_block->set_parent_idx(-1);
-  return program_desc;
-}
-
 TEST(Backward, simple_single_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
+
   f::OpDescBind *op = block->AppendOp();
   op->SetType("rowwise_add");
   op->SetInput("X", {"x"});
@@ -487,7 +523,7 @@ TEST(Backward, simple_single_op) {
   EXPECT_EQ(grad_op->Output(f::GradVarName("b")),
             std::vector<std::string>({f::GradVarName("b")}));
 
-  EXPECT_EQ(var_to_grad.size(), 2UL);
+  EXPECT_EQ(var_to_grad.size(), 3UL);
   EXPECT_EQ(var_to_grad.at("b"), f::GradVarInfo(f::GradVarName("b"), 0, 2));
   EXPECT_EQ(var_to_grad.at("x"), f::GradVarInfo(f::GradVarName("x"), 0, 2));
 
@@ -496,8 +532,7 @@ TEST(Backward, simple_single_op) {
 }
 
 TEST(Backward, default_attribute) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op = block->AppendOp();
   op->SetType("mul");
@@ -523,8 +558,7 @@ TEST(Backward, default_attribute) {
 }
 
 TEST(Backward, simple_mult_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -588,7 +622,7 @@ TEST(Backward, simple_mult_op) {
   EXPECT_EQ(grad_op3->Output(f::GradVarName("b")),
             std::vector<std::string>({f::GradVarName("b3")}));
 
-  EXPECT_EQ(var_to_grad.size(), 6UL);
+  EXPECT_EQ(var_to_grad.size(), 7UL);
   EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 6));
   EXPECT_EQ(var_to_grad.at("b1"), f::GradVarInfo(f::GradVarName("b1"), 0, 6));
   EXPECT_EQ(var_to_grad.at("out1"),
@@ -607,8 +641,7 @@ TEST(Backward, simple_mult_op) {
 }
 
 TEST(Backward, intermedia_var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -666,7 +699,7 @@ TEST(Backward, intermedia_var_no_grad) {
             std::vector<std::string>({f::GradVarName("out1")}));
   EXPECT_EQ(grad_op4->Output(f::GradVarName("Y")), std::vector<std::string>());
 
-  EXPECT_EQ(var_to_grad.size(), 3UL);
+  EXPECT_EQ(var_to_grad.size(), 4UL);
   EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 6));
   EXPECT_EQ(var_to_grad.at("b1"), f::GradVarInfo(f::GradVarName("b1"), 0, 6));
   EXPECT_EQ(var_to_grad.at("out1"),
@@ -678,8 +711,7 @@ TEST(Backward, intermedia_var_no_grad) {
 }
 
 TEST(Backward, var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("mult_in_out");
@@ -744,7 +776,7 @@ TEST(Backward, var_no_grad) {
   EXPECT_EQ(grad_op1->Output(f::GradVarName("H")),
             std::vector<std::string>({f::GradVarName("h1")}));
 
-  EXPECT_EQ(var_to_grad.size(), 3UL);
+  EXPECT_EQ(var_to_grad.size(), 4UL);
   EXPECT_EQ(var_to_grad.at("y1"), f::GradVarInfo(f::GradVarName("y1"), 0, 3));
   EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 5));
   EXPECT_EQ(var_to_grad.at("h1"), f::GradVarInfo(f::GradVarName("h1"), 0, 5));
@@ -755,8 +787,7 @@ TEST(Backward, var_no_grad) {
 }
 
 TEST(Backward, shared_var) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -830,7 +861,7 @@ TEST(Backward, shared_var) {
   EXPECT_EQ(grad_op1->Output(f::GradVarName("b")),
             std::vector<std::string>({f::GradVarName("b1")}));
 
-  EXPECT_EQ(var_to_grad.size(), 5UL);
+  EXPECT_EQ(var_to_grad.size(), 6UL);
   EXPECT_EQ(var_to_grad.at("b3"), f::GradVarInfo(f::GradVarName("b3"), 0, 4));
   EXPECT_EQ(var_to_grad.at("y2"), f::GradVarInfo(f::GradVarName("y2"), 0, 5));
   EXPECT_EQ(var_to_grad.at("out1"),
@@ -846,8 +877,7 @@ TEST(Backward, shared_var) {
 }
 
 TEST(Backward, half_backward) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   auto *op1 = block->AppendOp();
   op1->SetType("minus");
@@ -863,7 +893,7 @@ TEST(Backward, half_backward) {
   auto ops = block->AllOps();
   ASSERT_EQ(3UL, ops.size());
 
-  EXPECT_EQ(var_to_grad.size(), 1UL);
+  EXPECT_EQ(var_to_grad.size(), 2UL);
   EXPECT_EQ(var_to_grad.at("a"),
             f::GradVarInfo(f::GradVarName("a"), 0, forward_len + 1));
 }
diff --git a/paddle/framework/block_desc.cc b/paddle/framework/block_desc.cc
index 47b75228cdbd2a8b4f0c5ad33aa82f5e43044606..92ac302e46ae1a5c90bec8522a2974298311284a 100644
--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -19,11 +19,11 @@ namespace paddle {
 namespace framework {
 
 VarDescBind *BlockDescBind::Var(const std::string &name) {
-  need_update_ = true;
   auto it = vars_.find(name);
   if (it != vars_.end()) {
     return it->second.get();
   }
+  need_update_ = true;
   auto *var = new VarDescBind(name);
   vars_[name].reset(var);
   return var;
@@ -55,6 +55,11 @@ OpDescBind *BlockDescBind::AppendOp() {
   return ops_.back().get();
 }
 
+void BlockDescBind::AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc) {
+  need_update_ = true;
+  ops_.emplace_back(std::move(op_desc));
+}
+
 OpDescBind *BlockDescBind::PrependOp() {
   need_update_ = true;
   ops_.emplace_front(new OpDescBind());
@@ -70,15 +75,19 @@ std::vector<OpDescBind *> BlockDescBind::AllOps() const {
 }
 
 void BlockDescBind::Flush() {
+  for (auto &op_desc : ops_) {
+    op_desc->Flush();
+  }
+
   if (need_update_) {
     auto &op_field = *this->desc_->mutable_ops();
-    op_field.Clear();
+    this->ClearPBOps();
     op_field.Reserve(static_cast<int>(ops_.size()));
     for (auto &op_desc : ops_) {
       op_field.AddAllocated(op_desc->Proto());
     }
     auto &var_field = *this->desc_->mutable_vars();
-    var_field.Clear();
+    this->ClearPBVars();
     var_field.Reserve(static_cast<int>(vars_.size()));
     for (auto &var_desc : vars_) {
       var_field.AddAllocated(var_desc.second->Proto());
@@ -99,5 +108,21 @@ BlockDesc *BlockDescBind::Proto() {
   return desc_;
 }
 
+void BlockDescBind::ClearPBOps() {
+  auto ops = this->desc_->mutable_ops();
+  while (!ops->empty()) {
+    // we do not own the OpDesc, so release the ownership.
+    ops->ReleaseLast();
+  }
+}
+
+void BlockDescBind::ClearPBVars() {
+  auto vars = this->desc_->mutable_vars();
+  while (!vars->empty()) {
+    // we do not own the VarDesc, so release the ownership.
+    vars->ReleaseLast();
+  }
+}
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/block_desc.h b/paddle/framework/block_desc.h
index 9fb88f963283c72e1ec389b72dd2d98049c74f6d..5e1f10c1aef8bb7b13d75baad49167052dd01e49 100644
--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@@ -36,6 +36,11 @@ class BlockDescBind {
   BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
       : prog_(prog), desc_(desc), need_update_(false) {}
 
+  ~BlockDescBind() {
+    this->ClearPBVars();
+    this->ClearPBOps();
+  }
+
   int32_t ID() const { return desc_->idx(); }
 
   int32_t Parent() const { return desc_->parent_idx(); }
@@ -52,17 +57,25 @@ class BlockDescBind {
 
   OpDescBind *AppendOp();
 
+  void AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc);
+
   OpDescBind *PrependOp();
 
   std::vector<OpDescBind *> AllOps() const;
 
+  size_t OpSize() const { return ops_.size(); }
+
+  OpDescBind *Op(int idx) { return ops_.at(idx).get(); }
+
   void Flush();
 
   BlockDesc *Proto();
 
-  // FIXME(yuyang18): backward will access private data of BlockDesc.
-  // Mark it public temporary. We can fix it later.
- public:
+ private:
+  void ClearPBOps();
+  void ClearPBVars();
+
+ private:
   ProgramDescBind *prog_;  // not_own
   BlockDesc *desc_;        // not_own
   bool need_update_;
diff --git a/paddle/framework/details/op_registry.h b/paddle/framework/details/op_registry.h
index ed7c5f17b0854809bde923276f36440cce193a88..357ad21f39f3b1f6dbdb98063f8fb24ec6800ec6 100644
--- a/paddle/framework/details/op_registry.h
+++ b/paddle/framework/details/op_registry.h
@@ -18,6 +18,7 @@
 #include "paddle/framework/op_info.h"
 #include "paddle/framework/op_proto_maker.h"
 #include "paddle/framework/operator.h"
+#include "paddle/framework/var_type_inference.h"
 
 namespace paddle {
 namespace framework {
@@ -26,7 +27,8 @@ namespace details {
 enum OpInfoFillType {
   kOperator = 0,
   kOpProtoAndCheckerMaker = 1,
-  kGradOpDescMaker = 2
+  kGradOpDescMaker = 2,
+  kVarTypeInference = 3
 };
 
 template <typename T>
@@ -38,7 +40,9 @@ struct OpInfoFillTypeID {
                       ? kOpProtoAndCheckerMaker
                       : (std::is_base_of<GradOpDescMakerBase, T>::value
                              ? kGradOpDescMaker
-                             : static_cast<OpInfoFillType>(-1)));
+                             : (std::is_base_of<VarTypeInference, T>::value
+                                    ? kVarTypeInference
+                                    : static_cast<OpInfoFillType>(-1))));
   }
 };
 
@@ -106,6 +110,17 @@ struct OpInfoFiller<T, kGradOpDescMaker> {
     };
   }
 };
+
+template <typename T>
+struct OpInfoFiller<T, kVarTypeInference> {
+  void operator()(const char* op_type, OpInfo* info) const {
+    info->infer_var_type_ = [](const OpDescBind& fwd_op, BlockDescBind* block) {
+      T inference;
+      inference(fwd_op, block);
+    };
+  }
+};
+
 }  // namespace details
 
 }  // namespace framework
diff --git a/paddle/framework/executor.cc b/paddle/framework/executor.cc
index 8e82e28bac478ad93ece3fcec9730c6cbabc392a..00caa6e1d53a4bcfae56c4459413bc1622321960 100644
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -64,99 +64,24 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
   auto& block = pdesc.blocks(block_id);
   auto& device = device_contexts_[0];
 
-  // Instantiate all the vars in the global scope
-  for (auto& var : block.vars()) {
-    scope->Var(var.name());
-  }
-
   Scope& local_scope = scope->NewScope();
 
-  std::vector<bool> should_run = Prune(pdesc, block_id);
-  PADDLE_ENFORCE_EQ(should_run.size(), static_cast<size_t>(block.ops_size()));
-  for (size_t i = 0; i < should_run.size(); ++i) {
-    if (should_run[i]) {
-      for (auto& var : block.ops(i).outputs()) {
-        for (auto& argu : var.arguments()) {
-          if (local_scope.FindVar(argu) == nullptr) {
-            local_scope.Var(argu);
-          }
-        }
-      }
-      auto op = paddle::framework::OpRegistry::CreateOp(block.ops(i));
-      op->Run(local_scope, *device);
+  for (auto& var : block.vars()) {
+    if (var.persistable()) {
+      scope->Var(var.name());
+    } else {
+      local_scope.Var(var.name());
     }
   }
 
-  // TODO(tonyyang-svail):
-  //  - Destroy local_scope
-}
-
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id) {
-  // TODO(tonyyang-svail):
-  //    - will change to use multiple blocks for RNN op and Cond Op
-
-  auto& block = pdesc.blocks(block_id);
-  auto& ops = block.ops();
-
-  bool expect_feed = true;
-  for (auto& op_desc : ops) {
-    PADDLE_ENFORCE(op_desc.type() != kFeedOpType || expect_feed,
-                   "All FeedOps are at the beginning of the ProgramDesc");
-    expect_feed = (op_desc.type() == kFeedOpType);
-  }
-
-  bool expect_fetch = true;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-    PADDLE_ENFORCE(op_desc.type() != kFetchOpType || expect_fetch,
-                   "All FetchOps must at the end of the ProgramDesc");
-    expect_fetch = (op_desc.type() == kFetchOpType);
-  }
-
-  std::set<std::string> dependent_vars;
-  std::vector<bool> should_run;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-
-    bool found_dependent_vars = false;
-    for (auto& var : op_desc.outputs()) {
-      for (auto& argu : var.arguments()) {
-        if (dependent_vars.count(argu) != 0) {
-          found_dependent_vars = true;
-        }
-      }
-    }
-
-    if (op_desc.type() == kFetchOpType || found_dependent_vars) {
-      // erase its output to the dependency graph
-      for (auto& var : op_desc.outputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.erase(argu);
-        }
-      }
-
-      // insert its input to the dependency graph
-      for (auto& var : op_desc.inputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.insert(argu);
-        }
-      }
-
-      should_run.push_back(true);
-    } else {
-      should_run.push_back(false);
-    }
+  for (auto& op_desc : block.ops()) {
+    auto op = paddle::framework::OpRegistry::CreateOp(
+        op_desc, const_cast<ProgramDesc*>(&pdesc));
+    op->Run(local_scope, *device);
   }
 
   // TODO(tonyyang-svail):
-  //    - check this after integration of Init
-  // PADDLE_ENFORCE(dependent_vars.empty());
-
-  // since we are traversing the ProgramDesc in reverse order
-  // we reverse the should_run vector
-  std::reverse(should_run.begin(), should_run.end());
-
-  return should_run;
+  //  - Destroy local_scope
 }
 
 }  // namespace framework
diff --git a/paddle/framework/executor.h b/paddle/framework/executor.h
index 4e3bc2c0a59dfee5b9993037671f14a109dc8a74..793ee954e25f7da6c9d04ea6acc2ad78812e8329 100644
--- a/paddle/framework/executor.h
+++ b/paddle/framework/executor.h
@@ -40,16 +40,5 @@ class Executor {
   std::vector<platform::DeviceContext*> device_contexts_;
 };
 
-/* @Brief
- * Pruning the graph
- *
- * @param
- *  ProgramDesc
- *
- * @return
- *  vector<bool> Same size as ops. Indicates whether an op should be run.
- */
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id);
-
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
deleted file mode 100644
index e08d31e361412c309b441d9381d06efe6d8b9724..0000000000000000000000000000000000000000
--- a/paddle/framework/executor_test.cc
+++ /dev/null
@@ -1,348 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/framework/executor.h"
-
-#include <memory>
-#include <vector>
-
-#include "gflags/gflags.h"
-#include "gtest/gtest.h"
-#include "paddle/framework/attribute.h"
-#include "paddle/framework/backward.h"
-#include "paddle/framework/block_desc.h"
-#include "paddle/framework/op_desc.h"
-#include "paddle/framework/op_registry.h"
-#include "paddle/framework/operator.h"
-
-USE_OP(elementwise_add);
-USE_OP(gaussian_random);
-USE_NO_KERNEL_OP(feed);
-USE_NO_KERNEL_OP(fetch);
-USE_OP(mul);
-USE_OP(sum);
-USE_OP(squared_l2_distance);
-USE_OP(fill_constant);
-USE_OP(mean);
-USE_OP(sgd);
-
-constexpr auto kFeedValueName = "feed_value";
-constexpr auto kFetchValueName = "fetch_value";
-
-using namespace paddle::platform;
-using namespace paddle::framework;
-
-void AddOp(const std::string& type, const VariableNameMap& inputs,
-           const VariableNameMap& outputs, AttributeMap attrs,
-           paddle::framework::BlockDescBind* block) {
-  // insert output
-  for (auto kv : outputs) {
-    for (auto v : kv.second) {
-      // <<<<<<< HEAD
-      //       auto var = block->Var(v);
-      //       var->SetType(VarDesc::LOD_TENSOR);
-      //       var->SetDataType(paddle::framework::DataType::FP32);
-      // =======
-      if (!block->HasVar(v)) {
-        auto var = block->Var(v);
-        var->SetDataType(paddle::framework::DataType::FP32);
-      }
-      // >>>>>>> origin/develop
-    }
-  }
-
-  // insert op
-  auto op = block->AppendOp();
-  op->SetType(type);
-  for (auto& kv : inputs) {
-    op->SetInput(kv.first, kv.second);
-  }
-  for (auto& kv : outputs) {
-    op->SetOutput(kv.first, kv.second);
-  }
-  op->SetAttrMap(attrs);
-  op->CheckAttrs();
-}
-
-// Tensors in feed value variable will only be in CPUPlace
-// So we can memcpy the data from vector<T> to feed_value
-template <typename T>
-void SetFeedVariable(const std::vector<std::vector<T>>& inputs,
-                     const std::vector<std::vector<int64_t>>& dims) {
-  Variable* g_feed_value = GetGlobalScope().FindVar(kFeedValueName);
-  auto& feed_inputs =
-      *(g_feed_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
-  size_t size = inputs.size();
-  feed_inputs.resize(size);
-  for (size_t i = 0; i < size; i++) {
-    T* dst = feed_inputs[i].mutable_data<T>(make_ddim(dims[i]), CPUPlace());
-    memcpy(dst, inputs[i].data(), inputs[i].size() * sizeof(T));
-  }
-}
-
-// Tensors in fetch value variable will only be in CPUPlace
-// So we can memcpy the data from fetch_value to vector<T>
-template <typename T>
-std::vector<std::vector<T>> GetFetchVariable() {
-  Variable* g_fetch_value = GetGlobalScope().FindVar(kFetchValueName);
-  auto& fetch_outputs =
-      *(g_fetch_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
-
-  size_t size = fetch_outputs.size();
-  std::vector<std::vector<T>> result;
-  result.reserve(size);
-  for (size_t i = 0; i < size; i++) {
-    std::vector<T> tmp;
-    tmp.resize(fetch_outputs[i].numel());
-    memcpy(tmp.data(), fetch_outputs[i].data<T>(),
-           fetch_outputs[i].numel() * sizeof(T));
-    result.push_back(tmp);
-  }
-
-  return result;
-}
-
-class ExecutorTesterRandom : public ::testing::Test {
- public:
-  virtual void SetUp() override {
-    int input_dim = 3, batch_size = 2, embed_dim = 5;
-
-    auto temp_init_root_block = init_pdesc_.add_blocks();
-    temp_init_root_block->set_idx(0);
-    temp_init_root_block->set_parent_idx(-1);
-    paddle::framework::ProgramDescBind& init_program =
-        paddle::framework::ProgramDescBind::Instance(&init_pdesc_);
-    paddle::framework::BlockDescBind* init_root_block = init_program.Block(0);
-
-    AddOp("gaussian_random", {}, {{"Out", {"w1"}}},
-          {{"dims", std::vector<int>{input_dim, embed_dim}}}, init_root_block);
-    AddOp("gaussian_random", {}, {{"Out", {"w2"}}},
-          {{"dims", std::vector<int>{embed_dim, input_dim}}}, init_root_block);
-    AddOp("fetch", {{"Input", {"w1"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 0}}, init_root_block);
-    AddOp("fetch", {{"Input", {"w2"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 1}}, init_root_block);
-
-    // flush
-    init_program.Proto();
-
-    // run block
-    auto temp_root_block = pdesc_.add_blocks();
-    temp_root_block->set_idx(0);
-    temp_root_block->set_parent_idx(-1);
-    paddle::framework::ProgramDescBind& program =
-        paddle::framework::ProgramDescBind::Instance(&pdesc_);
-    paddle::framework::BlockDescBind* root_block = program.Block(0);
-
-    // feed data
-    inputs_.push_back({1.0, 1.0, 1.0, 1.0, 1.0, 1.0});
-    dims_.push_back({batch_size, input_dim});
-    AddOp("feed", {{"Input", {kFeedValueName}}}, {{"Out", {"a"}}},
-          {{"dims", std::vector<int>{batch_size, input_dim}}, {"col", 0}},
-          root_block);
-
-    // forward
-    AddOp("mul", {{"X", {"a"}}, {"Y", {"w1"}}}, {{"Out", {"b"}}}, {},
-          root_block);
-    AddOp("mul", {{"X", {"b"}}, {"Y", {"w2"}}}, {{"Out", {"a_out"}}}, {},
-          root_block);
-    AddOp("squared_l2_distance", {{"X", {"a"}}, {"Y", {"a_out"}}},
-          {{"Out", {"l2_distance"}}, {"sub_result", {"l2_distance_sub"}}}, {},
-          root_block);
-    AddOp("mean", {{"X", {"l2_distance"}}}, {{"Out", {"mean_out"}}}, {},
-          root_block);
-
-    // backward
-    auto target = VarDescBind("mean_out");
-    AppendBackward(program, target, {});
-
-    // update
-    AddOp("fill_constant", {}, {{"Out", {"learning_rate"}}},
-          {{"shape", std::vector<int>{1}}, {"value", float(0.001)}},
-          root_block);
-    AddOp("sgd", {{"Param", {"w1"}},
-                  {"LearningRate", {"learning_rate"}},
-                  {"Grad", {"w1@GRAD"}}},
-          {{"ParamOut", {"w1"}}}, {}, root_block);
-    AddOp("sgd", {{"Param", {"w2"}},
-                  {"LearningRate", {"learning_rate"}},
-                  {"Grad", {"w2@GRAD"}}},
-          {{"ParamOut", {"w2"}}}, {}, root_block);
-
-    AddOp("fetch", {{"Input", {"w1"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 0}}, root_block);
-    AddOp("fetch", {{"Input", {"w2"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 1}}, root_block);
-    AddOp("fetch", {{"Input", {"l2_distance"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 0}}, root_block);
-
-    // flush
-    program.Proto();
-  }
-
- protected:
-  ProgramDesc init_pdesc_;
-  ProgramDesc pdesc_;
-  std::vector<std::vector<float>> inputs_;
-  std::vector<std::vector<int64_t>> dims_;
-};
-
-class ExecutorTesterFeedAndFetch : public ::testing::Test {
- public:
-  virtual void SetUp() override {
-    auto temp_root_block = pdesc_.add_blocks();
-    temp_root_block->set_idx(0);
-    temp_root_block->set_parent_idx(-1);
-
-    // wrap to BlockDescBind
-    paddle::framework::ProgramDescBind& program =
-        paddle::framework::ProgramDescBind::Instance(&pdesc_);
-    paddle::framework::BlockDescBind* root_block = program.Block(0);
-
-    std::vector<int> dim{6};
-
-    AddOp("feed", {{"Input", {kFeedValueName}}}, {{"Out", {"a"}}},
-          {{"dims", dim}, {"col", 0}}, root_block);
-    AddOp("feed", {{"Input", {kFeedValueName}}}, {{"Out", {"b"}}},
-          {{"dims", dim}, {"col", 1}}, root_block);
-    AddOp("fetch", {{"Input", {"a"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 0}}, root_block);
-    AddOp("fetch", {{"Input", {"b"}}}, {{"Out", {kFetchValueName}}},
-          {{"col", 1}}, root_block);
-
-    // flush
-    program.Proto();
-
-    std::vector<float> vec1 = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
-    std::vector<float> vec2 = {4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
-    inputs_.push_back(vec1);
-    inputs_.push_back(vec2);
-    dims_.push_back({static_cast<int64_t>(vec1.size())});
-    dims_.push_back({static_cast<int64_t>(vec2.size())});
-  }
-
- protected:
-  ProgramDesc pdesc_;
-  std::vector<std::vector<float>> inputs_;
-  std::vector<std::vector<int64_t>> dims_;
-};
-
-#ifndef PADDLE_WITH_CUDA
-TEST_F(ExecutorTesterRandom, CPU) {
-  std::vector<Place> places;
-  CPUPlace cpu_place;
-  places.push_back(cpu_place);
-
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  paddle::memory::Used(cpu_place);
-
-  std::unique_ptr<Executor> executor(new Executor(places));
-  executor->Run(init_pdesc_, &GetGlobalScope(), 0);
-  SetFeedVariable<float>(inputs_, dims_);
-  executor->Run(pdesc_, &GetGlobalScope(), 0);
-  std::vector<std::vector<float>> result = GetFetchVariable<float>();
-}
-
-TEST_F(ExecutorTesterFeedAndFetch, CPU) {
-  std::vector<Place> places;
-  CPUPlace cpu_place;
-  places.emplace_back(cpu_place);
-
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  paddle::memory::Used(cpu_place);
-
-  std::unique_ptr<Executor> executor(new Executor(places));
-
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-    std::vector<std::vector<float>> result = GetFetchVariable<float>();
-    ASSERT_EQ(result.size(), inputs_.size());
-    for (size_t i = 0; i < result.size(); ++i) {
-      ASSERT_EQ(result[i].size(), inputs_[i].size());
-      for (size_t j = 0; j < result[i].size(); ++j) {
-        ASSERT_EQ(result[i][j], inputs_[i][j]);
-      }
-    }
-  }
-}
-#else
-TEST_F(ExecutorTesterRandom, GPU) {
-  std::vector<Place> places;
-  GPUPlace gpu_place(0);
-  places.push_back(gpu_place);
-
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
-  // need to be used at first.
-  paddle::memory::Used(CPUPlace());
-  paddle::memory::Used(gpu_place);
-
-  std::unique_ptr<Executor> executor(new Executor(places));
-
-  executor->Run(init_pdesc_, &GetGlobalScope(), 0);
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-  }
-}
-
-TEST_F(ExecutorTesterFeedAndFetch, GPU) {
-  std::vector<Place> places;
-  GPUPlace gpu_place(0);
-  places.push_back(gpu_place);
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
-  // need to be used at first.
-  paddle::memory::Used(CPUPlace());
-  paddle::memory::Used(gpu_place);
-
-  std::unique_ptr<Executor> executor(new Executor(places));
-
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-    std::vector<std::vector<float>> result = GetFetchVariable<float>();
-    PADDLE_ENFORCE_EQ(result.size(), inputs_.size());
-    for (size_t i = 0; i < result.size(); ++i) {
-      PADDLE_ENFORCE_EQ(result[i].size(), inputs_[i].size());
-      for (size_t j = 0; j < result[i].size(); ++j) {
-        PADDLE_ENFORCE_EQ(result[i][j], inputs_[i][j]);
-      }
-    }
-  }
-}
-
-DECLARE_double(fraction_of_gpu_memory_to_use);
-
-int main(int argc, char** argv) {
-  testing::InitGoogleTest(&argc, argv);
-  // Use less GPU memory for unittest.
-  FLAGS_fraction_of_gpu_memory_to_use = 0.25;
-  return RUN_ALL_TESTS();
-}
-
-#endif
diff --git a/paddle/framework/feed_fetch_method.h b/paddle/framework/feed_fetch_method.h
new file mode 100644
index 0000000000000000000000000000000000000000..826d180bfc5445224a8d9292f06eeb58d9a46b29
--- /dev/null
+++ b/paddle/framework/feed_fetch_method.h
@@ -0,0 +1,50 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/scope.h"
+#include "paddle/framework/variable.h"
+
+namespace paddle {
+namespace framework {
+
+template <typename T>
+void SetFeedVariable(const LoDTensor& input, const std::string& var_name,
+                     size_t index) {
+  // If var_name Variable is not found in GlobalScope, a new variable will
+  // be created.
+  Variable* g_feed_value = GetGlobalScope().Var(var_name);
+  auto& feed_inputs =
+      *(g_feed_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
+  if (index >= feed_inputs.size()) {
+    feed_inputs.resize(index + 1);
+  }
+  // shared data with input tensor
+  feed_inputs[index].ShareDataWith<T>(input);
+  // set lod
+  feed_inputs[index].set_lod(input.lod());
+}
+
+LoDTensor& GetFetchVariable(const std::string& var_name, size_t index) {
+  // Since we want to fetch LodTensor from a variable, the variable must
+  // be created alreadly.
+  Variable* g_fetch_value = GetGlobalScope().FindVar(var_name);
+  auto& fetch_outputs =
+      *(g_fetch_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
+  PADDLE_ENFORCE_LT(index, fetch_outputs.size());
+  return fetch_outputs[index];
+}
+
+}  // namespace framework
+}  // namespace paddle
diff --git a/paddle/framework/op_desc.cc b/paddle/framework/op_desc.cc
index 7f7cebb026703d53220b3b4d6dc092d5bcbbe039..18fabe481dac9c1b70e7c30cb83ec5ee8ac47026 100644
--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -239,5 +239,19 @@ void OpDescBind::InferShape(const BlockDescBind &block) const {
   it->second(&ctx);
 }
 
+void OpDescBind::InferVarType(BlockDescBind *block) const {
+  auto &info = OpInfoMap::Instance().Get(this->Type());
+  if (info.infer_var_type_) {
+    info.infer_var_type_(*this, block);
+  } else {
+    // all output type is LoDTensor by default
+    for (auto &out_pair : this->outputs_) {
+      for (auto &out_var_name : out_pair.second) {
+        block->Var(out_var_name)->SetType(VarDesc::LOD_TENSOR);
+      }
+    }
+  }
+}
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/op_desc.h b/paddle/framework/op_desc.h
index 73b5cf846f702fe21277ae139156ec9784aa79b3..313bf538ac7c947c5e77ca0ead6bb53e6a156478 100644
--- a/paddle/framework/op_desc.h
+++ b/paddle/framework/op_desc.h
@@ -102,6 +102,8 @@ class OpDescBind {
 
   void InferShape(const BlockDescBind &block) const;
 
+  void InferVarType(BlockDescBind *block) const;
+
   void Flush();
 
  private:
diff --git a/paddle/framework/op_info.h b/paddle/framework/op_info.h
index c504f69e30bb899c183bd4281d2eadb50fd3b376..e926180780609c0a8ffc6270627835c50bbce782 100644
--- a/paddle/framework/op_info.h
+++ b/paddle/framework/op_info.h
@@ -19,7 +19,6 @@
 #include <unordered_map>
 
 #include "paddle/framework/attribute.h"
-#include "paddle/framework/op_desc.h"
 #include "paddle/framework/type_defs.h"
 #include "paddle/platform/macros.h"
 
@@ -31,6 +30,7 @@ struct OpInfo {
   GradOpMakerFN grad_op_maker_;
   OpProto* proto_{nullptr};
   OpAttrChecker* checker_{nullptr};
+  InferVarTypeFN infer_var_type_;
 
   bool HasOpProtoAndChecker() const {
     return proto_ != nullptr && checker_ != nullptr;
diff --git a/paddle/framework/op_registry.cc b/paddle/framework/op_registry.cc
index 504afbd5dbacf7185f92e0000d19666230e2fb42..c2f2438edf6daadf26cbc6db37f6668739ab1726 100644
--- a/paddle/framework/op_registry.cc
+++ b/paddle/framework/op_registry.cc
@@ -43,12 +43,13 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap(
   return ret_val;
 }
 
-std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc) {
+std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc,
+                                                   ProgramDesc* program) {
   VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs());
   VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs());
   AttributeMap attrs;
   for (auto& attr : op_desc.attrs()) {
-    attrs[attr.name()] = GetAttrValue(attr);
+    attrs[attr.name()] = GetAttrValue(attr, program);
   }
 
   return CreateOp(op_desc.type(), inputs, outputs, attrs);
diff --git a/paddle/framework/op_registry.h b/paddle/framework/op_registry.h
index 226e8ddcd4b1a2630e0eea00ee6c9f6af6bd5d20..d25b4abccb11bfe4f208ffc5a3113f3624fee914 100644
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -45,18 +45,15 @@ class Registrar {
 
 template <typename... ARGS>
 struct OperatorRegistrar : public Registrar {
-  explicit OperatorRegistrar(const char* op_type) : op_type(op_type) {
+  explicit OperatorRegistrar(const char* op_type) {
     PADDLE_ENFORCE(!OpInfoMap::Instance().Has(op_type),
                    "'%s' is registered more than once.", op_type);
     static_assert(sizeof...(ARGS) != 0,
                   "OperatorRegistrar should be invoked at least by OpClass");
+    OpInfo info;
     details::OperatorRegistrarRecursive<0, false, ARGS...>(op_type, &info);
     OpInfoMap::Instance().Insert(op_type, info);
   }
-
-  const char* op_type;
-
-  OpInfo info;
 };
 
 class OpRegistry {
@@ -77,21 +74,12 @@ class OpRegistry {
                                                 const VariableNameMap& outputs,
                                                 AttributeMap attrs);
 
-  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc);
+  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc,
+                                                ProgramDesc* program);
 
   static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };
 
-template <typename OpType, typename ProtoMakerType, typename GradOpType>
-class OpRegistrar : public Registrar {
- public:
-  explicit OpRegistrar(const char* op_type) { OpRegistrar(op_type, ""); }
-  OpRegistrar(const char* op_type, const char* grad_op_type) {
-    OpRegistry::RegisterOp<OpType, ProtoMakerType, GradOpType>(op_type,
-                                                               grad_op_type);
-  }
-};
-
 template <typename PlaceType, bool at_end, size_t I, typename... KernelType>
 struct OpKernelRegistrarFunctor;
 
diff --git a/paddle/framework/op_registry_test.cc b/paddle/framework/op_registry_test.cc
index b860fe6cac773d1e85adecc43f5dfec42b6c7661..6289125d7c782e542e5c55e1d4403836351b7e05 100644
--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -74,7 +74,7 @@ TEST(OpRegistry, CreateOp) {
   attr->set_type(paddle::framework::AttrType::FLOAT);
   attr->set_f(scale);
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::framework::Scope scope;
   paddle::platform::CPUDeviceContext dev_ctx;
   op->Run(scope, dev_ctx);
@@ -95,7 +95,7 @@ TEST(OpRegistry, IllegalAttr) {
 
   bool caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "larger_than check fail";
@@ -115,7 +115,7 @@ TEST(OpRegistry, DefaultValue) {
 
   ASSERT_TRUE(op_desc.IsInitialized());
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::framework::Scope scope;
   paddle::platform::CPUDeviceContext dev_ctx;
   op->Run(scope, dev_ctx);
@@ -131,7 +131,7 @@ TEST(OpRegistry, CustomChecker) {
   // attr 'test_attr' is not set
   bool caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "Attribute 'test_attr' is required!";
@@ -149,7 +149,7 @@ TEST(OpRegistry, CustomChecker) {
   attr->set_i(3);
   caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "'test_attr' must be even!";
@@ -166,7 +166,7 @@ TEST(OpRegistry, CustomChecker) {
   attr->set_name("test_attr");
   attr->set_type(paddle::framework::AttrType::INT);
   attr->set_i(4);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::platform::CPUDeviceContext dev_ctx;
   paddle::framework::Scope scope;
   op->Run(scope, dev_ctx);
diff --git a/paddle/framework/operator_test.cc b/paddle/framework/operator_test.cc
index d7890ac8d0af2171271a0cfccd356563c7604e72..c358f1a2b6ee3174b8c336ba1d212be7c5aa15c6 100644
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -83,7 +83,7 @@ TEST(OperatorBase, all) {
   paddle::platform::CPUDeviceContext device_context;
   paddle::framework::Scope scope;
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   scope.Var("OUT1");
   ASSERT_EQ(paddle::framework::op_run_num, 0);
   op->Run(scope, device_context);
@@ -208,7 +208,7 @@ TEST(OpKernel, all) {
   paddle::platform::CPUDeviceContext cpu_device_context;
   paddle::framework::Scope scope;
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0);
   op->Run(scope, cpu_device_context);
   ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1);
@@ -244,7 +244,7 @@ TEST(OpKernel, multi_inputs) {
   scope.Var("y0")->GetMutable<Tensor>();
   scope.Var("y1")->GetMutable<Tensor>();
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   op->Run(scope, cpu_device_context);
 }
 
diff --git a/paddle/framework/program_desc.cc b/paddle/framework/program_desc.cc
index fcb7292884275d972377983cb3ba1bcd86fb8348..df846f115a5f6dc202b872349d258ac33366e518 100644
--- a/paddle/framework/program_desc.cc
+++ b/paddle/framework/program_desc.cc
@@ -18,27 +18,10 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
 
-using ProgDescMap =
-    std::unordered_map<ProgramDesc *, std::unique_ptr<ProgramDescBind>>;
-static ProgDescMap *g_bind_map = nullptr;
-
-ProgramDescBind &ProgramDescBind::Instance(ProgramDesc *prog) {
-  if (g_bind_map == nullptr) {
-    g_bind_map = new ProgDescMap();
-  }
-  auto &map = *g_bind_map;
-  auto &ptr = map[prog];
-
-  if (ptr == nullptr) {
-    ptr.reset(new ProgramDescBind(prog));
-  }
-  return *ptr;
-}
-
 BlockDescBind *ProgramDescBind::AppendBlock(const BlockDescBind &parent) {
-  auto *b = prog_->add_blocks();
+  auto *b = prog_.add_blocks();
   b->set_parent_idx(parent.ID());
-  b->set_idx(prog_->blocks_size() - 1);
+  b->set_idx(prog_.blocks_size() - 1);
   blocks_.emplace_back(new BlockDescBind(this, b));
   return blocks_.back().get();
 }
@@ -47,14 +30,14 @@ ProgramDesc *ProgramDescBind::Proto() {
   for (auto &block : blocks_) {
     block->Flush();
   }
-  return prog_;
+  return &prog_;
 }
 
-ProgramDescBind::ProgramDescBind(ProgramDesc *prog) {
-  prog_ = prog;
-  for (auto &block : *prog->mutable_blocks()) {
-    blocks_.emplace_back(new BlockDescBind(this, &block));
-  }
+ProgramDescBind::ProgramDescBind() {
+  auto *block = prog_.mutable_blocks()->Add();
+  block->set_idx(0);
+  block->set_parent_idx(-1);
+  blocks_.emplace_back(new BlockDescBind(this, block));
 }
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/program_desc.h b/paddle/framework/program_desc.h
index f29b1c54e7160ac477229f64e5471939131a2d8f..514b62654df8f2179ab06e57074b1857d1200ed8 100644
--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -26,7 +26,7 @@ class BlockDescBind;
 
 class ProgramDescBind {
  public:
-  static ProgramDescBind &Instance(ProgramDesc *prog);
+  ProgramDescBind();
 
   BlockDescBind *AppendBlock(const BlockDescBind &parent);
 
@@ -37,10 +37,7 @@ class ProgramDescBind {
   ProgramDesc *Proto();
 
  private:
-  explicit ProgramDescBind(ProgramDesc *prog);
-
-  // Not owned
-  ProgramDesc *prog_;
+  ProgramDesc prog_;
 
   std::vector<std::unique_ptr<BlockDescBind>> blocks_;
 
diff --git a/paddle/framework/scope.cc b/paddle/framework/scope.cc
index 8f8a53eec8f947b088124a3f034fedb17fd86a48..5bf5e91f25ab1d920ae368eaf2000fce77d2eb07 100644
--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -65,16 +65,12 @@ void Scope::DropKids() {
   kids_.clear();
 }
 
-std::once_flag feed_variable_flag;
-
 framework::Scope& GetGlobalScope() {
-  static std::unique_ptr<framework::Scope> g_scope{nullptr};
-  std::call_once(feed_variable_flag, [&]() {
-    g_scope.reset(new framework::Scope());
-    g_scope->Var("feed_value");
-    g_scope->Var("fetch_value");
-  });
-  return *(g_scope.get());
+  static framework::Scope* g_scope = nullptr;
+  if (g_scope == nullptr) {
+    g_scope = new framework::Scope();
+  }
+  return *g_scope;
 }
 
 }  // namespace framework
diff --git a/paddle/framework/selected_rows.h b/paddle/framework/selected_rows.h
index f9f563051e264ae7ed7cf3c07c0065522b2bbe2e..cd9078137132669c7265ce3972f2c6df996fa366 100644
--- a/paddle/framework/selected_rows.h
+++ b/paddle/framework/selected_rows.h
@@ -10,6 +10,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #pragma once
+#include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/tensor.h"
 
 namespace paddle {
@@ -34,9 +35,9 @@ class SelectedRows {
 
   void set_height(int64_t height) { height_ = height; }
 
-  const std::vector<int64_t>& rows() const { return rows_; }
+  const Vector<int64_t>& rows() const { return rows_; }
 
-  void set_rows(const std::vector<int64_t>& rows) { rows_ = rows; }
+  void set_rows(const Vector<int64_t>& rows) { rows_ = rows; }
 
   DDim GetCompleteDims() const {
     std::vector<int64_t> dims = vectorize(value_->dims());
@@ -45,7 +46,10 @@ class SelectedRows {
   }
 
  private:
-  std::vector<int64_t> rows_;
+  // Notice: rows can be duplicate. We can have {0, 4, 7, 0, 5, 7, 9} here.
+  // SelectedRows are simplely concated when adding together. Until a
+  // SelectedRows add a Tensor, will the duplicate rows be handled.
+  Vector<int64_t> rows_;
   std::unique_ptr<Tensor> value_{nullptr};
   int64_t height_;
 };
diff --git a/paddle/framework/type_defs.h b/paddle/framework/type_defs.h
index 0d1564a7510ddf0106ff417fb0b487ddbde1ac2e..00da7289394cf18e013220a4bedde2c182f6a4a4 100644
--- a/paddle/framework/type_defs.h
+++ b/paddle/framework/type_defs.h
@@ -16,12 +16,18 @@
 #include <functional>
 #include <map>
 #include <memory>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
 #include "paddle/platform/variant.h"
 
 namespace paddle {
 namespace framework {
 class OperatorBase;
 class OpDescBind;
+class BlockDescBind;
+class BlockDesc;
 using VariableNameMap = std::map<std::string, std::vector<std::string>>;
 
 // The order should be as same as framework.proto
@@ -40,5 +46,8 @@ using GradOpMakerFN = std::function<std::vector<std::unique_ptr<OpDescBind>>(
     const OpDescBind&, const std::unordered_set<std::string>& /*no_grad_set*/,
     std::unordered_map<std::string, std::string>* /*grad_to_var*/)>;
 
+using InferVarTypeFN = std::function<void(const OpDescBind& /*op_desc*/,
+                                          BlockDescBind* /*block*/)>;
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/var_type_inference.h b/paddle/framework/var_type_inference.h
new file mode 100644
index 0000000000000000000000000000000000000000..32abbeb33479444c5e7a9889f4211f59af07f98f
--- /dev/null
+++ b/paddle/framework/var_type_inference.h
@@ -0,0 +1,29 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+#include "paddle/framework/type_defs.h"
+
+namespace paddle {
+namespace framework {
+
+class VarTypeInference {
+ public:
+  virtual ~VarTypeInference() {}
+  virtual void operator()(const OpDescBind& op_desc,
+                          BlockDescBind* block) const = 0;
+};
+
+}  // namespace framework
+}  // namespace paddle
diff --git a/paddle/framework/var_type_inference_test.cc b/paddle/framework/var_type_inference_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..918de1fd055e32888f71ffea1f33993ba1210e86
--- /dev/null
+++ b/paddle/framework/var_type_inference_test.cc
@@ -0,0 +1,104 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/framework/var_type_inference.h"
+#include "gtest/gtest.h"
+#include "paddle/framework/op_registry.h"
+#include "paddle/framework/operator.h"
+#include "paddle/framework/program_desc.h"
+
+namespace paddle {
+namespace framework {
+
+class SumOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  SumOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddComment("");
+  }
+};
+
+class SumOpVarTypeInference : public VarTypeInference {
+ public:
+  void operator()(const OpDescBind &op_desc,
+                  BlockDescBind *block) const override {
+    auto &inputs = op_desc.Input("X");
+    auto default_var_type = VarDesc::SELECTED_ROWS;
+
+    bool any_input_is_lod_tensor = std::any_of(
+        inputs.begin(), inputs.end(), [block](const std::string &name) {
+          return block->Var(name)->GetType() == VarDesc::LOD_TENSOR;
+        });
+    if (any_input_is_lod_tensor) {
+      default_var_type = VarDesc::LOD_TENSOR;
+    }
+
+    auto out_var_name = op_desc.Output("Out").front();
+    block->Var(out_var_name)->SetType(default_var_type);
+  }
+};
+}  // namespace framework
+}  // namespace paddle
+
+REGISTER_OPERATOR(sum, paddle::framework::NOP, paddle::framework::SumOpMaker,
+                  paddle::framework::SumOpVarTypeInference);
+REGISTER_OPERATOR(sum_without_infer_var_type, paddle::framework::NOP,
+                  paddle::framework::SumOpMaker);
+
+namespace paddle {
+namespace framework {
+
+TEST(InferVarType, sum_op) {
+  ProgramDescBind prog;
+  auto *op = prog.Block(0)->AppendOp();
+  op->SetType("sum");
+  op->SetInput("X", {"test_a", "test_b", "test_c"});
+  op->SetOutput("Out", {"test_out"});
+
+  prog.Block(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_out");
+
+  op->InferVarType(prog.Block(0));
+
+  ASSERT_EQ(VarDesc::SELECTED_ROWS, prog.Block(0)->Var("test_out")->GetType());
+
+  prog.Block(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
+  op->InferVarType(prog.Block(0));
+  ASSERT_EQ(VarDesc::LOD_TENSOR, prog.Block(0)->Var("test_out")->GetType());
+}
+
+TEST(InferVarType, sum_op_without_infer_var_type) {
+  ProgramDescBind prog;
+  auto *op = prog.Block(0)->AppendOp();
+  op->SetType("sum_without_infer_var_type");
+  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
+  op->SetOutput("Out", {"test2_out"});
+
+  prog.Block(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_out");
+
+  op->InferVarType(prog.Block(0));
+
+  ASSERT_EQ(VarDesc_VarType_LOD_TENSOR,
+            prog.Block(0)->Var("test2_out")->GetType());
+}
+
+}  // namespace framework
+}  // namespace paddle
\ No newline at end of file
diff --git a/paddle/gserver/gradientmachines/NeuralNetwork.cpp b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
index dcf0acb5a2cc7698625a2e254d4c12a32bc9631d..dbadc352a4ccd7483bf67e1025c212f514e32a24 100644
--- a/paddle/gserver/gradientmachines/NeuralNetwork.cpp
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
@@ -21,6 +21,10 @@ limitations under the License. */
 #include "paddle/utils/Logging.h"
 #include "paddle/utils/Stat.h"
 
+#ifdef PADDLE_USE_MKLDNN
+#include "paddle/gserver/layers/MKLDNNLayer.h"
+#endif
+
 #ifndef PADDLE_MOBILE_INFERENCE
 #include "MultiNetwork.h"
 #include "RecurrentGradientMachine.h"
@@ -300,6 +304,17 @@ void NeuralNetwork::backward(const UpdateCallback& callback) {
   }
 }
 
+void NeuralNetwork::finish() {
+#ifdef PADDLE_USE_MKLDNN
+  FOR_EACH_R(layer, layers_) {
+    MKLDNNLayerPtr dnnLayer = std::dynamic_pointer_cast<MKLDNNLayer>(*layer);
+    if (dnnLayer) {
+      dnnLayer->convertWeightsToPaddle();
+    }
+  }
+#endif
+}
+
 Argument NeuralNetwork::getLayerOutput(const std::string& layerName) {
   return getLayer(layerName)->getOutput();
 }
diff --git a/paddle/gserver/gradientmachines/NeuralNetwork.h b/paddle/gserver/gradientmachines/NeuralNetwork.h
index 56a1ec78460731554c9b47cf3f517f7654dc314f..6888380290074318fe7f94d168b2931e776dda47 100644
--- a/paddle/gserver/gradientmachines/NeuralNetwork.h
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.h
@@ -134,6 +134,9 @@ public:
 
   const std::string& getName() const { return subModelName_; }
 
+  /// some finish work, like convert the weight format of MKLDNNLayers
+  void finish();
+
 protected:
   /**
    * The constructor of NeuralNetwork.
diff --git a/paddle/gserver/layers/MKLDNNConvLayer.cpp b/paddle/gserver/layers/MKLDNNConvLayer.cpp
index 8b67a1ef4ffdd42559f8078873ed135751d56674..26810a648343d6203f7937740641325ae8ea6879 100644
--- a/paddle/gserver/layers/MKLDNNConvLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNConvLayer.cpp
@@ -313,6 +313,7 @@ void MKLDNNConvLayer::resetOutValue(
       cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
       CHECK(cvtOutVal_) << "should not be empty";
     } else {
+      cpuOut->setData(output_.value->getData());
       cpuOutVal_ = out;
     }
     // when output is cpu device, change the mkldnn output value and make them
@@ -456,17 +457,18 @@ void MKLDNNConvLayer::resetOutGrad(
     MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
   } else {
     const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
+    // always share the same grad data of CPU output
+    // then the activation can get the right grad from output_.grad
+    output_.grad->setData(cpuOut->getData());
     // same PrimitiveDesc with cpuInVal_
     CHECK(cpuOutVal_);
     cpuOutGrad_ = MKLDNNMatrix::create(cpuOut, cpuOutVal_->getPrimitiveDesc());
     // create reorder if primitive desc does not match
     if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(output_.grad, outVal_->getPrimitiveDesc());
+      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
       cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
       CHECK(cvtOutGrad_);
     } else {
-      // share the same data of CPU output
-      output_.grad->setData(cpuOut->getData());
       out = cpuOutGrad_;
     }
   }
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index 5f9923da769781287e39a3aaaf92248dfe09f225..4e2753eba2350d2c3df81b57fe98270a3c38cb24 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -46,6 +46,9 @@ protected:
   // backward also need reset after reset forward handle
   bool needResetBwd_;
 
+  // is output only mkldnn
+  bool outputOnlyMKLDNN_;
+
   // mkldnn engine, stream and primivtives
   mkldnn::engine engine_;
   std::shared_ptr<MKLDNNStream> stream_;
@@ -141,6 +144,9 @@ public:
         updateInputData();
       }
 
+      if (!outputOnlyMKLDNN_) {
+        clearGrads();
+      }
       stream_->submit(pipelineFwd_);
     }
 
@@ -389,7 +395,8 @@ protected:
       CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
           << "Only support other device is CPU yet";
     }
-    return outputOtherDevice_.size() == 0;
+    outputOnlyMKLDNN_ = outputOtherDevice_.size() == 0;
+    return outputOnlyMKLDNN_;
   }
 
   /**
@@ -398,6 +405,16 @@ protected:
   void setDevice(int id) { deviceId_ = id; }
 
 private:
+  /**
+   * clear all grad
+   */
+  void clearGrads() {
+    output_.grad->zeroMem();
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      outputOtherDevice_[i].grad->zeroMem();
+    }
+  }
+
   /**
    * Set deviceId of the params used in this layer.
    */
diff --git a/paddle/gserver/layers/MKLDNNPoolLayer.cpp b/paddle/gserver/layers/MKLDNNPoolLayer.cpp
index 5606aae80ce8e9a1e571d3c057c471b26a59d032..0e53e2d1b7e6691909955eeacd345981a9960ec6 100644
--- a/paddle/gserver/layers/MKLDNNPoolLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNPoolLayer.cpp
@@ -146,6 +146,7 @@ void MKLDNNPoolLayer::resetOutValue(MKLDNNMatrixPtr& out) {
       cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
       CHECK(cvtOutVal_) << "should not be emptry";
     } else {
+      cpuOut->setData(output_.value->getData());
       cpuOutVal_ = out;
     }
     output_.value = std::dynamic_pointer_cast<Matrix>(cpuOutVal_);
@@ -213,15 +214,16 @@ void MKLDNNPoolLayer::resetOutGrad(MKLDNNMatrixPtr& out) {
     MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
   } else {
     const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
+    // always share the same grad data of CPU output
+    // then the activation can get the right grad from output_.grad
+    output_.grad->setData(cpuOut->getData());
     cpuOutGrad_ = MKLDNNMatrix::create(
         cpuOut, memory::dims{bs_, oc_, oh_, ow_}, format::nchw, engine_);
     if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(output_.grad, outVal_->getPrimitiveDesc());
+      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
       cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
       CHECK(cvtOutGrad_) << "should not be emptry";
     } else {
-      // share the same data of CPU output
-      output_.grad->setData(cpuOut->getData());
       out = cpuOutGrad_;
     }
   }
diff --git a/paddle/gserver/tests/CMakeLists.txt b/paddle/gserver/tests/CMakeLists.txt
index fcee19415c13e9731bd47eb53bbff9b52cf6450b..329536afaf6d69676e8c39fdf8b6b8cb87ade5fa 100644
--- a/paddle/gserver/tests/CMakeLists.txt
+++ b/paddle/gserver/tests/CMakeLists.txt
@@ -26,7 +26,10 @@ if(WITH_MKLDNN)
         test_MKLDNN.cpp
         MKLDNNTester.cpp
         LayerGradUtil.cpp)
-    add_test(NAME test_MKLDNN COMMAND test_MKLDNN)
+    add_test(NAME test_MKLDNN
+        COMMAND .set_python_path.sh -d ${PADDLE_SOURCE_DIR}/python
+            ${CMAKE_CURRENT_BINARY_DIR}/test_MKLDNN
+            WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
 endif()
 
 ################ test_CRFLayerGrad ####################
diff --git a/paddle/gserver/tests/MKLDNNTester.cpp b/paddle/gserver/tests/MKLDNNTester.cpp
index eaebdd671cfa1b37e5efe149588ca23fdc402a8e..3bf6a9e176cc1235aa5ddefcedd4253e6afc1342 100644
--- a/paddle/gserver/tests/MKLDNNTester.cpp
+++ b/paddle/gserver/tests/MKLDNNTester.cpp
@@ -15,6 +15,7 @@ limitations under the License. */
 #include "MKLDNNTester.h"
 #include "paddle/gserver/layers/MKLDNNBase.h"
 #include "paddle/gserver/layers/MKLDNNLayer.h"
+#include "paddle/trainer/Trainer.h"
 
 namespace paddle {
 
@@ -315,6 +316,7 @@ void MKLDNNTester::runOnce() {
     auto& value = para->getBuf(PARAMETER_VALUE);
     real lr = 1e-3;
     value->add(*grad, lr);
+    grad->zeroMem();
   };
   randomTopDiffs();
   dnnLayer_->backward(updateCallback);
@@ -411,4 +413,143 @@ void MKLDNNTester::run(const TestConfig& dnn,
   }
 }
 
+void MKLDNNTester::initArgument(DataIn& data,
+                                const std::string& configPath,
+                                const size_t iter) {
+  TrainerConfigHelper config(configPath);
+  size_t batchSize = config.getOptConfig().batch_size();
+  data.inArgs.resize(iter);
+  data.outGrads.resize(iter);
+  data.paraValues.clear();
+  for (const auto& layer_name : config.getModelConfig().input_layer_names()) {
+    auto layer_config = std::find_if(config.getModelConfig().layers().begin(),
+                                     config.getModelConfig().layers().end(),
+                                     [=](const LayerConfig& layer_config) {
+                                       return layer_config.name() == layer_name;
+                                     });
+    CHECK(layer_config != config.getModelConfig().layers().end());
+
+    size_t layerSize = layer_config->size();
+    for (size_t i = 0; i < iter; ++i) {
+      Argument arg;
+      arg.value = Matrix::create(batchSize, layerSize, false, false);
+      arg.grad = Matrix::create(batchSize, layerSize, false, false);
+      arg.value->randomizeUniform();
+      arg.value->add(-0.5);
+      arg.value->sigmoid(*arg.value);
+      arg.grad->zeroMem();
+      arg.ids = VectorT<int>::create(batchSize, false);
+      arg.ids->rand(layerSize);
+      generateSequenceStartPositions(batchSize, arg.sequenceStartPositions);
+      data.inArgs[i].push_back(arg);
+    }
+  }
+
+  for (const auto& layer_name : config.getModelConfig().output_layer_names()) {
+    auto layer_config = std::find_if(config.getModelConfig().layers().begin(),
+                                     config.getModelConfig().layers().end(),
+                                     [=](const LayerConfig& layer_config) {
+                                       return layer_config.name() == layer_name;
+                                     });
+    CHECK(layer_config != config.getModelConfig().layers().end());
+
+    size_t layerSize = layer_config->size();
+    for (size_t i = 0; i < iter; ++i) {
+      MatrixPtr grad = Matrix::create(batchSize, layerSize, false, false);
+      grad->randomizeUniform();
+      data.outGrads[i].push_back(grad);
+    }
+  }
+
+  for (const auto& para_config : config.getModelConfig().parameters()) {
+    VectorPtr value = Vector::create(para_config.size(), false);
+    value->randnorm(0, 2);
+    data.paraValues.push_back(value);
+  }
+}
+
+void MKLDNNTester::getOutResult(const std::string& configPath,
+                                DataIn& in,
+                                DataOut& out,
+                                bool use_mkldnn,
+                                size_t iter) {
+  FLAGS_use_gpu = false;
+  FLAGS_use_mkldnn = use_mkldnn;
+  *ThreadLocalRand::getSeed() = 1;
+  srand(1);
+
+  Trainer trainer;
+  auto config = std::make_shared<TrainerConfigHelper>(configPath);
+  trainer.init(config, false);
+  auto gradientMachine = trainer.getGradientMachine();
+  std::vector<ParameterPtr> parameters = gradientMachine->getParameters();
+  for (size_t i = 0; i < in.paraValues.size(); i++) {
+    parameters[i]->getBuf(PARAMETER_VALUE)->copyFrom(*in.paraValues[i]);
+  }
+  UpdateCallback simpleUpdate = [](Parameter* para) {
+    auto& grad = para->getBuf(PARAMETER_GRADIENT);
+    auto& value = para->getBuf(PARAMETER_VALUE);
+    real lr = 1e-2;
+    value->add(*grad, lr);
+    grad->zeroMem();
+  };
+
+  vector<Argument> outArgs;
+  gradientMachine->start();
+  out.outValues.clear();
+  out.paraValues.clear();
+  for (size_t i = 0; i < iter; ++i) {
+    VLOG(MKLDNN_TESTS) << "runing iteration " << i;
+    gradientMachine->forward(in.inArgs[i], &outArgs, PASS_TRAIN);
+    // save forward result
+    for (size_t k = 0; k < outArgs.size(); k++) {
+      MatrixPtr value = Matrix::create(outArgs[k].value->getHeight(),
+                                       outArgs[k].value->getWidth(),
+                                       false,
+                                       false);
+      value->copyFrom(*outArgs[k].value);
+      out.outValues.push_back(value);
+    }
+
+    // random backward input
+    for (size_t k = 0; k < outArgs.size(); k++) {
+      outArgs[k].grad->copyFrom(*in.outGrads[i][k]);
+    }
+    gradientMachine->backward(simpleUpdate);
+  }
+  gradientMachine->finish();
+
+  // save param value
+  for (size_t i = 0; i < in.paraValues.size(); i++) {
+    VectorPtr val = Vector::create(
+        parameters[i]->getBuf(PARAMETER_VALUE)->getSize(), false);
+    val->copyFrom(*parameters[i]->getBuf(PARAMETER_VALUE));
+    out.paraValues.push_back(val);
+  }
+}
+
+void MKLDNNTester::compareResult(DataOut& ref, DataOut& dnn, float eps) {
+  CHECK_EQ(ref.outValues.size(), dnn.outValues.size());
+  CHECK_EQ(ref.paraValues.size(), dnn.paraValues.size());
+  for (size_t i = 0; i < ref.outValues.size(); i++) {
+    EXPECT_LE(fabs(compareMatrix(ref.outValues[i], dnn.outValues[i])), eps);
+  }
+  for (size_t i = 0; i < ref.paraValues.size(); i++) {
+    EXPECT_LE(fabs(compareVector(ref.paraValues[i], dnn.paraValues[i])), eps);
+  }
+}
+
+void MKLDNNTester::runBranchesTest(const std::string& configPath,
+                                   size_t iter,
+                                   float eps) {
+  DataIn in;
+  initArgument(in, configPath, iter);
+
+  DataOut outCpu, outDnn;
+  getOutResult(configPath, in, outCpu, false, iter);
+  getOutResult(configPath, in, outDnn, true, iter);
+
+  compareResult(outCpu, outDnn, eps);
+}
+
 }  //  namespace paddle
diff --git a/paddle/gserver/tests/MKLDNNTester.h b/paddle/gserver/tests/MKLDNNTester.h
index 171d176ee757f1164c38d86273bdf9e5aefeda06..51abfcb67e2ec35fe1b0179e742a7d18f08f8a2c 100644
--- a/paddle/gserver/tests/MKLDNNTester.h
+++ b/paddle/gserver/tests/MKLDNNTester.h
@@ -33,6 +33,17 @@ class MKLDNNTester {
     NUM = 2,  // Number of total
   };
 
+  struct DataIn {
+    std::vector<std::vector<Argument>> inArgs;
+    std::vector<std::vector<MatrixPtr>> outGrads;
+    std::vector<VectorPtr> paraValues;
+  };
+
+  struct DataOut {
+    std::vector<MatrixPtr> outValues;
+    std::vector<VectorPtr> paraValues;
+  };
+
 protected:
   std::vector<TestConfig> configs_;
   vector<string> layerNames_;
@@ -74,7 +85,17 @@ public:
            float epsilon = 1e-4,
            bool log = false,
            int level = MKLDNN_ALL);
-  void setLogLevel(int lvl) { lvl_ = lvl; }
+  static void runBranchesTest(const std::string& configPath,
+                              size_t iter = 3,
+                              float eps = 1e-4);
+  static void initArgument(DataIn& data,
+                           const std::string& configPath,
+                           size_t iter = 3);
+  static void getOutResult(const std::string& configPath,
+                           DataIn& in,
+                           DataOut& out,
+                           bool use_mkldnn,
+                           size_t iter = 3);
 
 private:
   void reset(const TestConfig& dnn, const TestConfig& ref, size_t batchSize);
@@ -101,8 +122,9 @@ private:
   void saveWgt(const vector<ParameterPtr>& from, vector<VectorPtr>& to);
   void restoreWgt(const vector<VectorPtr>& from, vector<ParameterPtr>& to);
 
-  double compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2);
-  double compareVector(const VectorPtr& v1, const VectorPtr& v2);
+  static double compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2);
+  static double compareVector(const VectorPtr& v1, const VectorPtr& v2);
+  static void compareResult(DataOut& ref, DataOut& dnn, float eps = 1e-4);
 
   /**
    * Get delta percent
@@ -111,11 +133,11 @@ private:
    * else return sum(abs(a-b)) / sum(abs(b))
    * The return value should be smaller than eps when passing.
    */
-  double getDelta(const real* d1,
-                  const real* d2,
-                  size_t len,
-                  const float failRate = 1e-3,
-                  const float thres = 0.1);
+  static double getDelta(const real* d1,
+                         const real* d2,
+                         size_t len,
+                         const float failRate = 1e-3,
+                         const float thres = 0.1);
 };
 
 }  //  namespace paddle
diff --git a/paddle/gserver/tests/mkldnn_branches_conv.conf b/paddle/gserver/tests/mkldnn_branches_conv.conf
new file mode 100644
index 0000000000000000000000000000000000000000..2628509db43e6a5f69a4f5ea956bffdc2837e32a
--- /dev/null
+++ b/paddle/gserver/tests/mkldnn_branches_conv.conf
@@ -0,0 +1,56 @@
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_conv(input, group_name):
+  out1 = img_conv_layer(input=input,
+            name=group_name+'_conv1',
+            filter_size=1,
+            num_filters=channels,
+            padding=0,
+            shared_biases=True,
+            act=ReluActivation())
+
+  out2 = img_conv_layer(input=input,
+            name=group_name+'_conv2',
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=ReluActivation())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=ReluActivation())
+
+a1, a2 = two_conv(input=conv, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_conv(input=conv, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
diff --git a/paddle/gserver/tests/test_MKLDNN.cpp b/paddle/gserver/tests/test_MKLDNN.cpp
index a70b2f17f4f1130322f3c50d244f70fdcf34468b..3571fbb9e335fc6652bdbfc3f9e35beabda5044f 100644
--- a/paddle/gserver/tests/test_MKLDNN.cpp
+++ b/paddle/gserver/tests/test_MKLDNN.cpp
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #include <gtest/gtest.h>
+#include <paddle/utils/PythonUtil.h>
 #include <string>
 #include <vector>
 #include "MKLDNNTester.h"
@@ -40,12 +41,13 @@ DECLARE_bool(use_mkldnn);
 struct testFcDesc {
   int bs;
   int ic;
-  int oc;
   int ih, iw;  // oh == ow == 1
+  int oc;
 };
 
 static void getMKLDNNFcConfig(TestConfig& cfg, const testFcDesc& pm) {
   cfg.layerConfig.set_type("mkldnn_fc");
+  cfg.layerConfig.set_active_type("relu");
   cfg.layerConfig.set_size(pm.oc);
   cfg.inputDefs.push_back(
       {INPUT_DATA,
@@ -86,6 +88,7 @@ struct testConvDesc {
 
 static void getMKLDNNConvConfig(TestConfig& cfg, const testConvDesc& pm) {
   cfg.layerConfig.set_type("mkldnn_conv");
+  cfg.layerConfig.set_active_type("relu");
   cfg.layerConfig.set_num_filters(pm.oc);
   cfg.layerConfig.set_size(pm.oc * pm.oh * pm.ow);
   cfg.layerConfig.set_shared_biases(true);
@@ -158,6 +161,7 @@ struct testPoolDesc {
 
 static void getMKLDNNPoolConfig(TestConfig& cfg, const testPoolDesc& pm) {
   cfg.layerConfig.set_type("mkldnn_pool");
+  cfg.layerConfig.set_active_type("relu");
   cfg.layerConfig.set_size(pm.ic * pm.oh * pm.ow);
   cfg.inputDefs.push_back(
       {INPUT_DATA,
@@ -244,13 +248,26 @@ TEST(MKLDNNActivation, Activations) {
   }
 }
 
-// TODO(TJ): add branch test
+DECLARE_string(config_args);
+TEST(MKLDNNLayer, branches) {
+  std::vector<std::string> cases = {"conv"};
+  for (auto name : cases) {
+    std::string config = "./gserver/tests/mkldnn_branches_" + name + ".conf";
+    for (auto channels : {2, 32}) {
+      std::ostringstream oss;
+      oss << "channels=" << channels;
+      FLAGS_config_args = oss.str();
+      MKLDNNTester::runBranchesTest(config);
+    }
+  }
+}
 
 int main(int argc, char** argv) {
   testing::InitGoogleTest(&argc, argv);
   FLAGS_use_gpu = false;
   FLAGS_use_mkldnn = true;
   initMain(argc, argv);
+  initPython(argc, argv);
   FLAGS_thread_local_rand_use_global_seed = true;
   srand(1);
   return RUN_ALL_TESTS();
diff --git a/paddle/memory/memory.cc b/paddle/memory/memory.cc
index 5087c02385f7f37d78d134b739f3f22522977fb8..8e561528f0e7e6ff524fc51b4776efc4e5bd28cd 100644
--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -14,11 +14,6 @@ limitations under the License. */
 
 #include "paddle/memory/memory.h"
 
-#include <algorithm>  // for transform
-#include <cstring>    // for memcpy
-#include <memory>     // for unique_ptr
-#include <mutex>      // for call_once
-
 #include "glog/logging.h"
 
 #include "paddle/memory/detail/buddy_allocator.h"
@@ -32,19 +27,14 @@ namespace memory {
 
 using BuddyAllocator = detail::BuddyAllocator;
 
-std::once_flag cpu_allocator_flag;
-std::once_flag gpu_allocator_flag;
-
 BuddyAllocator* GetCPUBuddyAllocator() {
-  static std::unique_ptr<BuddyAllocator> a{nullptr};
-
-  std::call_once(cpu_allocator_flag, [&]() {
-    a.reset(new BuddyAllocator(new detail::CPUAllocator,
-                               platform::CpuMinChunkSize(),
-                               platform::CpuMaxChunkSize()));
-  });
-
-  return a.get();
+  static detail::BuddyAllocator* a = nullptr;
+  if (a == nullptr) {
+    a = new detail::BuddyAllocator(new detail::CPUAllocator,
+                                   platform::CpuMinChunkSize(),
+                                   platform::CpuMaxChunkSize());
+  }
+  return a;
 }
 
 template <>
@@ -65,35 +55,24 @@ size_t Used<platform::CPUPlace>(platform::CPUPlace place) {
 #ifdef PADDLE_WITH_CUDA
 
 BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
-  using BuddyAllocVec = std::vector<BuddyAllocator*>;
-  static std::unique_ptr<BuddyAllocVec, void (*)(BuddyAllocVec * p)> as{
-      new BuddyAllocVec, [](BuddyAllocVec* p) {
-        std::for_each(p->begin(), p->end(),
-                      [](BuddyAllocator* p) { delete p; });
-      }};
-
-  // GPU buddy allocators
-  auto& allocators = *as.get();
-
-  // GPU buddy allocator initialization
-  std::call_once(gpu_allocator_flag, [&]() {
+  static BuddyAllocator** as = NULL;
+  if (as == NULL) {
     int gpu_num = platform::GetCUDADeviceCount();
-    allocators.reserve(gpu_num);
+    as = new BuddyAllocator*[gpu_num];
     for (int gpu = 0; gpu < gpu_num; gpu++) {
       platform::SetDeviceId(gpu);
-      allocators.emplace_back(new BuddyAllocator(new detail::GPUAllocator,
-                                                 platform::GpuMinChunkSize(),
-                                                 platform::GpuMaxChunkSize()));
+      as[gpu] = new BuddyAllocator(new detail::GPUAllocator,
+                                   platform::GpuMinChunkSize(),
+                                   platform::GpuMaxChunkSize());
     }
     VLOG(3) << "\n\nNOTE: each GPU device use "
             << FLAGS_fraction_of_gpu_memory_to_use * 100 << "% of GPU memory.\n"
             << "You can set environment variable '"
             << platform::kEnvFractionGpuMemoryToUse
             << "' to change the fraction of GPU usage.\n\n";
-  });
-
+  }
   platform::SetDeviceId(gpu_id);
-  return allocators[gpu_id];
+  return as[gpu_id];
 }
 
 template <>
diff --git a/paddle/operators/accuracy_op.cc b/paddle/operators/accuracy_op.cc
index c5fb113e0f4455ec852fd48542ea2917398df5f3..037bb49abc6c272eed2d27ea5d8425866ef9a1d5 100644
--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -21,7 +21,6 @@ class AccuracyOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Inference"),
                    "Input(Inference) of AccuracyOp should not be null.");
diff --git a/paddle/operators/activation_op.cc b/paddle/operators/activation_op.cc
index 84c3775b4fc2602e5df9bb454d21b318b8fda493..ee4f9b0ef29cc73907bc09fb6014850cb4e58a67 100644
--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -21,7 +21,6 @@ class ActivationOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     ctx->SetOutputDim("Y", ctx->GetInputDim("X"));
     ctx->ShareLoD("X", /*->*/ "Y");
@@ -32,7 +31,6 @@ class ActivationOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("Y"));
   }
diff --git a/paddle/operators/adadelta_op.cc b/paddle/operators/adadelta_op.cc
index cf1bca1658f3fb4cd14465791a0f94865871a120..24e419b532d97bc16ab96dad418d6e73c03f30a0 100644
--- a/paddle/operators/adadelta_op.cc
+++ b/paddle/operators/adadelta_op.cc
@@ -21,7 +21,6 @@ class AdadeltaOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of AdadeltaOp should not be null.");
diff --git a/paddle/operators/adagrad_op.cc b/paddle/operators/adagrad_op.cc
index a17747efb79d76a6beffb27bfc03ee1b62c5618d..bc081f87dcab0dcd8ef329dcb1f66b627c82b4a2 100644
--- a/paddle/operators/adagrad_op.cc
+++ b/paddle/operators/adagrad_op.cc
@@ -21,7 +21,6 @@ class AdagradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of AdagradOp should not be null.");
diff --git a/paddle/operators/adam_op.cc b/paddle/operators/adam_op.cc
index 293b37b7750427cb88efb6dfd5a02dcf7ede24ac..e3db70ea129880434add21e71d15e5129c4551bd 100644
--- a/paddle/operators/adam_op.cc
+++ b/paddle/operators/adam_op.cc
@@ -21,7 +21,6 @@ class AdamOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of AdamOp should not be null.");
diff --git a/paddle/operators/adamax_op.cc b/paddle/operators/adamax_op.cc
index 5cf727742c770f8f0d59cd49345c5c055722a56c..e848333ef8a819648cc3056ae2f4a0e33fc58405 100644
--- a/paddle/operators/adamax_op.cc
+++ b/paddle/operators/adamax_op.cc
@@ -21,7 +21,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of AdamaxOp should not be null.");
diff --git a/paddle/operators/batch_norm_op.md b/paddle/operators/batch_norm_op.md
new file mode 100644
index 0000000000000000000000000000000000000000..80948adf2b9047a9685dbdd90b2296b5a955f9c1
--- /dev/null
+++ b/paddle/operators/batch_norm_op.md
@@ -0,0 +1,134 @@
+# Batch Normalization
+
+## What is batch normalization
+
+Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training. 
+
+The principle of batch normalization can be summarized into a simple function:
+
+```
+y = (x - E[x]) / STD[x]) * scale + bias
+```
+
+`x` is a batch of output data of a certain layer. `E[x]` and `STD[x]` is the mean and standard deviation of `x`, respectively。 `scale` and `bias` are two trainable parameters. The training of batch normalization layer equals to the learning of best values of `scale` and `bias`.
+
+In our design, we use a single operator(`batch_norm_op`) to implement the whole batch normalization in C++, and wrap it as a layer in Python.
+
+## Differences with normal operators
+
+`batch_norm_op` is a single operator. However, there are a few differences between `BatchNormOp` and normal operators, which we shall take into consideration in our design.
+
+1. `batch_norm_op` shall behave differently in training and inferencing. For example, during inferencing, there is no batch data and it's impossible to compute `E[x]` and `STD[x]`, so we have to use an `estimated_mean` and an `estimated_variance` instead of them. These require our framework to be able to inform operators current running type (training/inferencing), then operators can switch their behaviors.
+
+2. `batch_norm_op` shall have the ability to maintain `estimated_mean` and `estimated_variance` across mini-batch. In each mini-batch, `estimated_mean` is iterated by the following equations:
+
+```
+if batch_id == 0
+  estimated_mean = E[x]
+else
+  estimated_mean = estimated_mean * momentum + (1.0 - momentum_) * E[x]
+```
+
+The iterating of `estimated_variance` is similar. `momentum` is an attribute, which controls estimated_mean updating speed.
+
+## Implementation
+
+Batch normalization is designed as a single operator is C++, and then wrapped as a layer in Python.
+
+### C++
+
+As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attributes and compute kernels.
+
+#### Inputs
+
+- `x`: The inputs data, which is generated by the previous layer.
+- `estimated_mean`: The estimated mean of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `E[x]`.
+- `estimated_var`: The estimated standard deviation of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `STD[x]`.
+- `scale`: trainable parameter 'scale'
+- `bias`: trainable parameter 'bias'
+
+#### Outputs
+
+- `y`: The output data.
+- `batch_mean`: The mean value of batch data.
+- `batch_var`: The standard deviation value of batch data.
+- `saved_mean`: Updated `estimated_mean` with current batch data. It's supposed to share the memory with input `estimated_mean`.
+- `saved_var`: Updated `estimated_var` with current batch data. It's supposed to share the memory with input `estimated_var`.
+
+#### Attributes
+
+- `is_infer`: *bool*. If true, run `batch_norm_op` in inferencing mode.
+- `use_global_est`: *bool*. If true, use `saved_mean` and `saved_var` instead of `E[x]` and `STD[x]` in trainning.
+- `epsilon`: *float*. The epsilon value to avoid division by zero.
+- `momentum`: *float*. Factor used in `estimated_mean` and `estimated_var` updating. The usage is shown above.
+
+#### Kernels
+
+The following graph showes the training computational process of `batch_norm_op`:
+
+<img src="./images/batch_norm_op_kernel.png" width="800"/>
+
+cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
+
+### Python
+
+`batch_norm_op` is warpped as a layer in Python:
+
+```python 
+def batch_norm_layer(net, 
+                     input,
+                     output, 
+                     scale, 
+                     bias, 
+                     use_global_est = False, 
+                     epsilon = 1e-6,
+                     momentum = 0.99):
+	mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
+	var_cache = scop.new_var(name = 'estimated_var', trainable = False)
+	batch_mean = scope.new_var(name = 'batch_mean')
+	batch_var = scope.new_var(name = 'batch_var')
+	batch_norm_op = Operator('batch_norm_op',
+	                         x = input,
+	                         estimated_mean = mean_cache,
+	                         estimated_mean = var_cache,
+	                         scale = scale,
+	                         bias = bias,
+	                         y = output,
+	                         batch_mean = batch_mean,
+	                         batch_var = batch_var,
+	                         saved_mean = mean_cache,
+	                         saved_var = var_cache,
+	                         is_infer = False,
+	                         use_global_est = use_global_est,
+	                         epsilon = epsilon,
+	                         momentum = momentum)
+	net.append_op(batch_norm_op)
+	return output
+```
+
+Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note:
+
+1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch.
+
+2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this:
+
+```python
+for pass_id in range(PASS_NUM):
+    # ...
+    net.train()  # run training model
+    if pass_id % 100 == 0:
+        net.infer(test_image)    # run inferencing model
+    # ...
+``` 
+
+`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
+
+<div align=center>
+<img src="./images/batch_norm_fork.png" width="500"/>
+</div>
+
+Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate. 
+
+When the net runs in training mode, the end of the left branch will be set as the running target, so the dependency tracking process will ignore right branch automatically. When the net runs in inferencing mode, the process is reversed.
+
+How to set a target is related to Python API design, so I will leave it here waiting for more discussions.
diff --git a/paddle/operators/clip_op.cc b/paddle/operators/clip_op.cc
index 3e9b0d82ba06a918d52124e791715277640fd4ff..2d029394dd97a9c33c9c57fd3565345139cdff92 100644
--- a/paddle/operators/clip_op.cc
+++ b/paddle/operators/clip_op.cc
@@ -21,7 +21,6 @@ class ClipOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of ClipOp should not be null.");
@@ -60,7 +59,6 @@ class ClipOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/concat_op.cc b/paddle/operators/concat_op.cc
index 235c4449ace79e87d209615199d6628f269961a9..e11e51b4583817ef50cd447dbcf4c7202a152422 100644
--- a/paddle/operators/concat_op.cc
+++ b/paddle/operators/concat_op.cc
@@ -23,7 +23,6 @@ class ConcatOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE_GE(ctx->Inputs("X").size(), 1UL,
                       "Inputs(X) of ConcatOp should be empty.")
@@ -82,7 +81,6 @@ class ConcatOpGrad : public framework::OperatorWithKernel {
                const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     ctx->SetOutputsDim(framework::GradVarName("X"), ctx->GetInputsDim("X"));
   }
diff --git a/paddle/operators/conv2d_op.h b/paddle/operators/conv2d_op.h
index 7ebdbe81cbbaf59a60eb3dac0f570d70fc85d6ef..bd1734879ef2569bfc7c3bef21677d3b0dc49a78 100644
--- a/paddle/operators/conv2d_op.h
+++ b/paddle/operators/conv2d_op.h
@@ -44,7 +44,6 @@ class Conv2DOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override;
 };
 
@@ -52,7 +51,6 @@ class Conv2DOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override;
 };
 
diff --git a/paddle/operators/conv_shift_op.cc b/paddle/operators/conv_shift_op.cc
index e1e321ed5fce6ce4e4089cc5c5e488a2cbad6c82..6156a2d6af9a010240449a7c944ec0caffc85189 100644
--- a/paddle/operators/conv_shift_op.cc
+++ b/paddle/operators/conv_shift_op.cc
@@ -27,7 +27,6 @@ class ConvShiftOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");
@@ -54,7 +53,6 @@ class ConvShiftGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");
diff --git a/paddle/operators/cos_sim_op.cc b/paddle/operators/cos_sim_op.cc
index 2b4c4b9c45d22e9d0b124145b55cfaeeaf1583d5..55f69fb03ad69c94dc4ebb8edd651d84e06a5f46 100644
--- a/paddle/operators/cos_sim_op.cc
+++ b/paddle/operators/cos_sim_op.cc
@@ -23,7 +23,6 @@ class CosSimOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     // notnull check
     PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -97,7 +96,6 @@ class CosSimOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     // notnull check
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
diff --git a/paddle/operators/crop_op.cc b/paddle/operators/crop_op.cc
index a1424993cc3076f2755164393c5429d79fc1ee78..a994d916761da3b44cb60bd4c1c767cd1987522f 100644
--- a/paddle/operators/crop_op.cc
+++ b/paddle/operators/crop_op.cc
@@ -24,7 +24,6 @@ class CropOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of CropOp should not be null.");
@@ -114,7 +113,6 @@ class CropOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/cross_entropy_op.cc b/paddle/operators/cross_entropy_op.cc
index 708e80e96a0f007be1594d8b4781d1296d6b398d..a865991db3111d2a7cec9f7731b3c34876864299 100644
--- a/paddle/operators/cross_entropy_op.cc
+++ b/paddle/operators/cross_entropy_op.cc
@@ -21,7 +21,6 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
@@ -34,13 +33,13 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
     PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0],
                       "The 1st dimension of Input(X) and Input(Label) should "
                       "be equal.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
       PADDLE_ENFORCE_EQ(x_dims[1], label_dims[1],
-                        "If Attr(softLabel) == true, the 2nd dimension of "
+                        "If Attr(soft_label) == true, the 2nd dimension of "
                         "Input(X) and Input(Label) should be equal.");
     } else {
       PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "If Attr(softLabel) == false, the 2nd dimension of "
+                        "If Attr(soft_label) == false, the 2nd dimension of "
                         "Input(Label) should be 1.");
     }
 
@@ -48,6 +47,7 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
     ctx->ShareLoD("X", /*->*/ "Y");
   }
 
+ protected:
   // CrossEntropy's data type just determined by "X"
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
@@ -59,7 +59,6 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
@@ -82,18 +81,19 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel {
                       "be equal.");
     PADDLE_ENFORCE_EQ(dy_dims[1], 1,
                       "The 2nd dimension of Input(Y@Grad) should be 1.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
       PADDLE_ENFORCE_EQ(x_dims[1], label_dims[1],
-                        "When Attr(softLabel) == true, the 2nd dimension of "
+                        "When Attr(soft_label) == true, the 2nd dimension of "
                         "Input(X) and Input(Label) should be equal.");
     } else {
       PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "When Attr(softLabel) == false, the 2nd dimension of "
+                        "When Attr(soft_label) == false, the 2nd dimension of "
                         "Input(Label) should be 1.");
     }
     ctx->SetOutputDim(framework::GradVarName("X"), x_dims);
   }
 
+ protected:
   // CrossEntropy's data type just determined by "X"
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
@@ -115,15 +115,15 @@ class CrossEntropyOpMaker : public framework::OpProtoAndCheckerMaker {
         "Label",
         "(Tensor, default Tensor<int>), the ground truth which is "
         "a 2-D tensor. "
-        "When softLabel is set to false, `Label` is a Tensor<int> with shape "
+        "When soft_label is set to false, `Label` is a Tensor<int> with shape "
         "[N x 1]. "
-        "When softLabel is set to true, `Label` is a Tensor<float/double> "
+        "When soft_label is set to true, `Label` is a Tensor<float/double> "
         "with shape [N x K].");
     AddOutput("Y",
               "(Tensor, default Tensor<float>), a 2-D tensor "
               "with shape [N x 1]. The cross entropy loss.");
     AddAttr<bool>(
-        "softLabel",
+        "soft_label",
         "(bool, default false), a flag to indicate whether to interpretate "
         "the given labels as soft labels.")
         .SetDefault(false);
@@ -133,12 +133,12 @@ CrossEntropy Operator.
 It supports both standard cross-entropy and soft-label cross-entropy loss
 computation.
 1) One-hot cross-entropy:
-    softLabel = false, Label[i, 0] indicates the class index for sample i:
+    soft_label = false, Label[i, 0] indicates the class index for sample i:
 
                 Y[i] = -log(X[i, Label[i]])
 
 2) Soft-label cross-entropy:
-    softLabel = true, Label[i, j] indicates the soft label of class j
+    soft_label = true, Label[i, j] indicates the soft label of class j
     for sample i:
 
                 Y[i] = \sum_j{-Label[i, j] * log(X[i, j])}
diff --git a/paddle/operators/cross_entropy_op.cu b/paddle/operators/cross_entropy_op.cu
index 5e2024e0ea9040b758e1cec4dbaa4b329bbb727e..c492dddb09a41e3731a211b4fa083e57ad780f42 100644
--- a/paddle/operators/cross_entropy_op.cu
+++ b/paddle/operators/cross_entropy_op.cu
@@ -56,7 +56,7 @@ class CrossEntropyOpCUDAKernel : public framework::OpKernel<T> {
     y->mutable_data<T>(ctx.GetPlace());
 
     math::CrossEntropyFunctor<platform::GPUPlace, T>()(
-        ctx.device_context(), y, x, label, ctx.Attr<bool>("softLabel"));
+        ctx.device_context(), y, x, label, ctx.Attr<bool>("soft_label"));
   }
 };
 
@@ -83,7 +83,7 @@ class CrossEntropyGradientOpCUDAKernel : public framework::OpKernel<T> {
     int block = 512;
     int grid = (batch_size * class_num + block - 1) / block;
 
-    if (ctx.Attr<bool>("softLabel")) {
+    if (ctx.Attr<bool>("soft_label")) {
       auto* label_data = label->data<T>();
       SoftCrossEntropyGradientKernel<T><<<
           grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
@@ -91,7 +91,8 @@ class CrossEntropyGradientOpCUDAKernel : public framework::OpKernel<T> {
                               .stream()>>>(dx_data, dy_data, x_data, label_data,
                                            batch_size, class_num);
     } else {
-      math::SetConstant<platform::GPUPlace, T>(ctx.device_context(), dx, 0);
+      math::SetConstant<platform::GPUPlace, T> functor;
+      functor(ctx.device_context(), dx, 0);
       auto* label_data = label->data<int>();
       grid = (batch_size + block - 1) / block;
       CrossEntropyGradientKernel<T><<<
diff --git a/paddle/operators/cross_entropy_op.h b/paddle/operators/cross_entropy_op.h
index d2d321aa7ed8e32cc19d5a171beea34d36195b10..42f282103b5609e3c987fc4a83113f86532f74d6 100644
--- a/paddle/operators/cross_entropy_op.h
+++ b/paddle/operators/cross_entropy_op.h
@@ -38,7 +38,7 @@ class CrossEntropyOpKernel : public framework::OpKernel<T> {
     y->mutable_data<T>(ctx.GetPlace());
 
     math::CrossEntropyFunctor<platform::CPUPlace, T>()(
-        ctx.device_context(), y, x, labels, ctx.Attr<bool>("softLabel"));
+        ctx.device_context(), y, x, labels, ctx.Attr<bool>("soft_label"));
   }
 };
 
@@ -55,7 +55,7 @@ class CrossEntropyGradientOpKernel : public framework::OpKernel<T> {
     T* dx_data = dx->mutable_data<T>(ctx.GetPlace());
 
     int class_num = x->dims()[1];
-    if (ctx.Attr<bool>("softLabel")) {
+    if (ctx.Attr<bool>("soft_label")) {
       auto x_mat = EigenMatrix<T>::From(*x);
       auto dy_mat = EigenMatrix<T>::From(*dy);
       auto lbl_mat = EigenMatrix<T>::From(*label);
@@ -70,7 +70,8 @@ class CrossEntropyGradientOpKernel : public framework::OpKernel<T> {
       const T* x_data = x->data<T>();
       const int* label_data = label->data<int>();
 
-      math::SetConstant<platform::CPUPlace, T>(ctx.device_context(), dx, 0);
+      math::SetConstant<platform::CPUPlace, T> functor;
+      functor(ctx.device_context(), dx, 0);
 
       for (int i = 0; i < batch_size; ++i) {
         PADDLE_ASSERT(label_data[i] >= 0 || label_data[i] < class_num);
diff --git a/paddle/operators/decayed_adagrad_op.cc b/paddle/operators/decayed_adagrad_op.cc
index 7f583f18c8c6ee5025f6525306f9323fb329b030..17b394aa07cb0c7ca6e085b61590ff052221b22c 100644
--- a/paddle/operators/decayed_adagrad_op.cc
+++ b/paddle/operators/decayed_adagrad_op.cc
@@ -21,7 +21,6 @@ class DecayedAdagradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of DecayedAdagradOp should not be null.");
diff --git a/paddle/operators/dropout_op.cc b/paddle/operators/dropout_op.cc
index 708ccfa0bfc163ea75fcd9dbcb5ef6140b4b5f36..29858c90832bf116d07e43825eda5775a94beafb 100644
--- a/paddle/operators/dropout_op.cc
+++ b/paddle/operators/dropout_op.cc
@@ -23,7 +23,6 @@ class DropoutOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
     PADDLE_ENFORCE_GE(ctx->Attrs().Get<float>("dropout_prob"), 0);
@@ -69,7 +68,6 @@ class DropoutOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE_EQ(ctx->Attrs().Get<bool>("is_training"), 1,
                       "GradOp is only callable when is_training is true");
diff --git a/paddle/operators/dynamic_recurrent_op_test.cc b/paddle/operators/dynamic_recurrent_op_test.cc
index 83a5ba36d9af2ef81ebcbb33e056de2e0b98cbc1..36f405568d7e4ed9a469c3af7a80192b83142b7a 100644
--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
@@ -51,7 +51,7 @@ class DynamicRecurrentOpTestHelper : public ::testing::Test {
     CreateGlobalVariables();
 
     auto op_desc = CreateOpDesc();
-    op = paddle::framework::OpRegistry::CreateOp(op_desc);
+    op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
     dop = dynamic_cast<DynamicRecurrentOp*>(op.get());
     InitCacheManually();
     InitStepNet();
diff --git a/paddle/operators/elementwise_op.h b/paddle/operators/elementwise_op.h
index 66f1910a4738f737c7581e17729c201a7e3eaeae..fce4b24a22f40c9cc57738273a758d0d48ff5e91 100644
--- a/paddle/operators/elementwise_op.h
+++ b/paddle/operators/elementwise_op.h
@@ -23,7 +23,6 @@ class ElementwiseOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   using Tensor = framework::Tensor;
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -105,7 +104,6 @@ class ElementwiseOpGrad : public framework::OperatorWithKernel {
   using framework::OperatorWithKernel::OperatorWithKernel;
   using Tensor = framework::Tensor;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");
diff --git a/paddle/operators/fill_constant_op.cc b/paddle/operators/fill_constant_op.cc
index 65d03d5fa4426ef4229c1155753c67e59ce98857..0438d4d085f81d463253605b3aeca640a433a3b3 100644
--- a/paddle/operators/fill_constant_op.cc
+++ b/paddle/operators/fill_constant_op.cc
@@ -21,7 +21,6 @@ class FillConstantOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
                    "Output(Out) of FillConstantOp should not be null.");
@@ -33,9 +32,10 @@ class FillConstantOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", dims);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext &ctx) const override {
-    return static_cast<framework::DataType>(ctx.Attr<int>("dataType"));
+    return static_cast<framework::DataType>(ctx.Attr<int>("data_type"));
   }
 };
 
@@ -44,7 +44,7 @@ class FillConstantOpMaker : public framework::OpProtoAndCheckerMaker {
   FillConstantOpMaker(framework::OpProto *proto,
                       framework::OpAttrChecker *op_checker)
       : framework::OpProtoAndCheckerMaker(proto, op_checker) {
-    AddAttr<int>("dataType",
+    AddAttr<int>("data_type",
                  "(int, default 5 (FP32)) "
                  "Output data type")
         .SetDefault(framework::DataType::FP32);
diff --git a/paddle/operators/fill_zeros_like_op.cc b/paddle/operators/fill_zeros_like_op.cc
index 4c70b9a36bb8f3a2184a8f2ac738378e9d3271e2..ed529ac40aaf179b35a9ab32e11ed7dbbe9289ba 100644
--- a/paddle/operators/fill_zeros_like_op.cc
+++ b/paddle/operators/fill_zeros_like_op.cc
@@ -21,7 +21,6 @@ class FillZerosLikeOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of FillZerosLikeOp should not be null.");
diff --git a/paddle/operators/gather_op.cc b/paddle/operators/gather_op.cc
index fb99c6c01670fcbacd012573b0679ac755638c57..f6c7f472da24a1a60c0d2538ae643bdc8e55b10f 100644
--- a/paddle/operators/gather_op.cc
+++ b/paddle/operators/gather_op.cc
@@ -22,7 +22,6 @@ class GatherOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of GatherOp should not be null.");
@@ -40,6 +39,7 @@ class GatherOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", output_dims);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("X")->type());
@@ -50,11 +50,11 @@ class GatherGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("X")->type());
diff --git a/paddle/operators/gaussian_random_op.cc b/paddle/operators/gaussian_random_op.cc
index ca7fb385057dda4ee6f7dea7edc21d9ebbf6b48e..f59f497d9f32069b764a9f777c7e9d6da9cdb108 100644
--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -42,7 +42,6 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
                    "Output(Out) of GaussianRandomOp should not be null.");
@@ -57,6 +56,7 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", framework::make_ddim(temp));
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return static_cast<framework::DataType>(Attr<int>("data_type"));
diff --git a/paddle/operators/gru_unit_op.cc b/paddle/operators/gru_unit_op.cc
index 24f84597cd7301af6521b8c1032e69569ba6f03a..72dd841c85ce9934a57263d10c366e679693c471 100644
--- a/paddle/operators/gru_unit_op.cc
+++ b/paddle/operators/gru_unit_op.cc
@@ -23,7 +23,6 @@ class GRUUnitOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Input"),
                    "Input(%s) of GRUUnitOp should not be null.", "Input");
@@ -131,7 +130,6 @@ class GRUUnitGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Input"),
                    "Input(%s) of GRUUnitGradOp should not be null.", "Input");
diff --git a/paddle/operators/images/batch_norm_fork.dot b/paddle/operators/images/batch_norm_fork.dot
new file mode 100644
index 0000000000000000000000000000000000000000..4bc47713cba2cb23f1b34fffe6426ef10ac3a9df
--- /dev/null
+++ b/paddle/operators/images/batch_norm_fork.dot
@@ -0,0 +1,25 @@
+digraph ImageBatchNormForkGragh {
+  subgraph cluster_before {
+    Prev [label="...", shape=plaintext];
+    Rnn [label="rnn_op", shape=box];
+    BatchNorm [label="batch_norm_op", shape=box];
+    Fc [label="fc_op", shape=box];
+    After [label="...", shape=plaintext];
+    Prev -> Rnn -> BatchNorm -> Fc -> After;
+    label="original";
+  }
+
+  subgraph cluster_after {
+    Prev2 [label="...", shape=plaintext];
+    Rnn2 [label="rnn_op", shape=box];
+    BatchNorm2_1 [label="train_batch_norm_op", shape=box];
+    BatchNorm2_2 [label="infer_batch_norm_op", shape=box];
+    Fc2_1 [label="fc_op", shape=box];
+    Fc2_2 [label="fc_op", shape=box];
+    After2_1 [label="...", shape=plaintext];
+    After2_2 [label="...", shape=plaintext];
+    Prev2 -> Rnn2 -> BatchNorm2_1 -> Fc2_1 -> After2_1;
+    Rnn2 -> BatchNorm2_2 ->Fc2_2 ->After2_2
+    label="forked";
+  }
+}
diff --git a/paddle/operators/images/batch_norm_fork.png b/paddle/operators/images/batch_norm_fork.png
new file mode 100644
index 0000000000000000000000000000000000000000..aded62bce5bc268b7a3ef4dc96c89fe21d6ea955
Binary files /dev/null and b/paddle/operators/images/batch_norm_fork.png differ
diff --git a/paddle/operators/images/batch_norm_op_kernel.png b/paddle/operators/images/batch_norm_op_kernel.png
new file mode 100644
index 0000000000000000000000000000000000000000..a99ce81ff3bf42880ebbd6a1297de3bf038e09b2
Binary files /dev/null and b/paddle/operators/images/batch_norm_op_kernel.png differ
diff --git a/paddle/operators/lookup_table_op.cc b/paddle/operators/lookup_table_op.cc
index 3f8d4ab85756ec007004e397a55ea60e2fa704ae..b88cd14d78f616b0e57386ab891dad1a872bfe65 100644
--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@@ -21,7 +21,6 @@ class LookupTableOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("W"),
                    "Input(W) of LookupTableOp should not be null.");
@@ -37,6 +36,7 @@ class LookupTableOp : public framework::OperatorWithKernel {
     ctx->ShareLoD("Ids", /*->*/ "Out");
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("W")->type());
@@ -69,12 +69,12 @@ class LookupTableOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     auto table_dims = ctx->GetInputDim("W");
     ctx->SetOutputDim(framework::GradVarName("W"), table_dims);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("W")->type());
diff --git a/paddle/operators/lstm_unit_op.cc b/paddle/operators/lstm_unit_op.cc
index 13a45ec246988558b0c090ab14e627b5b0e44532..5d63017208a55ec4bcc2e8d66f1ca2e1b84d4593 100644
--- a/paddle/operators/lstm_unit_op.cc
+++ b/paddle/operators/lstm_unit_op.cc
@@ -21,7 +21,6 @@ class LstmUnitOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of LSTM should not be null.");
     PADDLE_ENFORCE(ctx->HasInput("C_prev"),
@@ -76,7 +75,6 @@ class LstmUnitGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("C")),
                    "Input(C@GRAD) should not be null");
diff --git a/paddle/operators/margin_rank_loss_op.cc b/paddle/operators/margin_rank_loss_op.cc
index 5be61dfec3bb58ab9b658cb59ab0dd49bb67d8cb..638a99addc2119e8f44648cc54b97bd8a892d2bc 100644
--- a/paddle/operators/margin_rank_loss_op.cc
+++ b/paddle/operators/margin_rank_loss_op.cc
@@ -21,7 +21,6 @@ class MarginRankLossOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     // input check
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
@@ -94,7 +93,6 @@ class MarginRankLossGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
     PADDLE_ENFORCE(ctx->HasInput("X1"), "Input(X1) shouldn't be null.");
diff --git a/paddle/operators/math/CMakeLists.txt b/paddle/operators/math/CMakeLists.txt
index 1a2f623ce7917b1e60656743e699271eab9c7011..72ce8585045b5166df424a401442db39b47ab098 100644
--- a/paddle/operators/math/CMakeLists.txt
+++ b/paddle/operators/math/CMakeLists.txt
@@ -1,18 +1,22 @@
 if(WITH_GPU)
     nv_library(math_function SRCS math_function.cc math_function.cu im2col.cc im2col.cu DEPS cblas device_context operator)
-    nv_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+    nv_test(math_function_gpu_test SRCS math_function_test.cu DEPS math_function tensor)
+    nv_library(selected_rows_functor SRCS selected_rows_functor.cc selected_rows_functor.cu DEPS selected_rows math_function)
+    nv_test(selected_rows_functor_gpu_test SRCS selected_rows_functor_test.cu DEPS selected_rows_functor)
     nv_library(softmax SRCS softmax.cc softmax.cu DEPS operator)
     nv_library(cross_entropy SRCS cross_entropy.cc cross_entropy.cu DEPS operator)
     nv_library(pooling SRCS pooling.cc pooling.cu DEPS device_context)
     nv_library(vol2col SRCS vol2col.cc vol2col.cu DEPS device_context)
 else()
     cc_library(math_function SRCS math_function.cc im2col.cc DEPS cblas device_context operator)
-    cc_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+    cc_library(selected_rows_functor SRCS selected_rows_functor.cc DEPS selected_rows math_function)
     cc_library(softmax SRCS softmax.cc DEPS operator)
     cc_library(cross_entropy SRCS cross_entropy.cc DEPS operator)
     cc_library(pooling SRCS pooling.cc DEPS device_context)
     cc_library(vol2col SRCS vol2col.cc DEPS device_context)
 endif()
 
+cc_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+cc_test(selected_rows_functor_test SRCS selected_rows_functor_test.cc DEPS selected_rows_functor)
 cc_test(im2col_test SRCS im2col_test.cc DEPS math_function tensor)
 cc_test(vol2col_test SRCS vol2col_test.cc DEPS vol2col tensor)
diff --git a/paddle/operators/math/math_function.cc b/paddle/operators/math/math_function.cc
index ba653afa2cb175ae2e5e21088b6dc7ba76a6018f..aad1357598c629a4edfe0ad9b23f0241093a2522 100644
--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
@@ -130,6 +130,89 @@ void matmul<platform::CPUPlace, double>(
       matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
 
+#ifdef PADDLE_USE_MKLML
+// Use cblas_{s,d}gemm_batched if available: Run with 1 group of size batchSize.
+template <>
+void batched_gemm<platform::CPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  auto a_array = std::vector<const float*>(batchCount);
+  auto b_array = std::vector<const float*>(batchCount);
+  auto c_array = std::vector<float*>(batchCount);
+  for (int k = 0; k < batchCount; ++k) {
+    a_array[k] = &A[k * strideA];
+    b_array[k] = &B[k * strideB];
+    c_array[k] = &C[k * M * N];
+  }
+  cblas_sgemm_batch(CblasRowMajor, &transA, &transB, &M, &N, &K, &alpha,
+                    a_array.data(), &lda, b_array.data(), &ldb, &beta,
+                    c_array.data(), &ldc, 1 /* group_count */, &batchCount);
+}
+
+template <>
+void batched_gemm<platform::CPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  auto a_array = std::vector<const double*>(batchCount);
+  auto b_array = std::vector<const double*>(batchCount);
+  auto c_array = std::vector<double*>(batchCount);
+  for (int k = 0; k < batchCount; ++k) {
+    a_array[k] = &A[k * strideA];
+    b_array[k] = &B[k * strideB];
+    c_array[k] = &C[k * M * N];
+  }
+  cblas_dgemm_batch(CblasRowMajor, &transA, &transB, &M, &N, &K, &alpha,
+                    a_array.data(), &lda, b_array.data(), &ldb, &beta,
+                    c_array.data(), &ldc, 1 /* group_count */, &batchCount);
+}
+#else
+// The below is a naive but correct serial implementation that just loops
+// over the batch dimension. This is a fallback for when the batched gemm
+// functions of Intel MKL are not available. In the future, this computation
+// should be parallelized.
+template <>
+void batched_gemm<platform::CPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  for (int k = 0; k < batchCount; ++k) {
+    const float* Ak = &A[k * strideA];
+    const float* Bk = &B[k * strideB];
+    float* Ck = &C[k * M * N];
+    gemm<platform::CPUPlace, float>(context, transA, transB, M, N, K, alpha, Ak,
+                                    Bk, beta, Ck);
+  }
+}
+
+template <>
+void batched_gemm<platform::CPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  for (int k = 0; k < batchCount; ++k) {
+    const double* Ak = &A[k * strideA];
+    const double* Bk = &B[k * strideB];
+    double* Ck = &C[k * M * N];
+    gemm<platform::CPUPlace, double>(context, transA, transB, M, N, K, alpha,
+                                     Ak, Bk, beta, Ck);
+  }
+}
+#endif
+
+template struct SetConstant<platform::CPUPlace, float>;
+
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
diff --git a/paddle/operators/math/math_function.cu b/paddle/operators/math/math_function.cu
index 649f1f352c2a4a5ebaa0cb00ffb2e4de8aa4961a..5583683c6e12b88ba81015aef9161913de261ef2 100644
--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
@@ -155,6 +155,56 @@ void matmul<platform::GPUPlace, double>(
       matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
 
+template <>
+void batched_gemm<platform::GPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  // Note that cublas follows fortran order, so the order is different from
+  // the cblas convention.
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  cublasOperation_t cuTransA =
+      (transA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  cublasOperation_t cuTransB =
+      (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  const int strideC = M * N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasSgemmStridedBatched(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransB, cuTransA, N, M, K, &alpha, B, ldb, strideB, A, lda, strideA,
+      &beta, C, ldc, strideC, batchCount));
+}
+
+template <>
+void batched_gemm<platform::GPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  // Note that cublas follows fortran order, so the order is different from
+  // the cblas convention.
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  cublasOperation_t cuTransA =
+      (transA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  cublasOperation_t cuTransB =
+      (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  const int strideC = M * N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasDgemmStridedBatched(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransB, cuTransA, N, M, K, &alpha, B, ldb, strideB, A, lda, strideA,
+      &beta, C, ldc, strideC, batchCount));
+}
+
+template struct SetConstant<platform::GPUPlace, float>;
+
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
diff --git a/paddle/operators/math/math_function.h b/paddle/operators/math/math_function.h
index 473eff4d198ca9b17b6af8eebd6dfe39d49d138d..9777ebfd156709a370be2cb4ba0077ac7c6735fb 100644
--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
@@ -63,7 +63,7 @@ namespace math {
 
 // Support continuous memory now
 // If transA = N, and transB = N
-// Then matrixA: M * K, matrixB: K * N matrixC : M * N
+// Then matrixA: M * K, matrixB: K * N, matrixC : M * N
 // For more detailed info, please refer to
 // http://www.netlib.org/lapack/explore-html/d4/de2/sgemm_8f.html
 template <typename Place, typename T>
@@ -85,12 +85,23 @@ void matmul(const platform::DeviceContext& context,
             const framework::Tensor& matrix_b, bool trans_b, T alpha,
             framework::Tensor* matrix_out, T beta);
 
+// Batched gemm
 template <typename Place, typename T>
-void SetConstant(const platform::DeviceContext& context,
-                 framework::Tensor* tensor, T num) {
-  auto t = framework::EigenVector<T>::Flatten(*tensor);
-  t.device(*context.GetEigenDevice<Place>()) = t.constant(static_cast<T>(num));
-}
+void batched_gemm(const platform::DeviceContext& context,
+                  const CBLAS_TRANSPOSE transA, const CBLAS_TRANSPOSE transB,
+                  const int M, const int N, const int K, const T alpha,
+                  const T* A, const T* B, const T beta, T* C,
+                  const int batchCount, const int strideA, const int strideB);
+
+template <typename Place, typename T>
+struct SetConstant {
+  void operator()(const platform::DeviceContext& context,
+                  framework::Tensor* tensor, T num) {
+    auto t = framework::EigenVector<T>::Flatten(*tensor);
+    t.device(*context.GetEigenDevice<Place>()) =
+        t.constant(static_cast<T>(num));
+  }
+};
 
 }  // namespace math
 }  // namespace operators
diff --git a/paddle/operators/math/math_function_test.cc b/paddle/operators/math/math_function_test.cc
index c87d200c3aa5a9336c0f73d3a8bb88d2e9eafbab..3b9f92e7ae5f34dd0fb1ba8fb0c67ff5ae1628c4 100644
--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
@@ -1,185 +1,6 @@
 #include "paddle/operators/math/math_function.h"
 #include "gtest/gtest.h"
 
-#ifdef PADDLE_WITH_CUDA
-TEST(math_function, notrans_mul_trans) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor out_gpu;
-  paddle::framework::Tensor out;
-
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr, 6 * sizeof(float));
-
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
-
-  out_gpu.mutable_data<float>({2, 2}, *gpu_place);
-
-  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
-      context, input1_gpu, false, input2_gpu, true, 1, &out_gpu, 0);
-
-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
-
-  float* out_ptr = out.data<float>();
-  context.Wait();
-  EXPECT_EQ(out_ptr[0], 5);
-  EXPECT_EQ(out_ptr[1], 14);
-  EXPECT_EQ(out_ptr[2], 14);
-  EXPECT_EQ(out_ptr[3], 50);
-  delete gpu_place;
-}
-
-TEST(math_function, trans_mul_notrans) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor out_gpu;
-  paddle::framework::Tensor out;
-
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr, 6 * sizeof(float));
-
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
-
-  out_gpu.mutable_data<float>({3, 3}, *gpu_place);
-
-  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
-      context, input1_gpu, true, input2_gpu, false, 1, &out_gpu, 0);
-
-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
-
-  float* out_ptr = out.data<float>();
-  context.Wait();
-  EXPECT_EQ(out_ptr[0], 9);
-  EXPECT_EQ(out_ptr[1], 12);
-  EXPECT_EQ(out_ptr[2], 15);
-  EXPECT_EQ(out_ptr[3], 12);
-  EXPECT_EQ(out_ptr[4], 17);
-  EXPECT_EQ(out_ptr[5], 22);
-  EXPECT_EQ(out_ptr[6], 15);
-  EXPECT_EQ(out_ptr[7], 22);
-  EXPECT_EQ(out_ptr[8], 29);
-  delete gpu_place;
-}
-
-TEST(math_function, gemm_notrans_cublas) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input2;
-  paddle::framework::Tensor input3;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor input3_gpu;
-
-  int m = 2;
-  int n = 3;
-  int k = 3;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr1[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr1, 6 * sizeof(float));
-  float* input2_ptr = input2.mutable_data<float>({3, 4}, *cpu_place);
-  float arr2[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
-  memcpy(input2_ptr, arr2, 12 * sizeof(float));
-  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
-  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  memcpy(input3_ptr, arr3, 8 * sizeof(float));
-
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
-  float* a = input1_gpu.data<float>();
-  float* b = input2_gpu.data<float>();
-  float* c = input3_gpu.mutable_data<float>(*gpu_place);
-
-  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
-      context, false, false, m, n, k, 1, a, 3, b + 1, 4, 1, c + 1, 4);
-
-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
-
-  // numpy code:
-  // a = np.arange(6).reshape(2, 3)
-  // b = np.arange(12).reshape(3, 4)[:, 1:]
-  // c = np.arange(8).reshape(2, 4)[:, 1:]
-  // out = np.arange(8).reshape(2, 4)
-  // out[:, 1:] = np.dot(a, b) + c
-  context.Wait();
-  EXPECT_EQ(input3_ptr[0], 0);
-  EXPECT_EQ(input3_ptr[1], 24);
-  EXPECT_EQ(input3_ptr[2], 28);
-  EXPECT_EQ(input3_ptr[3], 32);
-  EXPECT_EQ(input3_ptr[4], 4);
-  EXPECT_EQ(input3_ptr[5], 73);
-  EXPECT_EQ(input3_ptr[6], 86);
-  EXPECT_EQ(input3_ptr[7], 99);
-  delete gpu_place;
-}
-
-TEST(math_function, gemm_trans_cublas) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input2;
-  paddle::framework::Tensor input3;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor input3_gpu;
-
-  int m = 2;
-  int n = 3;
-  int k = 3;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr1[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr1, 6 * sizeof(float));
-  float* input2_ptr = input2.mutable_data<float>({4, 3}, *cpu_place);
-  float arr2[12] = {0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11};
-  memcpy(input2_ptr, arr2, 12 * sizeof(float));
-  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
-  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  memcpy(input3_ptr, arr3, 8 * sizeof(float));
-
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
-  float* a = input1_gpu.data<float>();
-  float* b = input2_gpu.data<float>();
-  float* c = input3_gpu.mutable_data<float>(*gpu_place);
-
-  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
-      context, false, true, m, n, k, 1, a, 3, b + 3, 3, 1, c + 1, 4);
-
-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
-  context.Wait();
-
-  EXPECT_EQ(input3_ptr[0], 0);
-  EXPECT_EQ(input3_ptr[1], 24);
-  EXPECT_EQ(input3_ptr[2], 28);
-  EXPECT_EQ(input3_ptr[3], 32);
-  EXPECT_EQ(input3_ptr[4], 4);
-  EXPECT_EQ(input3_ptr[5], 73);
-  EXPECT_EQ(input3_ptr[6], 86);
-  EXPECT_EQ(input3_ptr[7], 99);
-  delete gpu_place;
-}
-#endif
-
 TEST(math_function, gemm_notrans_cblas) {
   paddle::framework::Tensor input1;
   paddle::framework::Tensor input2;
@@ -253,15 +74,15 @@ TEST(math_function, zero) {
   auto* cpu_place = new paddle::platform::CPUPlace();
   float* t = tensor.mutable_data<float>({2, 2}, *cpu_place);
   paddle::platform::CPUDeviceContext context(*cpu_place);
-  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>(
-      context, &tensor, 0);
+  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>
+      functor;
+  functor(context, &tensor, 0);
   EXPECT_EQ(t[0], 0);
   EXPECT_EQ(t[1], 0);
   EXPECT_EQ(t[2], 0);
   EXPECT_EQ(t[3], 0);
 
-  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>(
-      context, &tensor, 1);
+  functor(context, &tensor, 1);
 
   EXPECT_EQ(t[0], 1);
   EXPECT_EQ(t[1], 1);
diff --git a/paddle/operators/math/math_function_test.cu b/paddle/operators/math/math_function_test.cu
new file mode 100644
index 0000000000000000000000000000000000000000..14359d835bba794703a313d70f34082868474b20
--- /dev/null
+++ b/paddle/operators/math/math_function_test.cu
@@ -0,0 +1,179 @@
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+
+TEST(math_function, notrans_mul_trans) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor out_gpu;
+  paddle::framework::Tensor out;
+
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr, 6 * sizeof(float));
+
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+
+  out_gpu.mutable_data<float>({2, 2}, *gpu_place);
+
+  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
+      context, input1_gpu, false, input2_gpu, true, 1, &out_gpu, 0);
+
+  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+
+  float* out_ptr = out.data<float>();
+  context.Wait();
+  EXPECT_EQ(out_ptr[0], 5);
+  EXPECT_EQ(out_ptr[1], 14);
+  EXPECT_EQ(out_ptr[2], 14);
+  EXPECT_EQ(out_ptr[3], 50);
+  delete gpu_place;
+}
+
+TEST(math_function, trans_mul_notrans) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor out_gpu;
+  paddle::framework::Tensor out;
+
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr, 6 * sizeof(float));
+
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+
+  out_gpu.mutable_data<float>({3, 3}, *gpu_place);
+
+  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
+      context, input1_gpu, true, input2_gpu, false, 1, &out_gpu, 0);
+
+  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+
+  float* out_ptr = out.data<float>();
+  context.Wait();
+  EXPECT_EQ(out_ptr[0], 9);
+  EXPECT_EQ(out_ptr[1], 12);
+  EXPECT_EQ(out_ptr[2], 15);
+  EXPECT_EQ(out_ptr[3], 12);
+  EXPECT_EQ(out_ptr[4], 17);
+  EXPECT_EQ(out_ptr[5], 22);
+  EXPECT_EQ(out_ptr[6], 15);
+  EXPECT_EQ(out_ptr[7], 22);
+  EXPECT_EQ(out_ptr[8], 29);
+  delete gpu_place;
+}
+
+TEST(math_function, gemm_notrans_cublas) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input2;
+  paddle::framework::Tensor input3;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor input3_gpu;
+
+  int m = 2;
+  int n = 3;
+  int k = 3;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr1[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr1, 6 * sizeof(float));
+  float* input2_ptr = input2.mutable_data<float>({3, 4}, *cpu_place);
+  float arr2[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
+  memcpy(input2_ptr, arr2, 12 * sizeof(float));
+  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
+  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
+  memcpy(input3_ptr, arr3, 8 * sizeof(float));
+
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
+  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  float* a = input1_gpu.data<float>();
+  float* b = input2_gpu.data<float>();
+  float* c = input3_gpu.mutable_data<float>(*gpu_place);
+
+  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
+      context, false, false, m, n, k, 1, a, 3, b + 1, 4, 1, c + 1, 4);
+
+  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+
+  // numpy code:
+  // a = np.arange(6).reshape(2, 3)
+  // b = np.arange(12).reshape(3, 4)[:, 1:]
+  // c = np.arange(8).reshape(2, 4)[:, 1:]
+  // out = np.arange(8).reshape(2, 4)
+  // out[:, 1:] = np.dot(a, b) + c
+  context.Wait();
+  EXPECT_EQ(input3_ptr[0], 0);
+  EXPECT_EQ(input3_ptr[1], 24);
+  EXPECT_EQ(input3_ptr[2], 28);
+  EXPECT_EQ(input3_ptr[3], 32);
+  EXPECT_EQ(input3_ptr[4], 4);
+  EXPECT_EQ(input3_ptr[5], 73);
+  EXPECT_EQ(input3_ptr[6], 86);
+  EXPECT_EQ(input3_ptr[7], 99);
+  delete gpu_place;
+}
+
+TEST(math_function, gemm_trans_cublas) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input2;
+  paddle::framework::Tensor input3;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor input3_gpu;
+
+  int m = 2;
+  int n = 3;
+  int k = 3;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr1[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr1, 6 * sizeof(float));
+  float* input2_ptr = input2.mutable_data<float>({4, 3}, *cpu_place);
+  float arr2[12] = {0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11};
+  memcpy(input2_ptr, arr2, 12 * sizeof(float));
+  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
+  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
+  memcpy(input3_ptr, arr3, 8 * sizeof(float));
+
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
+  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  float* a = input1_gpu.data<float>();
+  float* b = input2_gpu.data<float>();
+  float* c = input3_gpu.mutable_data<float>(*gpu_place);
+
+  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
+      context, false, true, m, n, k, 1, a, 3, b + 3, 3, 1, c + 1, 4);
+
+  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+  context.Wait();
+
+  EXPECT_EQ(input3_ptr[0], 0);
+  EXPECT_EQ(input3_ptr[1], 24);
+  EXPECT_EQ(input3_ptr[2], 28);
+  EXPECT_EQ(input3_ptr[3], 32);
+  EXPECT_EQ(input3_ptr[4], 4);
+  EXPECT_EQ(input3_ptr[5], 73);
+  EXPECT_EQ(input3_ptr[6], 86);
+  EXPECT_EQ(input3_ptr[7], 99);
+  delete gpu_place;
+}
diff --git a/paddle/operators/math/matmul.h b/paddle/operators/math/matmul.h
new file mode 100644
index 0000000000000000000000000000000000000000..6ba9a0ba9a70bd938f9362179990ab68fa3186ba
--- /dev/null
+++ b/paddle/operators/math/matmul.h
@@ -0,0 +1,124 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/operators/math/math_function.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+// Implements the logic of numpy matmul:
+// https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html
+//
+// but allowing also for a, b to be transposed
+//
+// Both a & b can be 1- to 3-dimensional. Higher rank tensors are not supported
+// yet.
+template <typename Place, typename T>
+class MatMulFunctor {
+ public:
+  void operator()(const platform::DeviceContext& context,
+                  const framework::Tensor& a, bool trans_a,
+                  const framework::Tensor& b, bool trans_b, T alpha,
+                  framework::Tensor* out, T beta) {
+    auto dim_a = a.dims();
+    auto dim_b = b.dims();
+
+    PADDLE_ENFORCE(a.place() == b.place() && b.place() == out->place(),
+                   "Tensors must all be in the same place.");
+    PADDLE_ENFORCE_GE(dim_a.size(), 1,
+                      "Input tensor a must be at least 1-dimensional.");
+    PADDLE_ENFORCE_GE(dim_b.size(), 1,
+                      "Input tensor b must be at least 1-dimensional.");
+    PADDLE_ENFORCE_LE(dim_a.size(), 3,
+                      "Input tensor a must be at most 3-dimensional.");
+    PADDLE_ENFORCE_LE(dim_b.size(), 3,
+                      "Input tensor b must be at most 3-dimensional.");
+
+    int M = 0, N = 0, kA = 0, kB = 0, batchCountA = 0, batchCountB = 0,
+        strideA = 0, strideB = 0;
+
+    switch (dim_a.size()) {
+      case 1:
+        // similar to np.matmul:
+        // prepend dimension 1 (no transpose) or append dimension 1 (transpose)
+        M = trans_a ? dim_a[0] : 1;
+        kA = trans_a ? 1 : dim_a[0];
+        break;
+      case 2:
+        M = trans_a ? dim_a[1] : dim_a[0];
+        kA = trans_a ? dim_a[0] : dim_a[1];
+        break;
+      case 3:
+        batchCountA = dim_a[0];
+        M = trans_a ? dim_a[2] : dim_a[1];
+        kA = trans_a ? dim_a[1] : dim_a[2];
+        strideA = M * kA;
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (dim_b.size()) {
+      case 1:
+        // similar to np.matmul:
+        // append dimension 1 (no transpose) or prepend dimension 1 (transpose)
+        kB = trans_b ? 1 : dim_b[0];
+        N = trans_b ? dim_b[0] : 1;
+        break;
+      case 2:
+        kB = trans_b ? dim_b[1] : dim_b[0];
+        N = trans_b ? dim_b[0] : dim_b[1];
+        break;
+      case 3:
+        batchCountB = dim_b[0];
+        kB = trans_b ? dim_b[2] : dim_b[1];
+        N = trans_b ? dim_b[1] : dim_b[2];
+        strideB = kB * N;
+        break;
+      default:
+        assert(false);
+    }
+
+    PADDLE_ENFORCE_EQ(
+        kA, kB,
+        "First matrix's width must be equal with second matrix's height.");
+    if (batchCountA && batchCountB) {
+      PADDLE_ENFORCE_EQ(
+          batchCountA, batchCountB,
+          "When input tensors a and b are both batched, they must have the "
+          "same batch dimension.");
+    }
+    int batchCount = std::max(batchCountA, batchCountB);
+
+    CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+    CBLAS_TRANSPOSE transB = (trans_b == false) ? CblasNoTrans : CblasTrans;
+
+    if (!batchCount) {
+      // regular matrix multiplication
+      gemm<Place, T>(context, transA, transB, M, N, kA, alpha, a.data<T>(),
+                     b.data<T>(), beta, out->data<T>());
+    } else {
+      // batched matrix multiplication
+      batched_gemm<Place, T>(context, transA, transB, M, N, kA, alpha,
+                             a.data<T>(), b.data<T>(), beta, out->data<T>(),
+                             batchCount, strideA, strideB);
+    }
+  }
+};
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/math/selected_rows_functor.cc b/paddle/operators/math/selected_rows_functor.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f2305ea16913e927dca17e5a80201368f03ca253
--- /dev/null
+++ b/paddle/operators/math/selected_rows_functor.cc
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "paddle/operators/math/math_function.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename T>
+struct SelectedRowsAdd<platform::CPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output) {
+    auto in1_height = input1.height();
+    PADDLE_ENFORCE_EQ(in1_height, input2.height());
+    output->set_height(in1_height);
+
+    auto& in1_rows = input1.rows();
+    auto& in2_rows = input2.rows();
+    std::vector<int64_t> out_rows;
+    out_rows.reserve(in1_rows.size() + in2_rows.size());
+
+    // concat rows
+    out_rows.insert(out_rows.end(), in1_rows.begin(), in1_rows.end());
+    out_rows.insert(out_rows.end(), in2_rows.begin(), in2_rows.end());
+    output->set_rows(out_rows);
+
+    auto* out_value = output->mutable_value();
+    auto& in1_value = input1.value();
+    auto& in2_value = input2.value();
+
+    auto in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, in2_value.numel() / in2_rows.size());
+    PADDLE_ENFORCE_EQ(in1_row_numel, out_value->numel() / out_rows.size());
+
+    auto in1_place = input1.place();
+    PADDLE_ENFORCE(platform::is_cpu_place(in1_place));
+    auto in2_place = input2.place();
+    PADDLE_ENFORCE(platform::is_cpu_place(in2_place));
+    auto out_place = context.GetPlace();
+    PADDLE_ENFORCE(platform::is_cpu_place(out_place));
+
+    auto* out_data = out_value->data<T>();
+    auto* in1_data = in1_value.data<T>();
+    memory::Copy(boost::get<platform::CPUPlace>(out_place), out_data,
+                 boost::get<platform::CPUPlace>(in1_place), in1_data,
+                 in1_value.numel() * sizeof(T));
+
+    auto* in2_data = in2_value.data<T>();
+    memory::Copy(boost::get<platform::CPUPlace>(out_place),
+                 out_data + in1_value.numel(),
+                 boost::get<platform::CPUPlace>(in2_place), in2_data,
+                 in2_value.numel() * sizeof(T));
+  }
+};
+
+template struct SelectedRowsAdd<platform::CPUPlace, float>;
+
+template <typename T>
+struct SelectedRowsAddTensor<platform::CPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::Tensor& input2, framework::Tensor* output) {
+    auto in1_height = input1.height();
+    auto in2_dims = input2.dims();
+    auto out_dims = output->dims();
+    PADDLE_ENFORCE_EQ(in1_height, in2_dims[0]);
+    PADDLE_ENFORCE_EQ(in1_height, out_dims[0]);
+
+    auto& in1_value = input1.value();
+    auto& in1_rows = input1.rows();
+
+    int64_t in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, input2.numel() / in1_height);
+    PADDLE_ENFORCE_EQ(in1_row_numel, output->numel() / in1_height);
+
+    SetConstant<platform::CPUPlace, T> functor;
+    functor(context, output, 0.0);
+
+    auto* in1_data = in1_value.data<T>();
+    auto* out_data = output->data<T>();
+
+    for (size_t i = 0; i < in1_rows.size(); i++) {
+      for (int64_t j = 0; j < in1_row_numel; j++) {
+        out_data[in1_rows[i] * in1_row_numel + j] +=
+            in1_data[i * in1_row_numel + j];
+      }
+    }
+
+    auto out_eigen = framework::EigenVector<T>::Flatten(*output);
+    auto in2_eigen = framework::EigenVector<T>::Flatten(input2);
+    out_eigen.device(*context.GetEigenDevice<platform::CPUPlace>()) =
+        out_eigen + in2_eigen;
+  }
+};
+
+template struct SelectedRowsAddTensor<platform::CPUPlace, float>;
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/math/selected_rows_functor.cu b/paddle/operators/math/selected_rows_functor.cu
new file mode 100644
index 0000000000000000000000000000000000000000..ea149ebbc12beeab43a2047372352ba769959307
--- /dev/null
+++ b/paddle/operators/math/selected_rows_functor.cu
@@ -0,0 +1,142 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/math/math_function.h"
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "paddle/platform/cuda_helper.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename T>
+struct SelectedRowsAdd<platform::GPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output) {
+    auto in1_height = input1.height();
+    PADDLE_ENFORCE_EQ(in1_height, input2.height());
+    output->set_height(in1_height);
+
+    auto& in1_rows = input1.rows();
+    auto& in2_rows = input2.rows();
+    std::vector<int64_t> out_rows;
+    out_rows.reserve(in1_rows.size() + in2_rows.size());
+
+    // concat rows
+    out_rows.insert(out_rows.end(), in1_rows.begin(), in1_rows.end());
+    out_rows.insert(out_rows.end(), in2_rows.begin(), in2_rows.end());
+    output->set_rows(out_rows);
+
+    auto* out_value = output->mutable_value();
+    auto& in1_value = input1.value();
+    auto& in2_value = input2.value();
+
+    auto in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, in2_value.numel() / in2_rows.size());
+    PADDLE_ENFORCE_EQ(in1_row_numel, out_value->numel() / out_rows.size());
+
+    auto* out_data = out_value->data<T>();
+    auto* in1_data = in1_value.data<T>();
+
+    auto in1_place = input1.place();
+    PADDLE_ENFORCE(platform::is_gpu_place(in1_place));
+    auto in2_place = input2.place();
+    PADDLE_ENFORCE(platform::is_gpu_place(in2_place));
+    auto out_place = context.GetPlace();
+    PADDLE_ENFORCE(platform::is_gpu_place(out_place));
+
+    memory::Copy(
+        boost::get<platform::GPUPlace>(out_place), out_data,
+        boost::get<platform::GPUPlace>(in1_place), in1_data,
+        in1_value.numel() * sizeof(T),
+        reinterpret_cast<const platform::CUDADeviceContext&>(context).stream());
+
+    auto* in2_data = in2_value.data<T>();
+    memory::Copy(
+        boost::get<platform::GPUPlace>(out_place), out_data + in1_value.numel(),
+        boost::get<platform::GPUPlace>(in2_place), in2_data,
+        in2_value.numel() * sizeof(T),
+        reinterpret_cast<const platform::CUDADeviceContext&>(context).stream());
+  }
+};
+
+template struct SelectedRowsAdd<platform::GPUPlace, float>;
+
+namespace {
+template <typename T>
+__global__ void SelectedRowsAddTensorKernel(const T* selected_rows,
+                                            const int64_t* rows, T* tensor_out,
+                                            int64_t row_numel, int block_size) {
+  const int ty = blockIdx.y;
+  int tid = threadIdx.x;
+
+  selected_rows += ty * row_numel;
+  tensor_out += rows[ty] * row_numel;
+
+  for (int index = tid; index < row_numel; index += block_size) {
+    // Since index in rows of SelectedRows can be duplicate, we can not use
+    // tensor_out[index] += selected_rows[index]; Instead, we have to use
+    // AtomicAdd to avoid concurrent write error.
+    paddle::platform::CudaAtomicAdd(tensor_out + index, selected_rows[index]);
+  }
+}
+}  // namespace
+
+template <typename T>
+struct SelectedRowsAddTensor<platform::GPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::Tensor& input2, framework::Tensor* output) {
+    auto in1_height = input1.height();
+    auto in2_dims = input2.dims();
+    auto out_dims = output->dims();
+    PADDLE_ENFORCE_EQ(in1_height, in2_dims[0]);
+    PADDLE_ENFORCE_EQ(in1_height, out_dims[0]);
+
+    auto& in1_value = input1.value();
+    auto& in1_rows = input1.rows();
+
+    int64_t in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, input2.numel() / in1_height);
+    PADDLE_ENFORCE_EQ(in1_row_numel, output->numel() / in1_height);
+
+    auto* in1_data = in1_value.data<T>();
+    auto* in2_data = input2.data<T>();
+    auto* out_data = output->data<T>();
+
+    SetConstant<platform::GPUPlace, T> functor;
+    functor(context, output, 0.0);
+
+    int block_size = 256;
+    dim3 threads(block_size, 1);
+    dim3 grid(1, in1_rows.size());
+    SelectedRowsAddTensorKernel<
+        T><<<grid, threads, 0,
+             reinterpret_cast<const platform::CUDADeviceContext&>(context)
+                 .stream()>>>(in1_data, in1_rows.data(), out_data,
+                              in1_row_numel, block_size);
+
+    auto out_eigen = framework::EigenVector<T>::Flatten(*output);
+    auto in2_eigen = framework::EigenVector<T>::Flatten(input2);
+    out_eigen.device(*context.GetEigenDevice<platform::GPUPlace>()) =
+        out_eigen + in2_eigen;
+  }
+};
+
+template struct SelectedRowsAddTensor<platform::GPUPlace, float>;
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/math/selected_rows_functor.h b/paddle/operators/math/selected_rows_functor.h
new file mode 100644
index 0000000000000000000000000000000000000000..53ab240ca600cd4a817afa2c19fb8d9427c6f3da
--- /dev/null
+++ b/paddle/operators/math/selected_rows_functor.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include "paddle/framework/selected_rows.h"
+#include "paddle/platform/device_context.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+// SelectedRows + SelectedRows will simplely concat value and rows.
+// The real computation happens in dealing with LoDTensor.
+template <typename Place, typename T>
+struct SelectedRowsAdd {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output);
+};
+
+template <typename Place, typename T>
+struct SelectedRowsAddTensor {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::Tensor& input2, framework::Tensor* output);
+};
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/math/selected_rows_functor_test.cc b/paddle/operators/math/selected_rows_functor_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4f7760cb713b6bf58c82f38fb043d7d53d82710a
--- /dev/null
+++ b/paddle/operators/math/selected_rows_functor_test.cc
@@ -0,0 +1,106 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+
+TEST(selected_rows_functor, cpu_add) {
+  using namespace paddle::framework;
+  using namespace paddle::platform;
+  using namespace paddle::operators::math;
+
+  CPUPlace cpu_place;
+  CPUDeviceContext ctx(cpu_place);
+  SetConstant<CPUPlace, float> functor;
+  int64_t height = 10;
+  int64_t row_numel = 10;
+
+  std::vector<int64_t> rows1{0, 4, 7};
+  std::unique_ptr<SelectedRows> selected_rows1{new SelectedRows(rows1, height)};
+  auto* in1_value = selected_rows1->mutable_value();
+  in1_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows1.size()), row_numel}), cpu_place);
+  functor(ctx, in1_value, 1.0);
+
+  std::vector<int64_t> rows2{0, 5, 7, 9};
+  std::unique_ptr<SelectedRows> selected_rows2{new SelectedRows(rows2, height)};
+  auto* in2_value = selected_rows2->mutable_value();
+  in2_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows2.size()), row_numel}), cpu_place);
+  functor(ctx, in2_value, 2.0);
+
+  std::unique_ptr<SelectedRows> output{new SelectedRows()};
+  auto* out_value = output->mutable_value();
+
+  // simplely concat two SelectedRows
+  out_value->mutable_data<float>(make_ddim({7, 10}), cpu_place);
+
+  SelectedRowsAdd<CPUPlace, float> add_functor;
+  add_functor(ctx, *selected_rows1, *selected_rows2, output.get());
+
+  auto out_height = output->height();
+  EXPECT_EQ(out_height, height);
+
+  auto& out_rows = output->rows();
+
+  // input1 rows
+  EXPECT_EQ(out_rows[0], 0);
+  EXPECT_EQ(out_rows[1], 4);
+  EXPECT_EQ(out_rows[2], 7);
+  // input2 rows
+  EXPECT_EQ(out_rows[3], 0);
+  EXPECT_EQ(out_rows[4], 5);
+  EXPECT_EQ(out_rows[5], 7);
+  EXPECT_EQ(out_rows[6], 9);
+
+  auto* out_data = output->value().data<float>();
+  // input1 value
+  EXPECT_EQ(out_data[0 * row_numel + 0], 1.0);
+  EXPECT_EQ(out_data[0 * row_numel + 8], 1.0);
+  EXPECT_EQ(out_data[1 * row_numel + 1], 1.0);
+  EXPECT_EQ(out_data[2 * row_numel + 6], 1.0);
+  // input2 value
+  EXPECT_EQ(out_data[3 * row_numel + 3], 2.0);
+  EXPECT_EQ(out_data[3 * row_numel + 8], 2.0);
+  EXPECT_EQ(out_data[4 * row_numel + 4], 2.0);
+  EXPECT_EQ(out_data[5 * row_numel + 7], 2.0);
+  EXPECT_EQ(out_data[6 * row_numel + 9], 2.0);
+
+  std::unique_ptr<Tensor> tensor1{new Tensor()};
+  tensor1->mutable_data<float>(make_ddim({height, row_numel}), cpu_place);
+  functor(ctx, tensor1.get(), 3.0);
+
+  std::unique_ptr<Tensor> tensor2{new Tensor()};
+  tensor2->mutable_data<float>(make_ddim({height, row_numel}), cpu_place);
+
+  SelectedRowsAddTensor<CPUPlace, float> add_tensor_functor;
+  add_tensor_functor(ctx, *output, *tensor1, tensor2.get());
+
+  auto* tensor2_data = tensor2->data<float>();
+  // row0: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[0 * row_numel + 0], 6.0);
+  // row1: 3.0
+  EXPECT_EQ(tensor2_data[1 * row_numel + 1], 3.0);
+  // row4 : 1.0 + 3.0
+  EXPECT_EQ(tensor2_data[4 * row_numel + 6], 4.0);
+  // row5: 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[5 * row_numel + 7], 5.0);
+  // row6: 3.0
+  EXPECT_EQ(tensor2_data[6 * row_numel + 1], 3.0);
+  // row7: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[7 * row_numel + 3], 6.0);
+  // row9: 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[9 * row_numel + 6], 5.0);
+}
diff --git a/paddle/operators/math/selected_rows_functor_test.cu b/paddle/operators/math/selected_rows_functor_test.cu
new file mode 100644
index 0000000000000000000000000000000000000000..8a9f25b98263c3bef50c38f358a20ea98ebe6324
--- /dev/null
+++ b/paddle/operators/math/selected_rows_functor_test.cu
@@ -0,0 +1,115 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+#include "paddle/operators/math/selected_rows_functor.h"
+
+TEST(selected_rows_functor, gpu_add) {
+  using namespace paddle::framework;
+  using namespace paddle::platform;
+  using namespace paddle::operators::math;
+
+  GPUPlace gpu_place(0);
+  CPUPlace cpu_place;
+  CUDADeviceContext ctx(gpu_place);
+  SetConstant<GPUPlace, float> functor;
+  int64_t height = 10;
+  int64_t row_numel = 10;
+
+  std::vector<int64_t> rows1{0, 4, 7};
+  std::unique_ptr<SelectedRows> selected_rows1{new SelectedRows(rows1, height)};
+  auto* in1_value = selected_rows1->mutable_value();
+  in1_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows1.size()), row_numel}), gpu_place);
+  functor(ctx, in1_value, 1.0);
+
+  std::vector<int64_t> rows2{0, 5, 7, 9};
+  std::unique_ptr<SelectedRows> selected_rows2{new SelectedRows(rows2, height)};
+  auto* in2_value = selected_rows2->mutable_value();
+  in2_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows2.size()), row_numel}), gpu_place);
+  functor(ctx, in2_value, 2.0);
+
+  std::unique_ptr<SelectedRows> output{new SelectedRows()};
+  auto* out_value = output->mutable_value();
+
+  // simplely concat two SelectedRows
+  out_value->mutable_data<float>(make_ddim({7, 10}), gpu_place);
+
+  SelectedRowsAdd<GPUPlace, float> add_functor;
+  add_functor(ctx, *selected_rows1, *selected_rows2, output.get());
+
+  auto out_height = output->height();
+  EXPECT_EQ(out_height, height);
+
+  auto& out_rows = output->rows();
+
+  // input1 rows
+  EXPECT_EQ(out_rows[0], 0);
+  EXPECT_EQ(out_rows[1], 4);
+  EXPECT_EQ(out_rows[2], 7);
+  // input2 rows
+  EXPECT_EQ(out_rows[3], 0);
+  EXPECT_EQ(out_rows[4], 5);
+  EXPECT_EQ(out_rows[5], 7);
+  EXPECT_EQ(out_rows[6], 9);
+
+  Tensor out_cpu;
+  out_cpu.CopyFrom<float>(*out_value, cpu_place, ctx);
+  ctx.Wait();
+
+  auto* out_cpu_data = out_cpu.data<float>();
+  // input1 value
+  EXPECT_EQ(out_cpu_data[0 * row_numel + 0], 1.0);
+  EXPECT_EQ(out_cpu_data[0 * row_numel + 8], 1.0);
+  EXPECT_EQ(out_cpu_data[1 * row_numel + 1], 1.0);
+  EXPECT_EQ(out_cpu_data[2 * row_numel + 6], 1.0);
+  // input2 value
+  EXPECT_EQ(out_cpu_data[3 * row_numel + 3], 2.0);
+  EXPECT_EQ(out_cpu_data[3 * row_numel + 8], 2.0);
+  EXPECT_EQ(out_cpu_data[4 * row_numel + 4], 2.0);
+  EXPECT_EQ(out_cpu_data[5 * row_numel + 7], 2.0);
+  EXPECT_EQ(out_cpu_data[6 * row_numel + 9], 2.0);
+
+  std::unique_ptr<Tensor> tensor1{new Tensor()};
+  tensor1->mutable_data<float>(make_ddim({height, row_numel}), gpu_place);
+  functor(ctx, tensor1.get(), 3.0);
+
+  std::unique_ptr<Tensor> tensor2{new Tensor()};
+  tensor2->mutable_data<float>(make_ddim({height, row_numel}), gpu_place);
+
+  SelectedRowsAddTensor<GPUPlace, float> add_tensor_functor;
+  add_tensor_functor(ctx, *output, *tensor1, tensor2.get());
+
+  Tensor tensor2_cpu;
+  tensor2_cpu.CopyFrom<float>(*tensor2, cpu_place, ctx);
+  ctx.Wait();
+
+  auto* tensor2_cpu_data = tensor2_cpu.data<float>();
+  // row0: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[0 * row_numel + 0], 6.0);
+  // row1: 3.0
+  EXPECT_EQ(tensor2_cpu_data[1 * row_numel + 1], 3.0);
+  // row4 : 1.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[4 * row_numel + 6], 4.0);
+  // row5: 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[5 * row_numel + 7], 5.0);
+  // row6: 3.0
+  EXPECT_EQ(tensor2_cpu_data[6 * row_numel + 1], 3.0);
+  // row7: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[7 * row_numel + 3], 6.0);
+  // row9: 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[9 * row_numel + 6], 5.0);
+}
diff --git a/paddle/operators/matmul_op.cc b/paddle/operators/matmul_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5ecbee3b413617e3a5523d9a32e72bc08bd316c5
--- /dev/null
+++ b/paddle/operators/matmul_op.cc
@@ -0,0 +1,208 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/matmul_op.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class MatMulOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* context) const override {
+    PADDLE_ENFORCE(context->HasInput("X"),
+                   "Input(X) of MatMulOp should not be null.");
+    PADDLE_ENFORCE(context->HasInput("Y"),
+                   "Input(Y) of MatMulOp should not be null.");
+    PADDLE_ENFORCE(context->HasOutput("Out"),
+                   "Output(Out) of MatMulOp should not be null.");
+
+    auto dim_x = context->GetInputDim("X");
+    auto dim_y = context->GetInputDim("Y");
+    bool transpose_x = context->Attrs().Get<bool>("transpose_X");
+    bool transpose_y = context->Attrs().Get<bool>("transpose_Y");
+
+    PADDLE_ENFORCE_GE(dim_x.size(), 1,
+                      "Input tensor X must be at least 1-dimensional.");
+    PADDLE_ENFORCE_GE(dim_y.size(), 1,
+                      "Input tensor Y must be at least 1-dimensional.");
+    PADDLE_ENFORCE_LE(dim_x.size(), 3,
+                      "Input tensor X must be at most 3-dimensional.");
+    PADDLE_ENFORCE_LE(dim_y.size(), 3,
+                      "Input tensor Y must be at most 3-dimensional.");
+
+    int M = 0, N = 0, KX = 0, KY = 0, batchCountX = 0, batchCountY = 0;
+    bool remove_initial_dim = false, remove_final_dim = false;
+
+    switch (dim_x.size()) {
+      case 1:
+        if (transpose_x) {
+          M = dim_x[0];
+          KX = 1;
+        } else {
+          M = 1;
+          KX = dim_x[0];
+          remove_initial_dim = true;
+        }
+        break;
+      case 2:
+        M = transpose_x ? dim_x[1] : dim_x[0];
+        KX = transpose_x ? dim_x[0] : dim_x[1];
+        break;
+      case 3:
+        batchCountX = dim_x[0];
+        M = transpose_x ? dim_x[2] : dim_x[1];
+        KX = transpose_x ? dim_x[1] : dim_x[2];
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (dim_y.size()) {
+      case 1:
+        if (transpose_y) {
+          N = dim_y[0];
+          KY = 1;
+        } else {
+          N = 1;
+          KY = dim_y[0];
+          remove_final_dim = true;
+        }
+        break;
+      case 2:
+        KY = transpose_y ? dim_y[1] : dim_y[0];
+        N = transpose_y ? dim_y[0] : dim_y[1];
+        break;
+      case 3:
+        batchCountY = dim_y[0];
+        KY = transpose_y ? dim_y[2] : dim_y[1];
+        N = transpose_y ? dim_y[1] : dim_y[2];
+        break;
+      default:
+        assert(false);
+    }
+
+    PADDLE_ENFORCE_EQ(
+        KX, KY,
+        "First matrix's width must be equal with second matrix's height.");
+    if (batchCountX && batchCountY) {
+      PADDLE_ENFORCE_EQ(
+          batchCountX, batchCountY,
+          "When Input(X) and Input(Y) are both three dimensional, they "
+          "must have the same batch dimension.");
+    }
+    int batchCount = std::max(batchCountX, batchCountY);
+
+    std::vector<int64_t> dim_out;
+    if (batchCount) {
+      dim_out.push_back(batchCount);
+    }
+    if (!remove_initial_dim) {
+      dim_out.push_back(M);
+    }
+    if (!remove_final_dim) {
+      dim_out.push_back(N);
+    }
+    if (dim_out.size() == 0) {
+      // We don't support 0-dimensional Tensors (scalars), so instead
+      // treat the output as a Tensor of shape (1, ) in this case.
+      dim_out.push_back(1);
+    }
+    context->SetOutputDim("Out", framework::make_ddim(dim_out));
+    context->ShareLoD("X", /*->*/ "Out");
+  }
+};
+
+class MatMulOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MatMulOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The first input of MatMul op");
+    AddInput("Y", "The second input of MatMul op");
+    AddOutput("Out", "The output of MatMul op");
+    AddAttr<bool>("transpose_X",
+                  R"DOC(If true, use the transpose of `X`.
+        )DOC")
+        .SetDefault(false);
+    AddAttr<bool>("transpose_Y",
+                  R"DOC(If true, use the transpose of `Y`.
+        )DOC")
+        .SetDefault(false);
+    AddComment(R"DOC(
+The MatMul operator is used to perform (batched) matrix multiplication
+over the last two dimensions of the input tensors `X` and `Y`.
+
+If a transpose flag is specified, the last two dimensions of the
+tensor are transposed. If the tensor is rank-1 of shape [D], then
+for `X` it is treated as [1, D] in nontransposed form and as [D, 1]
+in transposed form, whereas for `Y` it is the opposite: It is treated
+as [D, 1] in nontransposed form and as [1, D] in transposed form.
+
+Examples without transpose:
+- X: [K], Y: [K] => Out: [1]
+- X: [K], Y: [K, N] => Out: [N]
+- X: [B, M, K], Y: [K] => Out: [B, M]
+- X: [M, K], Y: [B, K, N] => Out: [B, M, N]
+- X: [B, M, K], Y: [B, K, N] => Out: [B, M, N]
+
+The behavior is designed to be similar to the `numpy.matmul` function.
+The differences are:
+- Currently only rank 1 to rank 3 input tensors are supported.
+- We add `transpose_X` and `transpose_Y` flags.
+
+Both the input `X` and `Y` can carry the LoD (Level of Details) information,
+or not. But the output only shares the LoD with input `X`.
+)DOC");
+  }
+};
+
+class MatMulOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* context) const override {
+    PADDLE_ENFORCE(context->HasInput("X"), "Input(X) should not be null");
+    PADDLE_ENFORCE(context->HasInput("Y"), "Input(Y) should not be null");
+    PADDLE_ENFORCE(context->HasInput(framework::GradVarName("Out")),
+                   "Input(Out@GRAD) should not be null");
+    auto x_dims = context->GetInputDim("X");
+    auto y_dims = context->GetInputDim("Y");
+
+    auto x_grad_name = framework::GradVarName("X");
+    auto y_grad_name = framework::GradVarName("Y");
+
+    if (context->HasOutput(x_grad_name)) {
+      context->SetOutputDim(x_grad_name, x_dims);
+    }
+    if (context->HasOutput(y_grad_name)) {
+      context->SetOutputDim(y_grad_name, y_dims);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(matmul, ops::MatMulOp, ops::MatMulOpMaker, matmul_grad,
+            ops::MatMulOpGrad);
+REGISTER_OP_CPU_KERNEL(matmul,
+                       ops::MatMulKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    matmul_grad, ops::MatMulGradKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/matmul_op.cu b/paddle/operators/matmul_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..b7e66382f00445b087e14103e7a148d450b37405
--- /dev/null
+++ b/paddle/operators/matmul_op.cu
@@ -0,0 +1,21 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/matmul_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(matmul,
+                       ops::MatMulKernel<paddle::platform::GPUPlace, float>);
+REGISTER_OP_GPU_KERNEL(
+    matmul_grad, ops::MatMulGradKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/matmul_op.h b/paddle/operators/matmul_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..8ae54e1eec33c4bce563f697bafbdc68f97ab746
--- /dev/null
+++ b/paddle/operators/matmul_op.h
@@ -0,0 +1,228 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   You may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+
+#include "paddle/framework/op_registry.h"
+#include "paddle/operators/math/matmul.h"
+#include "paddle/operators/transpose_op.h"
+
+namespace paddle {
+namespace operators {
+namespace matmul_detail {
+
+using Tensor = framework::Tensor;
+using DDim = framework::DDim;
+using framework::make_ddim;
+using framework::vectorize;
+
+template <typename Place, typename T>
+class MatMulKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor& x = *context.Input<Tensor>("X");
+    const Tensor& y = *context.Input<Tensor>("Y");
+    Tensor* out = context.Output<Tensor>("Out");
+    out->mutable_data<T>(context.GetPlace());
+    bool transpose_x = context.Attr<bool>("transpose_X");
+    bool transpose_y = context.Attr<bool>("transpose_Y");
+
+    math::MatMulFunctor<Place, T>()(context.device_context(), x, transpose_x, y,
+                                    transpose_y, T(1), out, T(0));
+  }
+};
+
+template <typename T>
+inline Tensor Reshape(const Tensor& input, const DDim& dims) {
+  Tensor output;
+  output.ShareDataWith<T>(input);
+  output.Resize(dims);
+  return output;
+}
+
+// Reshape a rank-3 tensor from P x M x N to (P * M) x N.
+// Identity op if the tensor is not of rank 3.
+template <typename T>
+Tensor CombineBatchAndM(const Tensor& input) {
+  Tensor output;
+  output.ShareDataWith<T>(input);
+  auto in_dims = input.dims();
+  if (in_dims.size() == 3) {
+    std::vector<int64_t> out_dims = {in_dims[0] * in_dims[1], in_dims[2]};
+    output.Resize(make_ddim(out_dims));
+  }
+  return output;
+}
+
+// Reshape a rank-3 tensor from P x M x N to M x (P * N).
+// (Warning: This requires transposing data and writes into new memory.)
+// Identity op if the tensor is not of rank 3.
+template <typename Place, typename T>
+Tensor CombineBatchAndN(const framework::ExecutionContext& context,
+                        const Tensor& input) {
+  Tensor output;
+  auto in_dims = input.dims();
+  if (in_dims.size() == 3) {
+    output.Resize(in_dims);
+    output.mutable_data<T>(context.GetPlace());
+    EigenTranspose<Place, T, 3>(context, input, output, {1, 0, 2});
+    std::vector<int64_t> out_dims = {in_dims[1], in_dims[0] * in_dims[2]};
+    output.Resize(make_ddim(out_dims));
+  } else {
+    output.ShareDataWith<T>(input);
+  }
+  return output;
+}
+
+// Using dimensional constraints on matrix multiplication, it is
+// straight-forward to check the following table for when X and Y
+// are both matrices.
+//
+// transpose_X | False    | True     | False    | True
+// transpose_Y | False    | False    | True     | True
+// -----------+----------+----------+----------+-----------
+//        dX = | dOut Y^T | Y dOut^T | dOut Y   | Y^T dOut^T
+//        dY = | X^T dOut | X dOut   | dOut^T X | dOut^T X^T
+//
+// When X is a vector of size K, we treat it instead as a matrix of shape
+// (1, K). Similarly, when Y is a vector of size K, we treat it instead as
+// a matrix of shape (K, 1).
+//
+// When X and Y are both 3-dimensional tensors, then the first dimension
+// the batch dimension can be ignored and the exact same formulas apply
+// as for two matrices.
+//
+// Finally, when, e.g., X is a 3-dimensional tensor but Y is a matrix, we end
+// up with formulas like
+//
+//   dY_{ij} = \sum_{p, m} X_{pmi} dOut_{pmj}
+//
+// To handle this sort of scenario, we reshape X : P x M x K, dOut: P x M x N
+// to X: (P * M) x K, dOut: (P * M) x N.
+template <typename Place, typename T>
+class MatMulGradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor& x = *context.Input<Tensor>("X");
+    const Tensor& y = *context.Input<Tensor>("Y");
+    const Tensor& dout = *context.Input<Tensor>(framework::GradVarName("Out"));
+    Tensor* dx = context.Output<Tensor>(framework::GradVarName("X"));
+    Tensor* dy = context.Output<Tensor>(framework::GradVarName("Y"));
+    bool transpose_x = context.Attr<bool>("transpose_X");
+    bool transpose_y = context.Attr<bool>("transpose_Y");
+
+    std::vector<int64_t> x_dims = vectorize(x.dims());
+    std::vector<int64_t> y_dims = vectorize(y.dims());
+
+    // If X is a vector, reshape it to a matrix.
+    if (x_dims.size() == 1) {
+      x_dims.insert(x_dims.begin(), 1);
+    }
+
+    // If Y is a vector, reshape it to a matrix.
+    if (y_dims.size() == 1) {
+      y_dims.push_back(1);
+    }
+
+    // Fix the dOut dimensions.
+    int M = 0, N = 0, batchCountX = 0, batchCountY = 0;
+
+    switch (x_dims.size()) {
+      case 2:
+        M = transpose_x ? x_dims[1] : x_dims[0];
+        break;
+      case 3:
+        batchCountX = x_dims[0];
+        M = transpose_x ? x_dims[2] : x_dims[1];
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (y_dims.size()) {
+      case 2:
+        N = transpose_y ? y_dims[0] : y_dims[1];
+        break;
+      case 3:
+        batchCountY = y_dims[0];
+        N = transpose_y ? y_dims[1] : y_dims[2];
+        break;
+      default:
+        assert(false);
+    }
+    if (batchCountX && batchCountY) {
+      PADDLE_ENFORCE_EQ(
+          batchCountX, batchCountY,
+          "When Input(X) and Input(Y) are both three dimensional, they "
+          "must have the same batch dimension.");
+    }
+    int batchCount = std::max(batchCountX, batchCountY);
+    std::vector<int64_t> dout_dims = {M, N};
+    if (batchCount) {
+      dout_dims.insert(dout_dims.begin(), batchCount);
+    }
+    Tensor X = Reshape<T>(x, make_ddim(x_dims));
+    Tensor Y = Reshape<T>(y, make_ddim(y_dims));
+    Tensor dOut = Reshape<T>(dout, make_ddim(dout_dims));
+
+    if (dx) {
+      dx->mutable_data<T>(context.GetPlace());
+      const Tensor& dOut_for_dX =
+          (x_dims.size() == 2 && y_dims.size() == 3)
+              ? CombineBatchAndN<Place, T>(context, dOut)
+              : dOut;
+      if (x_dims.size() == 2 && y_dims.size() == 3) {
+        Y = transpose_y ? CombineBatchAndM<T>(Y)
+                        : CombineBatchAndN<Place, T>(context, Y);
+      }
+      if (transpose_x) {
+        math::MatMulFunctor<Place, T>()(context.device_context(), Y,
+                                        transpose_y, dOut_for_dX, transpose_x,
+                                        T(1), dx, T(0));
+      } else {
+        math::MatMulFunctor<Place, T>()(context.device_context(), dOut_for_dX,
+                                        transpose_x, Y, !transpose_y, T(1), dx,
+                                        T(0));
+      }
+    }
+
+    if (dy) {
+      dy->mutable_data<T>(context.GetPlace());
+      const Tensor& dOut_for_dY = (y_dims.size() == 2 && x_dims.size() == 3)
+                                      ? CombineBatchAndM<T>(dOut)
+                                      : dOut;
+      if (y_dims.size() == 2 && x_dims.size() == 3) {
+        X = transpose_x ? CombineBatchAndN<Place, T>(context, X)
+                        : CombineBatchAndM<T>(X);
+        dOut = CombineBatchAndM<T>(dOut);
+      }
+      if (transpose_y) {
+        math::MatMulFunctor<Place, T>()(context.device_context(), dOut_for_dY,
+                                        transpose_y, X, transpose_x, T(1), dy,
+                                        T(0));
+      } else {
+        math::MatMulFunctor<Place, T>()(context.device_context(), X,
+                                        !transpose_x, dOut_for_dY, transpose_y,
+                                        T(1), dy, T(0));
+      }
+    }
+  }
+};
+}  // namespace matmul_detail
+
+using matmul_detail::MatMulKernel;
+using matmul_detail::MatMulGradKernel;
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/mean_op.cc b/paddle/operators/mean_op.cc
index 441543049faf96c8fbc14e84448a348c3287c1e2..9556fdf73151eeb947b4f1aee63e131ac6aa76e6 100644
--- a/paddle/operators/mean_op.cc
+++ b/paddle/operators/mean_op.cc
@@ -21,7 +21,6 @@ class MeanOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of MeanOp should not be null.");
@@ -46,7 +45,6 @@ class MeanGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
   }
diff --git a/paddle/operators/minus_op.cc b/paddle/operators/minus_op.cc
index d7fd2f901b766c3105a5ea8847fb8a3fe456a945..f7943e99acc5975d077f2319b6f678cfc693c1f3 100644
--- a/paddle/operators/minus_op.cc
+++ b/paddle/operators/minus_op.cc
@@ -25,7 +25,6 @@ class MinusOp : public framework::OperatorWithKernel {
           const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of MinusOp should not be null.");
diff --git a/paddle/operators/modified_huber_loss_op.cc b/paddle/operators/modified_huber_loss_op.cc
index 6522327fdcf646a60df76909d049341b5a9021c9..7b9e9528952d552a69ffe6a628672901c5c1a7fd 100644
--- a/paddle/operators/modified_huber_loss_op.cc
+++ b/paddle/operators/modified_huber_loss_op.cc
@@ -21,7 +21,6 @@ class ModifiedHuberLossOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");
@@ -73,7 +72,6 @@ class ModifiedHuberLossGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");
diff --git a/paddle/operators/momentum_op.cc b/paddle/operators/momentum_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9be4d15a43d87ae1a27c81498e8b19b0049a3bfa
--- /dev/null
+++ b/paddle/operators/momentum_op.cc
@@ -0,0 +1,94 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/momentum_op.h"
+
+namespace paddle {
+namespace operators {
+
+class MomentumOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(param) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(grad) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Velocity"),
+                   "Input(velocity) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("LearningRate"),
+                   "Input(LearningRate) of Momentum should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("VelocityOut"),
+                   "Output(VelocityOut) of Momentum should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Grad"),
+        "Param and Grad input of MomentumOp should have the same dimension.");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Velocity"),
+        "Param and Velocity of MomentumOp should have the same dimension.");
+    PADDLE_ENFORCE_EQ(framework::product(ctx->GetInputDim("LearningRate")), 1,
+                      "Learning_rate should be a scalar");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+    ctx->SetOutputDim("VelocityOut", param_dim);
+  }
+};
+
+class MomentumOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MomentumOpMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param",
+             "(Tensor, default Tensor<float>) "
+             "Input parameter that has to be updated");
+    AddInput("Grad",
+             "(Tensor, default Tensor<float>) "
+             "Input gradient of the parameter");
+    AddInput("Velocity",
+             "(Tensor, default Tensor<float>) "
+             "Input velocity (corresponding to the parameter) "
+             "that has to be updated");
+    AddInput("LearningRate",
+             "(Tensor, default Tensor<float>) "
+             "Input learning rate");
+
+    AddOutput("ParamOut", "(Tensor) Output updated parameter");
+    AddOutput("VelocityOut", "(Tensor) Output updated velocity");
+
+    AddAttr<float>("mu", "(float) Momentum coefficient");
+    AddComment(R"DOC(
+
+Momentum Algorithm (momentum).
+
+velocity = mu * velocity + gradient
+param = param - learning_rate * velocity
+
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(momentum, ops::MomentumOp, ops::MomentumOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    momentum, ops::MomentumOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/momentum_op.cu b/paddle/operators/momentum_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..efc24e795e05951024009f0b3258769c352df344
--- /dev/null
+++ b/paddle/operators/momentum_op.cu
@@ -0,0 +1,20 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/momentum_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    momentum, ops::MomentumOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/momentum_op.h b/paddle/operators/momentum_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..f7a724f048782ceee8509ddafcb4834fd8dbba8a
--- /dev/null
+++ b/paddle/operators/momentum_op.h
@@ -0,0 +1,55 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename Place, typename T>
+class MomentumOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto param_out = ctx.Output<framework::Tensor>("ParamOut");
+    auto velocity_out = ctx.Output<framework::Tensor>("VelocityOut");
+    auto param = ctx.Input<framework::Tensor>("Param");
+    auto velocity = ctx.Input<framework::Tensor>("Velocity");
+    auto grad = ctx.Input<framework::Tensor>("Grad");
+    auto learning_rate = ctx.Input<framework::Tensor>("LearningRate");
+
+    param_out->mutable_data<T>(ctx.GetPlace());
+    velocity_out->mutable_data<T>(ctx.GetPlace());
+
+    float mu = ctx.Attr<float>("mu");
+
+    auto p_out = framework::EigenVector<T>::Flatten(*param_out);
+    auto v_out = framework::EigenVector<T>::Flatten(*velocity_out);
+
+    auto p = framework::EigenVector<T>::Flatten(*param);
+    auto v = framework::EigenVector<T>::Flatten(*velocity);
+    auto g = framework::EigenVector<T>::Flatten(*grad);
+    auto lr = framework::EigenVector<T>::Flatten(*learning_rate);
+
+    auto place = ctx.GetEigenDevice<Place>();
+
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+    v_out.device(place) = v * mu + g;
+    p_out.device(place) = p - lr.broadcast(grad_dsize) * v_out;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/mul_op.cc b/paddle/operators/mul_op.cc
index ec0683d8875a945746b676a1f859e614af557441..065800f250d8b35a626060bac271e1bce6bb784b 100644
--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
@@ -23,7 +23,6 @@ class MulOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of MulOp should not be null.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) of MulOp should not be null.");
@@ -96,7 +95,6 @@ class MulOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");
@@ -106,10 +104,10 @@ class MulOpGrad : public framework::OperatorWithKernel {
     auto y_dims = ctx->GetInputDim("Y");
     auto out_dims = ctx->GetInputDim(framework::GradVarName("Out"));
 
-    auto x_mat_dims =
-        framework::flatten_to_2d(x_dims, Attr<int>("x_num_col_dims"));
-    auto y_mat_dims =
-        framework::flatten_to_2d(y_dims, Attr<int>("y_num_col_dims"));
+    auto x_mat_dims = framework::flatten_to_2d(
+        x_dims, ctx->Attrs().Get<int>("x_num_col_dims"));
+    auto y_mat_dims = framework::flatten_to_2d(
+        y_dims, ctx->Attrs().Get<int>("y_num_col_dims"));
 
     PADDLE_ENFORCE_EQ(
         x_mat_dims[0], out_dims[0],
diff --git a/paddle/operators/multiplex_op.cc b/paddle/operators/multiplex_op.cc
index 051051b051961c6da064bd9319460b3f41cea3e8..4d86769026e4b3e3040bdcb3bc6dc2edea58b4b0 100644
--- a/paddle/operators/multiplex_op.cc
+++ b/paddle/operators/multiplex_op.cc
@@ -23,7 +23,6 @@ class MultiplexOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Ids"), "Input(Ids) shouldn't be null.");
     PADDLE_ENFORCE(!ctx->Inputs("X").empty(),
@@ -51,6 +50,7 @@ class MultiplexOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", in_dim);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.MultiInput<Tensor>("X")[0]->type());
@@ -89,7 +89,6 @@ class MultiplexGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(!ctx->Inputs("X").empty(), "Input(X) should not be null.");
     PADDLE_ENFORCE(!ctx->Outputs(framework::GradVarName("X")).empty(),
@@ -105,6 +104,7 @@ class MultiplexGradOp : public framework::OperatorWithKernel {
     ctx->SetOutputsDim(framework::GradVarName("X"), d_ins);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.MultiInput<Tensor>("X")[0]->type());
diff --git a/paddle/operators/name_convention.md b/paddle/operators/name_convention.md
index 379385dc5d914101c7b5c9494f9383b6cf6a9b79..5a216907950100070ba57176c382eb659effb293 100644
--- a/paddle/operators/name_convention.md
+++ b/paddle/operators/name_convention.md
@@ -11,7 +11,7 @@ When defining an operator in Paddle, a corresponding [OpProtoMaker](https://gith
   - If an operator's Input/Output are tensors in math, not match to any meaningful words, input name should starts from `X`. e.g. `X`, `Y`, and output name should starts from `Out`. e.g. `Out`. This rule intends making operators which have few inputs/outputs unified.
 
 - Attribute.
-  - Attribute name follows the **camelCase**. e.g. `x`, `y`, `axis`, `rowwiseMatrix`. Also, attribute name prefers to meaningful English words.
+  - Attribute name follows the **snake_case**. e.g. `x`, `y`, `axis`, `rowwise_matrix`. Also, attribute name prefers to meaningful English words.
 
 - Comments.
   - Input/Output/Attr comment follow the format of **(type,default value) usage**, corresponding to which type it can be and how it will be used in the operator. e.g.  Attribute in Accumulator`"gamma" `,`(float, default 1.0) Accumulation multiplier`.
diff --git a/paddle/operators/pad_op.cc b/paddle/operators/pad_op.cc
index 2f26ada85e477a6e5cd24c881f6369548331e8d0..73a0b8baff530840ddd0d4c65cd4c060ab18e401 100644
--- a/paddle/operators/pad_op.cc
+++ b/paddle/operators/pad_op.cc
@@ -23,7 +23,6 @@ class PadOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of PadOp should not be null.");
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
@@ -97,7 +96,6 @@ class PadOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/pool_op.cc b/paddle/operators/pool_op.cc
index c6d9aae13322ebcc9ebbe394d9b9831bd67fe632..a326839c0f9ad14b8fd2aac596f21c7dd2539cd7 100644
--- a/paddle/operators/pool_op.cc
+++ b/paddle/operators/pool_op.cc
@@ -29,7 +29,7 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const {
 
   auto in_x_dims = ctx->GetInputDim("X");
 
-  std::string pooling_type = ctx->Attrs().Get<std::string>("poolingType");
+  std::string pooling_type = ctx->Attrs().Get<std::string>("pooling_type");
   std::vector<int> ksize = ctx->Attrs().Get<std::vector<int>>("ksize");
   std::vector<int> strides = ctx->Attrs().Get<std::vector<int>>("strides");
   std::vector<int> paddings = ctx->Attrs().Get<std::vector<int>>("paddings");
@@ -37,7 +37,7 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const {
   PADDLE_ENFORCE(in_x_dims.size() == 4 || in_x_dims.size() == 5,
                  "Pooling intput should be 4-D or 5-D tensor.");
 
-  if (ctx->Attrs().Get<bool>("globalPooling")) {
+  if (ctx->Attrs().Get<bool>("global_pooling")) {
     ksize.resize(static_cast<size_t>(in_x_dims.size()) - 2);
     for (size_t i = 0; i < ksize.size(); ++i)
       ksize[i] = static_cast<int>(in_x_dims[i + 2]);
@@ -80,23 +80,23 @@ Pool2dOpMaker::Pool2dOpMaker(framework::OpProto *proto,
             "the number of channels, H and W is the height and "
             "width of feature.");
 
-  AddAttr<std::string>("poolingType",
-                       "PoolingType of pooling operator."
+  AddAttr<std::string>("pooling_type",
+                       "Pooling_type of pooling operator."
                        "Str constant equal to 'max' or 'avg'.")
       .InEnum({"max", "avg"});
 
   AddAttr<std::vector<int>>(
       "ksize",
       "The pooling window size(height, width) of pooling operator."
-      "If globalPooling = true, ksize is ignored and need not be "
+      "If global_pooling = true, ksize is ignored and need not be "
       "specified.");  // TODO(Chengduo): Add checker. (Currently,
                       // TypedAttrChecker don't support vector type.)
   AddAttr<bool>(
-      "globalPooling",
-      "Whether to use the globalPooling."
+      "global_pooling",
+      "Whether to use the global_pooling."
       "Bool constant equal to false or true."
       "Default false."
-      "If globalPooling = true, ksize is ignored and need not be specified.")
+      "If global_pooling = true, ksize is ignored and need not be specified.")
       .SetDefault(false);
   AddAttr<std::vector<int>>("strides",
                             "The strides(height, width) of pooling window."
@@ -146,7 +146,7 @@ Pool3dOpMaker::Pool3dOpMaker(framework::OpProto *proto,
             "the number of channels, D, H and W is the depth, height and "
             "width of feature.");
 
-  AddAttr<std::string>("poolingType",
+  AddAttr<std::string>("pooling_type",
                        "PoolingType of pooling operator."
                        "Str constant equal to 'max' or 'avg'.")
       .InEnum({"max", "avg"});
@@ -154,15 +154,15 @@ Pool3dOpMaker::Pool3dOpMaker(framework::OpProto *proto,
   AddAttr<std::vector<int>>(
       "ksize",
       "The pooling window size(depth, height, width) of pooling operator."
-      "If globalPooling = true, ksize is ignored and need not be "
+      "If global_pooling = true, ksize is ignored and need not be "
       "specified.");  // TODO(Chengduo): Add checker. (Currently,
                       // TypedAttrChecker don't support vector type.)
   AddAttr<bool>(
-      "globalPooling",
-      "Whether to use the globalPooling."
+      "global_pooling",
+      "Whether to use the global_pooling."
       "Bool constant equal to false or true."
       "Default false."
-      "If globalPooling = true, ksize is ignored and need not be specified.")
+      "If global_pooling = true, ksize is ignored and need not be specified.")
       .SetDefault(false);
   AddAttr<std::vector<int>>("strides",
                             "Strides(depth, height, width) of pooling operator."
diff --git a/paddle/operators/pool_op.h b/paddle/operators/pool_op.h
index e5016d573dde0a9c8a90cddf14f68706b69fade5..ada956501918cc92a2d30ebb8d0c42453acd2839 100644
--- a/paddle/operators/pool_op.h
+++ b/paddle/operators/pool_op.h
@@ -28,7 +28,6 @@ class PoolOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override;
 };
 
@@ -36,7 +35,6 @@ class PoolOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override;
 };
 
@@ -59,11 +57,11 @@ class PoolKernel : public framework::OpKernel<T> {
     const Tensor* in_x = context.Input<Tensor>("X");
     Tensor* out = context.Output<Tensor>("Out");
 
-    std::string pooling_type = context.Attr<std::string>("poolingType");
+    std::string pooling_type = context.Attr<std::string>("pooling_type");
     std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
     std::vector<int> strides = context.Attr<std::vector<int>>("strides");
     std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
       for (size_t i = 0; i < ksize.size(); ++i) {
         ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
       }
@@ -119,12 +117,12 @@ class PoolGradKernel : public framework::OpKernel<T> {
         context.Input<Tensor>(framework::GradVarName("Out"));
     Tensor* in_x_grad = context.Output<Tensor>(framework::GradVarName("X"));
 
-    std::string pooling_type = context.Attr<std::string>("poolingType");
+    std::string pooling_type = context.Attr<std::string>("pooling_type");
     std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
     std::vector<int> strides = context.Attr<std::vector<int>>("strides");
     std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
 
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
       for (size_t i = 0; i < ksize.size(); ++i)
         ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
     }
diff --git a/paddle/operators/pool_with_index_op.cc b/paddle/operators/pool_with_index_op.cc
index 005ee886934b193064cc739638398b3535db9274..29d0322a27b71fe8d335703e228969c084f5139f 100644
--- a/paddle/operators/pool_with_index_op.cc
+++ b/paddle/operators/pool_with_index_op.cc
@@ -27,7 +27,6 @@ class MaxPoolWithIndexOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "X(Input) of Pooling should not be null.");
@@ -45,7 +44,7 @@ class MaxPoolWithIndexOp : public framework::OperatorWithKernel {
     PADDLE_ENFORCE(in_x_dims.size() == 4 || in_x_dims.size() == 5,
                    "Pooling intput should be 4-D or 5-D tensor.");
 
-    if (ctx->Attrs().Get<bool>("globalPooling")) {
+    if (ctx->Attrs().Get<bool>("global_pooling")) {
       ksize.resize(static_cast<size_t>(in_x_dims.size()) - 2);
       for (size_t i = 0; i < ksize.size(); ++i)
         ksize[i] = static_cast<int>(in_x_dims[i + 2]);
@@ -72,7 +71,6 @@ class MaxPoolWithIndexOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Mask"), "Input(Mask) must not be null.");
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
@@ -108,15 +106,15 @@ class MaxPool2dWithIndexOpMaker : public framework::OpProtoAndCheckerMaker {
     AddAttr<std::vector<int>>(
         "ksize",
         "The pooling window size(height, width) of pooling operator."
-        "If globalPooling = true, ksize is ignored and need not be "
+        "If global_pooling = true, ksize is ignored and need not be "
         "specified.");  // TODO(Chengduo): Add checker. (Currently,
                         // TypedAttrChecker don't support vector type.)
     AddAttr<bool>(
-        "globalPooling",
-        "Whether to use the globalPooling."
+        "global_pooling",
+        "Whether to use the global_pooling."
         "Bool constant equal to false or true."
         "Default false."
-        "If globalPooling = true, ksize is ignored and need not be specified.")
+        "If global_pooling = true, ksize is ignored and need not be specified.")
         .SetDefault(false);
     AddAttr<std::vector<int>>("strides",
                               "The strides(height, width) of pooling window."
@@ -179,15 +177,15 @@ class MaxPool3dWithIndexOpMaker : public framework::OpProtoAndCheckerMaker {
     AddAttr<std::vector<int>>(
         "ksize",
         "The pooling window size(depth, height, width) of pooling operator."
-        "If globalPooling = true, ksize is ignored and need not be "
+        "If global_pooling = true, ksize is ignored and need not be "
         "specified.");  // TODO(Chengduo): Add checker. (Currently,
                         // TypedAttrChecker don't support vector type.)
     AddAttr<bool>(
-        "globalPooling",
-        "Whether to use the globalPooling."
+        "global_pooling",
+        "Whether to use the global_pooling."
         "Bool constant equal to false or true."
         "Default false."
-        "If globalPooling = true, ksize is ignored and need not be specified.")
+        "If global_pooling = true, ksize is ignored and need not be specified.")
         .SetDefault(false);
     AddAttr<std::vector<int>>(
         "strides",
diff --git a/paddle/operators/pool_with_index_op.h b/paddle/operators/pool_with_index_op.h
index 01b961ca8295f723bea7335e43ec5ab100dfc65c..455c453efcd15bf0150bbd3de83d50729f338b4b 100644
--- a/paddle/operators/pool_with_index_op.h
+++ b/paddle/operators/pool_with_index_op.h
@@ -35,7 +35,7 @@ class MaxPoolWithIndexKernel : public framework::OpKernel<T> {
     std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
     std::vector<int> strides = context.Attr<std::vector<int>>("strides");
     std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
       for (size_t i = 0; i < ksize.size(); ++i) {
         ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
       }
@@ -70,7 +70,7 @@ class MaxPoolWithIndexGradKernel : public framework::OpKernel<T> {
     std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
     std::vector<int> strides = context.Attr<std::vector<int>>("strides");
     std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
       for (size_t i = 0; i < ksize.size(); ++i) {
         ksize[i] = static_cast<int>(in_x_grad->dims()[i + 2]);
       }
diff --git a/paddle/operators/prelu_op.cc b/paddle/operators/prelu_op.cc
index 166fe2682422c61a264c88b8683752ee323e940e..eef2e34eaacf59b9adacb343e9a0091ebabeaea3 100644
--- a/paddle/operators/prelu_op.cc
+++ b/paddle/operators/prelu_op.cc
@@ -25,7 +25,6 @@ class PReluOp : public framework::OperatorWithKernel {
           const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput("Alpha"), "Input(Alpha) should not be null");
@@ -62,7 +61,6 @@ class PReluGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/proximal_gd_op.cc b/paddle/operators/proximal_gd_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e4b014b9f5866ec0791cba9b3998b1734066eeeb
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.cc
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/proximal_gd_op.h"
+
+namespace paddle {
+namespace operators {
+
+class ProximalGDOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(Param) of ProximalGDOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(Grad) of ProximalGDOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("LearningRate"),
+                   "Input(LearningRate) of ProximalGDOp should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of ProximalGDOp should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("Grad"),
+                      "Two input of ProximalGD Op's dimension must be same.");
+
+    auto lr_dim = ctx->GetInputDim("LearningRate");
+    PADDLE_ENFORCE_EQ(framework::product(lr_dim), 1,
+                      "Learning Rate should be a scalar.");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+  }
+};
+
+class ProximalGDOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  ProximalGDOpMaker(framework::OpProto *proto,
+                    framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param",
+             "(Tensor, default Tensor<float>) "
+             "Input parameter value that has to be updated.");
+    AddInput("Grad",
+             "(Tensor, default Tensor<float>) "
+             "Input gradient of the parameter.");
+    AddInput("LearningRate",
+             "(Tensor, default Tensor<float>) "
+             "The learning rate should be a tensor of size 1.");
+
+    AddOutput("ParamOut", "(Tensor) Output updated parameter value.");
+
+    AddAttr<float>("l1",
+                   "(float, default 0.0) "
+                   "L1 regularization strength.")
+        .SetDefault(0.0f);
+    AddAttr<float>("l2",
+                   "(float, default 0.0)"
+                   "L2 regularization strength.")
+        .SetDefault(0.0f);
+    AddComment(R"DOC(
+
+Optimizer that implements the proximal gradient descent algorithm.
+
+prox_param = param - learning_rate * grad
+param = sign(prox_param) / (1 + learning_rate * l2) *
+        max { |prox_param| - learning_rate * l1 , 0 }
+
+The paper that proposed Proximal Gradient Descent:
+(http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf)
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(proximal_gd, ops::ProximalGDOp,
+                             ops::ProximalGDOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    proximal_gd, ops::ProximalGDOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/proximal_gd_op.cu b/paddle/operators/proximal_gd_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..26f4ebaa0f43620fee7ece2d71755be94a0e01a5
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.cu
@@ -0,0 +1,19 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed
+under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
+CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/proximal_gd_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    proximal_gd, ops::ProximalGDOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/proximal_gd_op.h b/paddle/operators/proximal_gd_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..bebda0204173ec5c3ec9a7a9da6fb623171f4cea
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.h
@@ -0,0 +1,64 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
+
+template <typename Place, typename T>
+class ProximalGDOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* param_out = ctx.Output<Tensor>("ParamOut");
+
+    param_out->mutable_data<T>(ctx.GetPlace());
+
+    auto grad = ctx.Input<Tensor>("Grad");
+
+    auto l1 = static_cast<T>(ctx.Attr<float>("l1"));
+    auto l2 = static_cast<T>(ctx.Attr<float>("l2"));
+
+    auto p = EigenVector<T>::Flatten(*ctx.Input<Tensor>("Param"));
+    auto g = EigenVector<T>::Flatten(*grad);
+    auto lr = EigenVector<T>::Flatten(*ctx.Input<Tensor>("LearningRate"));
+
+    auto p_out = EigenVector<T>::Flatten(*param_out);
+    auto place = ctx.GetEigenDevice<Place>();
+
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+
+    auto prox_param = p - lr.broadcast(grad_dsize) * g;
+    if (l1 > 0) {
+      p_out.device(place) =
+          prox_param.sign() *
+          (((prox_param.abs() - (lr * l1).broadcast(grad_dsize))
+                .cwiseMax(T(0.0))) /
+           (1.0 + (lr * l2).broadcast(grad_dsize)));
+    } else {
+      p_out.device(place) =
+          prox_param / (1.0 + (lr * l2).broadcast(grad_dsize));
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/rank_loss_op.cc b/paddle/operators/rank_loss_op.cc
index e0abbc4db1723e297c313d50fc1f9cbb77acb9f3..17ef2b1d01bd37abf2ece97ed0a307c2f1bf7e6f 100644
--- a/paddle/operators/rank_loss_op.cc
+++ b/paddle/operators/rank_loss_op.cc
@@ -24,7 +24,6 @@ class RankLossOp : public framework::OperatorWithKernel {
              const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     // input check
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null");
@@ -89,7 +88,6 @@ class RankLossGradOp : public framework::OperatorWithKernel {
                  const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
     PADDLE_ENFORCE(ctx->HasInput("Left"), "Input(Left) shouldn't be null.");
diff --git a/paddle/operators/reduce_op.cc b/paddle/operators/reduce_op.cc
index 005f88b57cb47b2c00f9a81da8afa49439665a22..5e878353ce978830ede03ca6284719615ed39718 100644
--- a/paddle/operators/reduce_op.cc
+++ b/paddle/operators/reduce_op.cc
@@ -23,7 +23,6 @@ class ReduceOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of ReduceOp should not be null.");
@@ -57,7 +56,6 @@ class ReduceGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/reshape_op.cc b/paddle/operators/reshape_op.cc
index 3cd54930a0919b7c6aad624b442eb15167b4c354..a8eb8d45eec214842ee756a260127b9d0aacb0f4 100644
--- a/paddle/operators/reshape_op.cc
+++ b/paddle/operators/reshape_op.cc
@@ -25,7 +25,6 @@ class ReshapeOp : public framework::OperatorWithKernel {
             const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     // input check
     PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -93,7 +92,6 @@ class ReshapeGradOp : public framework::OperatorWithKernel {
                 const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shouldn't be null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/rmsprop_op.cc b/paddle/operators/rmsprop_op.cc
index ada6f2bc3cd5de0424139144d8c0696158c319d9..fd5567a365c4c843de3b8aec7fa77164f16644a4 100644
--- a/paddle/operators/rmsprop_op.cc
+++ b/paddle/operators/rmsprop_op.cc
@@ -21,7 +21,6 @@ class RmspropOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of RmspropOp should not be null.");
diff --git a/paddle/operators/scale_op.cc b/paddle/operators/scale_op.cc
index ac297da6b72250eaa5eb6ea2177fbcc7a6c1ec56..7f1a21bea72992307a05d50e7a0600ee763dd813 100644
--- a/paddle/operators/scale_op.cc
+++ b/paddle/operators/scale_op.cc
@@ -25,7 +25,6 @@ class ScaleOp : public framework::OperatorWithKernel {
           const framework::AttributeMap &attrs)
       : OperatorWithKernel(type, inputs, outputs, attrs) {}
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of ScaleOp should not be null.");
@@ -56,7 +55,6 @@ class ScaleGradMaker : public framework::SingleGradOpDescMaker {
  public:
   using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
 
- protected:
   std::unique_ptr<framework::OpDescBind> Apply() const override {
     auto *grad_op = new framework::OpDescBind();
     grad_op->SetType("scale");
diff --git a/paddle/operators/scatter_op.cc b/paddle/operators/scatter_op.cc
index fbea01a8db007a5b8cfe17f545f1cce6ecff3afb..62e6c70b4513fdfab1c563b6b23f36292fb6486a 100644
--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -22,7 +22,6 @@ class ScatterOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Ref"),
                    "Input(Ref) of ScatterOp should not be null.");
@@ -49,6 +48,7 @@ class ScatterOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", ref_dims);
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("Ref")->type());
@@ -59,13 +59,13 @@ class ScatterGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     ctx->SetOutputDim(framework::GradVarName("Updates"),
                       ctx->GetInputDim("Updates"));
     ctx->SetOutputDim(framework::GradVarName("Ref"), ctx->GetInputDim("Ref"));
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("Ref")->type());
diff --git a/paddle/operators/sequence_concat_op.cc b/paddle/operators/sequence_concat_op.cc
index 287fb1942e4a2b17f6d51c9a6b7f6fb71fbaa601..1fce96cdfe20fc3ab33a3cd00e9a03833c9b94f8 100644
--- a/paddle/operators/sequence_concat_op.cc
+++ b/paddle/operators/sequence_concat_op.cc
@@ -21,7 +21,6 @@ class SequenceConcatOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInputs("X"),
                    "Inputs(X) of SequenceConcatOp should not be null.");
@@ -105,7 +104,6 @@ class SequenceConcatGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                    "The gradient of Out should not be null.");
diff --git a/paddle/operators/sequence_pool_op.cc b/paddle/operators/sequence_pool_op.cc
index 8dc4a59ba861237680b99c0d686e3dcb21b8071a..e3f5d509a85537669237b8fd0ed44efe8abb6874 100644
--- a/paddle/operators/sequence_pool_op.cc
+++ b/paddle/operators/sequence_pool_op.cc
@@ -21,7 +21,6 @@ class SequencePoolOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of SequencePoolOp should not be null.");
@@ -72,7 +71,6 @@ class SequencePoolGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                    "Gradient of Out should not be null.");
diff --git a/paddle/operators/sequence_pool_op.h b/paddle/operators/sequence_pool_op.h
index ce68204d41607ed6618a3d61a77cab391960083a..a5569d1aace215c848de43dd9c3dcb414b709083 100644
--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
@@ -111,7 +111,8 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
     in_g->mutable_data<T>(context.GetPlace());
     if (strategy == LAST || strategy == FIRST) {
       // set X@Grad be zero at first when strategy is LAST/FIRST
-      math::SetConstant<Place, T>(context.device_context(), in_g, 0);
+      math::SetConstant<Place, T> functor;
+      functor(context.device_context(), in_g, 0);
     }
     auto place = context.GetEigenDevice<Place>();
     for (int i = 0; i < static_cast<int>(lod.size()) - 1; ++i) {
diff --git a/paddle/operators/sequence_softmax_op.cc b/paddle/operators/sequence_softmax_op.cc
index ea217ba4597c24ec323c6509f097874d977fcf81..c891ab1fdcbb167453462c45b00b4632e663dd0e 100644
--- a/paddle/operators/sequence_softmax_op.cc
+++ b/paddle/operators/sequence_softmax_op.cc
@@ -21,7 +21,6 @@ class SequenceSoftmaxOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of SequenceSoftmaxOp should not be null.");
@@ -66,7 +65,6 @@ class SequenceSoftmaxGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Out"),
                    "Input(Out) of SequenceSoftmaxGradOp should not be null.");
diff --git a/paddle/operators/sgd_op.cc b/paddle/operators/sgd_op.cc
index 2a6a162a02507fa42c5ecbba59384f9c32aba7a9..0f78eeab9bc643a1a70c4b6ab02a160bbeda2b33 100644
--- a/paddle/operators/sgd_op.cc
+++ b/paddle/operators/sgd_op.cc
@@ -21,7 +21,6 @@ class SGDOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Param"),
                    "Input(Param) of SGDOp should not be null.");
diff --git a/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc b/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc
index b6653e1cc754bd626f04c28c42ce59f83c2fc87f..e781c8db208464cb94d94d1914e50f5aba3db2c6 100644
--- a/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc
+++ b/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc
@@ -23,7 +23,6 @@ class SigmoidCrossEntropyWithLogitsOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Labels"),
@@ -52,7 +51,6 @@ class SigmoidCrossEntropyWithLogitsGradOp
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput("Labels"),
diff --git a/paddle/operators/smooth_l1_loss_op.cc b/paddle/operators/smooth_l1_loss_op.cc
index 91391dc945be3d3d94966339bf9232eb21dcc38f..a4f0f37764667c43d48c6aa7646d61cdf4f3fd2d 100644
--- a/paddle/operators/smooth_l1_loss_op.cc
+++ b/paddle/operators/smooth_l1_loss_op.cc
@@ -21,7 +21,6 @@ class SmoothL1LossOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");
@@ -93,7 +92,6 @@ class SmoothL1LossGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     auto in_dims = ctx->GetInputDim("X");
     auto out_dims = ctx->GetInputDim(framework::GradVarName("Out"));
diff --git a/paddle/operators/softmax_op.cc b/paddle/operators/softmax_op.cc
index 4c131ed44de1c1567193c7cabff6ddf2dc63d9be..00fd0b32a9b3c0dd9fedf7b7621b1f15e5c4ce93 100644
--- a/paddle/operators/softmax_op.cc
+++ b/paddle/operators/softmax_op.cc
@@ -21,7 +21,6 @@ class SoftmaxOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of SoftmaxOp should not be null.");
@@ -68,7 +67,6 @@ class SoftmaxOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")),
diff --git a/paddle/operators/softmax_with_cross_entropy_op.cc b/paddle/operators/softmax_with_cross_entropy_op.cc
index 5431a1657c300d9c60100616ca7ea2e1196824ec..942fbb42df8bb90b86bd097832a15b320a857750 100644
--- a/paddle/operators/softmax_with_cross_entropy_op.cc
+++ b/paddle/operators/softmax_with_cross_entropy_op.cc
@@ -46,7 +46,7 @@ class SoftmaxWithCrossEntropyOpMaker
               "(Tensor, default: Tensor<float>), A 2-D tensor. The cross "
               "entropy loss with shape [N x 1].");
     AddAttr<bool>(
-        "softLabel",
+        "soft_label",
         "(bool, default: false), A flag to indicate whether to interpretate "
         "the given labels as soft labels.")
         .SetDefault(false);
@@ -82,7 +82,6 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("Logits"),
                    "Input(Logits) should be not null.");
@@ -100,13 +99,13 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
     PADDLE_ENFORCE_EQ(labels_dims.size(), 2UL,
                       "The labels should be a 2-D tensor.");
 
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
       PADDLE_ENFORCE_EQ(logits_dims[1], labels_dims[1],
-                        "If Attr(softLabel) == true, the 2nd dimension of "
+                        "If Attr(soft_label) == true, the 2nd dimension of "
                         "Input(X) and Input(Label) should be equal.");
     } else {
       PADDLE_ENFORCE_EQ(labels_dims[1], 1UL,
-                        "If Attr(softLabel) == false, the 2nd dimension of "
+                        "If Attr(soft_label) == false, the 2nd dimension of "
                         "Input(Label) should be 1.");
     }
 
@@ -117,6 +116,7 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
     ctx->ShareLoD("Logits", /*->*/ "Loss");
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(ctx.Input<Tensor>("Logits")->type());
@@ -127,7 +127,6 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Loss")),
                    "Input(Loss@Grad) should not be null.");
@@ -142,13 +141,13 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
     PADDLE_ENFORCE_EQ(labels_dims.size(), 2UL,
                       "The labels should be a 2-D tensor.");
 
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
       PADDLE_ENFORCE_EQ(softmax_dims[1], labels_dims[1],
-                        "When Attr(softLabel) == true, the 2nd dimension of "
+                        "When Attr(soft_label) == true, the 2nd dimension of "
                         "Input(X) and Input(Label) should be equal.");
     } else {
       PADDLE_ENFORCE_EQ(labels_dims[1], 1UL,
-                        "When Attr(softLabel) == false, the 2nd dimension of "
+                        "When Attr(soft_label) == false, the 2nd dimension of "
                         "Input(Label) should be 1.");
     }
 
@@ -156,6 +155,7 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
                       ctx->GetInputDim("Softmax"));
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return framework::ToDataType(
diff --git a/paddle/operators/softmax_with_cross_entropy_op.cu b/paddle/operators/softmax_with_cross_entropy_op.cu
index 2bc53ecf871eb1800a920ba85e8eac31d7037efe..d03a1a76585bc79633d089b776ca07ba908085ba 100644
--- a/paddle/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/operators/softmax_with_cross_entropy_op.cu
@@ -70,7 +70,7 @@ class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel<T> {
                                                   logits, softmax);
     math::CrossEntropyFunctor<platform::GPUPlace, T>()(
         context.device_context(), loss, softmax, labels,
-        context.Attr<bool>("softLabel"));
+        context.Attr<bool>("soft_label"));
   }
 };
 
@@ -93,7 +93,7 @@ class SoftmaxWithCrossEntropyGradCUDAKernel : public framework::OpKernel<T> {
     int block = 512;
     int grid = (batch_size * class_num + block - 1) / block;
 
-    if (context.Attr<bool>("softLabel")) {
+    if (context.Attr<bool>("soft_label")) {
       const T* label_data = labels->data<T>();
       SoftCrossEntropyGradientKernel<T><<<
           grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
diff --git a/paddle/operators/softmax_with_cross_entropy_op.h b/paddle/operators/softmax_with_cross_entropy_op.h
index cffd422f1827b646a8abcd881fdcb5455e6a663a..66d7bc1569e124096f30b6cd91fe22189506e4a5 100644
--- a/paddle/operators/softmax_with_cross_entropy_op.h
+++ b/paddle/operators/softmax_with_cross_entropy_op.h
@@ -44,7 +44,7 @@ class SoftmaxWithCrossEntropyKernel : public framework::OpKernel<T> {
                                                   logits, softmax);
     math::CrossEntropyFunctor<platform::CPUPlace, T>()(
         context.device_context(), loss, softmax, labels,
-        context.Attr<bool>("softLabel"));
+        context.Attr<bool>("soft_label"));
   }
 };
 
@@ -60,7 +60,7 @@ class SoftmaxWithCrossEntropyGradKernel : public framework::OpKernel<T> {
     logit_grad->ShareDataWith<T>(*context.Input<Tensor>("Softmax"));
 
     const int class_num = logit_grad->dims()[1];
-    if (context.Attr<bool>("softLabel")) {
+    if (context.Attr<bool>("soft_label")) {
       auto out_grad_mat = EigenMatrix<T>::From(*out_grad);
       auto logit_grad_mat = EigenMatrix<T>::From(*logit_grad);
       auto lbl_mat = EigenMatrix<T>::From(*labels);
diff --git a/paddle/operators/split_op.cc b/paddle/operators/split_op.cc
index d5dd4df2a2a6f424421ae7f117c26acff08d37d0..4a6c50f7970208b0f4141aa057bd0db715fb6152 100644
--- a/paddle/operators/split_op.cc
+++ b/paddle/operators/split_op.cc
@@ -23,7 +23,6 @@ class SplitOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of SplitOp should not be null.");
diff --git a/paddle/operators/squared_l2_distance_op.cc b/paddle/operators/squared_l2_distance_op.cc
index cce4e527c36bc0f4735165ff88ebba85b00eb898..e360c19b47eae7fc32ae66f9e4e3873bff211b04 100644
--- a/paddle/operators/squared_l2_distance_op.cc
+++ b/paddle/operators/squared_l2_distance_op.cc
@@ -21,7 +21,6 @@ class SquaredL2DistanceOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of SquaredL2DistanceOp should not be null.");
@@ -85,7 +84,6 @@ class SquaredL2DistanceGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                    "Gradient of Out should not be null");
diff --git a/paddle/operators/sum_op.cc b/paddle/operators/sum_op.cc
index 573487b83590c132d5a4379a4b2762fbc16c04bc..5214a8413e8f7b957015985496fe8fb4b4f8b323 100644
--- a/paddle/operators/sum_op.cc
+++ b/paddle/operators/sum_op.cc
@@ -21,7 +21,6 @@ class SumOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInputs("X"), "Inputs(X) should not be null");
     auto x_dims = ctx->GetInputsDim("X");
diff --git a/paddle/operators/top_k_op.cc b/paddle/operators/top_k_op.cc
index c95481991229dd40e162094ccbf15de8ab627c8b..d5c2c91a5fb0f639ea84d13e27de8271218da54f 100644
--- a/paddle/operators/top_k_op.cc
+++ b/paddle/operators/top_k_op.cc
@@ -21,7 +21,6 @@ class TopkOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of TopkOp should not be null.");
diff --git a/paddle/operators/transpose_op.cc b/paddle/operators/transpose_op.cc
index 1101bbe3efe3d313400663ade561063da4e34dfd..d785e57c830439ad80005d9a3d4bb77faf1ae1b9 100644
--- a/paddle/operators/transpose_op.cc
+++ b/paddle/operators/transpose_op.cc
@@ -23,7 +23,6 @@ class TransposeOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should not be null");
@@ -92,7 +91,6 @@ class TransposeOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
diff --git a/paddle/operators/uniform_random_op.cc b/paddle/operators/uniform_random_op.cc
index 75928f1ec818ab028ea06cfa72273fb99430c3c8..612bdd70db28f2c1fbeb66456fae4ca865530f1f 100644
--- a/paddle/operators/uniform_random_op.cc
+++ b/paddle/operators/uniform_random_op.cc
@@ -46,7 +46,6 @@ class UniformRandomOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
                    "Output(Out) of UniformRandomOp should not be null.");
@@ -63,6 +62,7 @@ class UniformRandomOp : public framework::OperatorWithKernel {
     ctx->SetOutputDim("Out", framework::make_ddim(temp));
   }
 
+ protected:
   framework::DataType IndicateDataType(
       const framework::ExecutionContext& ctx) const override {
     return static_cast<framework::DataType>(Attr<int>("data_type"));
diff --git a/paddle/platform/dynload/cublas.h b/paddle/platform/dynload/cublas.h
index 9d8343c0b5e200b390ccda760f09816959952e9d..6b64539b0a9a4d535a53447fbcc0e458f3ac9129 100644
--- a/paddle/platform/dynload/cublas.h
+++ b/paddle/platform/dynload/cublas.h
@@ -77,6 +77,10 @@ extern void *cublas_dso_handle;
   __macro(cublasDgemmBatched);            \
   __macro(cublasCgemmBatched);            \
   __macro(cublasZgemmBatched);            \
+  __macro(cublasSgemmStridedBatched);     \
+  __macro(cublasDgemmStridedBatched);     \
+  __macro(cublasCgemmStridedBatched);     \
+  __macro(cublasZgemmStridedBatched);     \
   __macro(cublasSgetrfBatched);           \
   __macro(cublasSgetriBatched);           \
   __macro(cublasDgetrfBatched);           \
diff --git a/paddle/pybind/protobuf.cc b/paddle/pybind/protobuf.cc
index b360b05d16c9a1c135fa56cb37919dece8f16788..d9647717d2a397a972d29e93545424431c21d495 100644
--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
@@ -100,21 +100,7 @@ using namespace paddle::framework;  // NOLINT
 // Bind Methods
 void BindProgramDesc(py::module &m) {
   py::class_<ProgramDescBind>(m, "ProgramDesc", "")
-      .def_static("instance",
-                  []() -> ProgramDescBind * {
-                    return &ProgramDescBind::Instance(&GetProgramDesc());
-                  },
-                  py::return_value_policy::reference)
-      .def_static("__create_program_desc__",
-                  []() -> ProgramDescBind * {
-                    // Only used for unit-test
-                    auto *prog_desc = new ProgramDesc;
-                    auto *block = prog_desc->mutable_blocks()->Add();
-                    block->set_idx(0);
-                    block->set_parent_idx(-1);
-                    return &ProgramDescBind::Instance(prog_desc);
-                  },
-                  py::return_value_policy::reference)
+      .def(py::init<>())
       .def("append_block", &ProgramDescBind::AppendBlock,
            py::return_value_policy::reference)
       .def("append_backward",
@@ -163,6 +149,11 @@ void BindBlockDesc(py::module &m) {
              return self.Var(name);
            },
            py::return_value_policy::reference)
+      .def("has_var",
+           [](BlockDescBind &self, py::bytes byte_name) {
+             std::string name = byte_name;
+             return self.HasVar(name);
+           })
       .def("find_var",
            [](BlockDescBind &self, py::bytes byte_name) {
              std::string name = byte_name;
@@ -171,8 +162,8 @@ void BindBlockDesc(py::module &m) {
            py::return_value_policy::reference)
       .def("all_vars", &BlockDescBind::AllVars,
            py::return_value_policy::reference)
-      .def("all_ops", &BlockDescBind::AllOps,
-           py::return_value_policy::reference)
+      .def("op_size", &BlockDescBind::OpSize)
+      .def("op", &BlockDescBind::Op, py::return_value_policy::reference)
       .def("serialize_to_string", [](BlockDescBind &block_desc) -> py::bytes {
         const BlockDesc *desc = block_desc.Proto();
         PADDLE_ENFORCE(desc->IsInitialized(),
diff --git a/paddle/pybind/pybind.cc b/paddle/pybind/pybind.cc
index afc80b25b185509739edcd0d06817aa1d507a8ec..9eb1bf4a16ef40bb3044f46db9777fd2f6c341d2 100644
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -16,7 +16,10 @@ limitations under the License. */
 
 #include "paddle/framework/backward.h"
 #include "paddle/framework/executor.h"
+#include "paddle/framework/feed_fetch_method.h"
+#include "paddle/framework/framework.pb.h"
 #include "paddle/framework/lod_tensor.h"
+#include "paddle/framework/selected_rows.h"
 #include "paddle/framework/tensor_array.h"
 #include "paddle/operators/cond_op.h"
 #include "paddle/operators/dynamic_recurrent_op.h"
@@ -138,6 +141,32 @@ PYBIND11_PLUGIN(core) {
 #endif
       });
 
+  py::class_<SelectedRows>(m, "SelectedRows")
+      .def("__init__",
+           [](SelectedRows &instance) { new (&instance) SelectedRows(); })
+      .def("__init__",
+           [](SelectedRows &instance, const std::vector<int64_t> rows,
+              const int64_t &height) {
+             new (&instance) SelectedRows(rows, height);
+           })
+      .def("get_tensor",
+           [](SelectedRows &self) { return self.mutable_value(); },
+           py::return_value_policy::reference)
+      .def("set_height", &SelectedRows::set_height)
+      .def("height", &SelectedRows::height)
+      .def("set_rows", &SelectedRows::set_rows)
+      .def("rows", [](SelectedRows &self) {
+#ifndef PADDLE_WITH_CUDA
+        return self.rows();
+#else
+         auto rows = self.rows();
+         std::vector<int64_t> new_rows;
+         new_rows.reserve(rows.size());
+         std::copy(rows.begin(), rows.end(), std::back_inserter(new_rows));
+         return new_rows;
+#endif
+      });
+
   py::class_<Variable>(m, "Variable", R"DOC(Variable Class.
 
 All parameter, weight, gradient are variables in Paddle.
@@ -231,7 +260,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    return OpRegistry::CreateOp(desc);
+                    return OpRegistry::CreateOp(desc, nullptr);
                   })
       .def("backward",
            [](const OperatorBase &forwardOp,
@@ -335,7 +364,7 @@ All parameter, weight, gradient are variables in Paddle.
             PADDLE_ENFORCE(desc.IsInitialized(),
                            "User OpDesc is not initialized, reason %s",
                            desc.InitializationErrorString());
-            auto rnn_op = OpRegistry::CreateOp(desc);
+            auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
             return static_cast<operators::RecurrentOp *>(rnn_op.release());
           })
       .def("set_stepnet", [](operators::RecurrentOp &self,
@@ -353,7 +382,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    auto rnn_op = OpRegistry::CreateOp(desc);
+                    auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
                     return static_cast<operators::DynamicRecurrentOp *>(
                         rnn_op.release());
                   })
@@ -380,7 +409,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    auto cond_op = OpRegistry::CreateOp(desc);
+                    auto cond_op = OpRegistry::CreateOp(desc, nullptr);
                     return static_cast<operators::CondOp *>(cond_op.release());
                   })
       .def("set_truenet",
@@ -403,6 +432,10 @@ All parameter, weight, gradient are variables in Paddle.
   m.def("unique_integer", UniqueIntegerGenerator);
 
   m.def("is_compile_gpu", IsCompileGPU);
+  m.def("set_feed_variable_float", framework::SetFeedVariable<float>);
+  m.def("set_feed_variable_double", framework::SetFeedVariable<double>);
+  m.def("set_feed_variable_int", framework::SetFeedVariable<int>);
+  m.def("get_fetch_variable", framework::GetFetchVariable);
 
   BindProgramDesc(m);
   BindBlockDesc(m);
diff --git a/python/paddle/trainer_config_helpers/networks.py b/python/paddle/trainer_config_helpers/networks.py
index 93e8ac173e721d9623fce91f30ac4642d273caba..120c9d11a5ebaa72b94590e596fd4362c552f979 100644
--- a/python/paddle/trainer_config_helpers/networks.py
+++ b/python/paddle/trainer_config_helpers/networks.py
@@ -26,8 +26,9 @@ __all__ = [
     'sequence_conv_pool', 'simple_lstm', "simple_img_conv_pool",
     "img_conv_bn_pool", 'lstmemory_group', 'lstmemory_unit', 'small_vgg',
     'img_conv_group', 'vgg_16_network', 'gru_unit', 'gru_group', 'simple_gru',
-    'simple_attention', 'simple_gru2', 'bidirectional_gru', 'text_conv_pool',
-    'bidirectional_lstm', 'inputs', 'outputs'
+    'simple_attention', 'dot_product_attention', 'simple_gru2',
+    'bidirectional_gru', 'text_conv_pool', 'bidirectional_lstm', 'inputs',
+    'outputs'
 ]
 
 ######################################################
@@ -1361,6 +1362,7 @@ def simple_attention(encoded_sequence,
                                 compute attention weight.
     :type transform_param_attr: ParameterAttribute
     :return: a context vector
+    :rtype: LayerOutput
     """
     assert encoded_proj.size == decoder_state.size
     proj_size = encoded_proj.size
@@ -1396,6 +1398,88 @@ def simple_attention(encoded_sequence,
         input=scaled, pooling_type=SumPooling(), name="%s_pooling" % name)
 
 
+@wrap_name_default()
+def dot_product_attention(encoded_sequence,
+                          attended_sequence,
+                          transformed_state,
+                          softmax_param_attr=None,
+                          name=None):
+    """
+    Calculate and return a context vector with dot-product attention mechanism.
+    The dimension of the context vector equals to that of the attended_sequence.
+
+    ..  math::
+
+        a(s_{i-1},h_{j}) & = s_{i-1}^\mathrm{T} h_{j}
+
+        e_{i,j} & = a(s_{i-1}, h_{j})
+
+        a_{i,j} & = \\frac{exp(e_{i,j})}{\\sum_{k=1}^{T_x}{exp(e_{i,k})}}
+
+        c_{i} & = \\sum_{j=1}^{T_{x}}a_{i,j}z_{j}
+
+    where :math:`h_{j}` is the jth element of encoded_sequence,
+    :math:`z_{j}` is the jth element of attended_sequence,
+    :math:`s_{i-1}` is transformed_state.
+
+    The example usage is:
+
+    ..  code-block:: python
+
+        context = dot_product_attention(encoded_sequence=enc_seq,
+                                        attended_sequence=att_seq,
+                                        transformed_state=state,)
+
+    :param name: A prefix attached to the name of each layer that defined inside
+                 the dot_product_attention.
+    :type name: basestring
+    :param softmax_param_attr: The parameter attribute of sequence softmax
+                               that is used to produce attention weight.
+    :type softmax_param_attr: ParameterAttribute
+    :param encoded_sequence: The output hidden vectors of the encoder.
+    :type encoded_sequence: LayerOutput
+    :param attended_sequence: The attention weight is computed by a feed forward neural
+                              network which has two inputs : decoder's transformed hidden
+                              state of previous time step and encoder's output.
+                              attended_sequence is the sequence to be attended.
+    :type attended_sequence: LayerOutput
+    :param transformed_state: The transformed hidden state of decoder in previous time step.
+                              Since the dot-product operation will be performed on it and the
+                              encoded_sequence, their dimensions must be equal. For flexibility,
+                              we suppose transformations of the decoder's hidden state have been
+                              done outside dot_product_attention and no more will be performed
+                              inside. Then users can use either the original or transformed one.
+    :type transformed_state: LayerOutput
+    :return: The context vector.
+    :rtype: LayerOutput
+    """
+    assert transformed_state.size == encoded_sequence.size
+
+    expanded = expand_layer(
+        input=transformed_state,
+        expanded_as=encoded_sequence,
+        name='%s_expand' % name)
+
+    m = linear_comb_layer(
+        weights=expanded, vectors=encoded_sequence, name='%s_dot-product')
+
+    attention_weight = fc_layer(
+        input=m,
+        size=1,
+        act=SequenceSoftmaxActivation(),
+        param_attr=softmax_param_attr,
+        name="%s_softmax" % name,
+        bias_attr=False)
+
+    scaled = scaling_layer(
+        weight=attention_weight,
+        input=attended_sequence,
+        name='%s_scaling' % name)
+
+    return pooling_layer(
+        input=scaled, pooling_type=SumPooling(), name="%s_pooling" % name)
+
+
 def inputs(layers, *args):
     """
     Declare the inputs of network. The order of input should be as same as
diff --git a/python/paddle/v2/framework/framework.py b/python/paddle/v2/framework/framework.py
index d649e69d58961dcc43e3bf7325b0b06c832245dc..5a8ded46ea4277f3e2032932e1774ba3ad056db9 100644
--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -153,7 +153,8 @@ class OpProtoHolder(object):
             self.op_proto_map[proto.type] = proto
 
     def get_op_proto(self, type):
-        assert type in self.op_proto_map, "Operator \"%s\" has not been registered." % type
+        if type not in self.op_proto_map:
+            raise ValueError("Operator \"%s\" has not been registered." % type)
         return self.op_proto_map[type]
 
 
@@ -231,7 +232,7 @@ class Operator(object):
         if attrs is not None:
             for attr in proto.attrs:
                 attr_name = attr.name
-                if not attr_name in attrs:
+                if (not attr_name in attrs) or (attrs[attr_name] is None):
                     continue
                 if not isinstance(attrs[attr_name], Block):
                     self.desc.set_attr(attr_name, attrs[attr_name])
@@ -305,6 +306,14 @@ class Block(object):
     def idx(self):
         return self.desc.id
 
+    def var(self, name):
+        if name not in self.vars:
+            raise ValueError("var %s not in this block" % name)
+        return self.vars[name]
+
+    def all_parameters(self):
+        return {v for k, v in self.vars.iteritems() if isinstance(v, Parameter)}
+
     def create_var(self, *args, **kwargs):
         return Variable(self, *args, **kwargs)
 
@@ -313,7 +322,8 @@ class Block(object):
 
     def create_parameter(self, *args, **kwargs):
         global_block = self.program.global_block()
-        return Parameter(global_block, *args, **kwargs)
+        param = Parameter(global_block, *args, **kwargs)
+        return param
 
     def append_op(self, *args, **kwargs):
         op_desc = self.desc.append_op()
@@ -334,7 +344,10 @@ class Block(object):
                 self.create_var(name=var.name(), desc=var, type=var.type())
 
         # sync operators from cpp
-        ops_in_cpp = self.desc.all_ops()
+        ops_in_cpp = []
+        for op_idx in range(0, self.desc.op_size()):
+            ops_in_cpp.append(self.desc.op(op_idx))
+
         first_op_in_python = self.ops[0].desc
         last_op_in_python = self.ops[len(self.ops) - 1].desc
         start_index = None
@@ -375,9 +388,7 @@ class Program(object):
         return cls._instance
 
     def __init__(self):
-        assert not hasattr(self.__class__,
-                           '_instance'), 'Do not call constructor directly!'
-        self.desc = core.ProgramDesc.instance()
+        self.desc = core.ProgramDesc()
         self.blocks = [Block(self, 0)]
         self.current_block_idx = 0
 
@@ -391,10 +402,16 @@ class Program(object):
     def global_block(self):
         return self.blocks[0]
 
+    def block(self, index):
+        return self.blocks[index]
+
     def current_block(self):
         return self.blocks[self.current_block_idx]
 
     def append_backward(self, target, no_grad_set):
+        """
+        return map(param_name -> (grad_name, block_index, op_index))
+        """
         assert isinstance(target, Variable)
         param_to_grad_info = self.desc.append_backward(target.desc, no_grad_set)
         self.sync_with_cpp()
@@ -428,7 +445,6 @@ class Parameter(Variable):
             if each < 0:
                 raise ValueError("Parameter shape should not be related with "
                                  "batch-size")
-
         Variable.__init__(self, block, shape=shape, dtype=dtype, **kwargs)
         self.trainable = kwargs.get('trainable', True)
         self.init_attr = kwargs.get('initialize_attr', {
@@ -441,7 +457,7 @@ class Parameter(Variable):
         self._append_initialize_ops_()
 
     def _append_initialize_ops_(self):
-        attr = copy.deepcopy(self.init_attr)
+        attr = self.init_attr
         op_type = attr.pop('type', None)
         block = self.block
         assert isinstance(block, Block)
diff --git a/python/paddle/v2/framework/layer_helper.py b/python/paddle/v2/framework/layer_helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..6615bdcd3b1afa493c9ad05c789664818e64d2f2
--- /dev/null
+++ b/python/paddle/v2/framework/layer_helper.py
@@ -0,0 +1,158 @@
+from paddle.v2.framework.framework import Variable, OpProtoHolder, g_program
+import paddle.v2.framework.core as core
+import copy
+import itertools
+
+
+def unique_name(prefix):
+    uid = core.unique_integer()  # unique during whole process.
+    return "_".join([prefix, str(uid)])
+
+
+class LayerHelper(object):
+    def __init__(self, layer_type, **kwargs):
+        self.kwargs = kwargs
+        self.layer_type = layer_type
+        name = self.kwargs.get('name', None)
+        if name is None:
+            self.kwargs['name'] = unique_name(self.layer_type)
+
+    @property
+    def name(self):
+        return self.kwargs['name']
+
+    @property
+    def program(self):
+        prog = self.kwargs.get('program', None)
+        if prog is None:
+            return g_program
+        else:
+            return prog
+
+    def append_op(self, *args, **kwargs):
+        return self.program.current_block().append_op(*args, **kwargs)
+
+    def multiple_input(self, input_param_name='input'):
+        inputs = self.kwargs.get(input_param_name, [])
+        type_error = TypeError(
+            "Input of {0} layer should be Variable or sequence of Variable".
+            format(self.layer_type))
+        if isinstance(inputs, Variable):
+            inputs = [inputs]
+        elif not isinstance(inputs, list) and not isinstance(inputs, tuple):
+            raise type_error
+        else:
+            for each in inputs:
+                if not isinstance(each, Variable):
+                    raise type_error
+        return inputs
+
+    def input(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        if len(inputs) != 1:
+            raise "{0} layer only takes one input".format(self.layer_type)
+        return inputs[0]
+
+    @property
+    def param_attr(self):
+        default = {
+            'name': None,
+            'init_attr': {
+                'type': 'uniform_random',
+                'min': -1.0,
+                'max': 1.0
+            }
+        }
+        actual = self.kwargs.get('param_attr', None)
+        return actual if actual is not None else default
+
+    def bias_attr(self, shape, dtype):
+        bias_attr = self.kwargs.get('bias_attr', None)
+        if bias_attr is True:
+            bias_attr = {
+                'name': None,
+                'init_attr': {
+                    'type': 'fill_constant',
+                    'value': 0.0,
+                    'shape': shape,
+                    'dataType': dtype
+                }
+            }
+        return bias_attr
+
+    def multiple_param_attr(self, length):
+        param_attr = self.param_attr
+        if isinstance(param_attr, dict):
+            param_attr = [param_attr]
+
+        if len(param_attr) != 1 and len(param_attr) != length:
+            raise ValueError("parameter number mismatch")
+        elif len(param_attr) == 1 and length != 1:
+            tmp = [None] * length
+            for i in xrange(length):
+                tmp[i] = copy.deepcopy(param_attr[0])
+            param_attr = tmp
+        return param_attr
+
+    def iter_inputs_and_params(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        param_attrs = self.multiple_param_attr(len(inputs))
+        for ipt, param_attr in itertools.izip(inputs, param_attrs):
+            yield ipt, param_attr
+
+    def input_dtype(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        dtype = None
+        for each in inputs:
+            if dtype is None:
+                dtype = each.data_type
+            elif dtype != each.data_type:
+                raise ValueError("Data Type mismatch")
+        return dtype
+
+    def create_parameter(self, attr, shape, dtype, suffix='w'):
+        if attr['name'] is None:
+            attr['name'] = unique_name(".".join([self.name, suffix]))
+        return self.program.global_block().create_parameter(
+            name=attr['name'],
+            dtype=dtype,
+            shape=shape,
+            initialize_attr=attr['init_attr'])
+
+    def create_tmp_variable(self, dtype):
+        return self.program.current_block().create_var(
+            name=unique_name(".".join([self.name, 'tmp'])), dtype=dtype)
+
+    def create_global_variable(self, *args, **kwargs):
+        return self.program.global_block().create_var(*args, **kwargs)
+
+    def append_bias_op(self, input_var):
+        size = list(input_var.shape[1:])
+        bias_attr = self.bias_attr(size, dtype=input_var.data_type)
+        if not bias_attr:
+            return input_var
+
+        b = self.create_parameter(
+            attr=bias_attr, shape=size, dtype=input_var.data_type, suffix='b')
+        tmp = self.create_tmp_variable(dtype=input_var.data_type)
+        self.append_op(
+            type='elementwise_add',
+            inputs={'X': [input_var],
+                    'Y': [b]},
+            outputs={'Out': [tmp]})
+        return tmp
+
+    def append_activation(self, input_var):
+        act = self.kwargs.get('act', None)
+        if act is None:
+            return input_var
+        if isinstance(act, basestring):
+            act = {'type': act}
+        tmp = self.create_tmp_variable(dtype=input_var.data_type)
+        act_type = act.pop('type')
+        self.append_op(
+            type=act_type,
+            inputs={"X": [input_var]},
+            outputs={"Y": [tmp]},
+            attrs=act)
+        return tmp
diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7397716c47dd7088d840edb00d96dda2fe88f1d
--- /dev/null
+++ b/python/paddle/v2/framework/layers.py
@@ -0,0 +1,189 @@
+from paddle.v2.framework.layer_helper import LayerHelper
+import paddle.v2.framework.core as core
+from paddle.v2.framework.framework import OpProtoHolder, Variable
+import re
+
+__all__ = ['fc', 'data', 'cross_entropy', 'conv2d']
+
+
+def fc(input,
+       size,
+       param_attr=None,
+       bias_attr=True,
+       name=None,
+       act=None,
+       num_flatten_dims=1,
+       program=None):
+    # create helper
+    helper = LayerHelper('fc', **locals())
+
+    dtype = helper.input_dtype()
+
+    # mul
+    mul_results = []
+    for input_var, param_attr in helper.iter_inputs_and_params():
+        input_shape = input_var.shape
+        param_shape = list(input_shape[num_flatten_dims:]) + [size]
+
+        w = helper.create_parameter(
+            attr=param_attr, shape=param_shape, dtype=dtype)
+        tmp = helper.create_tmp_variable(dtype)
+        helper.append_op(
+            type="mul",
+            inputs={
+                "X": input_var,
+                "Y": w,
+            },
+            outputs={"Out": tmp},
+            attrs={'x_num_col_dims': num_flatten_dims})
+        mul_results.append(tmp)
+
+    # sum
+    if len(mul_results) == 1:
+        pre_bias = mul_results[0]
+    else:
+        pre_bias = helper.create_tmp_variable(dtype)
+        helper.append_op(
+            type="sum", inputs={"X": mul_results}, outputs={"Out": pre_bias})
+    # add bias
+    pre_activation = helper.append_bias_op(pre_bias)
+    # add activation
+    return helper.append_activation(pre_activation)
+
+
+def data(name,
+         shape,
+         data_type='float32',
+         type=core.VarDesc.VarType.LOD_TENSOR,
+         program=None):
+    helper = LayerHelper('data', **locals())
+    shape = [-1] + shape  # append batch size as -1
+    return helper.create_global_variable(
+        name=name, shape=shape, dtype=data_type, type=type)
+
+
+def _convert_(name):
+    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
+    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
+
+
+def _create_op_func_(op_type):
+    op_proto = OpProtoHolder.instance().get_op_proto(op_type)
+    if len(op_proto.outputs) != 1:
+        raise ValueError(
+            "Only one output operator can be automatically generated")
+
+    if op_proto.outputs[0].duplicable:
+        raise ValueError(
+            "Only not duplicable op can be automatically generated")
+
+    o_name = op_proto.outputs[0].name
+
+    def func(**kwargs):
+        helper = LayerHelper(op_type, **kwargs)
+        inputs = dict()
+        dtype = None
+        for ipt in op_proto.inputs:
+            name = _convert_(ipt.name)
+            val = kwargs.pop(name, [])
+            if not isinstance(val, list) and not isinstance(val, tuple):
+                val = [val]
+            for each in val:
+                if not isinstance(each, Variable):
+                    raise ValueError("input of {0} must be variable".format(
+                        op_type))
+
+                if dtype is None:
+                    dtype = each.data_type
+                elif dtype != each.data_type:
+                    raise ValueError(
+                        "operator {0} must input same dtype".format(op_type))
+            inputs[ipt.name] = val
+
+        out = helper.create_tmp_variable(dtype=dtype)
+        helper.append_op(
+            type=op_type, inputs=inputs, outputs={o_name: [out]}, attrs=kwargs)
+        return out
+
+    func.__name__ = op_type
+    globals()[op_type] = func
+    global __all__
+    __all__.append(op_type)
+
+
+_create_op_func_('mean')
+_create_op_func_('pool2d')
+
+
+def cross_entropy(input, label, **kwargs):
+    helper = LayerHelper('cross_entropy', **kwargs)
+    out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='cross_entropy',
+        inputs={'X': [input],
+                'Label': [label]},
+        outputs={'Y': [out]},
+        attrs=kwargs)
+    return out
+
+
+def square_error_cost(input, label, **kwargs):
+    helper = LayerHelper('square_error_cost', **kwargs)
+    minus_out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='elementwise_sub',
+        inputs={'X': [input],
+                'Y': [label]},
+        outputs={'Out': [minus_out]})
+
+    square_out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='pow',
+        inputs={'X': [minus_out]},
+        outputs={'Y': [square_out]},
+        attrs={'factor': 2.0})
+    return square_out
+
+
+def conv2d(input,
+           num_filters,
+           name=None,
+           filter_size=[1, 1],
+           act=None,
+           groups=None,
+           stride=[1, 1],
+           padding=None,
+           bias_attr=None,
+           param_attr=None,
+           program=None):
+    helper = LayerHelper('conv2d', **locals())
+    dtype = helper.input_dtype()
+
+    num_channels = input.shape[1]
+    if groups is None:
+        num_filter_channels = num_channels
+    else:
+        if num_channels % groups is not 0:
+            raise ValueError("num_channels must be divisible by groups.")
+        num_filter_channels = num_channels / groups
+
+    input_shape = input.shape
+    filter_shape = [num_filters, num_filter_channels] + filter_size
+    filter = helper.create_parameter(
+        attr=helper.param_attr, shape=filter_shape, dtype=dtype)
+    pre_bias = helper.create_tmp_variable(dtype)
+
+    helper.append_op(
+        type='conv2d',
+        inputs={
+            'Input': input,
+            'Filter': filter,
+        },
+        outputs={"Output": pre_bias},
+        attrs={'strides': stride,
+               'paddings': padding,
+               'groups': groups})
+
+    pre_act = helper.append_bias_op(pre_bias)
+
+    return helper.append_activation(pre_act)
diff --git a/python/paddle/v2/framework/optimizer.py b/python/paddle/v2/framework/optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e356a7aadb8d6a87d0fe54a5dd2a11fea0d80a74
--- /dev/null
+++ b/python/paddle/v2/framework/optimizer.py
@@ -0,0 +1,124 @@
+import paddle.v2.framework.framework as framework
+
+__all__ = ['SGDOptimizer']
+
+
+class Optimizer(object):
+    """Optimizer Base class.
+
+    Define the common interface of an optimizer.
+    User should not use this class directly, but need to use one of it's implementation.
+    """
+
+    def __init__(self):
+        pass
+
+    def _append_optimize_op(self, block, param_and_grad):
+        """ append optimize operator to block and return all the added optimize_op
+        """
+        raise NotImplementedError()
+
+    def create_backward_pass(self, loss, parameter_list=None, no_grad_set=None):
+        """
+        create and add gradient Operators in BlockDesc to Compute gradients of `loss`
+        for parameters in parameter_list
+
+        Args:
+          loss: an variable generated by cost function.
+          no_grad_set: variable that should not create gradient
+          parameter_list: parameters that need to compute gradient and update to optimize the lost.
+
+        Returns:
+          list of (parameters, gradients) pair.
+        """
+        assert isinstance(loss, framework.Variable)
+        param_grad_map = loss.block.program.append_backward(loss, no_grad_set or
+                                                            set())
+        if parameter_list is not None:
+            parameters = parameter_list
+        else:
+            params = loss.block.program.global_block().all_parameters()
+            parameters = [param.name for param in params]
+        params_and_grads = []
+        for param in parameters:
+            if param not in param_grad_map:
+                raise Exception("param %s is not in map" % param)
+            grad_info = param_grad_map[param]
+            grad_block = loss.block.program.block(grad_info[1])
+            if not grad_block.has_var(grad_info[0]):
+                raise Exception("grad block[%d] did not have grad var %s" %
+                                grad_info[1], grad_info[0])
+            param_var = loss.block.var(param)
+            grad_var = grad_block.var(grad_info[0])
+            if loss.block.has_var(grad_info[0]):
+                params_and_grads.append((param_var, grad_var))
+            else:
+                params_and_grads.append((param_var, None))
+        return params_and_grads
+
+    def create_optimization_pass(self, parameters_and_grads, loss):
+        """Add optimization operators to update gradients to variables.
+
+        Args:
+          loss: the target that this optimization is for.
+          parameters_and_grads: a list of (variable, gradient) pair to update.
+
+        Returns:
+          optmization_op_list: a list of optimization operator that will update parameter using gradient.
+        """
+        optimize_ops = []
+        for param_and_grad in parameters_and_grads:
+            if param_and_grad[1] is not None:
+                optimize_op = self._append_optimize_op(loss.block,
+                                                       param_and_grad)
+                optimize_ops.append(optimize_op)
+        return optimize_ops
+
+    def minimize(self, loss, parameter_list=None, no_grad_set=None):
+        """Add operations to minimize `loss` by updating `parameter_list`.
+
+        This method combines interface `create_backward_pass()` and
+        `create_optimization_pass()` into one.
+        """
+        params_grads = self.create_backward_pass(loss, parameter_list,
+                                                 no_grad_set or set())
+        optimize_ops = self.create_optimization_pass(params_grads, loss)
+        return optimize_ops
+
+
+class SGDOptimizer(Optimizer):
+    """ Simple SGD optimizer without any state.
+    """
+
+    def __init__(self, learning_rate):
+        assert learning_rate is not None
+        super(Optimizer, self).__init__()
+        self.type = "sgd"
+        self._learning_rate = learning_rate
+
+    def _append_optimize_op(self, block, param_and_grad):
+        assert isinstance(block, framework.Block)
+        lr_shape = [1]
+        # create a var for learning_rate
+        lr = block.create_var(dtype="float32", shape=lr_shape, lod_level=0)
+
+        # create an op to init the learning_rate
+        init_op = block.append_op(
+            type="fill_constant",
+            outputs={"Out": lr},
+            attrs={"shape": lr_shape,
+                   "value": self._learning_rate})
+
+        # create the optimize op
+        sgd_op = block.append_op(
+            type=self.type,
+            inputs={
+                "Param": param_and_grad[0],
+                "Grad": param_and_grad[1],
+                "LearningRate": lr
+            },
+            outputs={"ParamOut": param_and_grad[0]},
+            attrs={"shape": [1],
+                   "value": self._learning_rate})
+
+        return sgd_op
diff --git a/python/paddle/v2/framework/tests/test_cross_entropy_op.py b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
index 4ea14da7fd3d84870965d62514d6a79b4926a6ec..919b6c3f6745a9c6115e7af857c1a30354305f89 100644
--- a/python/paddle/v2/framework/tests/test_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
@@ -49,7 +49,7 @@ class TestCrossEntropyOp2(OpTest):
 
         self.inputs = {"X": X, "Label": label}
         self.outputs = {"Y": cross_entropy}
-        self.attrs = {"softLabel": True}
+        self.attrs = {"soft_label": True}
 
     def test_check_output(self):
         self.check_output()
@@ -82,7 +82,7 @@ class TestCrossEntropyOp3(OpTest):
 
         self.inputs = {"X": X, "Label": label.astype(np.float32)}
         self.outputs = {"Y": cross_entropy}
-        self.attrs = {"softLabel": True}
+        self.attrs = {"soft_label": True}
 
     def test_check_output(self):
         self.check_output()
diff --git a/python/paddle/v2/framework/tests/test_feed_fetch_method.py b/python/paddle/v2/framework/tests/test_feed_fetch_method.py
new file mode 100644
index 0000000000000000000000000000000000000000..47eedddcb6f47927ea3918d7f6c379c5710592c6
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_feed_fetch_method.py
@@ -0,0 +1,30 @@
+import paddle.v2.framework.core as core
+import unittest
+import numpy as np
+
+
+class TestFeedFetch(unittest.TestCase):
+    def test_feed_fetch(self):
+        place = core.CPUPlace()
+        input_array = np.ones((4, 4, 6)).astype("float32")
+        input_array[0, 0, 0] = 3
+        input_array[3, 3, 5] = 10
+        input_tensor = core.LoDTensor([[0, 2, 4]])
+        input_tensor.set(input_array, place)
+
+        core.set_feed_variable_float(input_tensor, "feed", 0)
+
+        output_tensor = core.get_fetch_variable("feed", 0)
+
+        output_lod = output_tensor.lod()
+        self.assertEqual(0, output_lod[0][0])
+        self.assertEqual(2, output_lod[0][1])
+        self.assertEqual(4, output_lod[0][2])
+
+        output_array = np.array(output_tensor)
+        self.assertEqual(3, output_array[0, 0, 0])
+        self.assertEqual(10, output_array[3, 3, 5])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_infer_shape.py b/python/paddle/v2/framework/tests/test_infer_shape.py
index 19bb45acef9a7443a974bf5f11afab5d067321f7..5cfb9e6687f733353cfdbfbd1ad830c2bed8463b 100644
--- a/python/paddle/v2/framework/tests/test_infer_shape.py
+++ b/python/paddle/v2/framework/tests/test_infer_shape.py
@@ -5,7 +5,7 @@ import paddle.v2.framework.core as core
 
 class TestInferShape(unittest.TestCase):
     def test_sum_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -33,7 +33,7 @@ class TestInferShape(unittest.TestCase):
         self.assertEqual(out.shape(), shape)
 
     def test_mul_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
diff --git a/python/paddle/v2/framework/tests/test_layers.py b/python/paddle/v2/framework/tests/test_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbbb6535389e2336d156734cc672e0cc7bba175c
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_layers.py
@@ -0,0 +1,64 @@
+import paddle.v2.framework.layers as layers
+from paddle.v2.framework.framework import Program, g_program
+import paddle.v2.framework.core as core
+import unittest
+
+
+class TestBook(unittest.TestCase):
+    def test_fit_a_line(self):
+        program = Program()
+        x = layers.data(
+            name='x', shape=[13], data_type='float32', program=program)
+        y_predict = layers.fc(input=x, size=1, act=None, program=program)
+
+        y = layers.data(
+            name='y', shape=[1], data_type='float32', program=program)
+        cost = layers.square_error_cost(
+            input=y_predict, label=y, program=program)
+
+        avg_cost = layers.mean(x=cost, program=program)
+        self.assertIsNotNone(avg_cost)
+        program.append_backward(avg_cost, set())
+        print str(program)
+
+    def test_recognize_digits_mlp(self):
+        program = Program()
+
+        # Change g_program, so the rest layers use `g_program`
+        images = layers.data(
+            name='pixel', shape=[784], data_type='float32', program=program)
+        label = layers.data(
+            name='label', shape=[1], data_type='int32', program=program)
+        hidden1 = layers.fc(input=images, size=128, act='relu', program=program)
+        hidden2 = layers.fc(input=hidden1, size=64, act='relu', program=program)
+        predict = layers.fc(input=hidden2,
+                            size=10,
+                            act='softmax',
+                            program=program)
+        cost = layers.cross_entropy(input=predict, label=label, program=program)
+        avg_cost = layers.mean(x=cost, program=program)
+        self.assertIsNotNone(avg_cost)
+        # print str(program)
+
+    def test_simple_conv2d(self):
+        pd = core.ProgramDesc.__create_program_desc__()
+        program = Program(desc=pd)
+        images = data_layer(
+            name='pixel', shape=[3, 48, 48], data_type='int32', program=program)
+        conv2d_layer(
+            input=images, num_filters=3, filter_size=[4, 4], program=program)
+
+        # print str(program)
+
+    def test_simple_conv2d(self):
+        program = Program()
+        images = layers.data(
+            name='pixel', shape=[3, 48, 48], data_type='int32', program=program)
+        layers.conv2d(
+            input=images, num_filters=3, filter_size=[4, 4], program=program)
+
+        print str(program)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_matmul_op.py b/python/paddle/v2/framework/tests/test_matmul_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..d51572c8ab7c44fa0c6e83e50b56f05780530c61
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_matmul_op.py
@@ -0,0 +1,119 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+def generate_compatible_shapes(dim_X, dim_Y, transpose_X, transpose_Y):
+    BATCH_SIZE = 2
+    M = 3
+    N = 4
+    K = 5
+    if (dim_X == 1 and transpose_X) or (dim_Y == 1 and transpose_Y):
+        K = 1
+    if dim_X == 1:
+        if transpose_X:
+            shape_X = [M]
+        else:
+            shape_X = [K]
+    if dim_Y == 1:
+        if transpose_Y:
+            shape_Y = [N]
+        else:
+            shape_Y = [K]
+    if dim_X >= 2:
+        if transpose_X:
+            shape_X = [K, M]
+        else:
+            shape_X = [M, K]
+    if dim_X == 3:
+        shape_X = [BATCH_SIZE] + shape_X
+    if dim_Y >= 2:
+        if transpose_Y:
+            shape_Y = [N, K]
+        else:
+            shape_Y = [K, N]
+    if dim_Y == 3:
+        shape_Y = [BATCH_SIZE] + shape_Y
+    return shape_X, shape_Y
+
+
+def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
+    """Reference forward implementation using np.matmul."""
+    # np.matmul does not support the transpose flags, so we manually
+    # transpose X and Y appropriately.
+    if transpose_X:
+        if X.ndim == 1:
+            X = X.reshape((X.size, 1))
+        elif X.ndim == 2:
+            X = X.T
+        elif X.ndim == 3:
+            X = np.transpose(X, (0, 2, 1))
+        else:
+            raise ValueError('X must have between 1 and 3 dimensions')
+    if transpose_Y:
+        if Y.ndim == 1:
+            Y = Y.reshape((1, Y.size))
+        elif Y.ndim == 2:
+            Y = Y.T
+        elif Y.ndim == 3:
+            Y = np.transpose(Y, (0, 2, 1))
+        else:
+            raise ValueError('Y must have between 1 and 3 dimensions')
+    Out = np.matmul(X, Y)
+    if not Out.shape:
+        # We do not support 0-dimensional Tensors (scalars). So where
+        # np.matmul outputs a scalar, we must convert to a Tensor of
+        # shape (1, ) instead.
+        # Everywhere else, we are compatible with np.matmul.
+        Out = np.array([Out], dtype="float32")
+    return Out
+
+
+class Generator(object):
+    def setUp(self):
+        self.op_type = "matmul"
+        X = np.random.random(self.shape_X).astype("float32")
+        Y = np.random.random(self.shape_Y).astype("float32")
+        Out = reference_matmul(X, Y, self.transpose_X, self.transpose_Y)
+        self.inputs = {'X': X, 'Y': Y}
+        self.attrs = {
+            'transpose_X': self.transpose_X,
+            'transpose_Y': self.transpose_Y
+        }
+        self.outputs = {'Out': Out}
+
+    def test_check_output(self):
+        self.check_output(atol=1e-2)
+
+    def test_check_grad_normal(self):
+        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+
+    def test_check_grad_ignore_x(self):
+        self.check_grad(
+            ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+
+    def test_check_grad_ignore_y(self):
+        self.check_grad(
+            ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+
+
+# Generate test cases for all possibilities
+for dim_X in [1, 2, 3]:
+    for dim_Y in [1, 2, 3]:
+        for transpose_X in [False, True]:
+            for transpose_Y in [False, True]:
+                test_name = (
+                    'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
+                        dim_X, dim_Y, transpose_X, transpose_Y))
+                shape_X, shape_Y = generate_compatible_shapes(
+                    dim_X, dim_Y, transpose_X, transpose_Y)
+                test_class = type(test_name, (Generator, OpTest), {
+                    'shape_X': shape_X,
+                    'shape_Y': shape_Y,
+                    'transpose_X': transpose_X,
+                    'transpose_Y': transpose_Y,
+                })
+                globals()[test_name] = test_class
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_momentum_op.py b/python/paddle/v2/framework/tests/test_momentum_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3353ff6e4f4da32eaefdd4e816a621ddac8bece
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_momentum_op.py
@@ -0,0 +1,35 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestMomentumOp(OpTest):
+    def setUp(self):
+        self.op_type = "momentum"
+
+        param = np.random.random((123, 321)).astype("float32")
+        grad = np.random.random((123, 321)).astype("float32")
+        velocity = np.zeros((123, 321)).astype("float32")
+        learning_rate = np.array([0.001]).astype("float32")
+        mu = 0.0001
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'Velocity': velocity,
+            'LearningRate': learning_rate
+        }
+
+        self.attrs = {'mu': mu}
+
+        velocity_out = mu * velocity + grad
+        param_out = param - learning_rate * velocity_out
+
+        self.outputs = {'ParamOut': param_out, 'VelocityOut': velocity_out}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_operator_desc.py b/python/paddle/v2/framework/tests/test_operator_desc.py
index dfe39c98f7f4fe266d5ec0c4a9ed14ab02e40e3a..af4e980b8ed6db6cb9b76de49d8dc0860f07ec80 100644
--- a/python/paddle/v2/framework/tests/test_operator_desc.py
+++ b/python/paddle/v2/framework/tests/test_operator_desc.py
@@ -16,7 +16,7 @@ class TestOperator(unittest.TestCase):
         try:
             block.append_op(type="no_such_op")
             self.assertFail()
-        except AssertionError as a_err:
+        except ValueError as a_err:
             self.assertEqual(a_err.message,
                              "Operator \"no_such_op\" has not been registered.")
 
diff --git a/python/paddle/v2/framework/tests/test_optimizer.py b/python/paddle/v2/framework/tests/test_optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d6fa70737bf360df53785dc602feceda471ee70
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_optimizer.py
@@ -0,0 +1,31 @@
+import unittest
+
+import paddle.v2.framework.framework as framework
+import paddle.v2.framework.optimizer as optimizer
+
+
+class TestOptimizer(unittest.TestCase):
+    def test_sgd_optimizer(self):
+        program = framework.g_program
+        block = program.global_block()
+        mul_x = block.create_parameter(
+            dtype="float32", shape=[5, 10], lod_level=0, name="mul.x")
+        mul_y = block.create_var(
+            dtype="float32", shape=[10, 8], lod_level=0, name="mul.y")
+        mul_out = block.create_var(
+            dtype="float32", shape=[5, 8], lod_level=0, name="mul.out")
+        mul_op = block.append_op(
+            type="mul",
+            inputs={"X": mul_x,
+                    "Y": mul_y},
+            outputs={"Out": mul_out},
+            attrs={"x_num_col_dims": 1})
+        sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.01)
+        opts = sgd_optimizer.minimize(mul_out)
+        self.assertEqual(len(opts), 1)
+        sgd_op = opts[0]
+        self.assertEqual(sgd_op.type, "sgd")
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_pool2d_op.py b/python/paddle/v2/framework/tests/test_pool2d_op.py
index 2941fda81b23998072810d8c6f6597a6f3db7e30..3fcd8941d4f8a8638db0009b368734c234e702f6 100644
--- a/python/paddle/v2/framework/tests/test_pool2d_op.py
+++ b/python/paddle/v2/framework/tests/test_pool2d_op.py
@@ -56,8 +56,8 @@ class TestPool2d_Op(OpTest):
             'strides': self.strides,
             'paddings': self.paddings,
             'ksize': self.ksize,
-            'poolingType': self.pool_type,
-            'globalPooling': self.global_pool,
+            'pooling_type': self.pool_type,
+            'global_pooling': self.global_pool,
         }
 
         self.outputs = {'Out': output}
diff --git a/python/paddle/v2/framework/tests/test_pool3d_op.py b/python/paddle/v2/framework/tests/test_pool3d_op.py
index 8792b492e3da6541f71185be82b8bfc4f52d821d..f4e938041fa0ae9d0760023afdbf2f3052b244ea 100644
--- a/python/paddle/v2/framework/tests/test_pool3d_op.py
+++ b/python/paddle/v2/framework/tests/test_pool3d_op.py
@@ -64,8 +64,8 @@ class TestPool3d_Op(OpTest):
             'strides': self.strides,
             'paddings': self.paddings,
             'ksize': self.ksize,
-            'poolingType': self.pool_type,
-            'globalPooling': self.global_pool,
+            'pooling_type': self.pool_type,
+            'global_pooling': self.global_pool,
         }
 
         self.outputs = {'Out': output}
diff --git a/python/paddle/v2/framework/tests/test_pool_max_op.py b/python/paddle/v2/framework/tests/test_pool_max_op.py
index f0f8aa6089c74d31702a6a5d37362099205d96b2..b78f9bba05c5af38806f6cabb0e53379f8aa0526 100644
--- a/python/paddle/v2/framework/tests/test_pool_max_op.py
+++ b/python/paddle/v2/framework/tests/test_pool_max_op.py
@@ -86,7 +86,7 @@ class TestMaxPoolWithIndex_Op(OpTest):
             'strides': self.strides,
             'paddings': self.paddings,
             'ksize': self.ksize,
-            'globalPooling': self.global_pool,
+            'global_pooling': self.global_pool,
         }
 
         self.inputs = {'X': input}
diff --git a/python/paddle/v2/framework/tests/test_program.py b/python/paddle/v2/framework/tests/test_program.py
index d06f86c09fe4edf8364e7d124cb7b8b1ae6bcc64..c98dc3492b9506f5a1639d26e3f0f1c2c9ad75d7 100644
--- a/python/paddle/v2/framework/tests/test_program.py
+++ b/python/paddle/v2/framework/tests/test_program.py
@@ -34,49 +34,11 @@ class TestProgram(unittest.TestCase):
         self.assertEqual(1, b.idx)
         self.assertEqual(0, b.parent_idx)
 
-    def test_desc_append_backward(self):
-        prog = core.ProgramDesc.__create_program_desc__()
-        self.assertIsNotNone(prog)
-        block = prog.block(0)
-        self.assertIsNotNone(block)
-
-        mul_op_desc = block.append_op()
-        mul_op_desc.set_type("mul")
-        mul_op_desc.set_input("X", ["x1"])
-        mul_op_desc.set_input("Y", ["y1"])
-        mul_op_desc.set_output("Out", ["out1"])
-
-        sum_op_desc = block.append_op()
-        sum_op_desc.set_type("elementwise_add")
-        sum_op_desc.set_input("X", ["out1"])
-        sum_op_desc.set_input("Y", ["b1"])
-        sum_op_desc.set_output("Out", ["out2"])
-
-        target = block.var("out2")
-
-        expect_ops = [
-            "mul", "elementwise_add", "fill_constant", "elementwise_add_grad",
-            "mul_grad"
-        ]
-
-        def grad_name(name):
-            return name + "@GRAD"
-
-        actual_ops = []
-        param_to_grad = prog.append_backward(target, set())
-        for var_name in ("x1", "y1", "out1", "b1"):
-            self.assertEqual(param_to_grad[var_name][0], grad_name(var_name))
-            self.assertEqual(param_to_grad[var_name][1], 0)
-
-        for op in block.all_ops():
-            actual_ops.append(op.type())
-        self.assertEqual(actual_ops, expect_ops)
-
     def test_append_backward(self):
         prog = Program.instance()
         block = prog.global_block()
 
-        mul_x = block.create_parameter(
+        mul_x = block.create_var(
             dtype="float32", shape=[5, 10], lod_level=0, name="mul.x")
         mul_y = block.create_var(
             dtype="float32", shape=[10, 8], lod_level=0, name="mul.y")
@@ -88,7 +50,35 @@ class TestProgram(unittest.TestCase):
                     "Y": mul_y},
             outputs={"Out": [mul_out]},
             attrs={"x_num_col_dims": 1})
-        param_to_grad = prog.append_backward(mul_out, set())
+
+        add_y = block.create_var(
+            dtype="float32", shape=[5, 8], lod_level=0, name="add.y")
+        add_out = block.create_var(
+            dtype="float32", shape=[5, 8], lod_level=0, name="add.out")
+        add_op = block.append_op(
+            type="elementwise_add",
+            inputs={"X": mul_out,
+                    "Y": add_y},
+            outputs={"Out": add_out},
+            attrs={"x_num_col_dims": 1})
+
+        param_to_grad = prog.append_backward(add_out, set())
+
+        def grad_name(name):
+            return name + "@GRAD"
+
+        for var_name in ("mul.x", "mul.y", "mul.out", "add.y", "add.out"):
+            self.assertEqual(param_to_grad[var_name][0], grad_name(var_name))
+            self.assertEqual(param_to_grad[var_name][1], 0)
+
+        expect_ops = [
+            "mul", "elementwise_add", "fill_constant", "elementwise_add_grad",
+            "mul_grad"
+        ]
+        actual_ops = []
+        for op in block.ops:
+            actual_ops.append(op.type)
+        self.assertEqual(actual_ops, expect_ops)
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/v2/framework/tests/test_protobuf_descs.py b/python/paddle/v2/framework/tests/test_protobuf_descs.py
index c775b1a398dabb096845b4a8730152c682b2f0dd..2fd3d5d165ada5026510e0dc3e2c55b6e0596ff3 100644
--- a/python/paddle/v2/framework/tests/test_protobuf_descs.py
+++ b/python/paddle/v2/framework/tests/test_protobuf_descs.py
@@ -4,7 +4,7 @@ import paddle.v2.framework.core as core
 
 class TestOpDesc(unittest.TestCase):
     def test_op_desc(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -64,16 +64,16 @@ class TestOpDesc(unittest.TestCase):
 
 class TestProgramDesc(unittest.TestCase):
     def test_instance(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         self.assertIsNotNone(program_desc)
         del program_desc
-        program_desc = core.ProgramDesc.instance()
+        program_desc = core.ProgramDesc()
         self.assertIsNotNone(program_desc)
         self.assertIsNotNone(program_desc.block(0))
         del program_desc
 
     def test_append_block(self):
-        prog_desc = core.ProgramDesc.__create_program_desc__()
+        prog_desc = core.ProgramDesc()
         self.assertIsNotNone(prog_desc)
         block_root = prog_desc.block(0)
         self.assertIsNotNone(block_root)
@@ -91,7 +91,7 @@ class TestProgramDesc(unittest.TestCase):
 
 class TestVarDesc(unittest.TestCase):
     def test_shape(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         block = program_desc.block(0)
         var = block.var('my_var')
         var.set_type(core.VarDesc.VarType.SELECTED_ROWS)
@@ -102,7 +102,7 @@ class TestVarDesc(unittest.TestCase):
         self.assertEqual(core.VarDesc.VarType.SELECTED_ROWS, var.type())
 
     def test_data_type(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         block = program_desc.block(0)
         var = block.var('my_var')
         var.set_type(core.VarDesc.VarType.LOD_TENSOR)
@@ -113,7 +113,7 @@ class TestVarDesc(unittest.TestCase):
 
 class TestBlockDesc(unittest.TestCase):
     def test_add_var(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -121,19 +121,21 @@ class TestBlockDesc(unittest.TestCase):
         var2 = block.var("var2")
         var3 = block.var("var3")
         all_vars = block.all_vars()
-        self.assertEqual(set(all_vars), set([var1, var2, var3]))
+        self.assertEqual(set(all_vars), {var1, var2, var3})
         var2_re = block.find_var("var2")
         self.assertEqual(var2_re, var2)
 
     def test_add_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
         op1 = block.append_op()
         op2 = block.append_op()
         op0 = block.prepend_op()
-        all_ops = block.all_ops()
+        all_ops = []
+        for idx in xrange(0, block.op_size()):
+            all_ops.append(block.op(idx))
         self.assertEqual(all_ops, [op0, op1, op2])
 
 
diff --git a/python/paddle/v2/framework/tests/test_proximal_gd_op.py b/python/paddle/v2/framework/tests/test_proximal_gd_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ca79ce6b3b710244e4f65db70b305231a9f3fcf
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_proximal_gd_op.py
@@ -0,0 +1,33 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestProximalGDOp(OpTest):
+    def setUp(self):
+        self.op_type = "proximal_gd"
+        w = np.random.random((102, 105)).astype("float32")
+        g = np.random.random((102, 105)).astype("float32")
+        lr = np.array([0.1]).astype("float32")
+        l1 = 0.1
+        l2 = 0.2
+
+        self.inputs = {'Param': w, 'Grad': g, 'LearningRate': lr}
+        self.attrs = {'l1': l1, 'l2': l2}
+        prox_param = w - lr * g
+        param_out = 0.0
+        if l1 > 0.0:
+            x = np.abs(prox_param) - lr * l1
+            x[x < 0] = 0
+            param_out = np.sign(prox_param) * (x / (1.0 + lr * l2))
+        else:
+            param_out = prox_param / (1.0 + lr * l2)
+
+        self.outputs = {'ParamOut': param_out}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_selected_rows.py b/python/paddle/v2/framework/tests/test_selected_rows.py
new file mode 100644
index 0000000000000000000000000000000000000000..661e81817951f5605ba3ca7fb0cc667074b1e37c
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_selected_rows.py
@@ -0,0 +1,37 @@
+import paddle.v2.framework.core as core
+import unittest
+import numpy as np
+
+
+class TestSelectedRows(unittest.TestCase):
+    def test_selected_rows(self):
+        place = core.CPUPlace()
+        height = 10
+        rows = [0, 4, 7]
+        row_numel = 10
+        selcted_rows = core.SelectedRows(rows, row_numel)
+        np_array = np.ones((len(rows), height)).astype("float32")
+        np_array[0, 0] = 2.0
+        np_array[2, 8] = 4.0
+        tensor = selcted_rows.get_tensor()
+        tensor.set(np_array, place)
+
+        # compare rows
+        self.assertEqual(0, selcted_rows.rows()[0])
+        self.assertEqual(4, selcted_rows.rows()[1])
+        self.assertEqual(7, selcted_rows.rows()[2])
+
+        # compare height
+        self.assertEqual(10, selcted_rows.height())
+
+        # compare tensor
+        self.assertAlmostEqual(2.0,
+                               selcted_rows.get_tensor().get_float_element(0))
+        self.assertAlmostEqual(1.0,
+                               selcted_rows.get_tensor().get_float_element(1))
+        self.assertAlmostEqual(
+            4.0, selcted_rows.get_tensor().get_float_element(2 * row_numel + 8))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py b/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py
index 377d07fb5927a108e9bd39ab227da4f40a9cd447..05ba954c0b8655b92b12f9cc686ef048c4d84bbc 100644
--- a/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py
@@ -57,7 +57,7 @@ class TestSoftmaxWithCrossEntropyOp2(OpTest):
 
         self.inputs = {"Logits": logits, "Label": labels}
         self.outputs = {"Softmax": softmax, "Loss": cross_entropy}
-        self.attrs = {"softLabel": True}
+        self.attrs = {"soft_label": True}
 
     def test_check_output(self):
         self.check_output()