Merge branch 'develop' into crf

c8d0d37c · caoying03 · 80a5ee00 · c91de280 · c8d0d37c · c8d0d37c
187 changed file
--- a/Dockerfile
+++ b/Dockerfile
@@ -22,7 +22,7 @@ COPY ./paddle/scripts/docker/root/ /root/

 RUN apt-get update && \
    apt-get install -y \
-    git python-pip python-dev openssh-server bison  \
+    git python-pip python-dev openssh-server bison libnccl-dev \
    wget unzip unrar tar xz-utils bzip2 gzip coreutils ntp \
    curl sed grep graphviz libjpeg-dev zlib1g-dev  \
    python-matplotlib gcc-4.8 g++-4.8 \

--- a/doc/api/v2/config/networks.rst
+++ b/doc/api/v2/config/networks.rst
@@ -125,3 +125,8 @@ simple_attention
    :members: simple_attention
    :noindex:

+dot_product_attention
+---------------------
+..  automodule:: paddle.v2.networks
+    :members: dot_product_attention
+    :noindex:
--- a/doc/design/block.md
+++ b/doc/design/block.md
@@ -189,7 +189,7 @@ OpDesc {
  inputs = {0} // the index of x in vars of BlockDesc above
  outputs = {5, 3} // indices of act and hidden_out in vars of BlockDesc above
  attrs {
-    "memories" : {1} // the index of h
+    "states" : {1} // the index of h
    "step_net" : <above step net>
  }
 };

--- a/doc/design/cluster_train/src/trainer.graffle
+++ b/doc/design/cluster_train/src/trainer.graffle
--- a/doc/design/images/feed_forward.png
+++ b/doc/design/images/feed_forward.png
--- a/doc/design/images/feed_forward_regularized.png
+++ b/doc/design/images/feed_forward_regularized.png
--- a/doc/design/images/l1_regularization.png
+++ b/doc/design/images/l1_regularization.png
--- a/doc/design/images/l2_regularization.png
+++ b/doc/design/images/l2_regularization.png
--- a/doc/design/images/loss_equation.png
+++ b/doc/design/images/loss_equation.png
--- a/doc/design/prune.md
+++ b/doc/design/prune.md
+# Prune
+
+## Motivation
+
+We want to support running inference, training and checkpointing in one `ProgramDesc`. We implement 
+`void Prune(const ProgramDesc* input, ProgramDesc* output)` function, which takes a `ProgramDesc`
+and generate a pruned `ProgramDesc`.
+
+## Challenge
+
+Pruning need to support both variables and operators being evaluation targets. Consider the following
+different situations.
+
+```python
+# Case 1: run foward pass.
+cost_np = session.run(target=cost)
+# Case 2: run backward passing.
+opts_np, _ = session.run(target=[cost, opt])
+# Case 3: run checkpointing
+_ = session.run(target=checkpoint)
+```
+
+## Solution
+
+To support evaluation of operators, we add `is_target` field in the `OpDesc`.
+
+```c++
+message OpDesc {
+  required string type = 3;
+  repeated Var inputs = 1;
+  repeated Var outputs = 2;
+  repeated Attr attrs = 4;
+  optional bool is_target = 5 [ default = false ];
+};
+```
+
+To support evaluation of variables, we add [fetch_op](https://github.com/PaddlePaddle/Paddle/pull/4599).
+For each variable in the `target`, we insert a `fetch_op` into the `ProgramDesc` with `variable` being
+`fetch_op`'s input. Then we also set `fetch_op` is a target.
+
+### Algorithm
+
+If an operator needs to be run, it must fall into one of the following cases:
+
+1. It is the target.
+2. It is depended by some other ops, meaning its output is some other op's input.
+
+The first case can be checked by `op_desc.is_traget()` . The second case can be implement as
+
+```c++
+bool HasDependentVar(const OpDesc& op_desc, const std::set<string>& dependent_vars) {
+  for (auto& var : op_desc.outputs()) {
+    for (auto& argu : var.arguments()) {
+      if (dependent_vars.count(argu) != 0) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+```
+
+Then the whole algorithm can be implemented as the following [code](https://github.com/tonyyang-svail/Paddle/blob/prune_impl/paddle/framework/prune.cc).
--- a/doc/design/refactorization.md
+++ b/doc/design/refactorization.md
@@ -177,9 +177,6 @@ REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, grad_op_class)
 REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class)
 ```

-### USE Macros
-Make sure the registration process is executed and linked.
-
 ---
 # Registration Process
 1. Write an Op class and its gradient Op class, if required.
@@ -188,8 +185,6 @@ Make sure the registration process is executed and linked.
 	1. Call maker class to complete `proto` and `checker`
 	2. Using the completed `proto` and `checker`, it will add a new key-value pair to the `OpInfoMap`

-4. Invoke the `USE` macro in which the Op is used to make sure that it is linked.
-
 ---
 # Backward Module (1/2)
 ### Create Backward Operator

--- a/doc/design/register_grad_op.md
+++ b/doc/design/register_grad_op.md
@@ -3,17 +3,17 @@

 ## The Problem Posed

-Currently, for each C++ operator class definition, there registers a *gradient operator creator* function, which takes a C++ operator instance and returns the corresponding gradient operator instance.
+Currently, for each C++ operator class definition, a *gradient operator creator* function is registered, which takes as input a C++ operator instance and returns the corresponding gradient operator instance.

-However, we noticed two problems with the current deisgn:
+However, we noticed two problems with the current design:

-1. As we decided to separate the *compilation* and *execution* phases, we need to change the creator to take an `OpDesc` protobuf message in a `ProgramDesc` and inserts corresponding `OpDesc` messages into the `ProgramDesc` message.
+1. As we decided to separate the *compilation* and the *execution* phases, we need to change the creator to take an `OpDesc` protobuf message in a `ProgramDesc` and inserts corresponding `OpDesc` messages into the `ProgramDesc` message.

-1. Some operator's gradient computation requires more than one gradient operators.  For example, the gradient of *minus* consists of two operators -- an identity operaotr and a scale operator.  So we need to make the registration mechanism to support the mapping from an operator to a set of operators for gradient computation.
+1. For some operators, the gradient computation can be written in terms of existing operators.  For example, the gradient of *minus* operator consists of two operators -- an *identity* operator followed by a *scale* operator.  Hence the registration mechanism needs to support mapping from an operator to a set of operators for the gradient computation.

 ## The Current Implementation

-The C++ class `OpInfos` store in a association map which key is the operator type. The `grad_op_type` indicate associated gradient operator type. Operator can create gradient operator by `OpInfo::creator_` of gradient. The pseudo code is
+Instances of the C++ class `OpInfo` are stored an associative map whose key is the operator type. The `grad_op_type` indicates the associated gradient operator type. An operator can create the gradient operator by invoking `OpInfo::creator_` of the gradient operator. The pseudo code is as follows

 ```cpp
 struct OpInfo {
@@ -31,16 +31,16 @@ OperatorBase* CreateGradientOperator(const OperatorBase& op) {

 ## Proposed Solution

-The mapping relationship between an operator and its gradient operators is a function. The interface of that function is:
+The mapping relationship between an operator and its gradient operators is a function. The interface of this function is:

 ```cpp
 // (OpDesc) --> vector<OpDesc>
 std::function<std::vector<OpDescBind>(const OpDescBind&)>;
 ```

-The function takes an `OpDescBind` of the forward operator and returns one or many gradient operator descriptions. `OpDescBind` is a C++ wrapper for protobuf message `OpDesc` to manipulate `OpDesc` fast.
+The function takes an `OpDescBind` of the forward operator and returns one or many gradient operator descriptions. `OpDescBind` is a C++ wrapper for  the protobuf message `OpDesc` for rapid manipulation of `OpDesc`.

-The `GradOpDescMaker` will be registered in `OpInfo`, to replace `grad_op_type_` field. The `OpInfo` should be
+The `GradOpDescMaker` will be registered in `OpInfo` and will replace the `grad_op_type_` field. The `OpInfo` should look like 

 ```cpp
 struct OpInfo {
@@ -49,7 +49,7 @@ struct OpInfo {
 };
 ```

-The `grad_op_maker_ ` is `nullptr` if the operator does not have associated gradient operators.
+The `grad_op_maker_ ` is a `nullptr` if the operator does not have any associated gradient operators.

 We propose a base class called `GradOpDescMakerBase` to let operator developers generate `Gradient Operators` easily. The public interface of that class is

@@ -74,7 +74,7 @@ func = [] (const OpDescBind& fwd_op) {

 We can write many helper functions since the `GradOpDescMakerBase` is a class now. The basic helper functions get the variables of `Input`, `Output`, `InputGradient` and `OutputGradient` in the forwarding operator.

-We should chagne register macros at the same time. In the current solution, there is no difference between forwarding operators and backward operators. So `REGISTER_OP` just register one operator. If the `REGISTER_OPERATOR ` contains `OpProtoAndCheckerMaker` and `GradOpDescMaker`, we just list them in the same macro. It can be done by a macro contains `__VA_ARGS__`.
+We should change register macros at the same time. In the current solution, there is no difference between forwarding operators and backward operators. So `REGISTER_OP` just register one operator. If the `REGISTER_OPERATOR ` contains `OpProtoAndCheckerMaker` and `GradOpDescMaker`, we just list them in the same macro. It can be done by a macro contains `__VA_ARGS__`.

 The user interface should be


--- a/doc/design/regularization.md
+++ b/doc/design/regularization.md
+# Regularization in PaddlePaddle
+
+## Introduction to Regularization
+A central problem in machine learning is how to design an algorithm that will perform well not just on the training data, but also on new data. Many strategies are used by machine learning practitioners to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as **regularization**. 
+
+### Parameter Norm Penalties
+Most common regularization approaches in deep learning are based on limiting the capacity of the models by adding a parameter norm penalty to the objective function `J`. This is given as follows:
+
+<img src="./images/loss_equation.png" align="center"/><br/>
+
+The parameter `alpha` is a hyperparameter that weights the relative contribution of the norm penalty term, `omega`, relative to the standard objective function `J`.
+
+The most commonly used norm penalties are the L2 norm penalty and the L1 norm penalty. These are given as follows:
+
+##### L2 Regularization:
+<img src="./images/l2_regularization.png" align="center"/><br/>
+
+##### L1 Regularization
+<img src="./images/l1_regularization.png" align="center"/><br/>
+
+A much more detailed mathematical background of reguilarization can be found [here](http://www.deeplearningbook.org/contents/regularization.html).
+
+
+## How to do Regularization in PaddlePaddle
+
+On surveying existing frameworks like Tensorflow, PyTorch, Caffe, etc, it can be seen that there are 2 common approaches of doing regularization:
+
+1. Making regularization a part of the optimizer using an attribute like `weight_decay` that is used to control the scale of the L2 Penalty. This approach is used in PyTorch as follows:
+	```python
+	opt =  torch.optim.SGD(params, lr=0.2, weight_decay=0.2)
+	```
+    At every optimization step, this code will add the gradient of the L2 Norm of the params to the gradient of the params with respect to the loss function. This can seen in the following code snippet:
+    ```python
+    if weight_decay != 0:
+    	d_p.add_(weight_decay, p.data)
+    ```
+    This is a very restyrictive way of doing regularization and does not give the users enough flexibility. 
+    
+    **Advantages**:
+    -  It is easy to implement for us.
+    -  Faster execution of backward. However, it can be done manually by advanced users too.
+
+	**Disadvantages**:
+    - Not flexible for other regularizations such as L1/L0 regularization.
+    - Does not allow for different regularization coefficient for different parameters. For example, in most models, ony the weight matrices are regularized and the bias vectors are unregularized.
+    - Tightly coupled optimizer and regularization implementation. 
+
+
+2. Adding regularization ops to the graph through Python API. This approach is used by Tensorflow and Caffe. Using this approach, we manually add regularization ops to the graph and then add the regularization loss to the final loss function before sending them to the optimizer.
+
+	**Advantages**:
+    - Allows for greater flexibility to the users of Paddle. Using this approach, the users can put different regularization to different parameters and also choose parameters that are not a part of regularization.
+    - Makes it easy for the users to customize and extend the framework. 
+
+	**Disadvantages**:
+    - Implementation requires comprehensive design and time. 
+
+## Proposal for Regularization in PaddlePaddle
+
+### Low-Level implementation
+
+In the new design, we propose to create new operations for regularization. For now, we can add 2 ops thgat correspond to the most frequently used regularizations:
+- L2_regularization_op
+- L1_regularization_op
+
+These ops can be like any other ops with their own CPU/GPU implementations either using Eigen or separate Cpu and GPU kernels. As the initial implementation, we can implement their kernels using Eigen following the abstraction pattern implemented for [Activation Ops](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/accuracy_op.h). This abstraction pattern can make it very easy to implement new regularization schemes. other than L1 and L2 norm penalties. 
+
+The idea of building ops for regularization is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way these ops will be added to the computation graph, will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API. 
+
+### Computation Graph
+
+Below is an example of a really simple feed forward neural network.
+
+<img src="./images/feed_forward.png" align="center"/><br/>
+
+The Python API will modify this computation graph to add regularization operators. The modified computation graph will look as follows:
+
+<img src="./images/feed_forward_regularized.png" align="center"/><br/>
+   
+### Python API implementation for Regularization
+
+Using the low level ops, `L2_regularization_op` and `L1_regularization_op`, any user can add regularization to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support regularization. An example of such an API can be seen in [Keras](https://keras.io/regularizers/). As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since regularization is a property of parameters, it makes sense to create these in the layer functions. 
+
+#### Creation of Regularization ops
+There are two possibilities for creating the regularization ops:
+1. We create these ops immediately while building the computation graph. 
+2. We add these ops in a lazy manner, just before the backward, similar to the way the optimization ops are added. 
+
+The proposal is to add these ops in a lazy manner just before the backward pass. 
+
+#### Storage of Regularization attributes
+
+Since we want to create the regularization ops in a lazy manner, the regularization attributes (type of regularization and weight of regularization penalty) can be stored as attributes of the [`Parameter`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/framework.py#L421) class. This is because regularization is a property of the parameters and storing regularization properties with Parameters also allows for shared parameters. 
+
+#### High-level API
+
+In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we lso need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).
+
+
+
+
+
+    
--- a/doc/design/selected_rows.md
+++ b/doc/design/selected_rows.md
 # Design Doc: Selected Rows

-`SelectedRows` is a kind of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in that tensor. It is straightforward to represent the sparse tensor by the following sparse tensor data structure:
+`SelectedRows` is a type of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in this tensor. It is straight-forward to represent a sparse tensor by the following sparse tensor data structure:

 ```cpp
 class SelectedRows {
@@ -11,7 +11,7 @@ class SelectedRows {
 };
 ```

-The field `height_` shows the first dimension of `SelectedRows`. The `rows` are the indices of which rows of `SelectedRows` are non-zeros. The `value_` field is an N-dim tensor and shape is `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
+The field `height_` is the first dimension of `SelectedRows`. The `rows` are the indices of the non-zero rows of `SelectedRows`. The `value_` field is an N-dim tensor of shape `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.

 Suppose that a SelectedRows-typed variable `x` has many rows, but only two of them have values -- row 73 is `[1, 2]` and row 84 is `[3, 4]`, the `SelectedRows` representation would be:

@@ -25,7 +25,7 @@ x = SelectedRow {

 ## SelectedRows in Protobuf

-`SelectedRows` is a kind of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time since the `rows_` and `value_` are related to training data. 
+`SelectedRows` is a type of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time because the `rows_` and `value_` are dependent on the training data. 
 So we use `TensorDesc` to unify `data_type` and `dims`. A LodTensorDesc contains a `TensorDesc` and `lod_level`. The description of `SelectedRows` is a Tensor description.

 ```proto
@@ -54,7 +54,7 @@ message VarDesc {

 ## InferShape for Selected Rows

-Just like `LoD` information, `InferShape` method will inference output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
+Just like `LoD` information, `InferShape` method will infer the output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.

 For example, the gradient operator of `TableLookup` will always generate `SelectedRows`. Its `InferShape` method should be like following

@@ -68,7 +68,7 @@ void TableLookupGrad::InferShape(context) {

 ## Sparse Operators

-There are several operators should be written to support `SelectedRows`. They are:
+There are several operators that need to be written to support `SelectedRows`. These are:

-1. Operators which generates `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
+1. Operators which generate `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
 2. Optimize operators which support `SelectedRows` gradient. e.g. `SGD` or `AdaGrad` for `SelectedRows`. However, there should be only one `SGD` operator. `OpWithKernel::Run` should select a suitable kernel for both `dense` tensor or `SelectedRows`.
--- a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
 # 构建Android平台上的PaddlePaddle库

-用户可通过交叉编译的方式，在用户熟悉的开发平台（Linux，Mac OS X和Windows）上编译Android平台上适用的PaddlePaddle库。
+用户可通过如下两种方式，交叉编译Android平台上适用的PaddlePaddle库：
+- 基于Docker容器的编译方式
+- 基于Linux交叉编译环境的编译方式
+
+## 基于Docker容器的编译方式
+Docker能在所有主要操作系统（包括Linux，Mac OS X和Windows）上运行，因此，使用基于Docker容器的编译方式，用户可在自己熟悉的开发平台上编译Android平台上适用的PaddlePaddle库。
+
+### 构建PaddlePaddle的Android开发镜像
+我们把PaddlePaddle的交叉编译环境打包成一个镜像，称为开发镜像，里面涵盖了交叉编译Android版PaddlePaddle库需要的所有编译工具。
+
+```bash
+$ git clone https://github.com/PaddlePaddle/Paddle.git
+$ cd Paddle
+$ docker build -t username/paddle-android:dev . -f Dockerfile.android
+```
+
+### 编译PaddlePaddle C-API库
+构建好开发镜像后，即可使用开发镜像来编译Android版PaddlePaddle C-API库。
+Android的Docker开发镜像向用户提供两个可配置的参数：
+
+| Argument        | Optional Values         | Default |
+|-----------------|-------------------------|---------|
+|`ANDROID_ABI`    |`armeabi-v7a, arm64-v8a` | `armeabi-v7a` |
+|`ANDROID_API`    |`>= 21` | `21` |
+
+- 编译`armeabi-v7a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+执行上述`docker run`命令时，容器默认执行[paddle/scripts/docker/build_android.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置，并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`，`ANDROID_API<21`时，Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文**配置交叉编译参数**章节，根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后，PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录，所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。
+
+## 基于Linux交叉编译环境的编译方式
 本文档将以Linux x86-64平台为例，介绍交叉编译Android平台上适用的PaddlePaddle库的方法和步骤。

-## 准备交叉编译环境
+### 准备交叉编译环境

 从源码交叉编译PaddlePaddle，用户需要提前准备好交叉编译环境。Android平台上使用的C/C++交叉编译工具链为[Android NDK](https://developer.android.com/ndk/downloads/index.html?hl=zh-cn)，用户可自行前往下载预编译好的版本，也可通过以下命令获取：

@@ -13,18 +50,27 @@ unzip -q android-ndk-r14b-linux-x86_64.zip
 ```

 Android NDK中包含了所有Android API级别、所有架构（arm/arm64/x86/mips）需要用到的编译工具和系统库。用户可根据自己的编译目标架构、所需支持的最低Android API级别，构建[独立工具链](https://developer.android.google.cn/ndk/guides/standalone_toolchain.html?hl=zh-cn)。
-比如：
+
+- 构建`armeabi-v7a`、 `Android API 21`的独立工具链：

 ```bash
 your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
-        --arch=arm --platform=android-21 --install-dir=your/path/to/my_standalone_toolchain
+        --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain
 ```

-此命令将在your/path/to/my_standalone_toolchain目录生成一套编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，使用的编译器为arm-linux-androideabi-gcc (GCC) 4.9。
+此命令将在`your/path/to/arm_standalone_toolchain`目录生成一套独立编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。

-注意：**PaddlePaddle要求使用的编译工具链所支持的Andoid API级别不小于21**。
+- 构建`arm64-v8a`、 `Android API 21`的独立工具链：
+```bash
+your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
+        --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain
+```

-## 配置交叉编译参数
+此命令将在`your/path/to/arm64_standalone_toolchain`目录生成一套独立编译工具链，面向架构为64位ARM64架构，支持的最小Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。
+
+注意：**PaddlePaddle要求使用的编译工具链所支持的Android API级别不小于21**。
+
+### 配置交叉编译参数

 CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。为了简化cmake配置，PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/android.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake)，以提供一些默认的编译器和编译参数相关配置。注意，从CMake 3.7版本开始，CMake官方对Android平台的交叉编译提供了通用的支持。PaddlePaddle若检测到用户使用的CMake版本不低于3.7时，将会将用户传进来的配置参数传递CMake系统，交由CMake系统本身来处理。有关参数配置的详细说明见[cmake-toolchains](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling)。

@@ -36,32 +82,57 @@ CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cm
 Android平台可选配置参数：

 - `ANDROID_STANDALONE_TOOLCHAIN`，独立工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别；否则，用户需要在cmake时手动设置这些值。无默认值。
- `ANDROID_ABI`，目标架构ABI。目前只支持`armeabi-v7a`，默认值为`armeabi-v7a`。
+- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
+	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
+	- Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。
+- `ANDROID_ABI`，目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`，默认值为`armeabi-v7a`。
 - `ANDROID_NATIVE_API_LEVEL`，工具链的Android API级别。若没有显式设置，PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。
- `ANROID_ARM_MODE`，是否使用ARM模式。可设置`ON/OFF`，默认值为`ON`。
- `ANDROID_ARM_NEON`，是否使用NEON指令。目前必须设置成`ON`，默认值为`ON`。
+- `ANROID_ARM_MODE`，是否使用ARM模式。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
+- `ANDROID_ARM_NEON`，是否使用NEON指令。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。

 其他配置参数：

+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算。可设置`ON/OFF`，默认值为`OFF`。
 - `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC`的值；若环境变量`CC`没有设置，则设置成`cc`编译器。

-一种常用的cmake配置如下：
+常用的cmake配置如下：

 ```bash
 cmake -DCMAKE_SYSTEM_NAME=Android \
-      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/my_standalone_toolchain \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \
      -DANDROID_ABI=armeabi-v7a \
      -DANDROID_ARM_NEON=ON \
      -DANDROID_ARM_MODE=ON \
+      -DUSE_EIGEN_FOR_BLAS=ON \
      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_SWIG_PY=OFF \
      ..
 ```

+```
+cmake -DCMAKE_SYSTEM_NAME=Android \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
+      -DANDROID_ABI=arm64-v8a \
+      -DUSE_EIGEN_FOR_BLAS=OFF \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \  
+      -DWITH_C_API=ON \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
 用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。

-## 编译和安装
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 使用`clang`编译工具链
+- `armeabi-v7a`时，设置`USE_EIGEN_BLAS=ON`，使用Eigen进行矩阵计算；`arm64-v8a`时，设置`USE_EIGEN_FOR_BLAS=OFF`，使用OpenBLAS进行矩阵计算
+
+### 编译和安装

 CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。

@@ -72,4 +143,4 @@ make install

 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。

-执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Android版本的库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`、`lib`和`third_party`目录，其中`include`中包含C-API的头文件，`lib`中包含若干个不同Android ABI的PaddlePaddle库，`third_party`中包含所依赖的所有第三方库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
--- a/doc/howto/usage/cluster/cluster_train_cn.md
+++ b/doc/howto/usage/cluster/cluster_train_cn.md
--- a/doc/howto/usage/cluster/cluster_train_en.md
+++ b/doc/howto/usage/cluster/cluster_train_en.md
--- a/doc/howto/usage/cluster/src/trainer.png
+++ b/doc/howto/usage/cluster/src/trainer.png
--- a/doc/howto/usage/cluster/src/trainer_cn.png
+++ b/doc/howto/usage/cluster/src/trainer_cn.png
--- a/doc/howto/usage/cluster/src/word2vec/api_train_v2.py
+++ b/doc/howto/usage/cluster/src/word2vec/api_train_v2.py
+import gzip
+import math
+
+import paddle.v2 as paddle
+
+embsize = 32
+hiddensize = 256
+N = 5
+
+
+def wordemb(inlayer):
+    wordemb = paddle.layer.embedding(
+        input=inlayer,
+        size=embsize,
+        param_attr=paddle.attr.Param(
+            name="_proj",
+            initial_std=0.001,
+            learning_rate=1,
+            l2_rate=0,
+            sparse_update=True))
+    return wordemb
+
+
+def main():
+    # for local training
+    cluster_train = False
+
+    if not cluster_train:
+        paddle.init(use_gpu=False, trainer_count=1)
+    else:
+        paddle.init(
+            use_gpu=False,
+            trainer_count=2,
+            port=7164,
+            ports_num=1,
+            ports_num_for_sparse=1,
+            num_gradient_servers=1)
+    word_dict = paddle.dataset.imikolov.build_dict()
+    dict_size = len(word_dict)
+    firstword = paddle.layer.data(
+        name="firstw", type=paddle.data_type.integer_value(dict_size))
+    secondword = paddle.layer.data(
+        name="secondw", type=paddle.data_type.integer_value(dict_size))
+    thirdword = paddle.layer.data(
+        name="thirdw", type=paddle.data_type.integer_value(dict_size))
+    fourthword = paddle.layer.data(
+        name="fourthw", type=paddle.data_type.integer_value(dict_size))
+    nextword = paddle.layer.data(
+        name="fifthw", type=paddle.data_type.integer_value(dict_size))
+
+    Efirst = wordemb(firstword)
+    Esecond = wordemb(secondword)
+    Ethird = wordemb(thirdword)
+    Efourth = wordemb(fourthword)
+
+    contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+    hidden1 = paddle.layer.fc(input=contextemb,
+                              size=hiddensize,
+                              act=paddle.activation.Sigmoid(),
+                              layer_attr=paddle.attr.Extra(drop_rate=0.5),
+                              bias_attr=paddle.attr.Param(learning_rate=2),
+                              param_attr=paddle.attr.Param(
+                                  initial_std=1. / math.sqrt(embsize * 8),
+                                  learning_rate=1))
+    predictword = paddle.layer.fc(input=hidden1,
+                                  size=dict_size,
+                                  bias_attr=paddle.attr.Param(learning_rate=2),
+                                  act=paddle.activation.Softmax())
+
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 100 == 0:
+                with gzip.open("batch-" + str(event.batch_id) + ".tar.gz",
+                               'w') as f:
+                    trainer.save_parameter_to_tar(f)
+                result = trainer.test(
+                    paddle.batch(
+                        paddle.dataset.imikolov.test(word_dict, N), 32))
+                print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics,
+                    result.metrics)
+
+    cost = paddle.layer.classification_cost(input=predictword, label=nextword)
+
+    parameters = paddle.parameters.create(cost)
+    adagrad = paddle.optimizer.AdaGrad(
+        learning_rate=3e-3,
+        regularization=paddle.optimizer.L2Regularization(8e-4))
+    trainer = paddle.trainer.SGD(cost,
+                                 parameters,
+                                 adagrad,
+                                 is_local=not cluster_train)
+    trainer.train(
+        paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+        num_passes=30,
+        event_handler=event_handler)
+
+
+if __name__ == '__main__':
+    main()
--- a/doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py
+++ b/doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py
+import math
+import os
+import paddle.v2 as paddle
+import pickle
+
+embsize = 32
+hiddensize = 256
+N = 5
+cluster_train_file = "./train_data_dir/train/train.txt"
+cluster_test_file = "./test_data_dir/test/test.txt"
+node_id = os.getenv("OMPI_COMM_WORLD_RANK")
+if not node_id:
+    raise EnvironmentError("must provied OMPI_COMM_WORLD_RANK")
+
+
+def wordemb(inlayer):
+    wordemb = paddle.layer.embedding(
+        input=inlayer,
+        size=embsize,
+        param_attr=paddle.attr.Param(
+            name="_proj",
+            initial_std=0.001,
+            learning_rate=1,
+            l2_rate=0,
+            sparse_update=True))
+    return wordemb
+
+
+def cluster_reader_cluster(filename, node_id):
+    def cluster_reader():
+        with open("-".join([filename, "%05d" % int(node_id)]), "r") as f:
+            for l in f:
+                csv_data = [int(cell) for cell in l.split(",")]
+                yield tuple(csv_data)
+
+    return cluster_reader
+
+
+def main():
+    # get arguments from env
+
+    # for local training
+    TRUTH = ["true", "True", "TRUE", "1", "yes", "Yes", "YES"]
+    cluster_train = os.getenv('PADDLE_CLUSTER_TRAIN', "False") in TRUTH
+    use_gpu = os.getenv('PADDLE_INIT_USE_GPU', "False")
+
+    if not cluster_train:
+        paddle.init(
+            use_gpu=use_gpu,
+            trainer_count=int(os.getenv("PADDLE_INIT_TRAINER_COUNT", "1")))
+    else:
+        paddle.init(
+            use_gpu=use_gpu,
+            trainer_count=int(os.getenv("PADDLE_INIT_TRAINER_COUNT", "1")),
+            port=int(os.getenv("PADDLE_INIT_PORT", "7164")),
+            ports_num=int(os.getenv("PADDLE_INIT_PORTS_NUM", "1")),
+            ports_num_for_sparse=int(
+                os.getenv("PADDLE_INIT_PORTS_NUM_FOR_SPARSE", "1")),
+            num_gradient_servers=int(
+                os.getenv("PADDLE_INIT_NUM_GRADIENT_SERVERS", "1")),
+            trainer_id=int(os.getenv("PADDLE_INIT_TRAINER_ID", "0")),
+            pservers=os.getenv("PADDLE_INIT_PSERVERS", "127.0.0.1"))
+    fn = open("thirdparty/wuyi_train_thdpty/word_dict.pickle", "r")
+    word_dict = pickle.load(fn)
+    fn.close()
+    dict_size = len(word_dict)
+    firstword = paddle.layer.data(
+        name="firstw", type=paddle.data_type.integer_value(dict_size))
+    secondword = paddle.layer.data(
+        name="secondw", type=paddle.data_type.integer_value(dict_size))
+    thirdword = paddle.layer.data(
+        name="thirdw", type=paddle.data_type.integer_value(dict_size))
+    fourthword = paddle.layer.data(
+        name="fourthw", type=paddle.data_type.integer_value(dict_size))
+    nextword = paddle.layer.data(
+        name="fifthw", type=paddle.data_type.integer_value(dict_size))
+
+    Efirst = wordemb(firstword)
+    Esecond = wordemb(secondword)
+    Ethird = wordemb(thirdword)
+    Efourth = wordemb(fourthword)
+
+    contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+    hidden1 = paddle.layer.fc(input=contextemb,
+                              size=hiddensize,
+                              act=paddle.activation.Sigmoid(),
+                              layer_attr=paddle.attr.Extra(drop_rate=0.5),
+                              bias_attr=paddle.attr.Param(learning_rate=2),
+                              param_attr=paddle.attr.Param(
+                                  initial_std=1. / math.sqrt(embsize * 8),
+                                  learning_rate=1))
+    predictword = paddle.layer.fc(input=hidden1,
+                                  size=dict_size,
+                                  bias_attr=paddle.attr.Param(learning_rate=2),
+                                  act=paddle.activation.Softmax())
+
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 100 == 0:
+                result = trainer.test(
+                    paddle.batch(
+                        cluster_reader_cluster(cluster_test_file, node_id), 32))
+                print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics,
+                    result.metrics)
+
+    cost = paddle.layer.classification_cost(input=predictword, label=nextword)
+    parameters = paddle.parameters.create(cost)
+    adagrad = paddle.optimizer.AdaGrad(
+        learning_rate=3e-3,
+        regularization=paddle.optimizer.L2Regularization(8e-4))
+    trainer = paddle.trainer.SGD(cost,
+                                 parameters,
+                                 adagrad,
+                                 is_local=not cluster_train)
+    trainer.train(
+        paddle.batch(cluster_reader_cluster(cluster_train_file, node_id), 32),
+        num_passes=30,
+        event_handler=event_handler)
+
+
+if __name__ == '__main__':
+    main()
--- a/doc/howto/usage/cluster/src/word2vec/prepare.py
+++ b/doc/howto/usage/cluster/src/word2vec/prepare.py
+import paddle.v2 as paddle
+import tarfile
+import os
+import pickle
+
+SPLIT_COUNT = 3
+N = 5
+
+
+def file_len(fd):
+    for i, l in enumerate(fd):
+        pass
+    return i + 1
+
+
+def split_from_reader_by_line(filename, reader, split_count):
+    fn = open(filename, "w")
+    for batch_id, batch_data in enumerate(reader()):
+        batch_data_str = [str(d) for d in batch_data]
+        fn.write(",".join(batch_data_str))
+        fn.write("\n")
+    fn.close()
+
+    fn = open(filename, "r")
+    total_line_count = file_len(fn)
+    fn.close()
+    per_file_lines = total_line_count / split_count + 1
+    cmd = "split -d -a 5 -l %d %s %s-" % (per_file_lines, filename, filename)
+    os.system(cmd)
+
+
+word_dict = paddle.dataset.imikolov.build_dict()
+with open("word_dict.pickle", "w") as dict_f:
+    pickle.dump(word_dict, dict_f)
+
+split_from_reader_by_line("train.txt",
+                          paddle.dataset.imikolov.train(word_dict, N),
+                          SPLIT_COUNT)
+split_from_reader_by_line("test.txt",
+                          paddle.dataset.imikolov.test(word_dict, N),
+                          SPLIT_COUNT)
--- a/go/pserver/client/client.go
+++ b/go/pserver/client/client.go
@@ -137,7 +137,7 @@ func (c *Client) FinishInitParams() error {
 			return err
 		}
 	}
-	return nil
+	return c.sel.Done()
 }

 // SendGrads sends gradients to parameter servers for updating

--- a/paddle/capi/CMakeLists.txt
+++ b/paddle/capi/CMakeLists.txt
@@ -28,23 +28,37 @@ add_style_check_target(paddle_capi ${CAPI_SOURCES} ${CAPI_HEADER}

 add_dependencies(paddle_capi paddle_proto)

-# combine all paddle static libraries together, into libpaddle_capi_whole.a
-# user should use PaddleCAPI as -lpaddle_capi_whole
-set(PADDLE_CAPI_INFER_LIBS
-    paddle_utils
-    paddle_parameter
-    paddle_math
-    paddle_cuda
-    paddle_function
-    paddle_gserver
-    paddle_proto)
-
+# TODO: paddle_capi_whole will be removed.
+if(MOBILE_INFERENCE)
+    set(PADDLE_CAPI_INFER_LIBS
+        paddle_utils
+        paddle_parameter
+        paddle_math
+        paddle_cuda
+        paddle_function
+        paddle_gserver
+        paddle_proto)
+else()
+    set(PADDLE_CAPI_INFER_LIBS
+        paddle_utils
+        paddle_parameter
+        paddle_math
+        paddle_cuda
+        paddle_function
+        paddle_gserver
+        paddle_proto
+        paddle_pserver
+        paddle_network)
+endif()
 cc_library(paddle_capi_whole DEPS paddle_capi ${PADDLE_CAPI_INFER_LIBS})

-# No shared library for iOS
+# Link the static library for inference
+cc_library(paddle_capi_engine DEPS paddle_capi paddle_utils paddle_parameter paddle_math paddle_cuda paddle_proto)
+cc_library(paddle_capi_layers DEPS paddle_function paddle_gserver)
+
+# Link the shared library for inference
 if(NOT IOS)
-  set(LINK_FLAGS " -Wl,--retain-symbols-file ${CMAKE_CURRENT_SOURCE_DIR}/export.sym -Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/export.map")
-  # TODO: merge mkl into paddle_capi_shared
+  set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_capi.map")
  add_library(paddle_capi_shared SHARED ${CAPI_SOURCES})
  set_target_properties(paddle_capi_shared	PROPERTIES LINK_FLAGS "${LINK_FLAGS}")
  target_include_directories(paddle_capi_shared PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
@@ -53,9 +67,10 @@ endif()

 # install library & headers.
 install(FILES ${CAPI_HEADERS} DESTINATION include/paddle)
+install(FILES paddle_capi.map DESTINATION include/paddle)
 install(FILES ${CMAKE_CURRENT_BINARY_DIR}/config.h DESTINATION include/paddle)
 if(ANDROID)
-  install(TARGETS paddle_capi_whole paddle_capi_shared
+  install(TARGETS paddle_capi_whole paddle_capi_engine paddle_capi_layers paddle_capi_shared
          ARCHIVE DESTINATION lib/${ANDROID_ABI}
          LIBRARY DESTINATION lib/${ANDROID_ABI})
  execute_process(
@@ -80,7 +95,7 @@ if(ANDROID)
      )"
  )
 else(ANDROID)
-  install(TARGETS paddle_capi_whole ARCHIVE DESTINATION lib)
+  install(TARGETS paddle_capi_whole paddle_capi_engine paddle_capi_layers ARCHIVE DESTINATION lib)
  if(NOT IOS)
    install(TARGETS paddle_capi_shared DESTINATION lib)
  endif()

--- a/paddle/capi/export.sym
+++ b/paddle/capi/export.sym
--- a/paddle/capi/export.map
+++ b/paddle/capi/export.map
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@@ -19,14 +19,15 @@ cc_test(scope_test SRCS scope_test.cc DEPS scope)
 proto_library(framework_proto SRCS framework.proto)

 cc_library(attribute SRCS attribute.cc DEPS framework_proto)
-cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info)
+cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc)
 cc_library(op_proto_maker SRCS op_proto_maker.cc DEPS framework_proto attribute)
 cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
-cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope proto_desc)
+cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog)
 cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry)
+cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info operator)

-cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator)
+cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc)
 cc_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry)

 py_proto_compile(framework_py_proto SRCS framework.proto)
@@ -42,7 +43,10 @@ add_custom_command(TARGET framework_py_proto POST_BUILD
 cc_library(backward SRCS backward.cc DEPS net_op)
 cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)

-cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward)
+cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward glog)
+
+cc_library(prune SRCS prune.cc DEPS framework_proto)
+cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context)

 cc_library(tensor_array SRCS tensor_array.cc DEPS lod_tensor)
 cc_test(tensor_array_test SRCS tensor_array_test.cc DEPS tensor_array place)

--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@@ -19,19 +19,7 @@ limitations under the License. */
 namespace paddle {
 namespace framework {

-static ProgramDesc* g_program_desc = nullptr;
-
-ProgramDesc& GetProgramDesc() {
-  if (g_program_desc == nullptr) {
-    g_program_desc = new ProgramDesc();
-    auto root_block = g_program_desc->mutable_blocks()->Add();
-    root_block->set_idx(0);
-    root_block->set_parent_idx(-1);
-  }
-  return *g_program_desc;
-}
-
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
  switch (attr_desc.type()) {
    case framework::AttrType::BOOLEAN: {
      return attr_desc.b();
@@ -74,7 +62,9 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
      return val;
    }
    case framework::AttrType::BLOCK: {
-      return GetProgramDesc().mutable_blocks(attr_desc.block_idx());
+      PADDLE_ENFORCE(program != nullptr,
+                     "Need to specify ProgramDesc when get a block attr");
+      return program->mutable_blocks(attr_desc.block_idx());
    }
  }
  PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");

--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -26,16 +26,13 @@ limitations under the License. */

 namespace paddle {
 namespace framework {
-
-ProgramDesc& GetProgramDesc();
-
 template <typename T>
 inline AttrType AttrTypeID() {
  Attribute tmp = T();
  return static_cast<AttrType>(tmp.which() - 1);
 }

-Attribute GetAttrValue(const OpDesc::Attr& attr_desc);
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* desc);

 class AttrReader {
 public:

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -21,6 +21,7 @@

 #include "paddle/framework/block_desc.h"
 #include "paddle/framework/op_registry.h"
+#include "paddle/operators/dynamic_recurrent_op.h"
 #include "paddle/operators/net_op.h"
 #include "paddle/operators/recurrent_op.h"

@@ -220,8 +221,7 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
    // process recurrent gradient op as a special operator.
    if (forwardOp.Type() == "recurrent") {
      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
-      // or
-      // this will result in infinite loop.
+      // or this will result in infinite loop.
      const auto& rnnop =
          *static_cast<const operators::RecurrentOp*>(&forwardOp);
      auto rnn_grad_op =
@@ -231,6 +231,18 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
      // create stepnet's gradient op
      rnn_grad_op->set_stepnet(
          BackwardRecursive(stepnet_op, no_grad_names, grad_to_var, uniq_id));
+    } else if (forwardOp.Type() == "dynamic_recurrent") {
+      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
+      // or this will result in infinite loop.
+      const auto& rnnop =
+          *static_cast<const operators::DynamicRecurrentOp*>(&forwardOp);
+      auto rnn_grad_op =
+          static_cast<operators::DynamicRecurrentGradientOp*>(grad_op.get());
+      const auto& stepnet_op =
+          *static_cast<const OperatorBase*>(&rnnop.rnn.GetStepUnit());
+      // create stepnet's gradient op
+      rnn_grad_op->rnn.SetStepUnit(
+          BackwardRecursive(stepnet_op, no_grad_names, grad_to_var, uniq_id));
    }

    if (net->ops_.empty()) {  // Current no aux op is added to network
@@ -281,12 +293,16 @@ static void CreateGradVarInBlock(
  auto ops = block_desc->AllOps();
  for (size_t op_index = grad_op_start_index; op_index < ops.size();
       ++op_index) {
+    bool need_infer_shape = false;
    ForEachVarName(ops[op_index]->Outputs(),
                   [&](const std::string& grad_var_name) {
                     if (block_desc->HasVar(grad_var_name)) {
                       return false;
                     }
-                     block_desc->Var(grad_var_name);
+                     need_infer_shape = true;
+                     auto var = block_desc->Var(grad_var_name);
+                     // FIXME(qiao) infer the datatype
+                     var->SetDataType(framework::DataType::FP32);
                     auto it = param_name_map.find(grad_var_name);
                     if (it == param_name_map.end()) {
                       return false;
@@ -298,12 +314,14 @@ static void CreateGradVarInBlock(
                     grad_record.op_idx_ = static_cast<int>(op_index);
                     return false; /* not break */
                   });
+    if (need_infer_shape) {
+      ops[op_index]->InferShape(*block_desc);
+    }
  }
 }

 std::vector<std::unique_ptr<OpDescBind>> MakeOpGrad(
-    const std::unique_ptr<OpDescBind>& op_desc,
-    std::unordered_set<std::string>* no_grad_vars,
+    const OpDescBind* op_desc, std::unordered_set<std::string>* no_grad_vars,
    std::unordered_map<std::string, std::string>* grad_to_var) {
  std::vector<std::unique_ptr<OpDescBind>> grad_op_descs;
  // All input gradients of forwarding operator do not need to calculate.
@@ -350,7 +368,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
    std::unordered_set<std::string>* no_grad_vars,
    std::unordered_map<std::string, std::string>* grad_to_var) {
  BlockDescBind* cur_block = program_desc.Block(block_idx);
-  std::deque<std::unique_ptr<OpDescBind>>& op_descs = cur_block->ops_;
+  std::vector<OpDescBind*> op_descs = cur_block->AllOps();
  std::unordered_map<std::string, std::vector<size_t>> dup_out_ops;
  size_t grad_desc_idx = 0;
  std::vector<std::unique_ptr<OpDescBind>> backward_descs;
@@ -368,7 +386,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
          program_desc, step_block_idx, no_grad_vars, grad_to_var);
      BlockDescBind* backward_block = program_desc.AppendBlock(*cur_block);
      for (auto& ptr : backward_block_op_descs) {
-        backward_block->ops_.push_back(std::move(ptr));
+        backward_block->AppendAllocatedOp(std::move(ptr));
      }
      op_grads[0]->SetBlockAttr("step_block", *backward_block);
    }
@@ -425,17 +443,22 @@ ParamGradInfoMap AppendBackward(

  const int root_block_idx = 0;
  auto root_block = program_desc.Block(root_block_idx);
-  auto& all_ops = root_block->ops_;

  // insert fill one op for target
+  // TODO(qiao) add some check to the target.
  std::string fill_one_op_out = GradVarName(target.Name());
+  std::vector<int64_t> target_shape_desc = target.Shape();
+  std::vector<int> target_shape;
+  std::transform(target_shape_desc.begin(), target_shape_desc.end(),
+                 std::back_inserter(target_shape),
+                 [](int64_t dim) { return static_cast<int>(dim); });
  std::unique_ptr<OpDescBind> fill_one_op(
      new OpDescBind("fill_constant", {}, {{"Out", {fill_one_op_out}}},
-                     {{"shape", std::vector<int>{1}},
+                     {{"shape", target_shape},
                      {"value", static_cast<float>(1.0)},
                      {"data_type", framework::DataType::FP32}}));
-  all_ops.push_back(std::move(fill_one_op));
-  size_t forward_op_num = all_ops.size();
+  root_block->AppendAllocatedOp(std::move(fill_one_op));
+  size_t forward_op_num = root_block->OpSize();
  size_t forward_block_num = program_desc.Size();

  // Insert backward operators
@@ -443,13 +466,22 @@ ParamGradInfoMap AppendBackward(
  auto backward_op_descs = MakeBlockBackward(program_desc, root_block_idx,
                                             &no_grad_var_names, &grad_to_var);

-  std::unordered_map<std::string, GradVarInfo> retv;
-
-  // Create Variable
  for (auto& ptr : backward_op_descs) {
-    all_ops.push_back(std::move(ptr));
+    root_block->AppendAllocatedOp(std::move(ptr));
  }
-  root_block->Var(fill_one_op_out);
+  // Create Variable
+
+  // Create target gradient variable
+  std::unordered_map<std::string, GradVarInfo> retv;
+
+  auto var = root_block->Var(fill_one_op_out);
+  // FIXME(qiao) infer the data type
+  var->SetDataType(framework::DataType::FP32);
+  var->SetShape(target.Shape());
+  auto& target_grad = retv[target.Name()];
+  target_grad.name_ = fill_one_op_out;
+  target_grad.block_idx_ = root_block_idx;
+  target_grad.op_idx_ = static_cast<int>(forward_op_num);

  // create grad_var for all blocks in this program
  CreateGradVarInBlock(forward_op_num, grad_to_var, root_block, &retv);

--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -26,6 +26,20 @@ namespace framework {

 using DeviceContext = platform::DeviceContext;

+class NoneOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {}
+};
+
+template <typename Place, typename T>
+class NoneKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &context) const override {}
+};
+
 class RowWiseAddOpMaker : public OpProtoAndCheckerMaker {
 public:
  RowWiseAddOpMaker(OpProto *proto, OpAttrChecker *op_checker)
@@ -215,19 +229,51 @@ class MinusOpMaker : public OpProtoAndCheckerMaker {
 namespace f = paddle::framework;
 namespace ops = paddle::operators;
 using EnforceNotMet = paddle::platform::EnforceNotMet;
-REGISTER_OPERATOR(rowwise_add, f::NOP, f::RowWiseAddOpMaker,
+// rowwise_add
+REGISTER_OPERATOR(rowwise_add, f::NoneOp, f::RowWiseAddOpMaker,
                  f::RowWiseAddGradMaker);
-REGISTER_OPERATOR(rowwise_add_grad, f::NOP);
-REGISTER_OP(mul, f::NOP, f::MulOpMaker, mul_grad, f::NOP);
-REGISTER_OP(sigmoid, f::NOP, f::SigmoidOpMaker, sigmoid_grad, f::NOP);
-REGISTER_OP_WITHOUT_GRADIENT(nograd, f::NOP, f::NoGradOpMaker);
-REGISTER_OP_WITHOUT_GRADIENT(fill_zeros_like, f::NOP, f::FillZeroOpMaker);
-REGISTER_OP(sum, f::NOP, f::SumOpMaker, sum_grad, f::NOP);
+REGISTER_OP_CPU_KERNEL(rowwise_add,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OPERATOR(rowwise_add_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(rowwise_add_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// mul
+REGISTER_OP(mul, f::NoneOp, f::MulOpMaker, mul_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(mul, f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(mul_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// sigmoid
+REGISTER_OP(sigmoid, f::NoneOp, f::SigmoidOpMaker, sigmoid_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(sigmoid,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_WITHOUT_GRADIENT(nograd, f::NoneOp, f::NoGradOpMaker);
+// fill_zeros_like
+REGISTER_OP_WITHOUT_GRADIENT(fill_zeros_like, f::NoneOp, f::FillZeroOpMaker);
+REGISTER_OP_CPU_KERNEL(fill_zeros_like,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// sum
+REGISTER_OP(sum, f::NoneOp, f::SumOpMaker, sum_grad, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(sum, f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(sum_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// fc
 REGISTER_OP_WITHOUT_GRADIENT(fc, f::FcOp, f::FcOpMaker);
-REGISTER_OP(many_output_op, f::NOP, f::ManyOutputOpMaker, many_output_op_grad,
-            f::NOP);
-REGISTER_OP(mult_in_out, f::NOP, f::MultInOutOpMaker, mult_in_out_grad, f::NOP);
-REGISTER_OPERATOR(minus, f::NOP, f::MinusOpMaker, f::MinusGradOpDescMaker);
+// many_output_op
+REGISTER_OP(many_output_op, f::NoneOp, f::ManyOutputOpMaker,
+            many_output_op_grad, f::NoneOp);
+// mult_in_out
+REGISTER_OP(mult_in_out, f::NoneOp, f::MultInOutOpMaker, mult_in_out_grad,
+            f::NoneOp);
+REGISTER_OP_CPU_KERNEL(mult_in_out,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(mult_in_out_grad,
+                       f::NoneKernel<paddle::platform::CPUPlace, float>);
+// minus
+REGISTER_OPERATOR(minus, f::NoneOp, f::MinusOpMaker, f::MinusGradOpDescMaker);
+REGISTER_OP_CPU_KERNEL(minus, f::NoneKernel<paddle::platform::CPUPlace, float>);
+// scale
+REGISTER_OPERATOR(scale, f::NoneOp);
+REGISTER_OP_CPU_KERNEL(scale, f::NoneKernel<paddle::platform::CPUPlace, float>);

 TEST(Backward, simple_op_not_need_grad) {
  auto fwd = f::OpRegistry::CreateOp(
@@ -449,20 +495,10 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {
  EXPECT_EQ(bwd_net->ops_[2]->Outputs(all).size(), 0UL);
 }

-// =================================== //
-
-f::ProgramDesc *GetNewProgramDesc() {
-  auto *program_desc = new f::ProgramDesc();
-  auto *root_block = program_desc->add_blocks();
-  root_block->set_idx(0);
-  root_block->set_parent_idx(-1);
-  return program_desc;
-}
-
 TEST(Backward, simple_single_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
+
  f::OpDescBind *op = block->AppendOp();
  op->SetType("rowwise_add");
  op->SetInput("X", {"x"});
@@ -487,7 +523,7 @@ TEST(Backward, simple_single_op) {
  EXPECT_EQ(grad_op->Output(f::GradVarName("b")),
            std::vector<std::string>({f::GradVarName("b")}));

-  EXPECT_EQ(var_to_grad.size(), 2UL);
+  EXPECT_EQ(var_to_grad.size(), 3UL);
  EXPECT_EQ(var_to_grad.at("b"), f::GradVarInfo(f::GradVarName("b"), 0, 2));
  EXPECT_EQ(var_to_grad.at("x"), f::GradVarInfo(f::GradVarName("x"), 0, 2));

@@ -496,8 +532,7 @@ TEST(Backward, simple_single_op) {
 }

 TEST(Backward, default_attribute) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  f::OpDescBind *op = block->AppendOp();
  op->SetType("mul");
@@ -523,8 +558,7 @@ TEST(Backward, default_attribute) {
 }

 TEST(Backward, simple_mult_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
@@ -588,7 +622,7 @@ TEST(Backward, simple_mult_op) {
  EXPECT_EQ(grad_op3->Output(f::GradVarName("b")),
            std::vector<std::string>({f::GradVarName("b3")}));

-  EXPECT_EQ(var_to_grad.size(), 6UL);
+  EXPECT_EQ(var_to_grad.size(), 7UL);
  EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 6));
  EXPECT_EQ(var_to_grad.at("b1"), f::GradVarInfo(f::GradVarName("b1"), 0, 6));
  EXPECT_EQ(var_to_grad.at("out1"),
@@ -607,8 +641,7 @@ TEST(Backward, simple_mult_op) {
 }

 TEST(Backward, intermedia_var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
@@ -666,7 +699,7 @@ TEST(Backward, intermedia_var_no_grad) {
            std::vector<std::string>({f::GradVarName("out1")}));
  EXPECT_EQ(grad_op4->Output(f::GradVarName("Y")), std::vector<std::string>());

-  EXPECT_EQ(var_to_grad.size(), 3UL);
+  EXPECT_EQ(var_to_grad.size(), 4UL);
  EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 6));
  EXPECT_EQ(var_to_grad.at("b1"), f::GradVarInfo(f::GradVarName("b1"), 0, 6));
  EXPECT_EQ(var_to_grad.at("out1"),
@@ -678,8 +711,7 @@ TEST(Backward, intermedia_var_no_grad) {
 }

 TEST(Backward, var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("mult_in_out");
@@ -744,7 +776,7 @@ TEST(Backward, var_no_grad) {
  EXPECT_EQ(grad_op1->Output(f::GradVarName("H")),
            std::vector<std::string>({f::GradVarName("h1")}));

-  EXPECT_EQ(var_to_grad.size(), 3UL);
+  EXPECT_EQ(var_to_grad.size(), 4UL);
  EXPECT_EQ(var_to_grad.at("y1"), f::GradVarInfo(f::GradVarName("y1"), 0, 3));
  EXPECT_EQ(var_to_grad.at("x1"), f::GradVarInfo(f::GradVarName("x1"), 0, 5));
  EXPECT_EQ(var_to_grad.at("h1"), f::GradVarInfo(f::GradVarName("h1"), 0, 5));
@@ -755,8 +787,7 @@ TEST(Backward, var_no_grad) {
 }

 TEST(Backward, shared_var) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
@@ -830,7 +861,7 @@ TEST(Backward, shared_var) {
  EXPECT_EQ(grad_op1->Output(f::GradVarName("b")),
            std::vector<std::string>({f::GradVarName("b1")}));

-  EXPECT_EQ(var_to_grad.size(), 5UL);
+  EXPECT_EQ(var_to_grad.size(), 6UL);
  EXPECT_EQ(var_to_grad.at("b3"), f::GradVarInfo(f::GradVarName("b3"), 0, 4));
  EXPECT_EQ(var_to_grad.at("y2"), f::GradVarInfo(f::GradVarName("y2"), 0, 5));
  EXPECT_EQ(var_to_grad.at("out1"),
@@ -846,8 +877,7 @@ TEST(Backward, shared_var) {
 }

 TEST(Backward, half_backward) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
  f::BlockDescBind *block = program.Block(0);
  auto *op1 = block->AppendOp();
  op1->SetType("minus");
@@ -863,7 +893,7 @@ TEST(Backward, half_backward) {
  auto ops = block->AllOps();
  ASSERT_EQ(3UL, ops.size());

-  EXPECT_EQ(var_to_grad.size(), 1UL);
+  EXPECT_EQ(var_to_grad.size(), 2UL);
  EXPECT_EQ(var_to_grad.at("a"),
            f::GradVarInfo(f::GradVarName("a"), 0, forward_len + 1));
 }
--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -19,11 +19,11 @@ namespace paddle {
 namespace framework {

 VarDescBind *BlockDescBind::Var(const std::string &name) {
-  need_update_ = true;
  auto it = vars_.find(name);
  if (it != vars_.end()) {
    return it->second.get();
  }
+  need_update_ = true;
  auto *var = new VarDescBind(name);
  vars_[name].reset(var);
  return var;
@@ -41,6 +41,19 @@ bool BlockDescBind::HasVar(const std::string &name) const {
  return vars_.find(name) != vars_.end();
 }

+VarDescBind *BlockDescBind::FindVarRecursive(const std::string &name) const {
+  auto it = vars_.find(name);
+  if (it == vars_.end()) {
+    return Parent() == kNoneBlockIndex ? nullptr
+                                       : ParentBlock()->FindVarRecursive(name);
+  }
+  return it->second.get();
+}
+
+bool BlockDescBind::HasVarRecursive(const std::string &name) const {
+  return FindVarRecursive(name) != nullptr;
+}
+
 std::vector<VarDescBind *> BlockDescBind::AllVars() const {
  std::vector<VarDescBind *> res;
  for (const auto &p : vars_) {
@@ -55,6 +68,11 @@ OpDescBind *BlockDescBind::AppendOp() {
  return ops_.back().get();
 }

+void BlockDescBind::AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc) {
+  need_update_ = true;
+  ops_.emplace_back(std::move(op_desc));
+}
+
 OpDescBind *BlockDescBind::PrependOp() {
  need_update_ = true;
  ops_.emplace_front(new OpDescBind());
@@ -70,15 +88,19 @@ std::vector<OpDescBind *> BlockDescBind::AllOps() const {
 }

 void BlockDescBind::Flush() {
+  for (auto &op_desc : ops_) {
+    op_desc->Flush();
+  }
+
  if (need_update_) {
    auto &op_field = *this->desc_->mutable_ops();
-    op_field.Clear();
+    this->ClearPBOps();
    op_field.Reserve(static_cast<int>(ops_.size()));
    for (auto &op_desc : ops_) {
      op_field.AddAllocated(op_desc->Proto());
    }
    auto &var_field = *this->desc_->mutable_vars();
-    var_field.Clear();
+    this->ClearPBVars();
    var_field.Reserve(static_cast<int>(vars_.size()));
    for (auto &var_desc : vars_) {
      var_field.AddAllocated(var_desc.second->Proto());
@@ -88,7 +110,7 @@ void BlockDescBind::Flush() {
 }

 BlockDescBind *BlockDescBind::ParentBlock() const {
-  if (this->desc_->parent_idx() == -1) {
+  if (this->desc_->parent_idx() == kNoneBlockIndex) {
    return nullptr;
  }
  return prog_->Block(static_cast<size_t>(this->desc_->parent_idx()));
@@ -98,6 +120,35 @@ BlockDesc *BlockDescBind::Proto() {
  Flush();
  return desc_;
 }
+BlockDescBind::BlockDescBind(const BlockDescBind &other, BlockDesc *desc,
+                             ProgramDescBind *prog)
+    : prog_(prog), desc_(desc) {
+  need_update_ = true;
+  for (auto &op : other.ops_) {
+    ops_.emplace_back(new OpDescBind(*op));
+  }
+
+  for (auto &it : other.vars_) {
+    auto *var = new VarDescBind(*it.second);
+    vars_[it.first].reset(var);
+  }
+}
+
+void BlockDescBind::ClearPBOps() {
+  auto ops = this->desc_->mutable_ops();
+  while (!ops->empty()) {
+    // we do not own the OpDesc, so release the ownership.
+    ops->ReleaseLast();
+  }
+}
+
+void BlockDescBind::ClearPBVars() {
+  auto vars = this->desc_->mutable_vars();
+  while (!vars->empty()) {
+    // we do not own the VarDesc, so release the ownership.
+    vars->ReleaseLast();
+  }
+}

 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@@ -16,9 +16,12 @@ limitations under the License. */

 #include <deque>
 #include <memory>
+#include <set>
 #include <unordered_map>
 #include <vector>
+
 #include "paddle/framework/op_desc.h"
+#include "paddle/framework/proto_desc.h"
 #include "paddle/framework/var_desc.h"
 #include "paddle/platform/macros.h"

@@ -36,6 +39,14 @@ class BlockDescBind {
  BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
      : prog_(prog), desc_(desc), need_update_(false) {}

+  BlockDescBind(const BlockDescBind &other, BlockDesc *desc,
+                ProgramDescBind *prog);
+
+  ~BlockDescBind() {
+    this->ClearPBVars();
+    this->ClearPBOps();
+  }
+
  int32_t ID() const { return desc_->idx(); }

  int32_t Parent() const { return desc_->parent_idx(); }
@@ -46,23 +57,43 @@ class BlockDescBind {

  bool HasVar(const std::string &var_name) const;

+  VarDescBind *FindVarRecursive(const std::string &name_bytes) const;
+
+  bool HasVarRecursive(const std::string &var_name) const;
+
+  std::set<std::string> LocalVarNames() const {
+    std::set<std::string> var_names;
+    for (auto &var : vars_) {
+      var_names.insert(var.first);
+    }
+    return var_names;
+  }
+
  std::vector<VarDescBind *> AllVars() const;

  BlockDescBind *ParentBlock() const;

  OpDescBind *AppendOp();

+  void AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc);
+
  OpDescBind *PrependOp();

  std::vector<OpDescBind *> AllOps() const;

+  size_t OpSize() const { return ops_.size(); }
+
+  OpDescBind *Op(int idx) { return ops_.at(idx).get(); }
+
  void Flush();

  BlockDesc *Proto();

-  // FIXME(yuyang18): backward will access private data of BlockDesc.
-  // Mark it public temporary. We can fix it later.
- public:
+ private:
+  void ClearPBOps();
+  void ClearPBVars();
+
+ private:
  ProgramDescBind *prog_;  // not_own
  BlockDesc *desc_;        // not_own
  bool need_update_;

--- a/paddle/framework/data_type.h
+++ b/paddle/framework/data_type.h
@@ -26,6 +26,8 @@ inline DataType ToDataType(std::type_index type) {
    return DataType::FP64;
  } else if (typeid(int).hash_code() == type.hash_code()) {
    return DataType::INT32;
+  } else if (typeid(int64_t).hash_code() == type.hash_code()) {
+    return DataType::INT64;
  } else {
    PADDLE_THROW("Not supported");
  }

--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -64,99 +64,27 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
  auto& block = pdesc.blocks(block_id);
  auto& device = device_contexts_[0];

-  // Instantiate all the vars in the global scope
-  for (auto& var : block.vars()) {
-    scope->Var(var.name());
-  }
-
  Scope& local_scope = scope->NewScope();

-  std::vector<bool> should_run = Prune(pdesc, block_id);
-  PADDLE_ENFORCE_EQ(should_run.size(), static_cast<size_t>(block.ops_size()));
-  for (size_t i = 0; i < should_run.size(); ++i) {
-    if (should_run[i]) {
-      for (auto& var : block.ops(i).outputs()) {
-        for (auto& argu : var.arguments()) {
-          if (local_scope.FindVar(argu) == nullptr) {
-            local_scope.Var(argu);
-          }
-        }
-      }
-      auto op = paddle::framework::OpRegistry::CreateOp(block.ops(i));
-      op->Run(local_scope, *device);
-    }
-  }
-
-  // TODO(tonyyang-svail):
-  //  - Destroy local_scope
-}
-
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id) {
-  // TODO(tonyyang-svail):
-  //    - will change to use multiple blocks for RNN op and Cond Op
-
-  auto& block = pdesc.blocks(block_id);
-  auto& ops = block.ops();
-
-  bool expect_feed = true;
-  for (auto& op_desc : ops) {
-    PADDLE_ENFORCE(op_desc.type() != kFeedOpType || expect_feed,
-                   "All FeedOps are at the beginning of the ProgramDesc");
-    expect_feed = (op_desc.type() == kFeedOpType);
-  }
-
-  bool expect_fetch = true;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-    PADDLE_ENFORCE(op_desc.type() != kFetchOpType || expect_fetch,
-                   "All FetchOps must at the end of the ProgramDesc");
-    expect_fetch = (op_desc.type() == kFetchOpType);
-  }
-
-  std::set<std::string> dependent_vars;
-  std::vector<bool> should_run;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-
-    bool found_dependent_vars = false;
-    for (auto& var : op_desc.outputs()) {
-      for (auto& argu : var.arguments()) {
-        if (dependent_vars.count(argu) != 0) {
-          found_dependent_vars = true;
-        }
-      }
-    }
-
-    if (op_desc.type() == kFetchOpType || found_dependent_vars) {
-      // erase its output to the dependency graph
-      for (auto& var : op_desc.outputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.erase(argu);
-        }
-      }
-
-      // insert its input to the dependency graph
-      for (auto& var : op_desc.inputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.insert(argu);
-        }
-      }
-
-      should_run.push_back(true);
+  for (auto& var : block.vars()) {
+    if (var.persistable()) {
+      auto* ptr = scope->Var(var.name());
+      VLOG(3) << "Create Variable " << var.name()
+              << " global, which pointer is " << ptr;
    } else {
-      should_run.push_back(false);
+      auto* ptr = local_scope.Var(var.name());
+      VLOG(3) << "Create Variable " << var.name()
+              << " locally, which pointer is " << ptr;
    }
  }

-  // TODO(tonyyang-svail):
-  //    - check this after integration of Init
-  // PADDLE_ENFORCE(dependent_vars.empty());
-
-  // since we are traversing the ProgramDesc in reverse order
-  // we reverse the should_run vector
-  std::reverse(should_run.begin(), should_run.end());
+  for (auto& op_desc : block.ops()) {
+    auto op = paddle::framework::OpRegistry::CreateOp(
+        op_desc, const_cast<ProgramDesc*>(&pdesc));
+    op->Run(local_scope, *device);
+  }

-  return should_run;
+  scope->DeleteScope(&local_scope);
 }

 }  // namespace framework

--- a/paddle/framework/executor.h
+++ b/paddle/framework/executor.h
@@ -40,16 +40,5 @@ class Executor {
  std::vector<platform::DeviceContext*> device_contexts_;
 };

-/* @Brief
- * Pruning the graph
- *
- * @param
- *  ProgramDesc
- *
- * @return
- *  vector<bool> Same size as ops. Indicates whether an op should be run.
- */
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id);
-
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/feed_fetch_method.h
+++ b/paddle/framework/feed_fetch_method.h
@@ -13,37 +13,45 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #pragma once
+#include "glog/logging.h"
+#include "paddle/framework/feed_fetch_type.h"
 #include "paddle/framework/scope.h"
 #include "paddle/framework/variable.h"

 namespace paddle {
 namespace framework {

-template <typename T>
-void SetFeedVariable(const LoDTensor& input, const std::string& var_name,
-                     size_t index) {
+void SetFeedVariable(Scope* scope, const LoDTensor& input,
+                     const std::string& var_name, size_t index) {
  // If var_name Variable is not found in GlobalScope, a new variable will
  // be created.
-  Variable* g_feed_value = GetGlobalScope().Var(var_name);
+  VLOG(3) << "SetFeedVariable name=" << var_name << " index=" << index;
+  Variable* g_feed_value = scope->Var(var_name);
  auto& feed_inputs =
      *(g_feed_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
  if (index >= feed_inputs.size()) {
    feed_inputs.resize(index + 1);
  }
  // shared data with input tensor
-  feed_inputs[index].ShareDataWith<T>(input);
+  feed_inputs[index].ShareDataWith(input);
  // set lod
  feed_inputs[index].set_lod(input.lod());
 }

-LoDTensor& GetFetchVariable(const std::string& var_name, size_t index) {
+LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name,
+                            size_t index) {
  // Since we want to fetch LodTensor from a variable, the variable must
  // be created alreadly.
-  Variable* g_fetch_value = GetGlobalScope().FindVar(var_name);
-  auto& fetch_outputs =
-      *(g_fetch_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
+  Variable* g_fetch_value = scope.FindVar(var_name);
+  PADDLE_ENFORCE(g_fetch_value->IsType<FeedFetchList>(),
+                 "Only %s can be invoked by GetFetchVariable",
+                 typeid(FeedFetchList).name());
+  auto& fetch_outputs = *g_fetch_value->GetMutable<FeedFetchList>();
+  auto& tensor = fetch_outputs[index];
+  VLOG(3) << "Fetch " << var_name << " with index " << index
+          << " shape= " << tensor.dims();
  PADDLE_ENFORCE_LT(index, fetch_outputs.size());
-  return fetch_outputs[index];
+  return tensor;
 }

 }  // namespace framework

--- a/paddle/framework/framework.proto
+++ b/paddle/framework/framework.proto
@@ -55,6 +55,7 @@ message OpDesc {
  repeated Var inputs = 1;
  repeated Var outputs = 2;
  repeated Attr attrs = 4;
+  optional bool is_target = 5 [ default = false ];
 };

 // OpProto describes a C++ framework::OperatorBase derived class.
@@ -67,6 +68,7 @@ message OpProto {

    optional bool duplicable = 3 [ default = false ];
    optional bool intermediate = 4 [ default = false ];
+    optional bool dispensable = 5 [ default = false ];
  }

  // AttrProto describes the C++ type Attribute.
@@ -111,6 +113,8 @@ message VarDesc {
  enum VarType {
    LOD_TENSOR = 1;
    SELECTED_ROWS = 2;
+    FEED_MINIBATCH = 3;
+    FETCH_LIST = 4;
  }
  required string name = 1;
  required VarType type = 2;

--- a/paddle/framework/lod_tensor.cc
+++ b/paddle/framework/lod_tensor.cc
@@ -25,31 +25,50 @@ LoD SliceLevels(const LoD& in, size_t level_begin, size_t level_end) {
  for (size_t i = level_begin; i < level_end; i++) {
    new_lod.emplace_back(in.at(i));
  }
+  // transform the lowest level to absolute offset.
+  LoD abs_offset_lod = ToAbsOffset(in);
+  new_lod.back() = abs_offset_lod[level_end - 1];
  return new_lod;
 }

 LoD SliceInLevel(const LoD& in, size_t level, size_t elem_begin,
                 size_t elem_end) {
-  // slice the lod.
-  LoD new_lod;
-  new_lod.reserve(in.size() - level);
-  auto start = in.at(level)[elem_begin];
-  auto end = in.at(level)[elem_end];
-
-  for (auto it = in.begin() + level; it != in.end(); it++) {
-    auto it_begin = std::find(it->begin(), it->end(), start);
-    auto it_end = std::find(it_begin, it->end(), end);
-    PADDLE_ENFORCE(it_begin != it->end(), "error in parsing lod info");
-    PADDLE_ENFORCE(it_end != it->end(), "error in parsing lod info");
-    new_lod.emplace_back(it_begin, it_end + 1);
-    // reset offset if tensor is copyed and sliced.
-    std::transform(new_lod.back().begin(), new_lod.back().end(),
-                   new_lod.back().begin(),
-                   [start](int v) { return v - start; });
-    PADDLE_ENFORCE_EQ(new_lod.back().front(), 0, "error in slice LoD");
+  PADDLE_ENFORCE_LT(level, in.size());
+  PADDLE_ENFORCE_LT(elem_end, in[level].size());
+
+  LoD res;
+  res.resize(in.size() - level);
+  // copy the first level
+  res[0].assign(in[level].begin() + elem_begin,
+                in[level].begin() + elem_end + 1);
+  for (size_t lvl = 1; lvl < res.size(); lvl++) {
+    const auto& in_level = in[level + lvl];
+    const auto& above_level = res[lvl - 1];
+    auto& out_level = res[lvl];
+    out_level.assign(in_level.begin() + above_level.front(),
+                     in_level.begin() + above_level.back() + 1);
  }
-  PADDLE_ENFORCE_LE(new_lod.size(), in.size());
-  return new_lod;
+  for (size_t lvl = 0; lvl < res.size(); lvl++) {
+    // to make the first offset equals 0, all the elements minus the first
+    // element
+    size_t front = res[lvl].front();
+    for (auto& ele : res[lvl]) {
+      ele -= front;
+    }
+  }
+  return res;
+}
+
+LoD ToAbsOffset(const LoD& in) {
+  // the lowest level stores relative offsets
+  if (in.empty() || in.size() == 1) return in;
+  LoD result = in;
+  for (int level = result.size() - 2; level >= 0; level--) {
+    for (auto& ele : result[level]) {
+      ele = result[level + 1][ele];
+    }
+  }
+  return result;
 }

 bool operator==(const LoD& a, const LoD& b) {
@@ -75,17 +94,7 @@ bool operator==(const LoD& a, const LoD& b) {
 size_t LoDTensor::NumElements(size_t level, size_t idx) const {
  PADDLE_ENFORCE_LT(level, NumLevels());
  PADDLE_ENFORCE_LT(idx, NumElements(level));
-  // the last level of LoD, just return number of records in Tensor
-  if (level == NumLevels() - 1) {
-    return lod_[level][idx + 1] - lod_[level][idx];
-  }
-  // high level of LoD, and there is another lower level, return number of
-  // lower-level elements
-  auto tmp = SliceInLevel(lod_, level, idx, idx + 1);
-  PADDLE_ENFORCE_GE(tmp.size(), 2);
-  // there is a 0 as a placeholder stored in LoD, so the number of elements
-  // equals lod.size() - 1
-  return tmp[1].size() - 1;
+  return lod_[level][idx + 1] - lod_[level][idx];
 }

 void LoDTensor::ShrinkLevels(size_t level_begin, size_t level_end) {

--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@@ -39,23 +39,36 @@ using Vector = thrust::host_vector<
 #endif

 /*
- * 3-level LoD stores
+ * LoD is short for Level of Details.
 *
- * 0 10 20
- * 0 5 10 15 20
- * 0 2 5 7 10 12 15 20
- *
- * - in a level, each element indicates offset in the underlying Tensor
+ * - in a level, each element indicates relative offset of the lower level
 * - the first element should be 0 and that indicates that this sequence start
 * from 0
 * - each sequence's begin and end(no-inclusive) is level[id, id+1]
+ *
+ * For example:
+ *    3-level LoD stores
+ *
+ *    0 2 3
+ *    0 2 4 7
+ *    0 2 5 7 10 12 15 20
 */
 using LoD = std::vector<Vector<size_t>>;

+/*
+ * Slice levels from a LoD.
+ * NOTE the lowest level should always be the absolute offsets of the underlying
+ * tensor instances. So if higher layers are sliced without the lowest level,
+ * the lower level of the sliced LoD will be transformed to the absolute offset.
+ */
 LoD SliceLevels(const LoD& in, size_t level_begin, size_t level_end);

 LoD SliceInLevel(const LoD& in, size_t level, size_t elem_begin,
                 size_t elem_end);
+/*
+ * Transform an LoD from relative offsets to absolute offsets.
+ */
+LoD ToAbsOffset(const LoD& in);

 bool operator==(const LoD& a, const LoD& b);

@@ -74,12 +87,12 @@ class LoDTensor : public Tensor {
  LoD lod() const { return lod_; }

  /*
-   * Get a element from LoD.
+   * Get the start offset and end offset of an  element from LoD.
   */
-  size_t lod_element(size_t level, size_t elem) const {
+  std::pair<size_t, size_t> lod_element(size_t level, size_t elem) const {
    PADDLE_ENFORCE_LT(level, NumLevels());
    PADDLE_ENFORCE_LT(elem, NumElements(level));
-    return (lod_)[level][elem];
+    return std::make_pair((lod_)[level][elem], (lod_)[level][elem + 1]);
  }

  /*

--- a/paddle/framework/lod_tensor_test.cc
+++ b/paddle/framework/lod_tensor_test.cc
@@ -30,8 +30,8 @@ class LoDTensorTester : public ::testing::Test {
    // 0 5 10 15 20
    // 0 2 5 7 10 12 15 20
    LoD lod;
-    lod.push_back(std::vector<size_t>{0, 10, 20});
-    lod.push_back(std::vector<size_t>{0, 5, 10, 15, 20});
+    lod.push_back(std::vector<size_t>{0, 2, 3});
+    lod.push_back(std::vector<size_t>{0, 2, 5, 8});
    lod.push_back(std::vector<size_t>{0, 2, 5, 7, 10, 12, 15, 17, 20});

    ASSERT_EQ(lod.size(), 3UL);
@@ -52,14 +52,14 @@ TEST_F(LoDTensorTester, NumLevels) { ASSERT_EQ(lod_tensor_.NumLevels(), 3UL); }

 TEST_F(LoDTensorTester, NumElements) {
  ASSERT_EQ(lod_tensor_.NumElements(0), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(1), 4UL);
+  ASSERT_EQ(lod_tensor_.NumElements(1), 3UL);
  ASSERT_EQ(lod_tensor_.NumElements(2), 8UL);
 }

 TEST_F(LoDTensorTester, NumElements2) {
  ASSERT_EQ(lod_tensor_.NumElements(0, 0), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(0, 1), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(1, 1), 2UL);
+  ASSERT_EQ(lod_tensor_.NumElements(0, 1), 1UL);
+  ASSERT_EQ(lod_tensor_.NumElements(1, 1), 3UL);
 }

 TEST_F(LoDTensorTester, ShrinkLevels) {
@@ -68,17 +68,16 @@ TEST_F(LoDTensorTester, ShrinkLevels) {
    LoDTensor new_lod_tensor = lod_tensor_;
    new_lod_tensor.ShrinkLevels(level, level + 1);
    ASSERT_EQ(new_lod_tensor.NumLevels(), 1UL);
-    ASSERT_EQ(new_lod_tensor.NumElements(0), lod_tensor_.NumElements(level));
    ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
  }
  // shrink 2 level
  for (size_t level = 0; level < 2UL; ++level) {
    LoDTensor new_lod_tensor = lod_tensor_;
    new_lod_tensor.ShrinkLevels(level, level + 2);
+    // the lowest level's last element should be the tensor's batch_size.
+    ASSERT_EQ(new_lod_tensor.lod().back().back(),
+              lod_tensor_.lod().back().back());
    ASSERT_EQ(new_lod_tensor.NumLevels(), 2UL);
-    ASSERT_EQ(new_lod_tensor.NumElements(0), lod_tensor_.NumElements(level));
-    ASSERT_EQ(new_lod_tensor.NumElements(1),
-              lod_tensor_.NumElements(level + 1));
    ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
  }
 }
@@ -86,19 +85,19 @@ TEST_F(LoDTensorTester, ShrinkLevels) {
 TEST_F(LoDTensorTester, ShrinkInLevel) {
  size_t level = 0;
  LoDTensor new_lod_tensor = lod_tensor_;
-  new_lod_tensor.ShrinkInLevel(level, 0, 2);
+  new_lod_tensor.ShrinkInLevel(level, 0, 1);
  EXPECT_EQ(new_lod_tensor.NumLevels(), 3UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(0), 2UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(1), 4UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(2), 8UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(0), 1UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(1), 2UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(2), 5UL);
  ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());

  level = 1;
  new_lod_tensor = lod_tensor_;
-  new_lod_tensor.ShrinkInLevel(level, 0, 2);
+  new_lod_tensor.ShrinkInLevel(level, 1, 2);
  ASSERT_EQ(new_lod_tensor.NumLevels(), 2UL);
-  ASSERT_EQ(new_lod_tensor.NumElements(0), 2UL);
-  ASSERT_EQ(new_lod_tensor.NumElements(1), 4UL);
+  ASSERT_EQ(new_lod_tensor.NumElements(0), 1UL);
+  ASSERT_EQ(new_lod_tensor.NumElements(1), 3UL);
  ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
 }


--- a/paddle/framework/lod_tensor_test.cu
+++ b/paddle/framework/lod_tensor_test.cu
@@ -36,8 +36,8 @@ TEST(LoDTensor, LoDInGPU) {
  lod_tensor.mutable_data<float>(place);

  lod_tensor.set_lod(src_lod);
-  CHECK_EQ(lod_tensor.lod_element(0, 2), 4UL);
-  CHECK_EQ(lod_tensor.lod_element(0, 4), 8UL);
+  CHECK_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
+  CHECK_EQ(lod_tensor.lod_element(0, 4).first, 8UL);

  auto lod = lod_tensor.lod();


--- a/paddle/framework/op_proto_maker.h
+++ b/paddle/framework/op_proto_maker.h
@@ -44,6 +44,11 @@ class OpProtoAndCheckerMaker {
      var_->set_intermediate(true);
      return *this;
    }
+
+    VariableBuilder& AsDispensable() {
+      var_->set_dispensable(true);
+      return *this;
+    }
  };

  VariableBuilder AddInput(const std::string& name, const std::string& comment);

--- a/paddle/framework/op_registry.cc
+++ b/paddle/framework/op_registry.cc
@@ -43,12 +43,13 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap(
  return ret_val;
 }

-std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc) {
+std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc,
+                                                   ProgramDesc* program) {
  VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs());
  VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs());
  AttributeMap attrs;
  for (auto& attr : op_desc.attrs()) {
-    attrs[attr.name()] = GetAttrValue(attr);
+    attrs[attr.name()] = GetAttrValue(attr, program);
  }

  return CreateOp(op_desc.type(), inputs, outputs, attrs);

--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -20,6 +20,8 @@ limitations under the License. */
 #include <typeinfo>
 #include <unordered_map>
 #include <unordered_set>
+
+#include "glog/logging.h"  // For VLOG()
 #include "paddle/framework/attribute.h"
 #include "paddle/framework/details/op_registry.h"
 #include "paddle/framework/framework.pb.h"
@@ -45,18 +47,15 @@ class Registrar {

 template <typename... ARGS>
 struct OperatorRegistrar : public Registrar {
-  explicit OperatorRegistrar(const char* op_type) : op_type(op_type) {
+  explicit OperatorRegistrar(const char* op_type) {
    PADDLE_ENFORCE(!OpInfoMap::Instance().Has(op_type),
                   "'%s' is registered more than once.", op_type);
    static_assert(sizeof...(ARGS) != 0,
                  "OperatorRegistrar should be invoked at least by OpClass");
+    OpInfo info;
    details::OperatorRegistrarRecursive<0, false, ARGS...>(op_type, &info);
    OpInfoMap::Instance().Insert(op_type, info);
  }
-
-  const char* op_type;
-
-  OpInfo info;
 };

 class OpRegistry {
@@ -77,7 +76,8 @@ class OpRegistry {
                                                const VariableNameMap& outputs,
                                                AttributeMap attrs);

-  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc);
+  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc,
+                                                ProgramDesc* program);

  static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };

--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -74,7 +74,7 @@ TEST(OpRegistry, CreateOp) {
  attr->set_type(paddle::framework::AttrType::FLOAT);
  attr->set_f(scale);

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -95,7 +95,7 @@ TEST(OpRegistry, IllegalAttr) {

  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "larger_than check fail";
@@ -115,7 +115,7 @@ TEST(OpRegistry, DefaultValue) {

  ASSERT_TRUE(op_desc.IsInitialized());

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -131,7 +131,7 @@ TEST(OpRegistry, CustomChecker) {
  // attr 'test_attr' is not set
  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "Attribute 'test_attr' is required!";
@@ -149,7 +149,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_i(3);
  caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "'test_attr' must be even!";
@@ -166,7 +166,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_name("test_attr");
  attr->set_type(paddle::framework::AttrType::INT);
  attr->set_i(4);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  paddle::platform::CPUDeviceContext dev_ctx;
  paddle::framework::Scope scope;
  op->Run(scope, dev_ctx);

--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -252,5 +252,20 @@ std::ostream& operator<<(std::ostream& os,
  return os;
 }

+bool OpSupportGPU(const std::string& op_type) {
+  auto& all_kernels = OperatorWithKernel::AllOpKernels();
+  auto it = all_kernels.find(op_type);
+  if (it == all_kernels.end()) {
+    // All control operator must support GPU
+    return true;
+  }
+  for (auto& kern_pair : it->second) {
+    if (platform::is_gpu_place(kern_pair.first.place_)) {
+      return true;
+    }
+  }
+  return false;
+}
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -20,12 +20,13 @@ limitations under the License. */
 #include <unordered_map>
 #include <vector>

-#include "op_info.h"
+#include "glog/logging.h"  // For VLOG
 #include "paddle/framework/attribute.h"
 #include "paddle/framework/block_desc.h"
 #include "paddle/framework/data_type.h"
 #include "paddle/framework/framework.pb.h"
 #include "paddle/framework/lod_tensor.h"
+#include "paddle/framework/op_info.h"
 #include "paddle/framework/scope.h"
 #include "paddle/framework/shape_inference.h"
 #include "paddle/framework/tensor.h"
@@ -326,37 +327,47 @@ class CompileTimeInferShapeContext : public InferShapeContext {
  bool HasInput(const std::string& name) const override {
    const std::vector<std::string>& input_names = op_.Input(name);
    auto length = input_names.size();
+    if (length == 0) {
+      return false;
+    }
    PADDLE_ENFORCE_EQ(length, 1UL,
                      "Input(%s) should have only one value, "
                      "but it have %d now",
                      name, length);
-    return block_.HasVar(input_names[0]);
+    return block_.HasVarRecursive(input_names[0]);
  }

  bool HasOutput(const std::string& name) const override {
    const std::vector<std::string>& output_names = op_.Output(name);
    auto length = output_names.size();
+    if (length == 0) {
+      return false;
+    }
    PADDLE_ENFORCE_EQ(length, 1UL,
                      "Output(%s) should have only one value, "
                      "but it have %d now",
                      name, length);
-    return block_.HasVar(output_names[0]);
+    return block_.HasVarRecursive(output_names[0]);
  }

  bool HasInputs(const std::string& name) const override {
    const std::vector<std::string>& input_names = op_.Input(name);
-    PADDLE_ENFORCE(!input_names.empty(), "Inputs(%s) length is 0", name);
+    if (input_names.empty()) {
+      return false;
+    }
    for (auto& input : input_names) {
-      if (!block_.HasVar(input)) return false;
+      if (!block_.HasVarRecursive(input)) return false;
    }
    return true;
  }

  bool HasOutputs(const std::string& name) const override {
    const std::vector<std::string>& output_names = op_.Output(name);
-    PADDLE_ENFORCE(!output_names.empty(), "Inputs(%s) length is 0", name);
+    if (output_names.empty()) {
+      return false;
+    }
    for (auto& output : output_names) {
-      if (!block_.HasVar(output)) return false;
+      if (!block_.HasVarRecursive(output)) return false;
    }
    return true;
  }
@@ -403,11 +414,11 @@ class CompileTimeInferShapeContext : public InferShapeContext {

 private:
  DDim GetDim(const std::string& name) const override {
-    return framework::make_ddim(block_.FindVar(name)->Shape());
+    return framework::make_ddim(block_.FindVarRecursive(name)->Shape());
  }

  void SetDim(const std::string& name, const DDim& dim) override {
-    block_.FindVar(name)->SetShape(framework::vectorize(dim));
+    block_.FindVarRecursive(name)->SetShape(framework::vectorize(dim));
  }

  const OpDescBind& op_;
@@ -420,13 +431,27 @@ class RuntimeInferShapeContext : public InferShapeContext {
      : op_(op), scope_(scope) {}

  bool HasInput(const std::string& name) const override {
-    auto ipt = op_.Input(name);
+    auto& ins = Inputs(name);
+    size_t length = ins.size();
+    if (length == 0) {
+      return false;
+    }
+    PADDLE_ENFORCE_EQ(length, 1UL, "Input %s should have more than one inputs",
+                      name);
+    auto ipt = ins[0];
    auto* var = ipt == kEmptyVarName ? nullptr : scope_.FindVar(ipt);
    return var != nullptr;
  }

  bool HasOutput(const std::string& name) const override {
-    auto ipt = op_.Output(name);
+    auto& outs = Outputs(name);
+    size_t length = outs.size();
+    if (length == 0) {
+      return false;
+    }
+    PADDLE_ENFORCE_EQ(length, 1UL, "Output %s should have more than one inputs",
+                      name);
+    auto ipt = outs[0];
    auto* var = ipt == kEmptyVarName ? nullptr : scope_.FindVar(ipt);
    return var != nullptr;
  }
@@ -573,6 +598,7 @@ class OperatorWithKernel : public OperatorBase {

  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const final {
+    VLOG(3) << "Running operator " << this->Type();
    RuntimeInferShapeContext infer_shape_ctx(*this, scope);
    this->InferShape(&infer_shape_ctx);

@@ -647,5 +673,7 @@ class OperatorWithKernel : public OperatorBase {
 std::ostream& operator<<(std::ostream& os,
                         const OperatorWithKernel::OpKernelKey& kernel_key);

+extern bool OpSupportGPU(const std::string& op_type);
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -83,7 +83,7 @@ TEST(OperatorBase, all) {
  paddle::platform::CPUDeviceContext device_context;
  paddle::framework::Scope scope;

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  scope.Var("OUT1");
  ASSERT_EQ(paddle::framework::op_run_num, 0);
  op->Run(scope, device_context);
@@ -208,7 +208,7 @@ TEST(OpKernel, all) {
  paddle::platform::CPUDeviceContext cpu_device_context;
  paddle::framework::Scope scope;

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0);
  op->Run(scope, cpu_device_context);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1);
@@ -244,7 +244,7 @@ TEST(OpKernel, multi_inputs) {
  scope.Var("y0")->GetMutable<Tensor>();
  scope.Var("y1")->GetMutable<Tensor>();

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
  op->Run(scope, cpu_device_context);
 }


--- a/paddle/framework/program_desc.cc
+++ b/paddle/framework/program_desc.cc
@@ -18,27 +18,10 @@ limitations under the License. */
 namespace paddle {
 namespace framework {

-using ProgDescMap =
-    std::unordered_map<ProgramDesc *, std::unique_ptr<ProgramDescBind>>;
-static ProgDescMap *g_bind_map = nullptr;
-
-ProgramDescBind &ProgramDescBind::Instance(ProgramDesc *prog) {
-  if (g_bind_map == nullptr) {
-    g_bind_map = new ProgDescMap();
-  }
-  auto &map = *g_bind_map;
-  auto &ptr = map[prog];
-
-  if (ptr == nullptr) {
-    ptr.reset(new ProgramDescBind(prog));
-  }
-  return *ptr;
-}
-
 BlockDescBind *ProgramDescBind::AppendBlock(const BlockDescBind &parent) {
-  auto *b = prog_->add_blocks();
+  auto *b = prog_.add_blocks();
  b->set_parent_idx(parent.ID());
-  b->set_idx(prog_->blocks_size() - 1);
+  b->set_idx(prog_.blocks_size() - 1);
  blocks_.emplace_back(new BlockDescBind(this, b));
  return blocks_.back().get();
 }
@@ -47,13 +30,22 @@ ProgramDesc *ProgramDescBind::Proto() {
  for (auto &block : blocks_) {
    block->Flush();
  }
-  return prog_;
+  return &prog_;
+}
+
+ProgramDescBind::ProgramDescBind() {
+  auto *block = prog_.mutable_blocks()->Add();
+  block->set_idx(kRootBlockIndex);
+  block->set_parent_idx(kNoneBlockIndex);
+  blocks_.emplace_back(new BlockDescBind(this, block));
 }

-ProgramDescBind::ProgramDescBind(ProgramDesc *prog) {
-  prog_ = prog;
-  for (auto &block : *prog->mutable_blocks()) {
-    blocks_.emplace_back(new BlockDescBind(this, &block));
+ProgramDescBind::ProgramDescBind(const ProgramDescBind &o) {
+  prog_ = o.prog_;
+
+  for (int i = 0; i < prog_.blocks_size(); ++i) {
+    auto *block = prog_.mutable_blocks(i);
+    blocks_.emplace_back(new BlockDescBind(*o.blocks_[i], block, this));
  }
 }
 }  // namespace framework

--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -17,6 +17,7 @@ limitations under the License. */
 #include <memory>
 #include <vector>
 #include "paddle/framework/framework.pb.h"
+#include "paddle/framework/proto_desc.h"
 #include "paddle/platform/macros.h"

 namespace paddle {
@@ -26,7 +27,9 @@ class BlockDescBind;

 class ProgramDescBind {
 public:
-  static ProgramDescBind &Instance(ProgramDesc *prog);
+  ProgramDescBind();
+
+  ProgramDescBind(const ProgramDescBind &o);

  BlockDescBind *AppendBlock(const BlockDescBind &parent);

@@ -37,14 +40,9 @@ class ProgramDescBind {
  ProgramDesc *Proto();

 private:
-  explicit ProgramDescBind(ProgramDesc *prog);
-
-  // Not owned
-  ProgramDesc *prog_;
+  ProgramDesc prog_;

  std::vector<std::unique_ptr<BlockDescBind>> blocks_;
-
-  DISABLE_COPY_AND_ASSIGN(ProgramDescBind);
 };
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/program_desc_test.cc
+++ b/paddle/framework/program_desc_test.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/framework/program_desc.h"
+#include "gtest/gtest.h"
+#include "paddle/framework/block_desc.h"
+
+namespace paddle {
+namespace framework {
+TEST(ProgramDesc, copy_ctor) {
+  ProgramDescBind program;
+  auto* global_block = program.Block(0);
+  auto* x = global_block->Var("X");
+  x->SetType(VarDesc_VarType_LOD_TENSOR);
+  x->SetLoDLevel(0);
+  x->SetDataType(FP32);
+  x->SetShape({1000, 784});
+
+  auto* y = global_block->Var("Y");
+  y->SetType(VarDesc_VarType_LOD_TENSOR);
+  y->SetLoDLevel(0);
+  y->SetDataType(FP32);
+  y->SetShape({784, 100});
+
+  auto* op = global_block->AppendOp();
+  op->SetType("mul");
+  op->SetInput("X", {x->Name()});
+  op->SetInput("Y", {y->Name()});
+
+  auto* out = global_block->Var("Out");
+  out->SetType(VarDesc_VarType_LOD_TENSOR);
+  op->SetOutput("Y", {out->Name()});
+
+  ProgramDescBind program_copy(program);
+
+  auto* global_block_copy = program_copy.Block(0);
+  ASSERT_NE(global_block, global_block_copy);
+
+  auto assert_same_var = [&](const std::string& name, VarDescBind* var_before) {
+    ASSERT_TRUE(global_block_copy->HasVar(name));
+    auto* copy = global_block_copy->Var(name);
+    ASSERT_NE(copy, var_before);
+    ASSERT_EQ(copy->Name(), var_before->Name());
+    ASSERT_EQ(copy->GetType(), var_before->GetType());
+    ASSERT_EQ(copy->Shape(), var_before->Shape());
+    ASSERT_EQ(copy->Proto()->SerializeAsString(),
+              var_before->Proto()->SerializeAsString());
+  };
+
+  ASSERT_EQ(global_block->LocalVarNames(), global_block_copy->LocalVarNames());
+  ASSERT_EQ(3, global_block_copy->LocalVarNames().size());
+  assert_same_var("X", x);
+  assert_same_var("Y", y);
+  assert_same_var("Out", out);
+
+  for (size_t i = 0; i < global_block->OpSize(); ++i) {
+    auto op_origin = global_block->Op(i);
+    auto op_copy = global_block->Op(i);
+
+    ASSERT_EQ(op_origin->Type(), op_copy->Type());
+    ASSERT_EQ(op_origin->Inputs(), op_copy->Inputs());
+    ASSERT_EQ(op_origin->Outputs(), op_copy->Outputs());
+
+    ASSERT_EQ(op_copy->Proto()->SerializeAsString(),
+              op_origin->Proto()->SerializeAsString());
+  }
+
+  // Not check block's protostr are same it because the order of vars could be
+  // different and it is correct.
+}
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/proto_desc.h
+++ b/paddle/framework/proto_desc.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+namespace paddle {
+namespace framework {
+
+// The Index of first Block in Program. also called root block.
+constexpr int kRootBlockIndex = 0;
+// The Parent Index of root Block, this block does not exist.
+constexpr int kNoneBlockIndex = -1;
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/prune.cc
+++ b/paddle/framework/prune.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/framework/prune.h"
+
+#include <algorithm>
+#include <set>
+#include <string>
+#include <vector>
+
+#include <glog/logging.h>
+
+namespace paddle {
+namespace framework {
+
+const std::string kFeedOpType = "feed";
+const std::string kFetchOpType = "fetch";
+
+bool HasDependentVar(const OpDesc& op_desc,
+                     const std::set<std::string>& dependent_vars) {
+  for (auto& var : op_desc.outputs()) {
+    for (auto& argu : var.arguments()) {
+      if (dependent_vars.count(argu) != 0) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+bool IsTarget(const OpDesc& op_desc) {
+  if (op_desc.has_is_target()) {
+    return op_desc.is_target();
+  }
+  return false;
+}
+
+void prune_impl(const ProgramDesc& input, ProgramDesc& output, int block_id) {
+  // TODO(tonyyang-svail):
+  //    - will change to use multiple blocks for RNN op and Cond Op
+
+  auto& block = input.blocks(block_id);
+  auto& ops = block.ops();
+
+  bool expect_feed = true;
+  for (auto& op_desc : ops) {
+    PADDLE_ENFORCE(op_desc.type() != kFeedOpType || expect_feed,
+                   "All FeedOps are at the beginning of the ProgramDesc");
+    expect_feed = (op_desc.type() == kFeedOpType);
+  }
+
+  bool expect_fetch = true;
+  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
+    auto& op_desc = *op_iter;
+    PADDLE_ENFORCE(op_desc.type() != kFetchOpType || expect_fetch,
+                   "All FetchOps must at the end of the ProgramDesc");
+    expect_fetch = (op_desc.type() == kFetchOpType);
+  }
+
+  std::set<std::string> dependent_vars;
+  std::vector<bool> should_run;
+  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
+    auto& op_desc = *op_iter;
+
+    if (IsTarget(op_desc) || HasDependentVar(op_desc, dependent_vars)) {
+      // insert its input to the dependency graph
+      for (auto& var : op_desc.inputs()) {
+        for (auto& argu : var.arguments()) {
+          dependent_vars.insert(argu);
+        }
+      }
+
+      should_run.push_back(true);
+    } else {
+      should_run.push_back(false);
+    }
+  }
+
+  // since we are traversing the ProgramDesc in reverse order
+  // we reverse the should_run vector
+  std::reverse(should_run.begin(), should_run.end());
+
+  output = input;
+  auto* op_field = output.mutable_blocks(block_id)->mutable_ops();
+  op_field->Clear();
+  for (size_t i = 0; i < should_run.size(); ++i) {
+    if (should_run[i]) {
+      *op_field->Add() = input.blocks(block_id).ops(i);
+    }
+  }
+}
+
+void Prune(const ProgramDesc& input, ProgramDesc& output) {
+  prune_impl(input, output, 0);
+}
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/prune.h
+++ b/paddle/framework/prune.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/framework/framework.pb.h"
+#include "paddle/platform/enforce.h"
+
+namespace paddle {
+namespace framework {
+
+void Prune(const ProgramDesc& input, ProgramDesc& output);
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/prune_test.cc
+++ b/paddle/framework/prune_test.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/framework/prune.h"
+
+#include "paddle/framework/attribute.h"
+#include "paddle/framework/operator.h"
+#include "paddle/operators/net_op.h"
+
+#include "paddle/framework/block_desc.h"
+#include "paddle/framework/op_desc.h"
+#include "paddle/framework/program_desc.h"
+
+#include <gtest/gtest.h>
+
+namespace f = paddle::framework;
+namespace ops = paddle::operators;
+
+void AddOp(const std::string &type, const f::VariableNameMap &inputs,
+           const f::VariableNameMap &outputs, f::AttributeMap attrs,
+           paddle::framework::BlockDescBind *block) {
+  // insert output
+  for (auto kv : outputs) {
+    for (auto v : kv.second) {
+      auto var = block->Var(v);
+      var->SetDataType(paddle::framework::DataType::FP32);
+    }
+  }
+
+  // insert op
+  auto op = block->AppendOp();
+  op->SetType(type);
+  for (auto &kv : inputs) {
+    op->SetInput(kv.first, kv.second);
+  }
+  for (auto &kv : outputs) {
+    op->SetOutput(kv.first, kv.second);
+  }
+  op->SetAttrMap(attrs);
+}
+
+TEST(Prune, one_operator) {
+  f::ProgramDescBind program;
+  f::BlockDescBind *block = program.Block(0);
+
+  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
+
+  f::ProgramDesc *pdesc = program.Proto();
+  f::ProgramDesc pruned;
+
+  Prune(*pdesc, pruned);
+  PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), 0);
+
+  pdesc->mutable_blocks(0)->mutable_ops(0)->set_is_target(true);
+  Prune(*pdesc, pruned);
+  PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), 1);
+}
+
+TEST(Prune, forward) {
+  f::ProgramDescBind program;
+  f::BlockDescBind *block = program.Block(0);
+
+  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
+  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"c"}}}, {}, block);
+  AddOp("one_one", {{"input", {"c"}}}, {{"output", {"d"}}}, {}, block);
+  AddOp("one_one", {{"input", {"d"}}}, {{"output", {"e"}}}, {}, block);
+
+  f::ProgramDesc *pdesc = program.Proto();
+
+  for (int i = 0; i < pdesc->blocks(0).ops_size(); ++i) {
+    f::ProgramDesc pruned;
+    pdesc->mutable_blocks(0)->mutable_ops(i)->set_is_target(true);
+    Prune(*pdesc, pruned);
+    PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), i + 1);
+  }
+}
+
+TEST(Prune, multi_input_op) {
+  f::ProgramDescBind program;
+  f::BlockDescBind *block = program.Block(0);
+
+  AddOp("one_one", {{"input", {"a0"}}}, {{"output", {"b0"}}}, {}, block);
+  AddOp("one_one", {{"input", {"a1"}}}, {{"output", {"b1"}}}, {}, block);
+  AddOp("one_one", {{"input", {"a2"}}}, {{"output", {"b2"}}}, {}, block);
+  AddOp("three_one", {{"input", {"b0", "b1", "b2"}}}, {{"output", {"c"}}}, {},
+        block);
+
+  f::ProgramDesc *pdesc = program.Proto();
+  pdesc->mutable_blocks(0)->mutable_ops(3)->set_is_target(true);
+
+  f::ProgramDesc pruned;
+  Prune(*pdesc, pruned);
+  PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), 4);
+}
+
+TEST(Prune, multi_output_op) {
+  f::ProgramDescBind program;
+  f::BlockDescBind *block = program.Block(0);
+
+  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
+  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
+  AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, {}, block);
+
+  f::ProgramDesc *pdesc = program.Proto();
+  pdesc->mutable_blocks(0)->mutable_ops(2)->set_is_target(true);
+
+  f::ProgramDesc pruned;
+  Prune(*pdesc, pruned);
+  PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), 2);
+}
+
+TEST(Prune, multi_target) {
+  f::ProgramDescBind program;
+  f::BlockDescBind *block = program.Block(0);
+
+  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
+  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
+  AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, {}, block);
+
+  f::ProgramDesc *pdesc = program.Proto();
+  pdesc->mutable_blocks(0)->mutable_ops(1)->set_is_target(true);
+  pdesc->mutable_blocks(0)->mutable_ops(2)->set_is_target(true);
+
+  f::ProgramDesc pruned;
+  Prune(*pdesc, pruned);
+  PADDLE_ENFORCE_EQ(pruned.blocks(0).ops_size(), 3);
+}
--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -65,12 +65,11 @@ void Scope::DropKids() {
  kids_.clear();
 }

-framework::Scope& GetGlobalScope() {
-  static framework::Scope* g_scope = nullptr;
-  if (g_scope == nullptr) {
-    g_scope = new framework::Scope();
-  }
-  return *g_scope;
+void Scope::DeleteScope(Scope* scope) {
+  auto it = std::find(this->kids_.begin(), this->kids_.end(), scope);
+  PADDLE_ENFORCE(it != this->kids_.end(), "Cannot find %p as kid scope", scope);
+  this->kids_.erase(it);
+  delete scope;
 }

 }  // namespace framework

--- a/paddle/framework/scope.h
+++ b/paddle/framework/scope.h
@@ -59,6 +59,8 @@ class Scope {
  /// Find the scope or an ancestor scope that contains the given variable.
  const Scope* FindScope(const Variable* var) const;

+  void DeleteScope(Scope* scope);
+
  /// Drop all kids scopes belonged to this scope.
  void DropKids();

@@ -72,8 +74,5 @@ class Scope {

  DISABLE_COPY_AND_ASSIGN(Scope);
 };
-
-framework::Scope& GetGlobalScope();
-
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@@ -60,6 +60,10 @@ class Tensor {
  template <typename T>
  inline T* mutable_data(platform::Place place);

+  inline void* mutable_data(platform::Place place, std::type_index type);
+
+  inline void* mutable_data(platform::Place place);
+
  /**
   * @brief     Return a pointer to mutable memory block.
   *
@@ -81,7 +85,6 @@ class Tensor {
  inline Tensor& Resize(const DDim& dims);

  /*! The internal of two tensors share the same memory block. */
-  template <typename T>
  inline Tensor& ShareDataWith(const Tensor& src);

  /**
@@ -96,26 +99,9 @@ class Tensor {
  // TODO(qijun): https://github.com/PaddlePaddle/Paddle/issues/4647
  // Remove `CopyFrom` and `CopyFromVector` from Tensor interface
  // and make them global functions
-  template <typename T>
  inline void CopyFrom(const Tensor& src, const platform::Place& dst_place,
                       const platform::DeviceContext& ctx);

-  // FIXME(yuyang18): CopyFrom should without template T, use the replace
-  // `CopyFrom` with `CopyFromTensor`
-  inline void CopyFromTensor(const Tensor& src,
-                             const platform::Place& dst_place,
-                             const platform::DeviceContext& ctx) {
-    // NOLINTNEXTLINES_8 cpplint.py will recognize below lines as functions.
-    // That is a bug of cpplint.py. Just ignore lint these lines.
-    if (src.type() == std::type_index(typeid(double))) {
-      CopyFrom<double>(src, dst_place, ctx);
-    } else if (src.type() == std::type_index(typeid(float))) {
-      CopyFrom<float>(src, dst_place, ctx);
-    } else if (src.type() == std::type_index(typeid(int))) {
-      CopyFrom<int>(src, dst_place, ctx);
-    }
-  }
-
  /**
   * @brief   Copy the content of an external vector to a tensor.
   *
@@ -137,7 +123,6 @@ class Tensor {
   * @param[in] end_idx     The index of the end row(exclusive) to slice.
   *                        The index number begins from 0.
   */
-  template <typename T>
  inline Tensor Slice(const int& begin_idx, const int& end_idx) const;

  platform::Place place() const {
@@ -149,7 +134,6 @@ class Tensor {
  std::type_index type() const { return holder_->type(); }

 private:
-  template <typename T>
  inline void check_memory_size() const;

 private:
@@ -158,20 +142,22 @@ class Tensor {
   *          parameter of Variable.
   */
  struct Placeholder {
-    virtual ~Placeholder() {}
+    virtual ~Placeholder() = default;
    virtual void* ptr() const = 0;
    virtual size_t size() const = 0;
    virtual std::type_index type() const = 0;
    virtual platform::Place place() const = 0;
+    virtual void set_type(std::type_index type) = 0;
  };

-  template <typename T, typename Place>
+  template <typename Place>
  struct PlaceholderImpl : public Placeholder {
-    PlaceholderImpl(Place place, size_t size)
-        : ptr_(static_cast<T*>(memory::Alloc(place, size)),
-               memory::PODDeleter<T, Place>(place)),
+    PlaceholderImpl(Place place, size_t size, std::type_index type)
+        : ptr_(static_cast<uint8_t*>(memory::Alloc(place, size)),
+               memory::PODDeleter<uint8_t, Place>(place)),
          place_(place),
-          size_(size) {
+          size_(size),
+          type_(type) {
      PADDLE_ENFORCE_NOT_NULL(ptr_, "Insufficient %s memory to allocation.",
                              (is_cpu_place(place_) ? "CPU" : "GPU"));
    }
@@ -179,16 +165,20 @@ class Tensor {
    virtual size_t size() const { return size_; }
    virtual platform::Place place() const { return place_; }
    virtual void* ptr() const { return static_cast<void*>(ptr_.get()); }
-    virtual std::type_index type() const { return std::type_index(typeid(T)); }
+    virtual std::type_index type() const { return type_; }
+    virtual void set_type(std::type_index type) { type_ = type; }

    /*! the pointer of memory block. */
-    std::unique_ptr<T, memory::PODDeleter<T, Place>> ptr_;
+    std::unique_ptr<uint8_t, memory::PODDeleter<uint8_t, Place>> ptr_;

    /*! the place of memory block. */
    platform::Place place_;

    /*! the size of memory block. */
    size_t size_;
+
+    /* the current type of memory */
+    std::type_index type_;
  };

  /*! holds the memory block if allocated. */

--- a/paddle/framework/tensor_array.cc
+++ b/paddle/framework/tensor_array.cc
@@ -106,8 +106,8 @@ void TensorArray::Write(size_t index, const LoDTensor& value) {

  values_[index].Resize(value.dims());
  values_[index].mutable_data<value_type>(platform::CPUPlace());
-  values_[index].CopyFrom<value_type>(value, platform::CPUPlace(),
-                                      platform::CPUDeviceContext());
+  values_[index].CopyFrom(value, platform::CPUPlace(),
+                          platform::CPUDeviceContext());
 }

 void TensorArray::WriteShared(size_t index, const LoDTensor& value) {
@@ -116,7 +116,7 @@ void TensorArray::WriteShared(size_t index, const LoDTensor& value) {
    values_.resize(index + 1);
  }

-  values_[index].ShareDataWith<value_type>(value);
+  values_[index].ShareDataWith(value);
 }

 LoDTensor TensorArray::Pack(size_t level, const std::vector<DySeqMeta>& meta,
@@ -163,9 +163,9 @@ LoDTensor TensorArray::Stack() const {
  result.mutable_data<value_type>(platform::CPUPlace());

  for (size_t idx = 0; idx < size(); idx++) {
-    result.Slice<value_type>(idx, idx + 1)
-        .CopyFrom<value_type>(Read(idx), platform::CPUPlace(),
-                              platform::CPUDeviceContext());
+    result.Slice(idx, idx + 1)
+        .CopyFrom(Read(idx), platform::CPUPlace(),
+                  platform::CPUDeviceContext());
  }
  return result;
 }
@@ -191,13 +191,12 @@ void TensorArray::Unstack(const LoDTensor& source, bool data_shared) const {
    auto& value = values_[elem];
    if (data_shared) {
      // share memory
-      value.ShareDataWith<value_type>(source.Slice<value_type>(elem, elem + 1));
+      value.ShareDataWith(source.Slice(elem, elem + 1));
    } else {
      // copy
      value.Resize(value_dims);
-      value.CopyFrom<value_type>(source.Slice<value_type>(elem, elem + 1),
-                                 platform::CPUPlace(),
-                                 platform::CPUDeviceContext());
+      value.CopyFrom(source.Slice(elem, elem + 1), platform::CPUPlace(),
+                     platform::CPUDeviceContext());
    }
  }
 }
@@ -242,11 +241,10 @@ LoDTensor DynamicBatchUnpacker::GetBatch(size_t index) {

  for (size_t i = 0; i < indice.size(); i++) {
    auto index = indice[i];
-    auto target = result.Slice<value_type>(i, i + 1);
-    auto slice = source->Slice<value_type>(index, index + 1);
+    auto target = result.Slice(i, i + 1);
+    auto slice = source->Slice(index, index + 1);

-    target.CopyFrom<value_type>(slice, platform::CPUPlace(),
-                                platform::CPUDeviceContext());
+    target.CopyFrom(slice, platform::CPUPlace(), platform::CPUDeviceContext());
  }

  return result;
@@ -277,10 +275,10 @@ LoDTensor PackDynamicBatch(const std::vector<LoDTensor>& source,
      // target is result[index]
      auto index = seq_meta.begin + batch_id;
      if (index >= seq_meta.end) break;
-      auto source_ = source[batch_id].Slice<float>(seq_id, seq_id + 1);
-      auto target = result.Slice<float>(index, index + 1);
-      target.CopyFrom<float>(source_, platform::CPUPlace(),
-                             platform::CPUDeviceContext());
+      auto source_ = source[batch_id].Slice(seq_id, seq_id + 1);
+      auto target = result.Slice(index, index + 1);
+      target.CopyFrom(source_, platform::CPUPlace(),
+                      platform::CPUDeviceContext());
    }
  }


--- a/paddle/framework/tensor_array_test.cc
+++ b/paddle/framework/tensor_array_test.cc
@@ -91,7 +91,7 @@ class TensorArrayPackTester : public ::testing::Test {
      size_t begin = level[i];
      size_t end = level[i + 1];
      for (size_t j = begin; j < end; j++) {
-        auto record = source.Slice<int>(j, j + 1);
+        auto record = source.Slice(j, j + 1);
        for (int dim = 0; dim < 128; dim++) {
          record.mutable_data<int>(platform::CPUPlace())[dim] = j - begin;
        }

--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -19,12 +19,50 @@ limitations under the License. */
 namespace paddle {
 namespace framework {

+template <typename... T>
+struct SizeOfTypeFunctor;
+
 template <typename T>
+struct SizeOfTypeFunctor<T> {
+  size_t operator()(std::type_index type) const {
+    if (typeid(T).hash_code() == type.hash_code()) {
+      return sizeof(T);
+    } else {
+      return 0UL;
+    }
+  }
+};
+
+template <>
+struct SizeOfTypeFunctor<> {
+  size_t operator()(std::type_index type) const { return 0UL; }
+};
+
+template <typename HEAD, typename... TAIL>
+struct SizeOfTypeFunctor<HEAD, TAIL...> {
+  size_t operator()(std::type_index type) const {
+    SizeOfTypeFunctor<HEAD> head;
+    size_t head_size = head(type);
+    if (head_size != 0) {
+      return head_size;
+    }
+    SizeOfTypeFunctor<TAIL...> tail;
+    return tail(type);
+  }
+};
+
+static inline size_t SizeOfType(std::type_index type) {
+  SizeOfTypeFunctor<int, float, double, int16_t, int64_t> functor;
+  size_t size = functor(type);
+  PADDLE_ENFORCE(size != 0UL, "Cannot get size of type %s", type.name());
+  return size;
+}
+
 inline void Tensor::check_memory_size() const {
  PADDLE_ENFORCE_NOT_NULL(
      holder_, "Tensor holds no memory. Call Tensor::mutable_data first.");
  PADDLE_ENFORCE_GE(
-      holder_->size(), numel() * sizeof(T) + offset_,
+      holder_->size(), numel() * SizeOfType(type()) + offset_,
      "Tensor's dims_ is out of bound. Call Tensor::mutable_data "
      "first to re-allocate memory.\n"
      "or maybe the required data-type mismatches the data already stored.");
@@ -32,14 +70,23 @@ inline void Tensor::check_memory_size() const {

 template <typename T>
 inline const T* Tensor::data() const {
-  check_memory_size<T>();
+  check_memory_size();
+  PADDLE_ENFORCE(std::is_same<T, void>::value ||
+                     holder_->type().hash_code() == typeid(T).hash_code(),
+                 "Tensor holds the wrong type, it holds %s",
+                 this->holder_->type().name());
+
  return reinterpret_cast<const T*>(
      reinterpret_cast<uintptr_t>(holder_->ptr()) + offset_);
 }

 template <typename T>
 inline T* Tensor::data() {
-  check_memory_size<T>();
+  check_memory_size();
+  PADDLE_ENFORCE(std::is_same<T, void>::value ||
+                     holder_->type().hash_code() == typeid(T).hash_code(),
+                 "Tensor holds the wrong type, it holds %s",
+                 this->holder_->type().name());
  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
                              offset_);
 }
@@ -54,51 +101,62 @@ inline T* Tensor::mutable_data(DDim dims, platform::Place place) {
 template <typename T>
 inline T* Tensor::mutable_data(platform::Place place) {
  static_assert(std::is_pod<T>::value, "T must be POD");
+  return reinterpret_cast<T*>(mutable_data(place, typeid(T)));
+}
+
+inline void* Tensor::mutable_data(platform::Place place, std::type_index type) {
+  if (holder_ != nullptr) {
+    holder_->set_type(type);
+  }
  PADDLE_ENFORCE_GT(numel(), 0,
                    "Tensor's numel must be larger than zero to call "
                    "Tensor::mutable_data. Call Tensor::set_dim first.");
+  int64_t size = numel() * SizeOfType(type);
  /* some versions of boost::variant don't have operator!= */
-  int64_t size = numel() * sizeof(T);
  if (holder_ == nullptr || !(holder_->place() == place) ||
      holder_->size() < size + offset_) {
    if (platform::is_cpu_place(place)) {
-      holder_.reset(new PlaceholderImpl<T, platform::CPUPlace>(
-          boost::get<platform::CPUPlace>(place), size));
+      holder_.reset(new PlaceholderImpl<platform::CPUPlace>(
+          boost::get<platform::CPUPlace>(place), size, type));
    } else if (platform::is_gpu_place(place)) {
 #ifndef PADDLE_WITH_CUDA
      PADDLE_THROW("'GPUPlace' is not supported in CPU only device.");
    }
 #else
-      holder_.reset(new PlaceholderImpl<T, platform::GPUPlace>(
-          boost::get<platform::GPUPlace>(place), size));
+      holder_.reset(new PlaceholderImpl<platform::GPUPlace>(
+          boost::get<platform::GPUPlace>(place), size, type));
    }
 #endif
    offset_ = 0;
  }
-  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
-                              offset_);
+  return reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
+                                 offset_);
+}
+
+inline void* Tensor::mutable_data(platform::Place place) {
+  PADDLE_ENFORCE(this->holder_ != nullptr,
+                 "Cannot invoke mutable data if current hold nothing");
+  return mutable_data(place, holder_->type());
 }

-template <typename T>
 inline Tensor& Tensor::ShareDataWith(const Tensor& src) {
-  src.check_memory_size<T>();
+  src.check_memory_size();
  *this = src;
  return *this;
 }

-template <typename T>
 inline void Tensor::CopyFrom(const Tensor& src,
                             const platform::Place& dst_place,
                             const platform::DeviceContext& ctx) {
-  src.check_memory_size<T>();
+  src.check_memory_size();
  Resize(src.dims());

  auto src_place = src.holder_->place();
-  auto src_ptr = static_cast<const void*>(src.data<T>());
+  auto src_ptr = src.data<void>();

-  auto dst_ptr = static_cast<void*>(mutable_data<T>(dst_place));
+  auto dst_ptr = mutable_data(dst_place, src.type());

-  auto size = src.numel() * sizeof(T);
+  auto size = src.numel() * SizeOfType(src.type());

  if (platform::is_cpu_place(src_place) && platform::is_cpu_place(dst_place)) {
    memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,
@@ -165,9 +223,8 @@ inline void Tensor::CopyFromVector(const std::vector<T>& src,
 #endif
 }

-template <typename T>
 inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
-  check_memory_size<T>();
+  check_memory_size();
  PADDLE_ENFORCE_GE(begin_idx, 0,
                    "The start row index must be greater than 0.");
  PADDLE_ENFORCE_LE(end_idx, dims_[0], "The end row index is out of bound.");
@@ -183,7 +240,7 @@ inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
    DDim dst_dims = dims_;
    dst_dims[0] = end_idx - begin_idx;
    dst.Resize(dst_dims);
-    dst.offset_ = offset_ + begin_idx * base * sizeof(T);
+    dst.offset_ = offset_ + begin_idx * base * SizeOfType(type());
    return dst;
  }
 }
@@ -197,10 +254,9 @@ inline const DDim& Tensor::dims() const { return dims_; }

 inline int64_t Tensor::numel() const { return product(dims_); }

-template <typename T>
 inline Tensor ReshapeToMatrix(const Tensor& src, int num_col_dims) {
  Tensor res;
-  res.ShareDataWith<T>(src);
+  res.ShareDataWith(src);
  res.Resize(flatten_to_2d(src.dims(), num_col_dims));
  return res;
 }

--- a/paddle/framework/tensor_test.cc
+++ b/paddle/framework/tensor_test.cc
@@ -108,7 +108,7 @@ TEST(Tensor, ShareDataWith) {
    // Try to share data form uninitialized tensor
    bool caught = false;
    try {
-      dst_tensor.ShareDataWith<float>(src_tensor);
+      dst_tensor.ShareDataWith(src_tensor);
    } catch (paddle::platform::EnforceNotMet err) {
      caught = true;
      std::string msg =
@@ -122,7 +122,7 @@ TEST(Tensor, ShareDataWith) {
    ASSERT_TRUE(caught);

    src_tensor.mutable_data<int>(make_ddim({2, 3, 4}), CPUPlace());
-    dst_tensor.ShareDataWith<int>(src_tensor);
+    dst_tensor.ShareDataWith(src_tensor);
    ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
  }

@@ -131,7 +131,7 @@ TEST(Tensor, ShareDataWith) {
    Tensor src_tensor;
    Tensor dst_tensor;
    src_tensor.mutable_data<int>(make_ddim({2, 3, 4}), GPUPlace());
-    dst_tensor.ShareDataWith<int>(src_tensor);
+    dst_tensor.ShareDataWith(src_tensor);
    ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
  }
 #endif
@@ -143,7 +143,7 @@ TEST(Tensor, Slice) {
  {
    Tensor src_tensor;
    src_tensor.mutable_data<int>(make_ddim({5, 3, 4}), CPUPlace());
-    Tensor slice_tensor = src_tensor.Slice<int>(1, 3);
+    Tensor slice_tensor = src_tensor.Slice(1, 3);
    DDim slice_dims = slice_tensor.dims();
    ASSERT_EQ(arity(slice_dims), 3);
    EXPECT_EQ(slice_dims[0], 2);
@@ -167,7 +167,7 @@ TEST(Tensor, Slice) {
  {
    Tensor src_tensor;
    src_tensor.mutable_data<double>(make_ddim({6, 9}), GPUPlace());
-    Tensor slice_tensor = src_tensor.Slice<double>(2, 6);
+    Tensor slice_tensor = src_tensor.Slice(2, 6);
    DDim slice_dims = slice_tensor.dims();
    ASSERT_EQ(arity(slice_dims), 2);
    EXPECT_EQ(slice_dims[0], 4);
@@ -202,7 +202,7 @@ TEST(Tensor, CopyFrom) {
    memcpy(src_ptr, arr, 9 * sizeof(int));

    auto cpu_place = new paddle::platform::CPUPlace();
-    dst_tensor.CopyFrom<int>(src_tensor, *cpu_place, cpu_ctx);
+    dst_tensor.CopyFrom(src_tensor, *cpu_place, cpu_ctx);

    const int* dst_ptr = dst_tensor.data<int>();
    ASSERT_NE(src_ptr, dst_ptr);
@@ -210,8 +210,8 @@ TEST(Tensor, CopyFrom) {
      EXPECT_EQ(src_ptr[i], dst_ptr[i]);
    }

-    Tensor slice_tensor = src_tensor.Slice<int>(1, 2);
-    dst_tensor.CopyFrom<int>(slice_tensor, *cpu_place, cpu_ctx);
+    Tensor slice_tensor = src_tensor.Slice(1, 2);
+    dst_tensor.CopyFrom(slice_tensor, *cpu_place, cpu_ctx);
    const int* slice_ptr = slice_tensor.data<int>();
    dst_ptr = dst_tensor.data<int>();
    ASSERT_NE(dst_ptr, slice_ptr);
@@ -233,11 +233,11 @@ TEST(Tensor, CopyFrom) {
    // CPU Tensor to GPU Tensor
    auto gpu_place = new paddle::platform::GPUPlace(0);
    CUDADeviceContext gpu_ctx(*gpu_place);
-    gpu_tensor.CopyFrom<int>(src_tensor, *gpu_place, gpu_ctx);
+    gpu_tensor.CopyFrom(src_tensor, *gpu_place, gpu_ctx);

    // GPU Tensor to CPU Tensor
    auto cpu_place = new paddle::platform::CPUPlace();
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -247,13 +247,13 @@ TEST(Tensor, CopyFrom) {
      EXPECT_EQ(src_ptr[i], dst_ptr[i]);
    }

-    Tensor slice_tensor = src_tensor.Slice<int>(1, 2);
+    Tensor slice_tensor = src_tensor.Slice(1, 2);

    // CPU Slice Tensor to GPU Tensor
-    gpu_tensor.CopyFrom<int>(slice_tensor, *gpu_place, gpu_ctx);
+    gpu_tensor.CopyFrom(slice_tensor, *gpu_place, gpu_ctx);

    // GPU Tensor to CPU Tensor
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Slice Tensors
    gpu_ctx.Wait();
@@ -320,7 +320,7 @@ TEST(Tensor, CopyFromVector) {
    CUDADeviceContext gpu_ctx(*gpu_place);
    gpu_tensor.CopyFromVector<int>(src_vec, gpu_ctx);
    // Copy from GPU to CPU tensor for comparison
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -340,7 +340,7 @@ TEST(Tensor, CopyFromVector) {
    cpu_tensor.CopyFromVector<int>(src_vec, cpu_ctx);
    gpu_tensor.Resize(make_ddim({2, 2}));
    gpu_tensor.CopyFromVector<int>(src_vec, gpu_ctx);
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -368,7 +368,7 @@ TEST(Tensor, ReshapeToMatrix) {
  for (int i = 0; i < 2 * 3 * 4 * 9; ++i) {
    src_ptr[i] = i;
  }
-  Tensor res = ReshapeToMatrix<int>(src, 2);
+  Tensor res = ReshapeToMatrix(src, 2);
  ASSERT_EQ(res.dims()[0], 2 * 3);
  ASSERT_EQ(res.dims()[1], 4 * 9);
 }
--- a/paddle/framework/var_desc.h
+++ b/paddle/framework/var_desc.h
@@ -79,6 +79,10 @@ class VarDescBind {

  void SetType(VarDesc::VarType type) { desc_.set_type(type); }

+  bool Persistable() const { return desc_.persistable(); }
+
+  void SetPersistable(bool persistable) { desc_.set_persistable(persistable); }
+
 private:
  const TensorDesc &tensor_desc() const;
  TensorDesc *mutable_tensor_desc();

--- a/paddle/framework/var_type_inference_test.cc
+++ b/paddle/framework/var_type_inference_test.cc
@@ -62,7 +62,7 @@ namespace paddle {
 namespace framework {

 TEST(InferVarType, sum_op) {
-  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  ProgramDescBind prog;
  auto *op = prog.Block(0)->AppendOp();
  op->SetType("sum");
  op->SetInput("X", {"test_a", "test_b", "test_c"});
@@ -83,7 +83,7 @@ TEST(InferVarType, sum_op) {
 }

 TEST(InferVarType, sum_op_without_infer_var_type) {
-  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  ProgramDescBind prog;
  auto *op = prog.Block(0)->AppendOp();
  op->SetType("sum_without_infer_var_type");
  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});

--- a/paddle/framework/variable.h
+++ b/paddle/framework/variable.h
@@ -25,7 +25,10 @@ class Variable {
 public:
  template <typename T>
  const T& Get() const {
-    PADDLE_ENFORCE(IsType<T>(), "Variable must be type %s", typeid(T).name());
+    PADDLE_ENFORCE(holder_ != nullptr, "Variable must hold some thing");
+    PADDLE_ENFORCE(IsType<T>(),
+                   "Variable must be type %s, the holding type is %s",
+                   typeid(T).name(), holder_->Type().name());
    return *static_cast<const T*>(holder_->Ptr());
  }


--- a/paddle/gserver/gradientmachines/NeuralNetwork.cpp
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
@@ -21,6 +21,10 @@ limitations under the License. */
 #include "paddle/utils/Logging.h"
 #include "paddle/utils/Stat.h"

+#ifdef PADDLE_USE_MKLDNN
+#include "paddle/gserver/layers/MKLDNNLayer.h"
+#endif
+
 #ifndef PADDLE_MOBILE_INFERENCE
 #include "MultiNetwork.h"
 #include "RecurrentGradientMachine.h"
@@ -300,6 +304,17 @@ void NeuralNetwork::backward(const UpdateCallback& callback) {
  }
 }

+void NeuralNetwork::finish() {
+#ifdef PADDLE_USE_MKLDNN
+  FOR_EACH_R(layer, layers_) {
+    MKLDNNLayerPtr dnnLayer = std::dynamic_pointer_cast<MKLDNNLayer>(*layer);
+    if (dnnLayer) {
+      dnnLayer->convertWeightsToPaddle();
+    }
+  }
+#endif
+}
+
 Argument NeuralNetwork::getLayerOutput(const std::string& layerName) {
  return getLayer(layerName)->getOutput();
 }

--- a/paddle/gserver/gradientmachines/NeuralNetwork.h
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.h
@@ -134,6 +134,9 @@ public:

  const std::string& getName() const { return subModelName_; }

+  /// some finish work, like convert the weight format of MKLDNNLayers
+  void finish();
+
 protected:
  /**
   * The constructor of NeuralNetwork.

--- a/paddle/gserver/layers/MKLDNNConvLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNConvLayer.cpp
@@ -313,6 +313,7 @@ void MKLDNNConvLayer::resetOutValue(
      cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
      CHECK(cvtOutVal_) << "should not be empty";
    } else {
+      cpuOut->setData(output_.value->getData());
      cpuOutVal_ = out;
    }
    // when output is cpu device, change the mkldnn output value and make them
@@ -456,17 +457,18 @@ void MKLDNNConvLayer::resetOutGrad(
    MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
  } else {
    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
+    // always share the same grad data of CPU output
+    // then the activation can get the right grad from output_.grad
+    output_.grad->setData(cpuOut->getData());
    // same PrimitiveDesc with cpuInVal_
    CHECK(cpuOutVal_);
    cpuOutGrad_ = MKLDNNMatrix::create(cpuOut, cpuOutVal_->getPrimitiveDesc());
    // create reorder if primitive desc does not match
    if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(output_.grad, outVal_->getPrimitiveDesc());
+      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
      cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
      CHECK(cvtOutGrad_);
    } else {
-      // share the same data of CPU output
-      output_.grad->setData(cpuOut->getData());
      out = cpuOutGrad_;
    }
  }

--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -46,6 +46,9 @@ protected:
  // backward also need reset after reset forward handle
  bool needResetBwd_;

+  // is output only mkldnn
+  bool outputOnlyMKLDNN_;
+
  // mkldnn engine, stream and primivtives
  mkldnn::engine engine_;
  std::shared_ptr<MKLDNNStream> stream_;
@@ -141,6 +144,9 @@ public:
        updateInputData();
      }

+      if (!outputOnlyMKLDNN_) {
+        clearGrads();
+      }
      stream_->submit(pipelineFwd_);
    }

@@ -389,7 +395,8 @@ protected:
      CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
          << "Only support other device is CPU yet";
    }
-    return outputOtherDevice_.size() == 0;
+    outputOnlyMKLDNN_ = outputOtherDevice_.size() == 0;
+    return outputOnlyMKLDNN_;
  }

  /**
@@ -398,6 +405,16 @@ protected:
  void setDevice(int id) { deviceId_ = id; }

 private:
+  /**
+   * clear all grad
+   */
+  void clearGrads() {
+    output_.grad->zeroMem();
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      outputOtherDevice_[i].grad->zeroMem();
+    }
+  }
+
  /**
   * Set deviceId of the params used in this layer.
   */

--- a/paddle/gserver/layers/MKLDNNPoolLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNPoolLayer.cpp
@@ -146,6 +146,7 @@ void MKLDNNPoolLayer::resetOutValue(MKLDNNMatrixPtr& out) {
      cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
      CHECK(cvtOutVal_) << "should not be emptry";
    } else {
+      cpuOut->setData(output_.value->getData());
      cpuOutVal_ = out;
    }
    output_.value = std::dynamic_pointer_cast<Matrix>(cpuOutVal_);
@@ -213,15 +214,16 @@ void MKLDNNPoolLayer::resetOutGrad(MKLDNNMatrixPtr& out) {
    MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
  } else {
    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
+    // always share the same grad data of CPU output
+    // then the activation can get the right grad from output_.grad
+    output_.grad->setData(cpuOut->getData());
    cpuOutGrad_ = MKLDNNMatrix::create(
        cpuOut, memory::dims{bs_, oc_, oh_, ow_}, format::nchw, engine_);
    if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(output_.grad, outVal_->getPrimitiveDesc());
+      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
      cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
      CHECK(cvtOutGrad_) << "should not be emptry";
    } else {
-      // share the same data of CPU output
-      output_.grad->setData(cpuOut->getData());
      out = cpuOutGrad_;
    }
  }

--- a/paddle/gserver/tests/CMakeLists.txt
+++ b/paddle/gserver/tests/CMakeLists.txt
@@ -26,7 +26,10 @@ if(WITH_MKLDNN)
        test_MKLDNN.cpp
        MKLDNNTester.cpp
        LayerGradUtil.cpp)
-    add_test(NAME test_MKLDNN COMMAND test_MKLDNN)
+    add_test(NAME test_MKLDNN
+        COMMAND .set_python_path.sh -d ${PADDLE_SOURCE_DIR}/python
+            ${CMAKE_CURRENT_BINARY_DIR}/test_MKLDNN
+            WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
 endif()

 ################ test_CRFLayerGrad ####################

--- a/paddle/gserver/tests/MKLDNNTester.cpp
+++ b/paddle/gserver/tests/MKLDNNTester.cpp
@@ -15,6 +15,7 @@ limitations under the License. */
 #include "MKLDNNTester.h"
 #include "paddle/gserver/layers/MKLDNNBase.h"
 #include "paddle/gserver/layers/MKLDNNLayer.h"
+#include "paddle/trainer/Trainer.h"

 namespace paddle {

@@ -315,6 +316,7 @@ void MKLDNNTester::runOnce() {
    auto& value = para->getBuf(PARAMETER_VALUE);
    real lr = 1e-3;
    value->add(*grad, lr);
+    grad->zeroMem();
  };
  randomTopDiffs();
  dnnLayer_->backward(updateCallback);
@@ -411,4 +413,143 @@ void MKLDNNTester::run(const TestConfig& dnn,
  }
 }

+void MKLDNNTester::initArgument(DataIn& data,
+                                const std::string& configPath,
+                                const size_t iter) {
+  TrainerConfigHelper config(configPath);
+  size_t batchSize = config.getOptConfig().batch_size();
+  data.inArgs.resize(iter);
+  data.outGrads.resize(iter);
+  data.paraValues.clear();
+  for (const auto& layer_name : config.getModelConfig().input_layer_names()) {
+    auto layer_config = std::find_if(config.getModelConfig().layers().begin(),
+                                     config.getModelConfig().layers().end(),
+                                     [=](const LayerConfig& layer_config) {
+                                       return layer_config.name() == layer_name;
+                                     });
+    CHECK(layer_config != config.getModelConfig().layers().end());
+
+    size_t layerSize = layer_config->size();
+    for (size_t i = 0; i < iter; ++i) {
+      Argument arg;
+      arg.value = Matrix::create(batchSize, layerSize, false, false);
+      arg.grad = Matrix::create(batchSize, layerSize, false, false);
+      arg.value->randomizeUniform();
+      arg.value->add(-0.5);
+      arg.value->sigmoid(*arg.value);
+      arg.grad->zeroMem();
+      arg.ids = VectorT<int>::create(batchSize, false);
+      arg.ids->rand(layerSize);
+      generateSequenceStartPositions(batchSize, arg.sequenceStartPositions);
+      data.inArgs[i].push_back(arg);
+    }
+  }
+
+  for (const auto& layer_name : config.getModelConfig().output_layer_names()) {
+    auto layer_config = std::find_if(config.getModelConfig().layers().begin(),
+                                     config.getModelConfig().layers().end(),
+                                     [=](const LayerConfig& layer_config) {
+                                       return layer_config.name() == layer_name;
+                                     });
+    CHECK(layer_config != config.getModelConfig().layers().end());
+
+    size_t layerSize = layer_config->size();
+    for (size_t i = 0; i < iter; ++i) {
+      MatrixPtr grad = Matrix::create(batchSize, layerSize, false, false);
+      grad->randomizeUniform();
+      data.outGrads[i].push_back(grad);
+    }
+  }
+
+  for (const auto& para_config : config.getModelConfig().parameters()) {
+    VectorPtr value = Vector::create(para_config.size(), false);
+    value->randnorm(0, 2);
+    data.paraValues.push_back(value);
+  }
+}
+
+void MKLDNNTester::getOutResult(const std::string& configPath,
+                                DataIn& in,
+                                DataOut& out,
+                                bool use_mkldnn,
+                                size_t iter) {
+  FLAGS_use_gpu = false;
+  FLAGS_use_mkldnn = use_mkldnn;
+  *ThreadLocalRand::getSeed() = 1;
+  srand(1);
+
+  Trainer trainer;
+  auto config = std::make_shared<TrainerConfigHelper>(configPath);
+  trainer.init(config, false);
+  auto gradientMachine = trainer.getGradientMachine();
+  std::vector<ParameterPtr> parameters = gradientMachine->getParameters();
+  for (size_t i = 0; i < in.paraValues.size(); i++) {
+    parameters[i]->getBuf(PARAMETER_VALUE)->copyFrom(*in.paraValues[i]);
+  }
+  UpdateCallback simpleUpdate = [](Parameter* para) {
+    auto& grad = para->getBuf(PARAMETER_GRADIENT);
+    auto& value = para->getBuf(PARAMETER_VALUE);
+    real lr = 1e-2;
+    value->add(*grad, lr);
+    grad->zeroMem();
+  };
+
+  vector<Argument> outArgs;
+  gradientMachine->start();
+  out.outValues.clear();
+  out.paraValues.clear();
+  for (size_t i = 0; i < iter; ++i) {
+    VLOG(MKLDNN_TESTS) << "runing iteration " << i;
+    gradientMachine->forward(in.inArgs[i], &outArgs, PASS_TRAIN);
+    // save forward result
+    for (size_t k = 0; k < outArgs.size(); k++) {
+      MatrixPtr value = Matrix::create(outArgs[k].value->getHeight(),
+                                       outArgs[k].value->getWidth(),
+                                       false,
+                                       false);
+      value->copyFrom(*outArgs[k].value);
+      out.outValues.push_back(value);
+    }
+
+    // random backward input
+    for (size_t k = 0; k < outArgs.size(); k++) {
+      outArgs[k].grad->copyFrom(*in.outGrads[i][k]);
+    }
+    gradientMachine->backward(simpleUpdate);
+  }
+  gradientMachine->finish();
+
+  // save param value
+  for (size_t i = 0; i < in.paraValues.size(); i++) {
+    VectorPtr val = Vector::create(
+        parameters[i]->getBuf(PARAMETER_VALUE)->getSize(), false);
+    val->copyFrom(*parameters[i]->getBuf(PARAMETER_VALUE));
+    out.paraValues.push_back(val);
+  }
+}
+
+void MKLDNNTester::compareResult(DataOut& ref, DataOut& dnn, float eps) {
+  CHECK_EQ(ref.outValues.size(), dnn.outValues.size());
+  CHECK_EQ(ref.paraValues.size(), dnn.paraValues.size());
+  for (size_t i = 0; i < ref.outValues.size(); i++) {
+    EXPECT_LE(fabs(compareMatrix(ref.outValues[i], dnn.outValues[i])), eps);
+  }
+  for (size_t i = 0; i < ref.paraValues.size(); i++) {
+    EXPECT_LE(fabs(compareVector(ref.paraValues[i], dnn.paraValues[i])), eps);
+  }
+}
+
+void MKLDNNTester::runBranchesTest(const std::string& configPath,
+                                   size_t iter,
+                                   float eps) {
+  DataIn in;
+  initArgument(in, configPath, iter);
+
+  DataOut outCpu, outDnn;
+  getOutResult(configPath, in, outCpu, false, iter);
+  getOutResult(configPath, in, outDnn, true, iter);
+
+  compareResult(outCpu, outDnn, eps);
+}
+
 }  //  namespace paddle
--- a/paddle/gserver/tests/MKLDNNTester.h
+++ b/paddle/gserver/tests/MKLDNNTester.h
@@ -33,6 +33,17 @@ class MKLDNNTester {
    NUM = 2,  // Number of total
  };

+  struct DataIn {
+    std::vector<std::vector<Argument>> inArgs;
+    std::vector<std::vector<MatrixPtr>> outGrads;
+    std::vector<VectorPtr> paraValues;
+  };
+
+  struct DataOut {
+    std::vector<MatrixPtr> outValues;
+    std::vector<VectorPtr> paraValues;
+  };
+
 protected:
  std::vector<TestConfig> configs_;
  vector<string> layerNames_;
@@ -74,7 +85,17 @@ public:
           float epsilon = 1e-4,
           bool log = false,
           int level = MKLDNN_ALL);
-  void setLogLevel(int lvl) { lvl_ = lvl; }
+  static void runBranchesTest(const std::string& configPath,
+                              size_t iter = 3,
+                              float eps = 1e-4);
+  static void initArgument(DataIn& data,
+                           const std::string& configPath,
+                           size_t iter = 3);
+  static void getOutResult(const std::string& configPath,
+                           DataIn& in,
+                           DataOut& out,
+                           bool use_mkldnn,
+                           size_t iter = 3);

 private:
  void reset(const TestConfig& dnn, const TestConfig& ref, size_t batchSize);
@@ -101,8 +122,9 @@ private:
  void saveWgt(const vector<ParameterPtr>& from, vector<VectorPtr>& to);
  void restoreWgt(const vector<VectorPtr>& from, vector<ParameterPtr>& to);

-  double compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2);
-  double compareVector(const VectorPtr& v1, const VectorPtr& v2);
+  static double compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2);
+  static double compareVector(const VectorPtr& v1, const VectorPtr& v2);
+  static void compareResult(DataOut& ref, DataOut& dnn, float eps = 1e-4);

  /**
   * Get delta percent
@@ -111,11 +133,11 @@ private:
   * else return sum(abs(a-b)) / sum(abs(b))
   * The return value should be smaller than eps when passing.
   */
-  double getDelta(const real* d1,
-                  const real* d2,
-                  size_t len,
-                  const float failRate = 1e-3,
-                  const float thres = 0.1);
+  static double getDelta(const real* d1,
+                         const real* d2,
+                         size_t len,
+                         const float failRate = 1e-3,
+                         const float thres = 0.1);
 };

 }  //  namespace paddle
--- a/paddle/gserver/tests/mkldnn_branches_conv.conf
+++ b/paddle/gserver/tests/mkldnn_branches_conv.conf
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_conv(input, group_name):
+  out1 = img_conv_layer(input=input,
+            name=group_name+'_conv1',
+            filter_size=1,
+            num_filters=channels,
+            padding=0,
+            shared_biases=True,
+            act=ReluActivation())
+
+  out2 = img_conv_layer(input=input,
+            name=group_name+'_conv2',
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=ReluActivation())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=ReluActivation())
+
+a1, a2 = two_conv(input=conv, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_conv(input=conv, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_fc.conf
+++ b/paddle/gserver/tests/mkldnn_branches_fc.conf
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_fc(input, group_name):
+  out1 = fc_layer(input=input,
+            name=group_name+'_fc1',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
+
+  out2 = fc_layer(input=input,
+            name=group_name+'_fc2',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=LinearActivation())
+
+pool = img_pool_layer(input=conv,
+            pool_size=3,
+            stride=2,
+            padding=1,
+            pool_type=AvgPooling())
+
+a1, a2 = two_fc(input=pool, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_fc(input=pool, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_pool.conf
+++ b/paddle/gserver/tests/mkldnn_branches_pool.conf
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_pool(input, group_name):
+  out1 = img_pool_layer(input=input,
+            name=group_name+'_pool1',
+            pool_size=3,
+            stride=2,
+            padding=0,
+            pool_type=MaxPooling())
+
+  out2 = img_pool_layer(input=input,
+            name=group_name+'_pool2',
+            pool_size=5,
+            stride=2,
+            padding=1,
+            pool_type=MaxPooling())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=LinearActivation())
+
+pool = img_pool_layer(input=conv,
+            pool_size=3,
+            stride=1,
+            padding=1,
+            pool_type=AvgPooling())
+
+a1, a2 = two_pool(input=pool, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_pool(input=pool, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
--- a/paddle/gserver/tests/test_MKLDNN.cpp
+++ b/paddle/gserver/tests/test_MKLDNN.cpp
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include <gtest/gtest.h>
+#include <paddle/utils/PythonUtil.h>
 #include <string>
 #include <vector>
 #include "MKLDNNTester.h"
@@ -40,12 +41,13 @@ DECLARE_bool(use_mkldnn);
 struct testFcDesc {
  int bs;
  int ic;
-  int oc;
  int ih, iw;  // oh == ow == 1
+  int oc;
 };

 static void getMKLDNNFcConfig(TestConfig& cfg, const testFcDesc& pm) {
  cfg.layerConfig.set_type("mkldnn_fc");
+  cfg.layerConfig.set_active_type("relu");
  cfg.layerConfig.set_size(pm.oc);
  cfg.inputDefs.push_back(
      {INPUT_DATA,
@@ -86,6 +88,7 @@ struct testConvDesc {

 static void getMKLDNNConvConfig(TestConfig& cfg, const testConvDesc& pm) {
  cfg.layerConfig.set_type("mkldnn_conv");
+  cfg.layerConfig.set_active_type("relu");
  cfg.layerConfig.set_num_filters(pm.oc);
  cfg.layerConfig.set_size(pm.oc * pm.oh * pm.ow);
  cfg.layerConfig.set_shared_biases(true);
@@ -158,6 +161,7 @@ struct testPoolDesc {

 static void getMKLDNNPoolConfig(TestConfig& cfg, const testPoolDesc& pm) {
  cfg.layerConfig.set_type("mkldnn_pool");
+  cfg.layerConfig.set_active_type("relu");
  cfg.layerConfig.set_size(pm.ic * pm.oh * pm.ow);
  cfg.inputDefs.push_back(
      {INPUT_DATA,
@@ -244,13 +248,26 @@ TEST(MKLDNNActivation, Activations) {
  }
 }

-// TODO(TJ): add branch test
+DECLARE_string(config_args);
+TEST(MKLDNNLayer, branches) {
+  std::vector<std::string> cases = {"conv", "pool", "fc"};
+  for (auto name : cases) {
+    std::string config = "./gserver/tests/mkldnn_branches_" + name + ".conf";
+    for (auto channels : {2, 32}) {
+      std::ostringstream oss;
+      oss << "channels=" << channels;
+      FLAGS_config_args = oss.str();
+      MKLDNNTester::runBranchesTest(config);
+    }
+  }
+}

 int main(int argc, char** argv) {
  testing::InitGoogleTest(&argc, argv);
  FLAGS_use_gpu = false;
  FLAGS_use_mkldnn = true;
  initMain(argc, argv);
+  initPython(argc, argv);
  FLAGS_thread_local_rand_use_global_seed = true;
  srand(1);
  return RUN_ALL_TESTS();

--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -69,5 +69,8 @@ information, or not. But the output only shares the LoD with input `Inference`.

 namespace ops = paddle::operators;
 REGISTER_OP_WITHOUT_GRADIENT(accuracy, ops::AccuracyOp, ops::AccuracyOpMaker);
-REGISTER_OP_CPU_KERNEL(accuracy,
-                       ops::AccuracyKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    accuracy, ops::AccuracyKernel<paddle::platform::CPUPlace, float>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, int>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, double>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, int64_t>);
--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -21,9 +21,9 @@ namespace paddle {
 namespace operators {
 using platform::PADDLE_CUDA_NUM_THREADS;

-template <int BlockSize>
-__global__ void AccuracyCudaKernel(const int N, const int D, const int* Xdata,
-                                   const int* labeldata, float* accuracy) {
+template <typename T, int BlockSize>
+__global__ void AccuracyCudaKernel(const int N, const int D, const T* Xdata,
+                                   const T* labeldata, float* accuracy) {
  int count = 0;
  __shared__ int total[BlockSize];

@@ -57,8 +57,8 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
    auto* accuracy = ctx.Output<Tensor>("Accuracy");
    // FIXME(typhoonzero): only support indices currently
    // if add support for output values, how to detect the data type?
-    const int* inference_data = inference->data<int>();
-    const int* label_data = label->data<int>();
+    const T* inference_data = inference->data<T>();
+    const T* label_data = label->data<T>();
    float* accuracy_data = accuracy->mutable_data<float>(ctx.GetPlace());

    size_t num_samples = inference->dims()[0];
@@ -69,7 +69,7 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
      return;
    }

-    AccuracyCudaKernel<PADDLE_CUDA_NUM_THREADS><<<
+    AccuracyCudaKernel<T, PADDLE_CUDA_NUM_THREADS><<<
        1, PADDLE_CUDA_NUM_THREADS, 0,
        reinterpret_cast<const platform::CUDADeviceContext&>(
            ctx.device_context())
@@ -81,5 +81,7 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
 }  // namespace operators
 }  // namespace paddle

-REGISTER_OP_GPU_KERNEL(accuracy,
-                       paddle::operators::AccuracyOpCUDAKernel<float>);
+REGISTER_OP_GPU_KERNEL(accuracy, paddle::operators::AccuracyOpCUDAKernel<float>,
+                       paddle::operators::AccuracyOpCUDAKernel<double>,
+                       paddle::operators::AccuracyOpCUDAKernel<int>,
+                       paddle::operators::AccuracyOpCUDAKernel<int64_t>);
--- a/paddle/operators/adam_op.cc
+++ b/paddle/operators/adam_op.cc
@@ -43,10 +43,6 @@ class AdamOp : public framework::OperatorWithKernel {
                   "Output(Moment1Out) of AdamOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Moment2Out"),
                   "Output(Moment2Out) of AdamOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta1PowOut"),
-                   "Output(Beta1PowOut) of AdamOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta2PowOut"),
-                   "Output(Beta2PowOut) of AdamOp should not be null.");

    auto lr_dims = ctx->GetInputDim("LearningRate");
    PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1,
@@ -72,8 +68,6 @@ class AdamOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("ParamOut", param_dims);
    ctx->SetOutputDim("Moment1Out", param_dims);
    ctx->SetOutputDim("Moment2Out", param_dims);
-    ctx->SetOutputDim("Beta1PowOut", beta1_pow_dims);
-    ctx->SetOutputDim("Beta2PowOut", beta2_pow_dims);
  }
 };

@@ -92,8 +86,6 @@ class AdamOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("ParamOut", "(Tensor) Output parameter");
    AddOutput("Moment1Out", "(Tensor) Output first moment");
    AddOutput("Moment2Out", "(Tensor) Output second moment");
-    AddOutput("Beta1PowOut", "(Tensor) Output beta1 power accumulator");
-    AddOutput("Beta2PowOut", "(Tensor) Output beta2 power accumulator");

    AddAttr<float>("beta1",
                   "(float, default 0.9) "
@@ -121,10 +113,8 @@ Adam updates:

 moment1_out = beta1 * moment1 + (1 − beta1) * grad
 moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad
-beta1_pow_out = beta1_pow * beta1
-beta2_pow_out = beta2_pow * beta2
 learning_rate_t = learning_rate_t *
-                  sqrt(1 - beta2_pow_out) / (1 - beta1_pow_out)
+                  sqrt(1 - beta2_pow) / (1 - beta1_pow)
 param_out = param - learning_rate_t * moment1/ (sqrt(moment2) + epsilon)

 References:

--- a/paddle/operators/adam_op.h
+++ b/paddle/operators/adam_op.h
@@ -26,14 +26,10 @@ class AdamOpKernel : public framework::OpKernel<T> {
    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
    auto moment1_out_tensor = ctx.Output<framework::Tensor>("Moment1Out");
    auto moment2_out_tensor = ctx.Output<framework::Tensor>("Moment2Out");
-    auto beta1_pow_out_tensor = ctx.Output<framework::Tensor>("Beta1PowOut");
-    auto beta2_pow_out_tensor = ctx.Output<framework::Tensor>("Beta2PowOut");

    param_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment1_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment2_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta1_pow_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta2_pow_out_tensor->mutable_data<T>(ctx.GetPlace());

    float beta1 = ctx.Attr<float>("beta1");
    float beta2 = ctx.Attr<float>("beta2");
@@ -56,18 +52,13 @@ class AdamOpKernel : public framework::OpKernel<T> {
    auto param_out = framework::EigenVector<T>::Flatten(*param_out_tensor);
    auto moment1_out = framework::EigenVector<T>::Flatten(*moment1_out_tensor);
    auto moment2_out = framework::EigenVector<T>::Flatten(*moment2_out_tensor);
-    auto beta1_pow_out =
-        framework::EigenVector<T>::Flatten(*beta1_pow_out_tensor);
-    auto beta2_pow_out =
-        framework::EigenVector<T>::Flatten(*beta2_pow_out_tensor);
    auto place = ctx.GetEigenDevice<Place>();

    moment1_out.device(place) = beta1 * moment1 + (1 - beta1) * grad;
    moment2_out.device(place) = beta2 * moment2 + (1 - beta2) * grad.square();
-    beta1_pow_out.device(place) = beta1_pow * beta1;
-    beta2_pow_out.device(place) = beta2_pow * beta2;
+
    // All of these are tensors of 1 element
-    auto lr_t = lr * (1 - beta2_pow_out).sqrt() / (1 - beta1_pow_out);
+    auto lr_t = lr * (1 - beta2_pow).sqrt() / (1 - beta1_pow);
    // Eigen does not support automatic broadcast
    // Get dimensions of moment vector to broadcast lr_t
    Eigen::DSizes<int, 1> m_dsize(moment1_out_tensor->numel());

--- a/paddle/operators/adamax_op.cc
+++ b/paddle/operators/adamax_op.cc
@@ -41,8 +41,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
                   "Output(MomentOut) of AdamaxOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("InfNormOut"),
                   "Output(InfNormOut) of AdamaxOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta1PowOut"),
-                   "Output(Beta1PowOut) of AdamaxOp should not be null.");

    auto lr_dims = ctx->GetInputDim("LearningRate");
    PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1,
@@ -64,7 +62,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("ParamOut", param_dims);
    ctx->SetOutputDim("MomentOut", param_dims);
    ctx->SetOutputDim("InfNormOut", param_dims);
-    ctx->SetOutputDim("Beta1PowOut", beta1_pow_dims);
  }
 };

@@ -86,7 +83,6 @@ class AdamaxOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("InfNormOut",
              "(Tensor) "
              "Output exponentially weighted infinity norm");
-    AddOutput("Beta1PowOut", "(Tensor) Output beta1 power accumulator");

    AddAttr<float>("beta1",
                   "(float, default 0.9) "
@@ -113,8 +109,7 @@ Adamax updates:

 moment_out = beta1 * moment + (1 - beta1) * grad
 inf_norm_out = max(beta2 * inf_norm + epsilon, abs(grad))
-beta1_pow_out = beta1_pow * beta1
-learning_rate_t = learning_rate/(1 - beta1_pow_out)
+learning_rate_t = learning_rate/(1 - beta1_pow)
 param_out = param - learning_rate_t * moment_out/inf_norm_out

 The original paper does not have an epsilon attribute.

--- a/paddle/operators/adamax_op.h
+++ b/paddle/operators/adamax_op.h
@@ -26,12 +26,10 @@ class AdamaxOpKernel : public framework::OpKernel<T> {
    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
    auto moment_out_tensor = ctx.Output<framework::Tensor>("MomentOut");
    auto inf_norm_out_tensor = ctx.Output<framework::Tensor>("InfNormOut");
-    auto beta1_pow_out_tensor = ctx.Output<framework::Tensor>("Beta1PowOut");

    param_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment_out_tensor->mutable_data<T>(ctx.GetPlace());
    inf_norm_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta1_pow_out_tensor->mutable_data<T>(ctx.GetPlace());

    float beta1 = ctx.Attr<float>("beta1");
    float beta2 = ctx.Attr<float>("beta2");
@@ -53,15 +51,12 @@ class AdamaxOpKernel : public framework::OpKernel<T> {
    auto moment_out = framework::EigenVector<T>::Flatten(*moment_out_tensor);
    auto inf_norm_out =
        framework::EigenVector<T>::Flatten(*inf_norm_out_tensor);
-    auto beta1_pow_out =
-        framework::EigenVector<T>::Flatten(*beta1_pow_out_tensor);
    auto place = ctx.GetEigenDevice<Place>();

    moment_out.device(place) = beta1 * moment + (1 - beta1) * grad;
    inf_norm_out.device(place) =
        grad.abs().cwiseMax((beta2 * inf_norm) + epsilon);
-    beta1_pow_out.device(place) = beta1_pow * beta1;
-    auto lr_t = lr / (1 - beta1_pow_out);
+    auto lr_t = lr / (1 - beta1_pow);
    Eigen::DSizes<int, 1> m_dsize(moment_out_tensor->numel());
    param_out.device(place) =
        param - lr_t.broadcast(m_dsize) * (moment_out / inf_norm_out);

--- a/paddle/operators/batch_norm_op.md
+++ b/paddle/operators/batch_norm_op.md
+# Batch Normalization
+
+## What is batch normalization
+
+Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training. 
+
+The principle of batch normalization can be summarized into a simple function:
+
+```
+y = (x - E[x]) / STD[x]) * scale + bias
+```
+
+`x` is a batch of output data of a certain layer. `E[x]` and `STD[x]` is the mean and standard deviation of `x`, respectively。 `scale` and `bias` are two trainable parameters. The training of batch normalization layer equals to the learning of best values of `scale` and `bias`.
+
+In our design, we use a single operator(`batch_norm_op`) to implement the whole batch normalization in C++, and wrap it as a layer in Python.
+
+## Differences with normal operators
+
+`batch_norm_op` is a single operator. However, there are a few differences between `BatchNormOp` and normal operators, which we shall take into consideration in our design.
+
+1. `batch_norm_op` shall behave differently in training and inferencing. For example, during inferencing, there is no batch data and it's impossible to compute `E[x]` and `STD[x]`, so we have to use an `estimated_mean` and an `estimated_variance` instead of them. These require our framework to be able to inform operators current running type (training/inferencing), then operators can switch their behaviors.
+
+2. `batch_norm_op` shall have the ability to maintain `estimated_mean` and `estimated_variance` across mini-batch. In each mini-batch, `estimated_mean` is iterated by the following equations:
+
+```
+if batch_id == 0
+  estimated_mean = E[x]
+else
+  estimated_mean = estimated_mean * momentum + (1.0 - momentum_) * E[x]
+```
+
+The iterating of `estimated_variance` is similar. `momentum` is an attribute, which controls estimated_mean updating speed.
+
+## Implementation
+
+Batch normalization is designed as a single operator is C++, and then wrapped as a layer in Python.
+
+### C++
+
+As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attributes and compute kernels.
+
+#### Inputs
+
+- `x`: The inputs data, which is generated by the previous layer.
+- `estimated_mean`: The estimated mean of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `E[x]`.
+- `estimated_var`: The estimated standard deviation of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `STD[x]`.
+- `scale`: trainable parameter 'scale'
+- `bias`: trainable parameter 'bias'
+
+#### Outputs
+
+- `y`: The output data.
+- `batch_mean`: The mean value of batch data.
+- `batch_var`: The standard deviation value of batch data.
+- `saved_mean`: Updated `estimated_mean` with current batch data. It's supposed to share the memory with input `estimated_mean`.
+- `saved_var`: Updated `estimated_var` with current batch data. It's supposed to share the memory with input `estimated_var`.
+
+#### Attributes
+
+- `is_infer`: *bool*. If true, run `batch_norm_op` in inferencing mode.
+- `use_global_est`: *bool*. If true, use `saved_mean` and `saved_var` instead of `E[x]` and `STD[x]` in trainning.
+- `epsilon`: *float*. The epsilon value to avoid division by zero.
+- `momentum`: *float*. Factor used in `estimated_mean` and `estimated_var` updating. The usage is shown above.
+
+#### Kernels
+
+The following graph showes the training computational process of `batch_norm_op`:
+
+<img src="./images/batch_norm_op_kernel.png" width="800"/>
+
+cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
+
+### Python
+
+`batch_norm_op` is warpped as a layer in Python:
+
+```python 
+def batch_norm_layer(net, 
+                     input,
+                     output, 
+                     scale, 
+                     bias, 
+                     use_global_est = False, 
+                     epsilon = 1e-6,
+                     momentum = 0.99):
+	mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
+	var_cache = scop.new_var(name = 'estimated_var', trainable = False)
+	batch_mean = scope.new_var(name = 'batch_mean')
+	batch_var = scope.new_var(name = 'batch_var')
+	batch_norm_op = Operator('batch_norm_op',
+	                         x = input,
+	                         estimated_mean = mean_cache,
+	                         estimated_mean = var_cache,
+	                         scale = scale,
+	                         bias = bias,
+	                         y = output,
+	                         batch_mean = batch_mean,
+	                         batch_var = batch_var,
+	                         saved_mean = mean_cache,
+	                         saved_var = var_cache,
+	                         is_infer = False,
+	                         use_global_est = use_global_est,
+	                         epsilon = epsilon,
+	                         momentum = momentum)
+	net.append_op(batch_norm_op)
+	return output
+```
+
+Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note:
+
+1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch.
+
+2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this:
+
+```python
+for pass_id in range(PASS_NUM):
+    # ...
+    net.train()  # run training model
+    if pass_id % 100 == 0:
+        net.infer(test_image)    # run inferencing model
+    # ...
+``` 
+
+`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
+
+<div align=center>
+<img src="./images/batch_norm_fork.png" width="500"/>
+</div>
+
+Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate. 
+
+When the net runs in training mode, the end of the left branch will be set as the running target, so the dependency tracking process will ignore right branch automatically. When the net runs in inferencing mode, the process is reversed.
+
+How to set a target is related to Python API design, so I will leave it here waiting for more discussions.
--- a/paddle/operators/clip_op.cc
+++ b/paddle/operators/clip_op.cc
@@ -27,8 +27,8 @@ class ClipOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of ClipOp should not be null.");
    auto x_dims = ctx->GetInputDim("X");
-    auto max = Attr<float>("max");
-    auto min = Attr<float>("min");
+    auto max = ctx->Attrs().Get<float>("max");
+    auto min = ctx->Attrs().Get<float>("min");
    PADDLE_ENFORCE_LT(min, max, "max should be greater than min.");
    ctx->SetOutputDim("Out", x_dims);
    ctx->ShareLoD("X", /*->*/ "Out");

--- a/paddle/operators/conv2d_op.h
+++ b/paddle/operators/conv2d_op.h
@@ -108,17 +108,17 @@ class GemmConv2DKernel : public framework::OpKernel<T> {
    int in_step = input_channels / groups;
    int out_step = output_channels / groups;
    for (int i = 0; i < batch_size; i++) {
-      Tensor in_batch = input->Slice<T>(i, i + 1).Resize(input_shape);
-      Tensor out_batch = output->Slice<T>(i, i + 1).Resize(output_matrix_shape);
+      Tensor in_batch = input->Slice(i, i + 1).Resize(input_shape);
+      Tensor out_batch = output->Slice(i, i + 1).Resize(output_matrix_shape);
      for (int g = 0; g < groups; g++) {
        // im2col
-        Tensor in_slice = in_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+        Tensor in_slice = in_batch.Slice(g * in_step, (g + 1) * in_step);
        im2col(context.device_context(), in_slice, col, strides[0], strides[1],
               paddings[0], paddings[1]);

        // gemm
-        Tensor out_slice = out_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-        Tensor filter_slice = filter.Slice<T>(g * out_step, (g + 1) * out_step);
+        Tensor out_slice = out_batch.Slice(g * out_step, (g + 1) * out_step);
+        Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
        math::matmul<Place, T>(context.device_context(), filter_slice, false,
                               col_matrix, false, T(1.0), &out_slice, T(0.0));
      }
@@ -198,22 +198,20 @@ class GemmConvGrad2DKernel : public framework::OpKernel<T> {

      for (int i = 0; i < batch_size; i++) {
        Tensor out_grad_batch =
-            output_grad->Slice<T>(i, i + 1).Resize(output_matrix_shape);
-        Tensor in_grad_batch =
-            input_grad->Slice<T>(i, i + 1).Resize(input_shape);
+            output_grad->Slice(i, i + 1).Resize(output_matrix_shape);
+        Tensor in_grad_batch = input_grad->Slice(i, i + 1).Resize(input_shape);
        for (int g = 0; g < groups; g++) {
          // gemm
          Tensor out_grad_slice =
-              out_grad_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-          Tensor filter_slice =
-              filter.Slice<T>(g * out_step, (g + 1) * out_step);
+              out_grad_batch.Slice(g * out_step, (g + 1) * out_step);
+          Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
          math::matmul<Place, T>(context.device_context(), filter_slice, true,
                                 out_grad_slice, false, T(1.0), &col_matrix,
                                 T(0.0));

          // col2im
          Tensor in_grad_slice =
-              in_grad_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+              in_grad_batch.Slice(g * in_step, (g + 1) * in_step);
          col2im(context.device_context(), in_grad_slice, col, strides[0],
                 strides[1], paddings[0], paddings[1]);
        }
@@ -229,19 +227,19 @@ class GemmConvGrad2DKernel : public framework::OpKernel<T> {

      for (int i = 0; i < batch_size; i++) {
        Tensor out_grad_batch =
-            output_grad->Slice<T>(i, i + 1).Resize(output_matrix_shape);
-        Tensor in_batch = input->Slice<T>(i, i + 1).Resize(input_shape);
+            output_grad->Slice(i, i + 1).Resize(output_matrix_shape);
+        Tensor in_batch = input->Slice(i, i + 1).Resize(input_shape);
        for (int g = 0; g < groups; g++) {
          // im2col
          Tensor out_grad_slice =
-              out_grad_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-          Tensor in_slice = in_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+              out_grad_batch.Slice(g * out_step, (g + 1) * out_step);
+          Tensor in_slice = in_batch.Slice(g * in_step, (g + 1) * in_step);
          im2col(context.device_context(), in_slice, col, strides[0],
                 strides[1], paddings[0], paddings[1]);

          // gemm
          Tensor filter_grad_slice =
-              filter_grad_.Slice<T>(g * out_step, (g + 1) * out_step);
+              filter_grad_.Slice(g * out_step, (g + 1) * out_step);
          math::matmul<Place, T>(context.device_context(), out_grad_slice,
                                 false, col_matrix, true, T(1.0),
                                 &filter_grad_slice, T(1.0));

--- a/paddle/operators/dynamic_recurrent_op.cc
+++ b/paddle/operators/dynamic_recurrent_op.cc
--- a/paddle/operators/dynamic_recurrent_op.h
+++ b/paddle/operators/dynamic_recurrent_op.h
--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
--- a/paddle/operators/elementwise_op_function.h
+++ b/paddle/operators/elementwise_op_function.h
@@ -108,7 +108,7 @@ void ElementwiseCompute(const framework::ExecutionContext& ctx) {
  PADDLE_ENFORCE_GE(x_dims.size(), y_dims.size(),
                    "Rank of first input must >= rank of second input.")

-  if (x_dims == y_dims || product(y_dims) == 1) {
+  if (x_dims == y_dims) {
    functor f;
    f.template Run<Place, T>(x, y, z, ctx);
    return;
@@ -174,12 +174,6 @@ void ElementwiseGradCompute(const framework::ExecutionContext& ctx) {
    return;
  }

-  if (product(y_dims) == 1) {
-    functor1 f;
-    f(place, x, y, out, dx, dy, dout);
-    return;
-  }
-
  int axis = ctx.Attr<int>("axis");
  axis = (axis == -1 ? x_dims.size() - y_dims.size() : axis);


--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -26,8 +26,9 @@ class FeedOp : public framework::OperatorBase {
      : OperatorBase(type, inputs, outputs, attrs) {}
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
-    auto feed_var_name = Input("Input");
+    auto feed_var_name = Input("X");
    auto *feed_var = scope.FindVar(feed_var_name);
+
    PADDLE_ENFORCE(feed_var != nullptr,
                   "Cannot find feed_var in scope, feed_var_name is %s",
                   feed_var_name);
@@ -40,18 +41,32 @@ class FeedOp : public framework::OperatorBase {

    auto col = Attr<int>("col");

+    VLOG(3) << "Feed Var " << feed_var_name << "'s " << col << " column to var"
+            << out_name;
+
    auto &feed_list = feed_var->Get<framework::FeedFetchList>();
    auto &feed_item = feed_list.at(static_cast<size_t>(col));
    auto *out_item = out_var->GetMutable<framework::FeedFetchType>();
-    out_item->CopyFromTensor(feed_item, dev_ctx.GetPlace(), dev_ctx);
+    out_item->CopyFrom(feed_item, dev_ctx.GetPlace(), dev_ctx);
    out_item->set_lod(feed_item.lod());
  }
 };

+class FeedOpInfoMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FeedOpInfoMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The input of feed op");
+    AddOutput("Out", "The output of feed op");
+    AddComment("feed op, it should not be configured by users directly");
+    AddAttr<int>("col", "column of feed");
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle

-// We do not need to register OpInfoMaker,
-// since feed operator will not be used by end users directly
 REGISTER_OPERATOR(feed, paddle::operators::FeedOp,
-                  paddle::framework::EmptyGradOpMaker);
+                  paddle::framework::EmptyGradOpMaker,
+                  paddle::operators::FeedOpInfoMaker);
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
@@ -27,7 +27,7 @@ class FetchOp : public framework::OperatorBase {

  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
-    auto fetch_var_name = Input("Input");
+    auto fetch_var_name = Input("X");
    auto *fetch_var = scope.FindVar(fetch_var_name);
    PADDLE_ENFORCE(fetch_var != nullptr,
                   "Cannot find fetch variable in scope, fetch_var_name is %s",
@@ -51,14 +51,26 @@ class FetchOp : public framework::OperatorBase {

    // FIXME(yuyang18): Should we assume the fetch operator always generate
    // CPU outputs?
-    dst_item.CopyFromTensor(src_item, platform::CPUPlace(), dev_ctx);
+    dst_item.CopyFrom(src_item, platform::CPUPlace(), dev_ctx);
+
+    VLOG(3) << "Fetch variable " << fetch_var_name << " to " << out_name;
  }
 };

+class FetchOpInfoMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FetchOpInfoMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The input of fetch op");
+    AddOutput("Out", "The output of fetch op");
+    AddComment("fetch op, it should not be configured by users directly");
+    AddAttr<int>("col", "column of fetch");
+  }
+};
 }  // namespace operators
 }  // namespace paddle

-// We do not need to register OpInfoMaker,
-// since fetch operator will not be used by end users directly
 REGISTER_OPERATOR(fetch, paddle::operators::FetchOp,
-                  paddle::framework::EmptyGradOpMaker);
+                  paddle::framework::EmptyGradOpMaker,
+                  paddle::operators::FetchOpInfoMaker);
--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -59,7 +59,7 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return static_cast<framework::DataType>(Attr<int>("data_type"));
+    return static_cast<framework::DataType>(ctx.Attr<int>("data_type"));
  }
 };


--- a/paddle/operators/images/batch_norm_fork.dot
+++ b/paddle/operators/images/batch_norm_fork.dot
--- a/paddle/operators/images/batch_norm_fork.png
+++ b/paddle/operators/images/batch_norm_fork.png
--- a/paddle/operators/images/batch_norm_op_kernel.png
+++ b/paddle/operators/images/batch_norm_op_kernel.png
--- a/paddle/operators/increment_op.cc
+++ b/paddle/operators/increment_op.cc
--- a/paddle/operators/increment_op.cu
+++ b/paddle/operators/increment_op.cu
--- a/paddle/operators/increment_op.h
+++ b/paddle/operators/increment_op.h
--- a/paddle/operators/linear_chain_crf_op.cc
+++ b/paddle/operators/linear_chain_crf_op.cc
--- a/paddle/operators/math/im2col_test.cc
+++ b/paddle/operators/math/im2col_test.cc
--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
--- a/paddle/operators/math/math_function_test.cu
+++ b/paddle/operators/math/math_function_test.cu
--- a/paddle/operators/math/matmul.h
+++ b/paddle/operators/math/matmul.h
--- a/paddle/operators/math/selected_rows_functor_test.cu
+++ b/paddle/operators/math/selected_rows_functor_test.cu
--- a/paddle/operators/math/vol2col_test.cc
+++ b/paddle/operators/math/vol2col_test.cc
--- a/paddle/operators/matmul_op.cc
+++ b/paddle/operators/matmul_op.cc
--- a/paddle/operators/matmul_op.cu
+++ b/paddle/operators/matmul_op.cu
--- a/paddle/operators/matmul_op.h
+++ b/paddle/operators/matmul_op.h
--- a/paddle/operators/momentum_op.cc
+++ b/paddle/operators/momentum_op.cc
--- a/paddle/operators/momentum_op.cu
+++ b/paddle/operators/momentum_op.cu
--- a/paddle/operators/momentum_op.h
+++ b/paddle/operators/momentum_op.h
--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
--- a/paddle/operators/mul_op.h
+++ b/paddle/operators/mul_op.h
--- a/paddle/operators/multiplex_op.cu
+++ b/paddle/operators/multiplex_op.cu
--- a/paddle/operators/proximal_gd_op.cc
+++ b/paddle/operators/proximal_gd_op.cc
--- a/paddle/operators/proximal_gd_op.cu
+++ b/paddle/operators/proximal_gd_op.cu
--- a/paddle/operators/proximal_gd_op.h
+++ b/paddle/operators/proximal_gd_op.h
--- a/paddle/operators/recurrent_op.cc
+++ b/paddle/operators/recurrent_op.cc
--- a/paddle/operators/reduce_op.cc
+++ b/paddle/operators/reduce_op.cc
--- a/paddle/operators/reshape_op.h
+++ b/paddle/operators/reshape_op.h
--- a/paddle/operators/rnn/recurrent_op_utils.cc
+++ b/paddle/operators/rnn/recurrent_op_utils.cc
--- a/paddle/operators/rnn/recurrent_op_utils.h
+++ b/paddle/operators/rnn/recurrent_op_utils.h
--- a/paddle/operators/scatter_op.cu
+++ b/paddle/operators/scatter_op.cu
--- a/paddle/operators/scatter_op.h
+++ b/paddle/operators/scatter_op.h
--- a/paddle/operators/sequence_concat_op.h
+++ b/paddle/operators/sequence_concat_op.h
--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
--- a/paddle/operators/sequence_softmax_op.h
+++ b/paddle/operators/sequence_softmax_op.h
--- a/paddle/operators/sgd_op.cc
+++ b/paddle/operators/sgd_op.cc
--- a/paddle/operators/sgd_op.cu
+++ b/paddle/operators/sgd_op.cu
--- a/paddle/operators/sgd_op.h
+++ b/paddle/operators/sgd_op.h
--- a/paddle/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/operators/softmax_with_cross_entropy_op.cu
--- a/paddle/operators/softmax_with_cross_entropy_op.h
+++ b/paddle/operators/softmax_with_cross_entropy_op.h
--- a/paddle/operators/uniform_random_op.cc
+++ b/paddle/operators/uniform_random_op.cc
--- a/paddle/parameter/FirstOrderOptimizer.h
+++ b/paddle/parameter/FirstOrderOptimizer.h
--- a/paddle/platform/dynload/cublas.h
+++ b/paddle/platform/dynload/cublas.h
--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
--- a/paddle/scripts/cluster_train_v2/fabric/conf.py
+++ b/paddle/scripts/cluster_train_v2/fabric/conf.py
--- a/paddle/scripts/cluster_train_v2/fabric/docker_cluster/Dockerfile
+++ b/paddle/scripts/cluster_train_v2/fabric/docker_cluster/Dockerfile
--- a/paddle/scripts/cluster_train_v2/fabric/docker_cluster/ssh_servers.yaml
+++ b/paddle/scripts/cluster_train_v2/fabric/docker_cluster/ssh_servers.yaml
--- a/paddle/scripts/cluster_train_v2/fabric/run.sh
+++ b/paddle/scripts/cluster_train_v2/fabric/run.sh
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/Dockerfile
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/Dockerfile
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/head.yaml
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/head.yaml
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/mpi-nodes.yaml
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/mpi-nodes.yaml
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/config
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/config
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/id_rsa.mpi
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/id_rsa.mpi
--- a/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/id_rsa.mpi.pub
+++ b/paddle/scripts/cluster_train_v2/openmpi/docker_cluster/ssh/id_rsa.mpi.pub
--- a/paddle/scripts/cluster_train_v2/openmpi/start_mpi_train.sh
+++ b/paddle/scripts/cluster_train_v2/openmpi/start_mpi_train.sh
--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
--- a/python/paddle/trainer_config_helpers/networks.py
+++ b/python/paddle/trainer_config_helpers/networks.py
--- a/python/paddle/v2/framework/executor.py
+++ b/python/paddle/v2/framework/executor.py
--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
--- a/python/paddle/v2/framework/nets.py
+++ b/python/paddle/v2/framework/nets.py
--- a/python/paddle/v2/framework/optimizer.py
+++ b/python/paddle/v2/framework/optimizer.py
--- a/python/paddle/v2/framework/tests/.gitignore
+++ b/python/paddle/v2/framework/tests/.gitignore
--- a/python/paddle/v2/framework/tests/test_adam_op.py
+++ b/python/paddle/v2/framework/tests/test_adam_op.py
--- a/python/paddle/v2/framework/tests/test_adamax_op.py
+++ b/python/paddle/v2/framework/tests/test_adamax_op.py
--- a/python/paddle/v2/framework/tests/test_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
--- a/python/paddle/v2/framework/tests/test_dynamic_recurrent_op.py
+++ b/python/paddle/v2/framework/tests/test_dynamic_recurrent_op.py
--- a/python/paddle/v2/framework/tests/test_elementwise_add_op.py
+++ b/python/paddle/v2/framework/tests/test_elementwise_add_op.py
--- a/python/paddle/v2/framework/tests/test_executor_and_mul.py
+++ b/python/paddle/v2/framework/tests/test_executor_and_mul.py
--- a/python/paddle/v2/framework/tests/test_feed_fetch_method.py
+++ b/python/paddle/v2/framework/tests/test_feed_fetch_method.py
--- a/python/paddle/v2/framework/tests/test_fit_a_line.py
+++ b/python/paddle/v2/framework/tests/test_fit_a_line.py
--- a/python/paddle/v2/framework/tests/test_increment_op.py
+++ b/python/paddle/v2/framework/tests/test_increment_op.py
--- a/python/paddle/v2/framework/tests/test_infer_shape.py
+++ b/python/paddle/v2/framework/tests/test_infer_shape.py
--- a/python/paddle/v2/framework/tests/test_layers.py
+++ b/python/paddle/v2/framework/tests/test_layers.py
--- a/python/paddle/v2/framework/tests/test_matmul_op.py
+++ b/python/paddle/v2/framework/tests/test_matmul_op.py
--- a/python/paddle/v2/framework/tests/test_momentum_op.py
+++ b/python/paddle/v2/framework/tests/test_momentum_op.py
--- a/python/paddle/v2/framework/tests/test_op_support_gpu.py
+++ b/python/paddle/v2/framework/tests/test_op_support_gpu.py
--- a/python/paddle/v2/framework/tests/test_optimizer.py
+++ b/python/paddle/v2/framework/tests/test_optimizer.py
--- a/python/paddle/v2/framework/tests/test_program.py
+++ b/python/paddle/v2/framework/tests/test_program.py
--- a/python/paddle/v2/framework/tests/test_protobuf_descs.py
+++ b/python/paddle/v2/framework/tests/test_protobuf_descs.py
--- a/python/paddle/v2/framework/tests/test_proximal_gd_op.py
+++ b/python/paddle/v2/framework/tests/test_proximal_gd_op.py
--- a/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
--- a/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
--- a/python/paddle/v2/framework/tests/test_recurrent_op.py
+++ b/python/paddle/v2/framework/tests/test_recurrent_op.py
--- a/python/paddle/v2/framework/tests/test_reduce_op.py
+++ b/python/paddle/v2/framework/tests/test_reduce_op.py
--- a/python/paddle/v2/framework/tests/test_rmsprop_op.py
+++ b/python/paddle/v2/framework/tests/test_rmsprop_op.py
--- a/python/paddle/v2/framework/tests/test_selected_rows.py
+++ b/python/paddle/v2/framework/tests/test_selected_rows.py
--- a/python/paddle/v2/framework/tests/test_sgd_op.py
+++ b/python/paddle/v2/framework/tests/test_sgd_op.py
--- a/python/paddle/v2/framework/tests/test_uniform_random_op.py
+++ b/python/paddle/v2/framework/tests/test_uniform_random_op.py