update data types and test cases

496583a1 · zhouxiao-coder · 1491041b · 39320f00 · 496583a1 · 496583a1
161 changed file
--- a/.travis.yml
+++ b/.travis.yml
@@ -30,6 +30,7 @@ addons:
      - automake
      - libtool
      - ccache
+  ssh_known_hosts: 52.76.173.135
 before_install:
  - if [[ "$JOB" == "check_style" ]]; then sudo ln -s /usr/bin/clang-format-3.8 /usr/bin/clang-format; fi
  # Paddle is using protobuf 3.1 currently. Protobuf 3.2 breaks the compatibility. So we specify the python
@@ -42,6 +43,14 @@ script:
  - |
    timeout 2580 paddle/scripts/travis/${JOB}.sh # 43min timeout
    RESULT=$?; if [ $RESULT -eq 0 ] || [ $RESULT -eq 142 ]; then true; else false; fi;
+  - |
+    if [[ "$JOB" != "build_doc" ]]; then exit 0; fi;
+    if [[ "$TRAVIS_PULL_REQUEST" != "false" ]]; then exit 0; fi;
+    if [[ "$TRAVIS_BRANCH" != "develop"  && ! "$TRAVIS_BRANCH" =~ ^v[[:digit:]]+\.[[:digit:]]+(\.[[:digit:]]+)?(-\S*)?$ ]]; then exit 0; fi;
+    export DEPLOY_DOCS_SH=https://raw.githubusercontent.com/PaddlePaddle/PaddlePaddle.org/master/scripts/deploy/deploy_docs.sh
+    export DOCS_DIR=`pwd`
+    cd ..
+    curl $DEPLOY_DOCS_SH | bash -s $CONTENT_DEC_PASSWD $TRAVIS_BRANCH $DOCS_DIR $DOCS_DIR/build/doc   
 notifications:
  email:
    on_success: change

--- a/benchmark/IntelOptimizedPaddle.md
+++ b/benchmark/IntelOptimizedPaddle.md
+# Benchmark
+
+Machine:
+
+- Server
+ 	- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
+- Laptop
+ 	- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
+ 	- i5 MacBook Pro (Retina, 13-inch, Early 2015)
+- Desktop
+ 	- i7-6700k
+
+System: CentOS release 6.3 (Final), Docker 1.12.1.
+
+PaddlePaddle: paddlepaddle/paddle:latest (TODO: will rerun after 0.11.0)
+
+- MKL-DNN tag v0.10
+- MKLML 2018.0.20170720
+- OpenBLAS v0.2.20
+	 
+On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.
+
+## Benchmark Model
+
+### Server
+Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+Input image size - 3 * 224 * 224, Time: images/second
+
+- VGG-19
+
+| BatchSize    | 64    | 128  | 256     |
+|--------------|-------| -----| --------|
+| OpenBLAS     | 7.82  | 8.62  | 10.34  | 
+| MKLML        | 11.02 | 12.86 | 15.33  |
+| MKL-DNN      | 27.69 | 28.8 | 29.27  |
+
+
+chart on batch size 128
+TBD
+
+ - ResNet
+ - GoogLeNet
+
+### Laptop
+TBD
+### Desktop
+TBD
--- a/doc/design/images/asgd.gif
+++ b/doc/design/images/asgd.gif
--- a/doc/design/images/theta_star.gif
+++ b/doc/design/images/theta_star.gif
--- a/doc/design/parameter_average.md
+++ b/doc/design/parameter_average.md
+# Averaging Parameter in PaddlePaddle
+
+## Why Averaging
+In a large scale machine learning setup where the size of the training data is huge, it could take us a large number of iterations over the training data before we can achieve the optimal values of parameters of our model. Looking at the problem setup, it is desirable if we can obtain the optimal values of parameters by going through the data in as few passes as we can.
+
+Polyak and Juditsky (1992) showed that the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameter values that are obtained by training the model over and over again, over the training dataset.
+
+Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows:
+
+<img src="./images/asgd.gif" align="center"/><br/>
+
+We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above.
+
+### How to perform Parameter Averaging in PaddlePaddle
+
+Parameter Averaging in PaddlePaddle works in the following way during training :
+1. It will take in an instance of a normal optimizer as an input, e.g. RMSPropOptimizer
+2. The optimizer itself is responsible for updating the parameters.
+3. The ParameterAverageOptimizer maintains a separate copy of the parameters for itself:
+    1. In concept, the values of this copy are the average of the values of the parameters in the most recent N batches.
+    2. However, saving all the N instances of the parameters in memory is not feasible.
+    3. Therefore, an approximation algorithm is used.
+
+Hence, overall we have have two copies of the parameters: one for the optimizer itself, and one for the ParameterAverageOptimizer. The former should be used in back propagation, while the latter should be used during testing and should be saved.
+
+During the testing/ saving the model phase, we perform the following steps:
+1. Perform the delayed operations.
+2. Save current values of the parameters to a temporary variable.
+3. Replace the values of the parameters with the averaged values.
+4. Perform testing and/or save the parameters.
+5. Restore the values of the parameters once done.
+
+### How to implement Averaging of Parameter in PaddlePaddle
+
+We can add the ParameterAverageOptimizer op to the graph through Python API. Using this approach, we manually add this op to the graph and direct the output of the optimizer op to this op during training.
+
+	**Advantages**:
+    - Allows for greater flexibility to the users of PaddlePaddle. Using this approach, the users can plug different optimizers into ParameterAverageOptimizer by passing in the optimizer to the op.
+    - Makes it easy for the users to customize and extend the framework.
+
+	**Disadvantages**:
+    - Implementation requires re-writing the averaging methodology in Python.  
+
+### Low-Level implementation
+
+In the new design, we propose to create a new operation for averaging parameter updates (ParameterAverageOptimizer). For now, we can add an op that takes in the following as input:
+- the optimizer
+- the window_size to keep the updates
+
+The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h). We also want to support the case when the Trainer/Optimizer runs on the GPU while ParameterAverageOptimizer runs on a CPU.
+
+The idea of building an op for averaging is in sync with the refactored PaddlePaddle philosophy of using operators to represent any computation unit. The way the op will be added to the computation graph will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API.
+
+### Python API implementation for ParameterAverageOptimizer
+
+Based on Polyak and Juditsky (1992), we can generalize the averaging of updates to any optimizer. The input to the op would be the following:
+- Any optimizer (RMSProp , AdaGrad etc.)
+- A window size. The op keeps accumulating updated parameter values over a window of N batches and takes an average. Move the averaged value to a buffer when window is full to avoid loss of precision.
+
+Using the ParameterAverageOptimizer op, any user can add the operation to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support averaging. As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since ParameterAverageOptimizer will be an operator, it makes sense to create it in the layer functions.
+We will have a wrapper written in Python that will support the functionality and implement the actual core computation in C++ core as we have done for other [Optimizers](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.cc)
+
+#### Creation of the ParameterAverageOptimizer operator
+There are two ways for creating the ParameterAverageOptimizer op:
+1. We create the op immediately while building the computation graph.
+2. We add the op in a lazy manner, just before the backward pass, similar to the way the optimization ops are added.
+
+The proposal is to add the op immediately while building the computation graph.
+
+#### High-level API
+
+In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we also need to provide parameter average functionality in layer functions.
--- a/doc/howto/cross_compiling/cross_compiling_for_android.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android.md
+# Build PaddlePaddle for Android
+
+There are two approaches to build PaddlePaddle for Android: using Docker and on Linux without Docker. 
+
+## Cross-Compiling Using Docker
+
+Docker-based cross-compiling is the recommended approach because Docker runs on all major operating systems, including Linux, Mac OS X, and Windows.
+
+### Build the Docker Image
+
+The following steps pack all the tools that we need to build PaddlePaddle into a Docker image.
+
+```bash
+$ git clone https://github.com/PaddlePaddle/Paddle.git
+$ cd Paddle
+$ docker build -t paddle:dev-android . -f Dockerfile.android
+```
+
+### Build the Inference Library
+
+We can run the Docker image we just created to build the inference library of PaddlePaddle for Android using the command below:
+
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" paddle:dev-android
+```
+
+The Docker image accepts two arguments `ANDROID_ABI` and `ANDROID_API`:
+
+| Argument        | Optional Values         | Default |
+|-----------------|-------------------------|---------|
+|`ANDROID_ABI`    |`armeabi-v7a, arm64-v8a` | `armeabi-v7a` |
+|`ANDROID_API`    |`>= 21` | `21` |
+
+The ARM-64 architecture (`arm64-v8a`) requires at least level 21 of Android API.
+
+The default entry-point of the Docker image, [`paddle/scripts/docker/build_android.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh) generates the [Android cross-compiling standalone toolchain](https://developer.android.com/ndk/guides/standalone_toolchain.html) based on the argument: `ANDROID_ABI` or `ANDROID_API`.  For information about other configuration arguments, please continue reading.
+
+The above command generates and outputs the inference library in `$PWD/install_android` and puts third-party libraries in `$PWD/install_android/third_party`.
+
+## Cross-Compiling on Linux
+
+The Linux-base approach to cross-compile is to run steps in `Dockerfile.android` manually on a Linux x64 computer.
+
+### Setup the Environment
+
+To build for Android's, we need [Android NDK](
+https://developer.android.com/ndk/downloads/index.html):
+
+```bash
+wget -q https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip
+unzip -q android-ndk-r14b-linux-x86_64.zip
+```
+
+Android NDK includes everything we need to build the [*standalone toolchain*](https://developer.android.com/ndk/guides/standalone_toolchain.html), which in then used to build PaddlePaddle for Android.  (We plan to remove the intermediate stage of building the standalone toolchain in the near future.)
+
+- To build the standalone toolchain for `armeabi-v7a` and Android API level 21:
+
+  ```bash
+  your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
+          --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain
+  ```
+  
+  The generated standalone toolchain will be in `your/path/to/arm_standalone_toolchain`.
+
+- To build the standalone toolchain for `arm64-v8a` and Android API level 21:
+
+  ```bash
+  your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
+          --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain
+  ```
+
+  The generated standalone toolchain will be in `your/path/to/arm64_standalone_toolchain`.
+
+**Please be aware that the minimum level of Android API required by PaddlePaddle is 21.**
+
+### Cross-Compiling Arguments
+
+CMake supports [choosing the toolchain](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling).  PaddlePaddle provides [`android.cmake`](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake), which configures the Android cross-compiling toolchain for CMake.  `android.cmake` is not required for CMake >= 3.7, which support Android cross-compiling. PaddlePaddle detects the CMake version, for those newer than 3.7, it uses [the official version](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling).
+
+Some other CMake arguments you need to know:
+
+- `CMAKE_SYSTEM_NAME` must be `Android`.  This tells PaddlePaddle's CMake system to cross-compile third-party dependencies.  This also changes some other CMake arguments like `WITH_GPU=OFF`, `WITH_AVX=OFF`, `WITH_PYTHON=OFF`, and `WITH_RDMA=OFF`.
+- `WITH_C_API` must be `ON`, to build the C-based inference library for Android.
+- `WITH_SWIG_PY` must be `OFF` because the Android platform doesn't support SWIG-based API.
+
+Some Android-specific arguments:
+
+- `ANDROID_STANDALONE_TOOLCHAIN`: the absolute path of the Android standalone toolchain, or the path relative to the CMake build directory.  PaddlePaddle's CMake extensions would derive the cross-compiler, sysroot and Android API level from this argument.
+- `ANDROID_TOOLCHAIN`: could be `gcc` or `clang`.  The default value is `clang`.
+  - For CMake >= 3.7, it should anyway be `clang`.  For older versions, it could be `gcc`.
+  - Android's official `clang` requires `glibc` >= 2.15.
+- `ANDROID_ABI`: could be `armeabi-v7a` or `arm64-v8a`.  The default value is `armeabi-v7a`.
+- `ANDROID_NATIVE_API_LEVEL`: could be derived from the value of `ANDROID_STANDALONE_TOOLCHAIN`.
+- `ANROID_ARM_MODE`:
+  - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`;
+  - no need to specify when `ANDROID_ABI=arm64-v8a`.
+- `ANDROID_ARM_NEON`: indicates if to use NEON instructions.
+  - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`;
+  - no need to specify when `ANDROID_ABI=arm64-v8a`.
+
+Other useful arguments:
+
+- `USE_EIGEN_FOR_BLAS`: indicates if using Eigen.  Could be `ON` or `OFF`, defaults to `OFF`.
+- `HOST_C/CXX_COMPILER`: specifies the host compiler, which is used to build the host-specific protoc and target-specific OpenBLAS.  It defaults to the value of the environment variable `CC`, or `cc`.
+
+Some frequent configurations for your reference:
+
+```bash
+cmake -DCMAKE_SYSTEM_NAME=Android \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \
+      -DANDROID_ABI=armeabi-v7a \
+      -DANDROID_ARM_NEON=ON \
+      -DANDROID_ARM_MODE=ON \
+      -DUSE_EIGEN_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+```
+cmake -DCMAKE_SYSTEM_NAME=Android \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
+      -DANDROID_ABI=arm64-v8a \
+      -DUSE_EIGEN_FOR_BLAS=OFF \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+
+There are some other arguments you might want to configure.
+
+- `CMAKE_BUILD_TYPE=MinSizeRel` minimizes the size of library.
+- `CMAKE_BUILD_TYPE-Release` optimizes the runtime performance.
+
+Our own tip for performance optimization to use clang and Eigen or OpenBLAS:
+- `CMAKE_BUILD_TYPE=Release`
+- `ANDROID_TOOLCHAIN=clang`
+- `USE_EIGEN_BLAS=ON` for `armeabi-v7a`, or `USE_EIGEN_FOR_BLAS=OFF` for `arm64-v8a`.
+
+### Build and Install
+
+After running `cmake`, we can run `make; make install` to build and install.
+
+Before building, you might want to remove the `third_party` and `build` directories including pre-built libraries for other architectures.
+
+After building，in the directory `CMAKE_INSTALL_PREFIX`, you will find three sub-directories:
+
+- `include`: the header file of the inference library,
+- `lib`: the inference library built for various Android ABIs,
+- `third_party`: dependent third-party libraries built for Android.
--- a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
 # 构建Android平台上的PaddlePaddle库

 用户可通过如下两种方式，交叉编译Android平台上适用的PaddlePaddle库：
- 基于Docker容器的编译方式
+- 基于Docker容器的编译方式
 - 基于Linux交叉编译环境的编译方式

 ## 基于Docker容器的编译方式
@@ -26,14 +26,14 @@ Android的Docker开发镜像向用户提供两个可配置的参数：
 |`ANDROID_API`    |`>= 21` | `21` |

 - 编译`armeabi-v7a`，`Android API 21`的PaddlePaddle库
-```bash
-$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
-```
+  ```bash
+  $ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
+  ```

- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
-```bash
-$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
-```
+- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
+  ```bash
+  $ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
+  ```

 执行上述`docker run`命令时，容器默认执行[paddle/scripts/docker/build_android.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置，并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`，`ANDROID_API<21`时，Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文**配置交叉编译参数**章节，根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后，PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录，所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。

@@ -82,16 +82,16 @@ CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cm
 Android平台可选配置参数：

 - `ANDROID_STANDALONE_TOOLCHAIN`，独立工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别；否则，用户需要在cmake时手动设置这些值。无默认值。
- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
-	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
+- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
+	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
 	- Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。
 - `ANDROID_ABI`，目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`，默认值为`armeabi-v7a`。
 - `ANDROID_NATIVE_API_LEVEL`，工具链的Android API级别。若没有显式设置，PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。
- `ANROID_ARM_MODE`，是否使用ARM模式。
-	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+- `ANROID_ARM_MODE`，是否使用ARM模式。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
 	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
- `ANDROID_ARM_NEON`，是否使用NEON指令。
-	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+- `ANDROID_ARM_NEON`，是否使用NEON指令。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
 	- `ANDROID_ABI=arm64-v8a`时，不需要设置。

 其他配置参数：
@@ -119,7 +119,7 @@ cmake -DCMAKE_SYSTEM_NAME=Android \
      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
      -DANDROID_ABI=arm64-v8a \
      -DUSE_EIGEN_FOR_BLAS=OFF \
-      -DCMAKE_INSTALL_PREFIX=your/path/to/install \  
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_SWIG_PY=OFF \
      ..
@@ -128,8 +128,8 @@ cmake -DCMAKE_SYSTEM_NAME=Android \
 用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。

 **性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
- 设置`CMAKE_BUILD_TYPE`为`Release`
- 使用`clang`编译工具链
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 使用`clang`编译工具链
 - `armeabi-v7a`时，设置`USE_EIGEN_BLAS=ON`，使用Eigen进行矩阵计算；`arm64-v8a`时，设置`USE_EIGEN_FOR_BLAS=OFF`，使用OpenBLAS进行矩阵计算

 ### 编译和安装

--- a/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
+# 构建iOS平台上的PaddlePaddle库
+交叉编译iOS平台上适用的PaddlePaddle库，需要在MacOS系统上进行。本文的将介绍在MacOS上，从源码交叉编译iOS平台上适用的PaddlePaddle库。
+
+## 准备交叉编译环境
+Apple官方为iOS开发提供了完整的交叉编译工具和集成开发环境，用户从App Store下载安装Xcode即可。也可自行前往官网下载，[Xcode](https://developer.apple.com/cn/xcode/)。安装完成之后，可在命令行执行`xcodebuild -version`，判断是否安装成功。
+
+```bash
+$ xcodebuild -version
+Xcode 9.0
+Build version 9A235
+```
+
+## 配置交叉编译参数
+
+PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake)，以提供一些默认的编译器和编译参数配置。
+
+交叉编译iOS版本的PaddlePaddle库时，有一些必须配置的参数：
+
+- `CMAKE_SYSTEM_NAME`，CMake编译的目标平台，必须设置为`iOS`。在设置`CMAKE_SYSTEM_NAME=iOS`后，PaddlePaddle的CMake系统会自动编译所有的第三方依赖库，并且强制设置一些PaddlePaddle参数的值（`WITH_C_API=ON`、`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`）。
+- `WITH_C_API`，是否编译C-API预测库，必须设置为ON。在iOS平台上只支持使用C-API来预测。
+- `WITH_SWIG_PY`，必须设置为ON。在iOS平台上不支持通过swig调用来训练或者预测。
+
+iOS平台可选配置参数：
+
+- `IOS_PLATFORM`，可设置为`OS/SIMULATOR`，默认值为`OS`。
+  - `OS`，构建目标为`arm`架构的iPhone或者iPad等物理设备。
+  - `SIMULATOR`，构建目标为`x86`架构的模拟器平台。
+- `IOS_ARCH`，目标架构。针对不同的`IOS_PLATFORM`，可设置的目标架构如下表所示：
+
+   | IOS_PLATFORM | IOS_ARCH             |
+   |--------------|----------------------|
+   |   OS         | armv7, armv7s, arm64 (默认) |
+   | SIMULATOR    | i386, x86_64 (默认)         |   
+
+- `IOS_DEPLOYMENT_TARGET`，最小的iOS部署版本，默认值为`7.0`。
+- `IOS_ENABLE_BITCODE`，是否使能[Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3)，可设置`ON/OFF`，默认值为`ON`。
+- `IOS_USE_VECLIB_FOR_BLAS`，是否使用[vecLib](https://developer.apple.com/documentation/accelerate/veclib)框架进行BLAS矩阵计算，可设置`ON/OFF`，默认值为`OFF`。
+- `IOS_DEVELOPMENT_ROOT`，`Developer`目录，可显式指定为`/path/to/platform/Developer`。若未显式指定，PaddlePaddle将会根据`IOS_PLATFORM`自动选择`Xcode`对应`platform`的`Developer`目录。
+- `IOS_SDK_ROOT`，所使用`SDK`的根目录，可显式指定为`/path/to/platform/Developer/SDKs/SDK`。若未显式指定，PaddlePaddle将会自动选择`IOS_DEVELOPMENT_ROOT`目录下最新的`SDK`版本。
+
+其他配置参数：
+
+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算，在`IOS_USE_VECLIB_FOR_BLAS=OFF`时有效。可设置`ON/OFF`，默认值为`OFF`。
+- `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。默认值为环境变量`CC/CXX`的值；若环境变量`CC/CXX`未设置，则使用`cc/c++`编译器。
+
+常用的cmake配置如下：
+
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=OS \
+      -DIOS_ARCH="arm64" \
+      -DIOS_ENABLE_BITCODE=ON \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=SIMULATOR \
+      -DIOS_ARCH="x86_64" \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+用户还可根据自己的需求设置其他编译参数。比如希望最小化生成库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望得到最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS`来影响PaddlePaddle的编译过程。
+
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 设置`IOS_USE_VECLIB_FOR_BLAS=ON`，调用`vecLib`框架提供的BLAS函数进行矩阵计算。
+
+## 编译和安装
+
+CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。
+
+```
+$ make
+$ make install
+```
+
+注意：如果你曾在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
+
+执行完安装命令后，`your/path/to/install`目录中会包含以下内容：
+
+- `include`目录，其中包含所有C-API的头文件
+- `lib`目录，其中包含PaddlePaddle的C-API静态库
+- `third_party`目录，其中包含所依赖的所有第三方库
+
+注意，不同架构的PaddlePaddle库建议安装到不同的目录下，然后使用`lipo`工具将多个静态库合并成一个支持多个架构的fat库。
+
+自此，PaddlePaddle库已经安装完成，用户可将合成的fat库用于深度学习相关的iOS App中，调用方法见C-API文档。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
@@ -59,4 +59,4 @@ make install

 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。

-执行完安装命令后，，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
@@ -44,7 +44,7 @@ cmake -DCMAKE_SYSTEM_NAME=RPi \
      ..
 ```

-To build the inference library, please set the argument WITH_API to ON: `WITH_C_API=ON`.
+To build the inference library, please set the argument WITH\_C\_API to ON: `WITH_C_API=ON`.

 You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`.


--- a/doc/howto/usage/cluster/cluster_train_cn.md
+++ b/doc/howto/usage/cluster/cluster_train_cn.md
@@ -19,7 +19,7 @@
     * [启动集群作业](#启动集群作业-1)
  * [在Kubernetes集群中提交训练作业](#在kubernetes集群中提交训练作业)

-# 概述
+## 概述
 本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示：

 <img src="https://user-images.githubusercontent.com/13348433/31772175-5f419eca-b511-11e7-9db7-5231fe3d9ccb.png" width="500">
@@ -32,7 +32,7 @@

 在使用同步SGD训练神经网络时，PaddlePaddle使用同步屏障（barrier），使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中，则并不会等待所有trainer提交梯度才更新参数，这样极大地提高了计算的并行性：参数服务器之间不相互依赖，并行地接收梯度和更新参数，参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步，计算节点之间也不会相互依赖，并行地执行模型的训练。可以看出，虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新，在任意时间某一台参数服务器上保存的参数可能比另一台要更新，与同步SGD相比，梯度会有噪声。

-# 环境准备
+## 环境准备

 1. 准备您的计算集群。计算集群通常由一组（几台到几千台规模）的Linux服务器组成。服务器之间可以通过局域网（LAN）联通，每台服务器具有集群中唯一的IP地址（或者可被DNS解析的主机名）。集群中的每台计算机通常被成为一个“节点”。
 1. 我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU，还需要在节点上安装对应的GPU驱动以及CUDA。PaddlePaddle的安装可以参考[build_and_install](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/getstarted/build_and_install)的多种安装方式。我们推荐使用[Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_cn.rst)安装方式来快速安装PaddlePaddle。
@@ -51,8 +51,8 @@ PaddlePaddle 0.10.0, compiled with

 下面以`doc/howto/usage/cluster/src/word2vec`中的代码作为实例，介绍使用PaddlePaddle v2 API完成分布式训练。

-# 启动参数说明
-## 启动参数服务器
+## 启动参数说明
+### 启动参数服务器
 执行以下的命令启动一个参数服务器并等待和计算节点的数据交互
 ```bash
 $ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1
@@ -70,7 +70,7 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
 | ports_num_for_sparse  | 必选 | 1 | 用于稀疏类型参数通信的端口个数  |
 | num_gradient_servers  | 必选 | 1 | 当前训练任务pserver总数 |

-## 启动计算节点
+### 启动计算节点
 执行以下命令启动使用python编写的trainer程序（文件名为任意文件名，如train.py）
 ```bash
 $ python train.py
@@ -117,7 +117,7 @@ paddle.init(
 | pservers  | 必选 | 127.0.0.1 | 当前训练任务启动的pserver的IP列表，多个IP使用“,”隔开 |


-## 准备数据集
+### 准备数据集

 参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py)，准备训练数据和验证数据集，我们使用paddle.dataset.imikolov数据集，并根据分布式训练并发数（trainer节点个数），在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。

@@ -149,7 +149,7 @@ test.txt-00002

 对于不同的训练任务，训练数据格式和训练程序的`reader()`会大不相同，所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。

-## 准备训练程序
+### 准备训练程序

 我们会对每个训练任务都会在每个节点上创建一个工作空间（workspace），其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。

@@ -184,7 +184,7 @@ test.txt-00002
 - `train_data_dir`：包含训练数据的目录，可以是从分布式存储挂载过来的，也可以是在任务启动前下载到本地的。
 - `test_data_dir`：包含测试数据集的目录。

-# 使用分布式计算平台或工具
+## 使用分布式计算平台或工具

 PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务，包括：
 - [Kubernetes](http://kubernetes.io) Google开源的容器集群的调度框架，支持大规模集群生产环境的完整集群方案。
@@ -195,12 +195,12 @@ PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务

 在使用分布式计算平台进行训练时，任务被调度在集群中时，分布式计算平台通常会通过API或者环境变量提供任务运行需要的参数，比如节点的ID、IP和任务节点个数等。

-## 使用Fabric启动集群作业
+### 使用Fabric启动集群作业

-### 准备一个Linux集群
+#### 准备一个Linux集群
 可以在`paddle/scripts/cluster_train_v2/fabric/docker_cluster`目录下，执行`kubectl -f ssh_servers.yaml`启动一个测试集群，并使用`kubectl get po -o wide`获得这些节点的IP地址。

-### 启动集群作业
+#### 启动集群作业

 `paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下，所有命令行选项可以设置为 `paddle.py` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。

@@ -216,10 +216,10 @@ sh run.sh

 集群作业将会在几秒后启动。

-### 终止集群作业
+#### 终止集群作业
 `paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。

-### 检查集群训练结果
+#### 检查集群训练结果
 详细信息请检查 $workspace/log 里的日志，每一个节点都有相同的日志结构。

 `paddle_trainer.INFO`
@@ -234,13 +234,13 @@ sh run.sh
 `train.log`
 提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。

-### 检查模型输出
+#### 检查模型输出
 运行完成后，模型文件将被写入节点 0 的 `output` 目录中。
 工作空间中的 `nodefile` 表示当前集群作业的节点 ID。

-## 在OpenMPI集群中提交训练作业
+### 在OpenMPI集群中提交训练作业

-### 准备OpenMPI集群
+#### 准备OpenMPI集群

 执行下面的命令以启动3个节点的OpenMPI集群和一个"head"节点：

@@ -252,7 +252,7 @@ kubectl create -f mpi-nodes.yaml

 然后可以从head节点ssh无密码登录到OpenMPI的每个节点上。

-### 启动集群作业
+#### 启动集群作业

 您可以按照下面的步骤在OpenMPI集群中提交paddle训练任务：

@@ -280,6 +280,6 @@ scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```

-## 在Kubernetes集群中提交训练作业
+### 在Kubernetes集群中提交训练作业

 此部分的使用方法可以参考[here](../k8s/k8s_distributed_cn.md)。
--- a/doc/howto/usage/cluster/cluster_train_en.md
+++ b/doc/howto/usage/cluster/cluster_train_en.md
@@ -19,7 +19,7 @@
      * [Launching Cluster Job](#launching-cluster-job-1)
   * [Cluster Training Using Kubernetes](#cluster-training-using-kubernetes)

-# Introduction
+## Introduction

 In this article, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:

@@ -33,7 +33,7 @@ PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and

 When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.

-# Preparations
+## Preparations
 1. Prepare your computer cluster. It's normally a bunch of Linux servers connected by LAN. Each server will be assigned a unique IP address. The computers in the cluster can be called "nodes".
 2. Install PaddlePaddle on every node. If you are going to take advantage of GPU cards, you'll also need to install proper driver and CUDA libraries. To install PaddlePaddle please read [this build and install](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/getstarted/build_and_install) document. We strongly recommend using [Docker installation](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).

@@ -52,9 +52,9 @@ PaddlePaddle 0.10.0rc, compiled with

 We'll take `doc/howto/usage/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API.

-# Command-line arguments
+## Command-line arguments

-## Starting parameter server
+### Starting parameter server

 Type the below command to start a parameter server which will wait for trainers to connect:

@@ -74,7 +74,7 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
 | ports_num_for_sparse  | required | 1 | number of ports which serves sparse parameter update  |
 | num_gradient_servers  | required | 1 | total number of gradient servers |

-## Starting trainer
+### Starting trainer
 Type the command below to start the trainer(name the file whatever you want, like "train.py")

 ```bash
@@ -122,7 +122,7 @@ paddle.init(
 | trainer_id  | required | 0 | ID for every trainer, start from 0 |
 | pservers  | required | 127.0.0.1 | list of IPs of parameter servers, separated by "," |

-## Prepare Training Dataset
+### Prepare Training Dataset

 Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.

@@ -155,7 +155,7 @@ When job started, every trainer needs to get it's own part of data. In some dist

 Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.

-## Prepare Training program
+### Prepare Training program

 We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.

@@ -191,7 +191,7 @@ Your workspace may looks like:
 - `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
 - `test_data_dir`: containing testing data.

-# Use cluster platforms or cluster management tools
+## Use cluster platforms or cluster management tools

 PaddlePaddle supports running jobs on several platforms including:
 - [Kubernetes](http://kubernetes.io) open-source system for automating deployment, scaling, and management of containerized applications from Google.
@@ -202,13 +202,13 @@ We'll introduce cluster job management on these platforms. The examples can be f

 These cluster platforms provide API or environment variables for training processes, when the job is dispatched to different nodes. Like node ID, IP or total number of nodes etc.

-## Cluster Training Using Fabric
+### Cluster Training Using Fabric

-### Prepare a Linux cluster
+#### Prepare a Linux cluster

 Run `kubectl -f ssh_servers.yaml` under the directory:  `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.

-### Launching Cluster Job
+#### Launching Cluster Job
 `paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.

 `paddle.py`provides two distinguished command option for easy job launching.
@@ -224,10 +224,10 @@ sh run.sh

 The cluster Job will start in several seconds.

-### Kill Cluster Job
+#### Kill Cluster Job
 `paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed.

-### Check Cluster Training Result
+#### Check Cluster Training Result
 Check log in $workspace/log for details, each node owns same log structure.

 `paddle_trainer.INFO`
@@ -242,13 +242,13 @@ It provides stderr and stdout of parameter server process. Check error log if tr
 `train.log`
 It provides stderr and stdout of trainer process. Check error log if training crashes.

-### Check Model Output
+#### Check Model Output
 After one pass finished, model files will be written in `output` directory in node 0.
 `nodefile` in workspace indicates the node id of current cluster job.

-## Cluster Training Using OpenMPI
+### Cluster Training Using OpenMPI

-### Prepare an OpenMPI cluster
+#### Prepare an OpenMPI cluster

 Run the following command to start a 3-node MPI cluster and one "head" node.

@@ -260,7 +260,7 @@ kubectl create -f mpi-nodes.yaml

 Then you can log in to every OpenMPI node using ssh without input any passwords.

-### Launching Cluster Job
+#### Launching Cluster Job

 Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\

@@ -288,6 +288,6 @@ scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```

-## Cluster Training Using Kubernetes
+### Cluster Training Using Kubernetes

 The details can be found [here](../k8s/k8s_cn.md)
--- a/paddle/cuda/include/hl_matrix.h
+++ b/paddle/cuda/include/hl_matrix.h
@@ -300,4 +300,12 @@ extern void hl_matrix_col2Vol(real* dataDst,
                              real alpha,
                              real beta);

+/**
+ * @brief  Matrix col2Vol: Convert col matrix into 3D volume
+ * @param[out]  out     output int vector.
+ * @param[in]   vec     input float vector.
+ * @param[in]   size    size of the vector.
+ */
+extern void hl_vector_cast2int(int* out, real* vec, int size);
+
 #endif /* HL_MATRIX_H_ */
--- a/paddle/cuda/include/stub/hl_matrix_stub.h
+++ b/paddle/cuda/include/stub/hl_matrix_stub.h
@@ -133,4 +133,6 @@ inline void hl_matrix_col2Vol(real* dataDst,
                              real alpha,
                              real beta) {}

+inline void hl_vector_cast2int(int* out, real* vec, int size) {}
+
 #endif  // HL_MATRIX_STUB_H_
--- a/paddle/cuda/src/hl_cuda_matrix.cu
+++ b/paddle/cuda/src/hl_cuda_matrix.cu
@@ -793,3 +793,14 @@ void hl_matrix_col2Vol(real* dataDst,

  CHECK_SYNC("hl_matrix_col2Vol failed");
 }
+
+__global__ void keVectorCast2Int(int* out, real* vec, int size) {
+  for (int i = threadIdx.x; i < (size); i += blockDim.x) {
+    out[i] = int(vec[i]);
+  }
+}
+
+void hl_vector_cast2int(int* out, real* vec, int size) {
+  keVectorCast2Int<<<1, 512, 0, STREAM_DEFAULT>>>(out, vec, size);
+  CHECK_SYNC("hl_vector_cast2int failed");
+}
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@@ -20,7 +20,8 @@ cc_test(scope_test SRCS scope_test.cc DEPS scope)


 cc_library(attribute SRCS attribute.cc DEPS framework_proto)
-cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc)
+cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc
+device_context)
 cc_library(op_proto_maker SRCS op_proto_maker.cc DEPS framework_proto attribute)
 cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)

--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@@ -19,7 +19,7 @@ limitations under the License. */
 namespace paddle {
 namespace framework {

-Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
  switch (attr_desc.type()) {
    case framework::AttrType::BOOLEAN: {
      return attr_desc.b();
@@ -61,13 +61,9 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
      }
      return val;
    }
-    case framework::AttrType::BLOCK: {
-      PADDLE_ENFORCE(program != nullptr,
-                     "Need to specify ProgramDesc when get a block attr");
-      return program->mutable_blocks(attr_desc.block_idx());
-    }
+    default:
+      PADDLE_THROW("Unsupport attr type %d", attr_desc.type());
  }
-  PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");
  return boost::blank();
 }


--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -32,7 +32,7 @@ inline AttrType AttrTypeID() {
  return static_cast<AttrType>(tmp.which() - 1);
 }

-Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* desc);
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc);

 class AttrReader {
 public:

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -18,12 +18,12 @@
 #include <deque>
 #include <list>
 #include <memory>
+#include <unordered_set>

 #include "paddle/framework/block_desc.h"
 #include "paddle/framework/op_registry.h"
 #include "paddle/operators/dynamic_recurrent_op.h"
 #include "paddle/operators/net_op.h"
-#include "paddle/operators/recurrent_op.h"

 namespace paddle {
 namespace framework {
@@ -37,7 +37,7 @@ static inline std::unique_ptr<OperatorBase> CreateGradOp(
  op_desc.SetType(op.Type());
  op_desc.SetAttrMap(op.Attrs());
  auto& info = OpInfoMap::Instance().Get(op.Type());
-  auto grad_descs = info.GradOpMaker()(op_desc, no_grad_set, grad_to_var);
+  auto grad_descs = info.GradOpMaker()(op_desc, no_grad_set, grad_to_var, {});
  std::vector<std::unique_ptr<OperatorBase>> grad_ops;
  grad_ops.reserve(grad_descs.size());
  std::transform(grad_descs.begin(), grad_descs.end(),
@@ -219,19 +219,7 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
                   });

    // process recurrent gradient op as a special operator.
-    if (forwardOp.Type() == "recurrent") {
-      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
-      // or this will result in infinite loop.
-      const auto& rnnop =
-          *static_cast<const operators::RecurrentOp*>(&forwardOp);
-      auto rnn_grad_op =
-          static_cast<operators::RecurrentGradientOp*>(grad_op.get());
-      const auto& stepnet_op =
-          *static_cast<const OperatorBase*>(&rnnop.stepnet());
-      // create stepnet's gradient op
-      rnn_grad_op->set_stepnet(
-          BackwardRecursive(stepnet_op, no_grad_names, grad_to_var, uniq_id));
-    } else if (forwardOp.Type() == "dynamic_recurrent") {
+    if (forwardOp.Type() == "dynamic_recurrent") {
      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
      // or this will result in infinite loop.
      const auto& rnnop =
@@ -285,6 +273,15 @@ static bool AllGradInSet(const std::vector<std::string>& names,
  return true;
 }

+static std::string FwdName(const std::string& grad_name) {
+  auto pos = grad_name.find("@GRAD");
+  if (pos == std::string::npos) {
+    return "";
+  } else {
+    return grad_name.substr(0, pos);
+  }
+}
+
 static void CreateGradVarInBlock(
    size_t grad_op_start_index,
    const std::unordered_map<std::string, std::string>& param_name_map,
@@ -294,6 +291,7 @@ static void CreateGradVarInBlock(
  for (size_t op_index = grad_op_start_index; op_index < ops.size();
       ++op_index) {
    bool need_infer_shape = false;
+    std::unordered_set<std::string> new_vars;
    ForEachVarName(ops[op_index]->Outputs(),
                   [&](const std::string& grad_var_name) {
                     if (block_desc->HasVar(grad_var_name)) {
@@ -301,8 +299,7 @@ static void CreateGradVarInBlock(
                     }
                     need_infer_shape = true;
                     auto var = block_desc->Var(grad_var_name);
-                     // FIXME(qiao) infer the datatype
-                     var->SetDataType(framework::DataType::FP32);
+                     new_vars.insert(var->Name());
                     auto it = param_name_map.find(grad_var_name);
                     if (it == param_name_map.end()) {
                       return false;
@@ -316,6 +313,21 @@ static void CreateGradVarInBlock(
                   });
    if (need_infer_shape) {
      ops[op_index]->InferVarType(block_desc);
+      for (auto& arg : ops[op_index]->OutputArgumentNames()) {
+        if (new_vars.find(arg) == new_vars.end()) {
+          continue;
+        }
+        auto pname = FwdName(arg);
+        auto* param = block_desc->FindVarRecursive(pname);
+        auto* grad = block_desc->FindVar(arg);
+        if (param == nullptr) {
+          LOG(WARNING) << "Cannot find forward variable of " << arg
+                       << ". Set its gradient to FP32";
+          grad->SetDataType(DataType::FP32);
+        } else {
+          grad->SetDataType(param->GetDataType());
+        }
+      }
      ops[op_index]->InferShape(*block_desc);
    }
  }
@@ -323,7 +335,9 @@ static void CreateGradVarInBlock(

 std::vector<std::unique_ptr<OpDescBind>> MakeOpGrad(
    const OpDescBind* op_desc, std::unordered_set<std::string>* no_grad_vars,
-    std::unordered_map<std::string, std::string>* grad_to_var) {
+    std::unordered_map<std::string, std::string>* grad_to_var,
+    const std::vector<BlockDescBind*>& grad_block =
+        std::vector<BlockDescBind*>()) {
  std::vector<std::unique_ptr<OpDescBind>> grad_op_descs;
  // All input gradients of forwarding operator do not need to calculate.
  const std::vector<std::string>& inputs = op_desc->InputArgumentNames();
@@ -339,9 +353,10 @@ std::vector<std::unique_ptr<OpDescBind>> MakeOpGrad(
    return grad_op_descs;  // empty vector
  }

-  grad_op_descs = OpInfoMap::Instance()
-                      .Get(op_desc->Type())
-                      .GradOpMaker()(*op_desc, *no_grad_vars, grad_to_var);
+  grad_op_descs =
+      OpInfoMap::Instance()
+          .Get(op_desc->Type())
+          .GradOpMaker()(*op_desc, *no_grad_vars, grad_to_var, grad_block);

  std::list<std::unique_ptr<OpDescBind>> pending_fill_zeros_ops;
  for (auto& desc : grad_op_descs) {
@@ -368,28 +383,27 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
    ProgramDescBind& program_desc, int block_idx,
    std::unordered_set<std::string>* no_grad_vars,
    std::unordered_map<std::string, std::string>* grad_to_var) {
-  BlockDescBind* cur_block = program_desc.Block(block_idx);
+  BlockDescBind* cur_block = program_desc.MutableBlock(block_idx);
  std::vector<OpDescBind*> op_descs = cur_block->AllOps();
  std::unordered_map<std::string, std::vector<size_t>> dup_out_ops;
  size_t grad_desc_idx = 0;
  std::vector<std::unique_ptr<OpDescBind>> backward_descs;

  for (auto it = op_descs.rbegin(); it != op_descs.rend(); ++it) {
-    std::vector<std::unique_ptr<OpDescBind>> op_grads =
-        MakeOpGrad(*it, no_grad_vars, grad_to_var);
+    std::vector<std::unique_ptr<OpDescBind>> op_grads;

    if ((*it)->Type() == "recurrent") {
-      PADDLE_ENFORCE_EQ(
-          op_grads.size(), static_cast<size_t>(1),
-          "rnn_op's gradient process should contain only one op.");
      int step_block_idx = (*it)->GetBlockAttr("step_block");
      auto backward_block_op_descs = MakeBlockBackward(
          program_desc, step_block_idx, no_grad_vars, grad_to_var);
-      BlockDescBind* backward_block = program_desc.AppendBlock(*cur_block);
+      BlockDescBind* backward_block =
+          program_desc.AppendBlock(*program_desc.MutableBlock(step_block_idx));
      for (auto& ptr : backward_block_op_descs) {
        backward_block->AppendAllocatedOp(std::move(ptr));
      }
-      op_grads[0]->SetBlockAttr("step_block", *backward_block);
+      op_grads = MakeOpGrad(*it, no_grad_vars, grad_to_var, {backward_block});
+    } else {
+      op_grads = MakeOpGrad(*it, no_grad_vars, grad_to_var);
    }

    for (const auto& desc : op_grads) {
@@ -443,7 +457,7 @@ ParamGradInfoMap AppendBackward(
  }

  const int root_block_idx = 0;
-  auto root_block = program_desc.Block(root_block_idx);
+  auto root_block = program_desc.MutableBlock(root_block_idx);

  // insert fill one op for target
  // TODO(qiao) add some check to the target.
@@ -492,7 +506,7 @@ ParamGradInfoMap AppendBackward(
  CreateGradVarInBlock(forward_op_num, grad_to_var, root_block, &retv);
  for (size_t block_index = forward_block_num;
       block_index < program_desc.Size(); ++block_index) {
-    CreateGradVarInBlock(0, grad_to_var, program_desc.Block(block_index),
+    CreateGradVarInBlock(0, grad_to_var, program_desc.MutableBlock(block_index),
                         &retv);
  }
  return retv;

--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -499,7 +499,7 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {

 TEST(Backward, simple_single_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  f::OpDescBind *op = block->AppendOp();
  op->SetType("rowwise_add");
@@ -535,7 +535,7 @@ TEST(Backward, simple_single_op) {

 TEST(Backward, default_attribute) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op = block->AppendOp();
  op->SetType("mul");
  op->SetInput("X", {"x"});
@@ -561,7 +561,7 @@ TEST(Backward, default_attribute) {

 TEST(Backward, simple_mult_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -644,7 +644,7 @@ TEST(Backward, simple_mult_op) {

 TEST(Backward, intermedia_var_no_grad) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -714,7 +714,7 @@ TEST(Backward, intermedia_var_no_grad) {

 TEST(Backward, var_no_grad) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("mult_in_out");
  op1->SetInput("X", {"x1"});
@@ -790,7 +790,7 @@ TEST(Backward, var_no_grad) {

 TEST(Backward, shared_var) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -880,7 +880,7 @@ TEST(Backward, shared_var) {

 TEST(Backward, half_backward) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  auto *op1 = block->AppendOp();
  op1->SetType("minus");
  op1->SetInput("X", {"a"});

--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -113,7 +113,7 @@ BlockDescBind *BlockDescBind::ParentBlock() const {
  if (this->desc_->parent_idx() == kNoneBlockIndex) {
    return nullptr;
  }
-  return prog_->Block(static_cast<size_t>(this->desc_->parent_idx()));
+  return prog_->MutableBlock(static_cast<size_t>(this->desc_->parent_idx()));
 }

 BlockDesc *BlockDescBind::Proto() {

--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@@ -88,6 +88,8 @@ class BlockDescBind {

  BlockDesc *Proto();

+  ProgramDescBind *Program() { return this->prog_; }
+
 private:
  void ClearPBOps();
  void ClearPBVars();

--- a/paddle/framework/details/op_registry.h
+++ b/paddle/framework/details/op_registry.h
@@ -108,8 +108,9 @@ struct OpInfoFiller<T, kGradOpDescMaker> {
    info->grad_op_maker_ = [](
        const OpDescBind& fwd_op,
        const std::unordered_set<std::string>& no_grad_set,
-        std::unordered_map<std::string, std::string>* grad_to_var) {
-      T maker(fwd_op, no_grad_set, grad_to_var);
+        std::unordered_map<std::string, std::string>* grad_to_var,
+        const std::vector<BlockDescBind*>& grad_block) {
+      T maker(fwd_op, no_grad_set, grad_to_var, grad_block);
      return maker();
    };
  }

--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -31,7 +31,7 @@ namespace framework {
 const std::string kFeedOpType = "feed";
 const std::string kFetchOpType = "fetch";

-Executor::Executor(const std::vector<platform::Place>& places) {
+Executor::Executor(const std::vector<platform::Place>& places) : own_(true) {
  PADDLE_ENFORCE_GT(places.size(), 0);
  device_contexts_.resize(places.size());
  for (size_t i = 0; i < places.size(); i++) {
@@ -52,8 +52,10 @@ Executor::Executor(const std::vector<platform::Place>& places) {
 }

 Executor::~Executor() {
-  for (auto& device_context : device_contexts_) {
-    delete device_context;
+  if (own_) {
+    for (auto& device_context : device_contexts_) {
+      delete device_context;
+    }
  }
 }

@@ -66,45 +68,61 @@ static void CreateTensor(Variable* var, VarDesc::VarType var_type) {
    var->GetMutable<FeedFetchList>();
  } else if (var_type == VarDesc::FETCH_LIST) {
    var->GetMutable<FeedFetchList>();
+  } else if (var_type == VarDesc::STEP_SCOPES) {
+    var->GetMutable<std::vector<framework::Scope>>();
  } else {
    PADDLE_THROW(
-        "Variable type must be "
-        "LoDTensor/SelectedRows/FEED_MINIBATCH/FETCH_LIST.");
+        "Variable type %d is not in "
+        "[LoDTensor, SelectedRows, FEED_MINIBATCH, FETCH_LIST]",
+        var_type);
  }
 }

-void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
+void Executor::Run(const ProgramDescBind& pdesc, Scope* scope, int block_id,
+                   bool create_local_scope) {
  // TODO(tonyyang-svail):
  //    - only runs on the first device (i.e. no interdevice communication)
  //    - will change to use multiple blocks for RNN op and Cond Op
-  PADDLE_ENFORCE_GT(pdesc.blocks_size(), block_id);
-  auto& block = pdesc.blocks(block_id);
+  PADDLE_ENFORCE_LT(static_cast<size_t>(block_id), pdesc.Size());
+  auto& block = pdesc.Block(block_id);
  auto& device = device_contexts_[0];

-  Scope& local_scope = scope->NewScope();
-
-  for (auto& var : block.vars()) {
-    if (var.persistable()) {
-      auto* ptr = scope->Var(var.name());
-      CreateTensor(ptr, var.type());
-      VLOG(3) << "Create Variable " << var.name()
-              << " global, which pointer is " << ptr;
-    } else {
-      auto* ptr = local_scope.Var(var.name());
-      CreateTensor(ptr, var.type());
-      VLOG(3) << "Create Variable " << var.name()
-              << " locally, which pointer is " << ptr;
+  Scope* local_scope = scope;
+  if (create_local_scope) {
+    local_scope = &scope->NewScope();
+    for (auto& var : block.AllVars()) {
+      if (var->Persistable()) {
+        auto* ptr = scope->Var(var->Name());
+        CreateTensor(ptr, var->GetType());
+        VLOG(3) << "Create Variable " << var->Name()
+                << " global, which pointer is " << ptr;
+      } else {
+        auto* ptr = local_scope->Var(var->Name());
+        CreateTensor(ptr, var->GetType());
+        VLOG(3) << "Create Variable " << var->Name()
+                << " locally, which pointer is " << ptr;
+      }
+    }
+  } else {
+    for (auto& var : block.AllVars()) {
+      auto* ptr = local_scope->Var(var->Name());
+      CreateTensor(ptr, var->GetType());
+      VLOG(3) << "Create variable " << var->Name() << ", which pointer is "
+              << ptr;
    }
  }

-  for (auto& op_desc : block.ops()) {
-    auto op = paddle::framework::OpRegistry::CreateOp(
-        op_desc, const_cast<ProgramDesc*>(&pdesc));
-    op->Run(local_scope, *device);
+  for (auto& op_desc : block.AllOps()) {
+    auto op = paddle::framework::OpRegistry::CreateOp(*op_desc);
+    op->Run(*local_scope, *device);
+  }
+  if (create_local_scope) {
+    scope->DeleteScope(local_scope);
  }
-
-  scope->DeleteScope(&local_scope);
 }

+Executor::Executor(const platform::DeviceContext& device)
+    : device_contexts_({&device}), own_(false) {}
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/executor.h
+++ b/paddle/framework/executor.h
@@ -14,8 +14,8 @@ limitations under the License. */

 #pragma once

-#include "paddle/framework/framework.pb.h"
 #include "paddle/framework/op_info.h"
+#include "paddle/framework/program_desc.h"
 #include "paddle/framework/scope.h"
 #include "paddle/framework/tensor.h"

@@ -25,6 +25,7 @@ namespace framework {
 class Executor {
 public:
  explicit Executor(const std::vector<platform::Place>& places);
+  explicit Executor(const platform::DeviceContext& devices);
  ~Executor();

  /* @Brief
@@ -34,10 +35,11 @@ class Executor {
   *  ProgramDesc
   *  Scope
   */
-  void Run(const ProgramDesc&, Scope*, int);
+  void Run(const ProgramDescBind&, Scope*, int, bool create_local_scope = true);

 private:
-  std::vector<platform::DeviceContext*> device_contexts_;
+  std::vector<const platform::DeviceContext*> device_contexts_;
+  bool own_;
 };

 }  // namespace framework

--- a/paddle/framework/grad_op_desc_maker.h
+++ b/paddle/framework/grad_op_desc_maker.h
@@ -15,6 +15,7 @@
 #pragma once
 #include <string>
 #include <unordered_set>
+#include <vector>
 #include "paddle/framework/op_desc.h"
 #include "paddle/framework/operator.h"

@@ -26,8 +27,13 @@ class GradOpDescMakerBase {
  explicit GradOpDescMakerBase(
      const OpDescBind& fwd_op,
      const std::unordered_set<std::string>& no_grad_set,
-      std::unordered_map<std::string, std::string>* grad_to_var)
-      : fwd_op_(fwd_op), no_grad_set_(no_grad_set), grad_to_var_(grad_to_var) {}
+      std::unordered_map<std::string, std::string>* grad_to_var,
+      const std::vector<BlockDescBind*>& grad_block =
+          std::vector<BlockDescBind*>())
+      : fwd_op_(fwd_op),
+        no_grad_set_(no_grad_set),
+        grad_to_var_(grad_to_var),
+        grad_block_(grad_block) {}

  virtual ~GradOpDescMakerBase() = default;
  virtual std::vector<std::unique_ptr<OpDescBind>> operator()() const = 0;
@@ -102,6 +108,9 @@ class GradOpDescMakerBase {
  const OpDescBind& fwd_op_;
  const std::unordered_set<std::string>& no_grad_set_;
  std::unordered_map<std::string, std::string>* grad_to_var_;
+
+ protected:
+  std::vector<BlockDescBind*> grad_block_;
 };

 class SingleGradOpDescMaker : public GradOpDescMakerBase {

--- a/paddle/framework/lod_tensor_test.cu
+++ b/paddle/framework/lod_tensor_test.cu
@@ -36,8 +36,8 @@ TEST(LoDTensor, LoDInGPU) {
  lod_tensor.mutable_data<float>(place);

  lod_tensor.set_lod(src_lod);
-  CHECK_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
-  CHECK_EQ(lod_tensor.lod_element(0, 4).first, 8UL);
+  EXPECT_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
+  EXPECT_EQ(lod_tensor.lod_element(0, 4).first, 8UL);

  auto lod = lod_tensor.lod();

@@ -45,6 +45,6 @@ TEST(LoDTensor, LoDInGPU) {
  cudaDeviceSynchronize();

  for (size_t i = 0; i < src_lod[0].size(); ++i) {
-    CHECK_EQ(lod[0].data()[i], src_lod[0].data()[i] * 2);
+    EXPECT_EQ(lod[0].data()[i], src_lod[0].data()[i] * 2);
  }
-}
\ No newline at end of file
+}
--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -52,6 +52,22 @@ class CompileTimeInferShapeContext : public InferShapeContext {
  const std::vector<std::string> &Outputs(
      const std::string &name) const override;

+  void ShareLoD(const std::string &in, const std::string &out, size_t i = 0,
+                size_t j = 0) const override {
+    PADDLE_ENFORCE_LT(i, Inputs(in).size());
+    PADDLE_ENFORCE_LT(j, Outputs(out).size());
+    auto *in_var = block_.FindVarRecursive(Inputs(in)[i]);
+    auto *out_var = block_.FindVarRecursive(Outputs(out)[j]);
+    if (in_var->GetType() != VarDesc::LOD_TENSOR) {
+      VLOG(3) << "input " << in << "is not LodTensor";
+      return;
+    }
+    PADDLE_ENFORCE_EQ(in_var->GetType(), VarDesc::LOD_TENSOR,
+                      "The %d-th output of Output(%s) must be LoDTensor.", j,
+                      out);
+    in_var->SetLoDLevel(out_var->GetLodLevel());
+  }
+
 private:
  DDim GetDim(const std::string &name) const override;

@@ -98,7 +114,12 @@ OpDescBind::OpDescBind(const OpDesc &desc, ProgramDescBind *prog)
  // restore attrs_
  for (const OpDesc::Attr &attr : desc_.attrs()) {
    std::string attr_name = attr.name();
-    attrs_[attr_name] = GetAttrValue(attr, prog->Proto());
+    if (attr.type() != AttrType::BLOCK) {
+      attrs_[attr_name] = GetAttrValue(attr);
+    } else {
+      auto bid = attr.block_idx();
+      attrs_[attr_name] = prog->MutableBlock(bid);
+    }
  }
 }

@@ -172,8 +193,7 @@ void OpDescBind::SetAttr(const std::string &name, const Attribute &v) {
 }

 void OpDescBind::SetBlockAttr(const std::string &name, BlockDescBind &block) {
-  BlockDesc *desc = block.Proto();
-  this->attrs_[name] = desc;
+  this->attrs_[name] = &block;
  need_update_ = true;
 }

@@ -192,7 +212,7 @@ Attribute OpDescBind::GetAttr(const std::string &name) const {
 int OpDescBind::GetBlockAttr(const std::string &name) const {
  auto it = attrs_.find(name);
  PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name);
-  return boost::get<BlockDesc *>(it->second)->idx();
+  return boost::get<BlockDescBind *>(it->second)->ID();
 }

 const std::unordered_map<std::string, Attribute> &OpDescBind::GetAttrMap()
@@ -307,6 +327,19 @@ void OpDescBind::InferShape(const BlockDescBind &block) const {
  PADDLE_ENFORCE(static_cast<bool>(infer_shape),
                 "%s's infer_shape has not been registered", this->Type());
  CompileTimeInferShapeContext ctx(*this, block);
+  if (VLOG_IS_ON(10)) {
+    std::ostringstream sout;
+    auto inames = this->InputArgumentNames();
+    sout << " From [";
+    std::copy(inames.begin(), inames.end(),
+              std::ostream_iterator<std::string>(sout, ", "));
+    sout << "] to [";
+    auto onames = this->OutputArgumentNames();
+    std::copy(onames.begin(), onames.end(),
+              std::ostream_iterator<std::string>(sout, ", "));
+    sout << "]";
+    VLOG(10) << sout.str();
+  }
  infer_shape(&ctx);
 }


--- a/paddle/framework/op_registry.cc
+++ b/paddle/framework/op_registry.cc
@@ -43,13 +43,15 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap(
  return ret_val;
 }

-std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc,
-                                                   ProgramDesc* program) {
+std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc) {
+  VLOG(1) << "CreateOp directly from OpDesc is deprecated. It should only be"
+             "used in unit tests. Use CreateOp(const OpDescBind& op_desc) "
+             "instead.";
  VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs());
  VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs());
  AttributeMap attrs;
  for (auto& attr : op_desc.attrs()) {
-    attrs[attr.name()] = GetAttrValue(attr, program);
+    attrs[attr.name()] = GetAttrValue(attr);
  }

  return CreateOp(op_desc.type(), inputs, outputs, attrs);

--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -77,8 +77,7 @@ class OpRegistry {
                                                const VariableNameMap& outputs,
                                                AttributeMap attrs);

-  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc,
-                                                ProgramDesc* program);
+  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc);

  static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };

--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -74,7 +74,7 @@ TEST(OpRegistry, CreateOp) {
  attr->set_type(paddle::framework::AttrType::FLOAT);
  attr->set_f(scale);

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -95,7 +95,7 @@ TEST(OpRegistry, IllegalAttr) {

  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "larger_than check fail";
@@ -115,7 +115,7 @@ TEST(OpRegistry, DefaultValue) {

  ASSERT_TRUE(op_desc.IsInitialized());

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -131,7 +131,7 @@ TEST(OpRegistry, CustomChecker) {
  // attr 'test_attr' is not set
  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "Attribute 'test_attr' is required!";
@@ -149,7 +149,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_i(3);
  caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "'test_attr' must be even!";
@@ -166,7 +166,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_name("test_attr");
  attr->set_type(paddle::framework::AttrType::INT);
  attr->set_i(4);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::platform::CPUDeviceContext dev_ctx;
  paddle::framework::Scope scope;
  op->Run(scope, dev_ctx);

--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -37,32 +37,32 @@ ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {
 std::string OperatorBase::Input(const std::string& name) const {
  auto& ins = Inputs(name);
  PADDLE_ENFORCE_LE(ins.size(), 1UL,
-                    "Op %s input %s should contain only one variable", type_,
-                    name);
+                    "Operator %s's input %s should contain only one variable.",
+                    type_, name);
  return ins.empty() ? kEmptyVarName : ins[0];
 }

 const std::vector<std::string>& OperatorBase::Inputs(
    const std::string& name) const {
  auto it = inputs_.find(name);
-  PADDLE_ENFORCE(it != inputs_.end(), "Op %s do not have input %s", type_,
-                 name);
+  PADDLE_ENFORCE(it != inputs_.end(), "Operator %s does not have the input %s.",
+                 type_, name);
  return it->second;
 }

 std::string OperatorBase::Output(const std::string& name) const {
  auto& outs = Outputs(name);
  PADDLE_ENFORCE_LE(outs.size(), 1UL,
-                    "Op %s output %s should contain only one variable", type_,
-                    name);
+                    "Operator %s's output %s should contain only one variable.",
+                    type_, name);
  return outs.empty() ? kEmptyVarName : outs[0];
 }

 const std::vector<std::string>& OperatorBase::Outputs(
    const std::string& name) const {
  auto it = outputs_.find(name);
-  PADDLE_ENFORCE(it != outputs_.end(), "Op %s does not have output called %s",
-                 type_, name);
+  PADDLE_ENFORCE(it != outputs_.end(),
+                 "Operator %s does not have an output called %s.", type_, name);
  return it->second;
 }

@@ -126,7 +126,7 @@ OperatorBase::OperatorBase(const std::string& type,

 std::vector<std::string> OperatorBase::InputVars() const {
  std::vector<std::string> ret_val;
-  for (auto& o : outputs_) {
+  for (auto& o : inputs_) {
    ret_val.reserve(ret_val.size() + o.second.size());
    ret_val.insert(ret_val.end(), o.second.begin(), o.second.end());
  }
@@ -351,6 +351,20 @@ class RuntimeInferShapeContext : public InferShapeContext {
    return op_.Outputs(name);
  }

+  void ShareLoD(const std::string& in, const std::string& out, size_t i = 0,
+                size_t j = 0) const override {
+    PADDLE_ENFORCE_LT(i, Inputs(in).size());
+    PADDLE_ENFORCE_LT(j, Outputs(out).size());
+    Variable* in_var = scope_.FindVar(Inputs(in)[i]);
+    Variable* out_var = scope_.FindVar(Outputs(out)[j]);
+    if (!in_var->IsType<LoDTensor>()) return;
+    PADDLE_ENFORCE(out_var->IsType<LoDTensor>(),
+                   "The %d-th output of Output(%s) must be LoDTensor.", j, out);
+    auto in_tensor = in_var->Get<LoDTensor>();
+    auto* out_tensor = out_var->GetMutable<LoDTensor>();
+    out_tensor->set_lod(in_tensor.lod());
+  }
+
 private:
  DDim GetDim(const std::string& name) const override {
    Variable* var = scope_.FindVar(name);
@@ -380,7 +394,19 @@ class RuntimeInferShapeContext : public InferShapeContext {

 void OperatorWithKernel::Run(const Scope& scope,
                             const platform::DeviceContext& dev_ctx) const {
-  VLOG(3) << "Running operator " << this->Type();
+  if (VLOG_IS_ON(1)) {
+    auto inputs = this->InputVars();
+    auto outputs = this->OutputVars(true);
+    std::ostringstream sout;
+    sout << "Run operator " << this->Type() << " From [";
+    std::ostream_iterator<std::string> out_it(sout, ",");
+    std::copy(inputs.begin(), inputs.end(), out_it);
+    sout << "] to [";
+    std::copy(outputs.begin(), outputs.end(), out_it);
+    sout << "]";
+    VLOG(1) << sout.str();
+  }
+
  RuntimeInferShapeContext infer_shape_ctx(*this, scope);
  this->InferShape(&infer_shape_ctx);


--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -427,7 +427,8 @@ class OperatorWithKernel : public OperatorBase {
            int tmp = static_cast<int>(ToDataType(t->type()));
            VLOG(3) << "Input " << ipt_name << " with data_type " << tmp;
            PADDLE_ENFORCE(tmp == data_type || data_type == -1,
-                           "DataType of Paddle Op %s must be same.", Type());
+                           "DataType of Paddle Op %s must be the same.",
+                           Type());
            data_type = tmp;
          }
        }

--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -83,7 +83,7 @@ TEST(OperatorBase, all) {
  paddle::platform::CPUDeviceContext device_context;
  paddle::framework::Scope scope;

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  scope.Var("OUT1");
  ASSERT_EQ(paddle::framework::op_run_num, 0);
  op->Run(scope, device_context);
@@ -208,7 +208,7 @@ TEST(OpKernel, all) {
  paddle::platform::CPUDeviceContext cpu_device_context;
  paddle::framework::Scope scope;

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0);
  op->Run(scope, cpu_device_context);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1);
@@ -244,7 +244,7 @@ TEST(OpKernel, multi_inputs) {
  scope.Var("y0")->GetMutable<LoDTensor>();
  scope.Var("y1")->GetMutable<LoDTensor>();

-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  op->Run(scope, cpu_device_context);
 }


--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -37,7 +37,9 @@ class ProgramDescBind {

  BlockDescBind *AppendBlock(const BlockDescBind &parent);

-  BlockDescBind *Block(size_t idx) { return blocks_[idx].get(); }
+  BlockDescBind *MutableBlock(size_t idx) { return blocks_[idx].get(); }
+
+  const BlockDescBind &Block(size_t idx) const { return *blocks_[idx]; }

  size_t Size() const { return blocks_.size(); }


--- a/paddle/framework/program_desc_test.cc
+++ b/paddle/framework/program_desc_test.cc
@@ -20,7 +20,7 @@ namespace paddle {
 namespace framework {
 TEST(ProgramDesc, copy_ctor) {
  ProgramDescBind program;
-  auto* global_block = program.Block(0);
+  auto* global_block = program.MutableBlock(0);
  auto* x = global_block->Var("X");
  x->SetType(VarDesc_VarType_LOD_TENSOR);
  x->SetLoDLevel(0);
@@ -44,7 +44,7 @@ TEST(ProgramDesc, copy_ctor) {

  ProgramDescBind program_copy(program);

-  auto* global_block_copy = program_copy.Block(0);
+  auto* global_block_copy = program_copy.MutableBlock(0);
  ASSERT_NE(global_block, global_block_copy);

  auto assert_same_var = [&](const std::string& name, VarDescBind* var_before) {
@@ -82,7 +82,7 @@ TEST(ProgramDesc, copy_ctor) {

 TEST(ProgramDescBind, serialize_and_deserialize) {
  ProgramDescBind program_origin;
-  auto* global_block = program_origin.Block(0);
+  auto* global_block = program_origin.MutableBlock(0);
  auto* x = global_block->Var("X");
  x->SetType(VarDesc_VarType_LOD_TENSOR);
  x->SetLoDLevel(0);
@@ -108,7 +108,7 @@ TEST(ProgramDescBind, serialize_and_deserialize) {
  program_origin.Proto()->SerializeToString(&binary_str);

  ProgramDescBind program_restored(binary_str);
-  auto* global_block_restored = program_restored.Block(0);
+  auto* global_block_restored = program_restored.MutableBlock(0);
  ASSERT_NE(global_block, global_block_restored);

  auto assert_same_var = [&](const std::string& name, VarDescBind* var_before) {

--- a/paddle/framework/prune_test.cc
+++ b/paddle/framework/prune_test.cc
@@ -52,7 +52,7 @@ void AddOp(const std::string &type, const f::VariableNameMap &inputs,

 TEST(Prune, one_operator) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);

@@ -69,7 +69,7 @@ TEST(Prune, one_operator) {

 TEST(Prune, forward) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"c"}}}, {}, block);
@@ -88,7 +88,7 @@ TEST(Prune, forward) {

 TEST(Prune, multi_input_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  AddOp("one_one", {{"input", {"a0"}}}, {{"output", {"b0"}}}, {}, block);
  AddOp("one_one", {{"input", {"a1"}}}, {{"output", {"b1"}}}, {}, block);
@@ -106,7 +106,7 @@ TEST(Prune, multi_input_op) {

 TEST(Prune, multi_output_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
@@ -122,7 +122,7 @@ TEST(Prune, multi_output_op) {

 TEST(Prune, multi_target) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);

  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);

--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -47,8 +47,12 @@ Variable* Scope::Var(const std::string& name) {
  return v;
 }

-Variable* Scope::Var() {
-  return Var(string::Sprintf("%p.%d", this, vars_.size()));
+Variable* Scope::Var(std::string* name) {
+  auto var_name = string::Sprintf("%p.%d", this, vars_.size());
+  if (name != nullptr) {
+    *name = var_name;
+  }
+  return Var(var_name);
 }

 Variable* Scope::FindVar(const std::string& name) const {

--- a/paddle/framework/scope.h
+++ b/paddle/framework/scope.h
@@ -49,7 +49,7 @@ class Scope {
  Variable* Var(const std::string& name);

  /// Create a variable with a scope-unique name.
-  Variable* Var();
+  Variable* Var(std::string* name = nullptr);

  /// Find a variable in the scope or any of its ancestors.  Returns
  /// nullptr if cannot find.

--- a/paddle/framework/shape_inference.cc
+++ b/paddle/framework/shape_inference.cc
@@ -28,9 +28,6 @@ void InferShapeContext::SetOutputsDim(
  SetDims(names, dims);
 }

-void InferShapeContext::ShareLoD(const std::string &in, const std::string &out,
-                                 size_t i, size_t j) const {}
-
 std::vector<framework::DDim> InferShapeContext::GetDims(
    const std::vector<std::string> &names) const {
  std::vector<framework::DDim> ret;

--- a/paddle/framework/shape_inference.h
+++ b/paddle/framework/shape_inference.h
@@ -43,9 +43,8 @@ class InferShapeContext {
  virtual const std::vector<std::string> &Outputs(
      const std::string &name) const = 0;

-  // TODO(qiao) implement this function
-  void ShareLoD(const std::string &in, const std::string &out, size_t i = 0,
-                size_t j = 0) const;
+  virtual void ShareLoD(const std::string &in, const std::string &out,
+                        size_t i = 0, size_t j = 0) const = 0;

 protected:
  virtual framework::DDim GetDim(const std::string &name) const = 0;

--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@@ -118,12 +118,14 @@ class Tensor {
                             const platform::DeviceContext& ctx);

  /**
-   * @brief   Return the slice of the tensor.
+   * @brief  Return a sub-tensor of the given tensor.
   *
-   * @param[in] begin_idx   The begin index of the slice.
-   * @param[in] end_idx     The end index of the slice.
+   * @param[in] begin_idx   The index of the start row(inclusive) to slice.
+   *                        The index number begins from 0.
+   * @param[in] end_idx     The index of the end row(exclusive) to slice.
+   *                        The index number begins from 0.
   */
-  inline Tensor Slice(const int& begin_idx, const int& end_idx) const;
+  inline Tensor Slice(int begin_idx, int end_idx) const;

  platform::Place place() const {
    PADDLE_ENFORCE_NOT_NULL(

--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -112,9 +112,10 @@ inline void* Tensor::mutable_data(platform::Place place, std::type_index type) {
  if (holder_ != nullptr) {
    holder_->set_type(type);
  }
-  PADDLE_ENFORCE_GT(numel(), 0,
-                    "Tensor's numel must be larger than zero to call "
-                    "Tensor::mutable_data. Call Tensor::set_dim first.");
+  PADDLE_ENFORCE_GT(
+      numel(), 0,
+      "When calling this method, the Tensor's numel must be larger than zero. "
+      "Please check Tensor::Resize has been called first.");
  int64_t size = numel() * SizeOfType(type);
  /* some versions of boost::variant don't have operator!= */
  if (holder_ == nullptr || !(holder_->place() == place) ||
@@ -227,12 +228,14 @@ inline void Tensor::CopyFromVector(const std::vector<T>& src,
 #endif
 }

-inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
+inline Tensor Tensor::Slice(int begin_idx, int end_idx) const {
  check_memory_size();
-  PADDLE_ENFORCE_GE(begin_idx, 0, "Slice begin index is less than zero.");
-  PADDLE_ENFORCE_LE(end_idx, dims_[0], "Slice end index is out of bound.");
-  PADDLE_ENFORCE_LT(begin_idx, end_idx,
-                    "Begin index must be less than end index.");
+  PADDLE_ENFORCE_GE(begin_idx, 0,
+                    "The start row index must be greater than 0.");
+  PADDLE_ENFORCE_LE(end_idx, dims_[0], "The end row index is out of bound.");
+  PADDLE_ENFORCE_LT(
+      begin_idx, end_idx,
+      "The start row index must be lesser than the end row index.");

  if (dims_[0] == 1) {
    return *this;

--- a/paddle/framework/type_defs.h
+++ b/paddle/framework/type_defs.h
@@ -29,6 +29,7 @@ class OpDescBind;
 class BlockDescBind;
 class BlockDesc;
 class InferShapeContext;
+class BlockDescBind;

 using VariableNameMap = std::map<std::string, std::vector<std::string>>;

@@ -36,7 +37,7 @@ using VariableNameMap = std::map<std::string, std::vector<std::string>>;
 using Attribute =
    boost::variant<boost::blank, int, float, std::string, std::vector<int>,
                   std::vector<float>, std::vector<std::string>, bool,
-                   std::vector<bool>, BlockDesc*>;
+                   std::vector<bool>, BlockDescBind*>;

 using AttributeMap = std::unordered_map<std::string, Attribute>;

@@ -46,7 +47,8 @@ using OpCreator = std::function<OperatorBase*(

 using GradOpMakerFN = std::function<std::vector<std::unique_ptr<OpDescBind>>(
    const OpDescBind&, const std::unordered_set<std::string>& /*no_grad_set*/,
-    std::unordered_map<std::string, std::string>* /*grad_to_var*/)>;
+    std::unordered_map<std::string, std::string>* /*grad_to_var*/,
+    const std::vector<BlockDescBind*>& grad_block)>;

 using InferVarTypeFN = std::function<void(const OpDescBind& /*op_desc*/,
                                          BlockDescBind* /*block*/)>;

--- a/paddle/framework/var_type_inference_test.cc
+++ b/paddle/framework/var_type_inference_test.cc
@@ -63,41 +63,43 @@ namespace framework {

 TEST(InferVarType, sum_op) {
  ProgramDescBind prog;
-  auto *op = prog.Block(0)->AppendOp();
+  auto *op = prog.MutableBlock(0)->AppendOp();
  op->SetType("sum");
  op->SetInput("X", {"test_a", "test_b", "test_c"});
  op->SetOutput("Out", {"test_out"});

-  prog.Block(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_out");
+  prog.MutableBlock(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_out");

-  op->InferVarType(prog.Block(0));
+  op->InferVarType(prog.MutableBlock(0));

-  ASSERT_EQ(VarDesc::SELECTED_ROWS, prog.Block(0)->Var("test_out")->GetType());
+  ASSERT_EQ(VarDesc::SELECTED_ROWS,
+            prog.MutableBlock(0)->Var("test_out")->GetType());

-  prog.Block(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
-  op->InferVarType(prog.Block(0));
-  ASSERT_EQ(VarDesc::LOD_TENSOR, prog.Block(0)->Var("test_out")->GetType());
+  prog.MutableBlock(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
+  op->InferVarType(prog.MutableBlock(0));
+  ASSERT_EQ(VarDesc::LOD_TENSOR,
+            prog.MutableBlock(0)->Var("test_out")->GetType());
 }

 TEST(InferVarType, sum_op_without_infer_var_type) {
  ProgramDescBind prog;
-  auto *op = prog.Block(0)->AppendOp();
+  auto *op = prog.MutableBlock(0)->AppendOp();
  op->SetType("sum_without_infer_var_type");
  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
  op->SetOutput("Out", {"test2_out"});

-  prog.Block(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_out");
+  prog.MutableBlock(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_out");

-  op->InferVarType(prog.Block(0));
+  op->InferVarType(prog.MutableBlock(0));

  ASSERT_EQ(VarDesc_VarType_LOD_TENSOR,
-            prog.Block(0)->Var("test2_out")->GetType());
+            prog.MutableBlock(0)->Var("test2_out")->GetType());
 }

 }  // namespace framework

--- a/paddle/gserver/evaluators/Evaluator.cpp
+++ b/paddle/gserver/evaluators/Evaluator.cpp
@@ -395,14 +395,24 @@ real AucEvaluator::evalImp(std::vector<Argument>& arguments) {
  CHECK_LE(arguments.size(), (size_t)3);
  MatrixPtr output = arguments[0].value;
  IVectorPtr label = arguments[1].ids;
+  MatrixPtr labelval = arguments[1].value;
  bool supportWeight = (3 == arguments.size()) ? true : false;
  MatrixPtr weight = supportWeight ? arguments[2].value : nullptr;
-  if (nullptr == output || nullptr == label ||
-      (supportWeight && nullptr == weight)) {
+
+  if (nullptr == output || (supportWeight && nullptr == weight)) {
    return 0;
  }
  size_t insNum = output->getHeight();
  size_t outputDim = output->getWidth();
+  // Copy label from value to a vector.
+  if (nullptr == label && nullptr != labelval) {
+    // label width is 1
+    CHECK_EQ(1U, labelval->getWidth());
+    VectorPtr vec =
+        Vector::create(labelval->getData(), insNum, output->useGpu());
+    label = vec->castToInt();
+  }
+
  CHECK_EQ(insNum, label->getSize());
  if (supportWeight) {
    CHECK_EQ(insNum, weight->getHeight());
@@ -443,6 +453,7 @@ real AucEvaluator::evalImp(std::vector<Argument>& arguments) {
  int* labelD = label->getData();
  real* weightD = supportWeight ? weight->getData() : nullptr;
  size_t pos = realColumnIdx_;
+
  for (size_t i = 0; i < insNum; ++i) {
    real value = outputD[pos];
    uint32_t binIdx = static_cast<uint32_t>(value * kBinNum_);

--- a/paddle/gserver/layers/CRFLayer.cpp
+++ b/paddle/gserver/layers/CRFLayer.cpp
@@ -101,8 +101,10 @@ void CRFLayer::backward(const UpdateCallback& callback) {
                              : real(1.0f);
    instanceWeight *= coeff_;

-    MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
-    grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+    if (output.grad) {
+      MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
+      grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+    }
    if (needWGrad) {
      weight_->getWGrad()->add(
          *crfs_[i].getWGrad(), real(1.0f), instanceWeight);

--- a/paddle/gserver/layers/LinearChainCRF.cpp
+++ b/paddle/gserver/layers/LinearChainCRF.cpp
@@ -102,7 +102,6 @@ real LinearChainCRF::forward(real* x, int* s, int length) {
 }

 void LinearChainCRF::backward(real* x, int* s, int length, bool needWGrad) {
-  MatrixPtr matX = Matrix::create(x, length, numClasses_);
  Matrix::resizeOrCreate(matGrad_, length, numClasses_);
  Matrix::resizeOrCreate(beta_, length, numClasses_);
  real* b = b_->getData();

--- a/paddle/gserver/layers/SequenceReshapeLayer.cpp
+++ b/paddle/gserver/layers/SequenceReshapeLayer.cpp
@@ -70,11 +70,23 @@ void SequenceReshapeLayer::forward(PassType passType) {
  size_t outDim = getSize();

  size_t numSequences = input.getNumSequences();
-  auto startPositions = input.sequenceStartPositions->getVector(false);
-  const int* starts = startPositions->getData();

-  CHECK_EQ(starts[numSequences], input.getBatchSize());
-  CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  // by default, we assume each instance as a sequence
+  IVectorPtr seqStarts;
+  IVector::resizeOrCreate(seqStarts, input.getBatchSize() + 1, false);
+  int* startsData = seqStarts->getData();
+  for (int i = 0; i < input.getBatchSize() + 1; i++) {
+    startsData[i] = i;
+  }
+  const int* starts = startsData;
+
+  // if there is sequence, then use start positions
+  if (input.sequenceStartPositions) {
+    auto startPositions = input.sequenceStartPositions->getVector(false);
+    starts = startPositions->getData();
+    CHECK_EQ(starts[numSequences], input.getBatchSize());
+    CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  }

  for (size_t seqID = 0; seqID < numSequences; seqID++) {
    size_t inNumIns = starts[seqID + 1] - starts[seqID];

--- a/paddle/gserver/tests/MKLDNNTester.cpp
+++ b/paddle/gserver/tests/MKLDNNTester.cpp
@@ -273,31 +273,37 @@ void MKLDNNTester::printVector(const VectorPtr& v) {
  VLOG(MKLDNN_ALL) << std::endl << ostr.str();
 }

-double MKLDNNTester::getDelta(const real* d1,
-                              const real* d2,
+double MKLDNNTester::getDelta(const real* refer,
+                              const real* value,
                              size_t len,
                              const float failRate,
                              const float thres) {
  double delta = 0, sum = 0;
  int failCnt = 0;
  const double eps = 1e-5;
-  double maxOut = 0;
+  double maxRatio = 0;
  for (size_t i = 0; i < len; ++i) {
-    double ref = fabs(d2[i]);
-    double diff = fabs(d1[i] - d2[i]);
+    double ref = fabs(refer[i]);
+    double val = fabs(value[i]);
+    double diff = fabs(refer[i] - value[i]);
    delta += diff;
    sum += ref;
-    if (ref > eps && fabs(d1[i]) > eps && diff / ref > thres) {
-      maxOut = std::max(maxOut, diff / ref);
+    if (ref < eps && val < eps) {  // both values are very small
+      continue;
+    }
+    double ratio = diff / ref;
+    if (ratio > thres) {
+      maxRatio = std::max(maxRatio, ratio);
      failCnt++;
    }
  }
-  EXPECT_TRUE(std::isnormal(sum));
  EXPECT_FALSE(std::isinf(sum));
+  EXPECT_FALSE(std::isnan(sum));
  EXPECT_FALSE(std::isnan(delta));
  VLOG(MKLDNN_ALL) << "reference avg data: " << sum / len
                   << ", delta: " << delta / sum << ", failCnt:" << failCnt;
-  return (failCnt / (float)len) > failRate ? maxOut : delta / sum;
+  double res = sum > eps ? delta / sum : eps;
+  return (failCnt / (float)len) > failRate ? maxRatio : res;
 }

 double MKLDNNTester::compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2) {
@@ -515,12 +521,16 @@ void MKLDNNTester::getOutResult(const std::string& configPath,
    gradientMachine->forward(in.inArgs[i], &outArgs, PASS_TRAIN);
    // save forward result
    for (size_t k = 0; k < outArgs.size(); k++) {
-      MatrixPtr value = Matrix::create(outArgs[k].value->getHeight(),
-                                       outArgs[k].value->getWidth(),
-                                       false,
-                                       false);
-      value->copyFrom(*outArgs[k].value);
-      out.outValues.push_back(value);
+      const MatrixPtr& src = outArgs[k].value;
+      MatrixPtr dst =
+          Matrix::create(src->getHeight(), src->getWidth(), false, false);
+      if (typeid(*src) == typeid(MKLDNNMatrix)) {
+        MKLDNNMatrixPtr dnnSrc = std::dynamic_pointer_cast<MKLDNNMatrix>(src);
+        dnnSrc->copyTo(*dst);
+      } else {
+        dst->copyFrom(*src);
+      }
+      out.outValues.push_back(dst);
    }

    // random backward input
@@ -543,19 +553,19 @@ void MKLDNNTester::getOutResult(const std::string& configPath,
 void MKLDNNTester::compareResult(DataOut& ref, DataOut& dnn, float eps) {
  CHECK_EQ(ref.outValues.size(), dnn.outValues.size());
  CHECK_EQ(ref.paraValues.size(), dnn.paraValues.size());
-  VLOG(MKLDNN_TESTS) << "compare value size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.outValues.size(); i++) {
+    VLOG(MKLDNN_TESTS) << "compare value index: " << i;
    EXPECT_LE(fabs(compareMatrix(ref.outValues[i], dnn.outValues[i])), eps);
  }
-  VLOG(MKLDNN_TESTS) << "compare param size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.paraValues.size(); i++) {
+    VLOG(MKLDNN_TESTS) << "compare param index: " << i;
    EXPECT_LE(fabs(compareVector(ref.paraValues[i], dnn.paraValues[i])), eps);
  }
 }

-void MKLDNNTester::runBranchesTest(const std::string& configPath,
-                                   size_t iter,
-                                   float eps) {
+void MKLDNNTester::runNetTest(const std::string& configPath,
+                              size_t iter,
+                              float eps) {
  DataIn in;
  initArgument(in, configPath, iter);
  DataOut outCpu, outDnn;

--- a/paddle/gserver/tests/MKLDNNTester.h
+++ b/paddle/gserver/tests/MKLDNNTester.h
@@ -85,17 +85,17 @@ public:
           bool printDetails = false,
           size_t iter = 3,
           float epsilon = 1e-4);
-  static void runBranchesTest(const std::string& configPath,
-                              size_t iter = 3,
-                              float eps = 1e-4);
+  static void runNetTest(const std::string& configPath,
+                         size_t iter = 2,
+                         float eps = 1e-4);
  static void initArgument(DataIn& data,
                           const std::string& configPath,
-                           size_t iter = 3);
+                           size_t iter = 2);
  static void getOutResult(const std::string& configPath,
                           DataIn& in,
                           DataOut& out,
                           bool use_mkldnn,
-                           size_t iter = 3);
+                           size_t iter = 2);

 private:
  void reset(const TestConfig& dnn, const TestConfig& ref, size_t batchSize);
@@ -128,13 +128,13 @@ private:

  /**
   * Get delta percent
-   * if many(>failRate) wrong(abs(dnn-ref)/abs(ref)>thres) points return the
-   * max(diff/ref)
-   * else return sum(abs(a-b)) / sum(abs(b))
+   * if many(>failRate) wrong(abs(val-ref)/abs(ref) > thres) points
+   * return the max(diff/ref)
+   * else return sum(abs(diff)) / sum(abs(ref))
   * The return value should be smaller than eps when passing.
   */
-  static double getDelta(const real* d1,
-                         const real* d2,
+  static double getDelta(const real* refer,
+                         const real* value,
                         size_t len,
                         const float failRate = 1e-3,
                         const float thres = 0.1);

--- a/paddle/trainer/tests/sample_trainer_config_branch_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_branch_net.conf
@@ -14,36 +14,82 @@

 from paddle.trainer_config_helpers import *

-################################### Data Configuration ###################################
-TrainData(ProtoData(files = "trainer/tests/mnist.list"))
-################################### Algorithm Configuration ###################################
-settings(batch_size = 128,
-         learning_method = MomentumOptimizer(momentum=0.5, sparse=False))
-################################### Network Configuration ###################################
-data = data_layer(name ="input", size=784)
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_conv(input, group_name):
+  out1 = img_conv_layer(input=input,
+              name=group_name+'_conv1_',
+              filter_size=1,
+              num_filters=channels,
+              padding=0,
+              shared_biases=True,
+              act=ReluActivation())
+
+  out2 = img_conv_layer(input=input,
+              name=group_name+'_conv2_',
+              filter_size=3,
+              num_filters=channels,
+              padding=1,
+              shared_biases=True,
+              act=ReluActivation())
+  return out1, out2
+
+def two_conv_bn(input, group_name):
+  out1, out2 = two_conv(input, group_name)
+  out1 = batch_norm_layer(input=out1,
+              name=group_name+'_bn1_',
+              use_global_stats=False,
+              act=ReluActivation())
+
+  out2 = batch_norm_layer(input=out2,
+              name=group_name+'_bn2_',
+              use_global_stats=False,
+              act=ReluActivation())
+  return out1, out2
+
+def two_conv_pool(input, group_name):
+  out1, out2 = two_conv(input, group_name)
+  out1 = img_pool_layer(input=out1,
+              name=group_name+'_pool1_',
+              pool_size=3,
+              stride=2,
+              padding=0,
+              pool_type=MaxPooling())
+
+  out2 = img_pool_layer(input=out2,
+              name=group_name+'_pool2_',
+              pool_size=5,
+              stride=2,
+              padding=1,
+              pool_type=MaxPooling())
+  return out1, out2
+
+def two_fc(input, group_name):
+  out1 = fc_layer(input=input,
+            name=group_name+'_fc1_',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())

-tmp = img_conv_layer(input=data,
-            num_channels=1,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
+  out2 = fc_layer(input=input,
+            name=group_name+'_fc2_',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
+  return out1, out2

-a1 = img_conv_layer(input=tmp,
-            filter_size=1,
-            num_filters=32,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
+data = data_layer(name ="input", size=channels*16*16)

-a2 = img_conv_layer(input=tmp,
+tmp = img_conv_layer(input=data,
+            num_channels=channels,
            filter_size=3,
-            num_filters=32,
+            num_filters=channels,
            padding=1,
            shared_biases=True,
            act=ReluActivation())

+a1, a2 = two_conv(tmp, 'conv_branch')
 tmp = addto_layer(input=[a1, a2],
            act=ReluActivation(),
            bias_attr=False)
@@ -54,36 +100,11 @@ tmp = img_pool_layer(input=tmp,
            padding=1,
            pool_type=AvgPooling())

-b1 = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-
-b1 = img_pool_layer(input=b1,
-            pool_size=3,
-            stride=2,
-            padding=0,
-            pool_type=MaxPooling())
-
-b2 = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=64,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-
-b2 = img_pool_layer(input=b2,
-            pool_size=5,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
-
+b1, b2 = two_conv_pool(tmp, 'pool_branch')
 tmp = concat_layer(input=[b1, b2])

 tmp = img_pool_layer(input=tmp,
-            num_channels=96,
+            num_channels=channels*2,
            pool_size=3,
            stride=2,
            padding=1,
@@ -91,8 +112,9 @@ tmp = img_pool_layer(input=tmp,

 tmp = img_conv_layer(input=tmp,
            filter_size=3,
-            num_filters=32,
+            num_filters=channels,
            padding=1,
+            stride=2,
            shared_biases=True,
            act=LinearActivation(),
            bias_attr=False)
@@ -101,33 +123,20 @@ tmp = batch_norm_layer(input=tmp,
            use_global_stats=False,
            act=ReluActivation())

-c1 = img_conv_layer(input=tmp,
-            filter_size=1,
-            num_filters=32,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
-
-c2 = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-
+c1, c2 = two_conv_bn(tmp, 'bn_branch')
 tmp = addto_layer(input=[c1, c2],
            act=ReluActivation(),
            bias_attr=False)

-tmp = fc_layer(input=tmp, size=64,
-            bias_attr=False,
-            act=TanhActivation())
+tmp = fc_layer(input=tmp, size=channels,
+            bias_attr=True,
+            act=ReluActivation())

-output = fc_layer(input=tmp, size=10,
+d1, d2 = two_fc(tmp, 'fc_branch')
+tmp = addto_layer(input=[d1, d2])
+
+out = fc_layer(input=tmp, size=10,
            bias_attr=True,
            act=SoftmaxActivation())

-lbl = data_layer(name ="label", size=10)
-
-cost = classification_cost(input=output, label=lbl)
-outputs(cost)
+outputs(out)
--- a/paddle/gserver/tests/mkldnn_branches_fc.conf
+++ b/paddle/gserver/tests/mkldnn_branches_fc.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle.trainer_config_helpers import *
-
-settings(batch_size=16)
-channels = get_config_arg("channels", int, 2)
-
-def two_fc(input, group_name):
-  out1 = fc_layer(input=input,
-            name=group_name+'_fc1',
-            size=channels,
-            bias_attr=False,
-            act=LinearActivation())
-
-  out2 = fc_layer(input=input,
-            name=group_name+'_fc2',
-            size=channels,
-            bias_attr=False,
-            act=LinearActivation())
-  return out1, out2
-
-data = data_layer(name ="input", size=channels*16*16)
-
-conv = img_conv_layer(input=data,
-            num_channels=channels,
-            filter_size=3,
-            num_filters=channels,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation())
-
-pool = img_pool_layer(input=conv,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=AvgPooling())
-
-a1, a2 = two_fc(input=pool, group_name='a')
-
-concat = concat_layer(input=[a1, a2])
-
-b1, b2 = two_fc(input=pool, group_name='b')
-
-addto = addto_layer(input=[b1, b2])
-
-outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_pool.conf
+++ b/paddle/gserver/tests/mkldnn_branches_pool.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle.trainer_config_helpers import *
-
-settings(batch_size=16)
-channels = get_config_arg("channels", int, 2)
-
-def two_pool(input, group_name):
-  out1 = img_pool_layer(input=input,
-            name=group_name+'_pool1',
-            pool_size=3,
-            stride=2,
-            padding=0,
-            pool_type=MaxPooling())
-
-  out2 = img_pool_layer(input=input,
-            name=group_name+'_pool2',
-            pool_size=5,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
-  return out1, out2
-
-data = data_layer(name ="input", size=channels*16*16)
-
-conv = img_conv_layer(input=data,
-            num_channels=channels,
-            filter_size=3,
-            num_filters=channels,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation())
-
-pool = img_pool_layer(input=conv,
-            pool_size=3,
-            stride=1,
-            padding=1,
-            pool_type=AvgPooling())
-
-a1, a2 = two_pool(input=pool, group_name='a')
-
-concat = concat_layer(input=[a1, a2])
-
-b1, b2 = two_pool(input=pool, group_name='b')
-
-addto = addto_layer(input=[b1, b2])
-
-outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_conv.conf
+++ b/paddle/gserver/tests/mkldnn_branches_conv.conf
@@ -17,40 +17,48 @@ from paddle.trainer_config_helpers import *
 settings(batch_size=16)
 channels = get_config_arg("channels", int, 2)

-def two_conv(input, group_name):
-  out1 = img_conv_layer(input=input,
-            name=group_name+'_conv1',
-            filter_size=1,
-            num_filters=channels,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
+data = data_layer(name ="input", size=channels*16*16)

-  out2 = img_conv_layer(input=input,
-            name=group_name+'_conv2',
+tmp = img_conv_layer(input=data,
+            num_channels=channels,
            filter_size=3,
            num_filters=channels,
            padding=1,
            shared_biases=True,
            act=ReluActivation())
-  return out1, out2

-data = data_layer(name ="input", size=channels*16*16)
+tmp = img_pool_layer(input=tmp,
+            pool_size=3,
+            stride=1,
+            padding=0,
+            pool_type=AvgPooling())

-conv = img_conv_layer(input=data,
-            num_channels=channels,
+tmp = img_conv_layer(input=tmp,
            filter_size=3,
            num_filters=channels,
            padding=1,
            shared_biases=True,
-            act=ReluActivation())
+            act=LinearActivation(),
+            bias_attr=False)

-a1, a2 = two_conv(input=conv, group_name='a')
+tmp = batch_norm_layer(input=tmp,
+            use_global_stats=False,
+            act=ReluActivation())

-concat = concat_layer(input=[a1, a2])
+tmp = img_pool_layer(input=tmp,
+            pool_size=3,
+            stride=2,
+            padding=1,
+            pool_type=MaxPooling())

-b1, b2 = two_conv(input=conv, group_name='b')
+tmp = fc_layer(input=tmp,
+            size=channels,
+            bias_attr=False,
+            act=ReluActivation())

-addto = addto_layer(input=[b1, b2])
+out = fc_layer(input=tmp,
+            size=10,
+            bias_attr=True,
+            act=SoftmaxActivation())

-outputs([concat, addto])
+outputs(out)
--- a/paddle/gserver/tests/test_MKLDNN.cpp
+++ b/paddle/gserver/tests/test_MKLDNN.cpp
@@ -234,8 +234,7 @@ static void getMKLDNNBatchNormConfig(TestConfig& cfg,
  cfg.inputDefs.push_back({INPUT_DATA, "layer_2_moving_var", 1, size_t(pm.ic)});
  cfg.inputDefs.back().isStatic = true;
  LayerInputConfig* input = cfg.layerConfig.add_inputs();
-  // TODO(TJ): uncomment me when refine and support comparing all zeroes vector
-  // cfg.layerConfig.set_active_type("relu");
+  cfg.layerConfig.set_active_type("relu");
  cfg.layerConfig.add_inputs();
  cfg.layerConfig.add_inputs();
  ImageConfig* img_conf = input->mutable_image_conf();
@@ -309,15 +308,15 @@ TEST(MKLDNNActivation, Activations) {
 }

 DECLARE_string(config_args);
-TEST(MKLDNNLayer, branches) {
-  std::vector<std::string> cases = {"conv", "pool", "fc"};
+TEST(MKLDNNNet, net) {
+  std::vector<std::string> cases = {"simple", "branch"};
  for (auto name : cases) {
-    std::string config = "./gserver/tests/mkldnn_branches_" + name + ".conf";
+    std::string config = "./gserver/tests/mkldnn_" + name + "_net.conf";
    for (auto channels : {2, 32}) {
      std::ostringstream oss;
      oss << "channels=" << channels;
      FLAGS_config_args = oss.str();
-      MKLDNNTester::runBranchesTest(config);
+      MKLDNNTester::runNetTest(config);
    }
  }
 }

--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -102,6 +102,11 @@ public:
    m_->copyFrom(src);
  }

+  void copyTo(Matrix& dst) {
+    // TODO(TJ): reorder data if this format is not nchw or x
+    dst.copyFrom(*m_);
+  }
+
 public:
  /**
   * Reorder this MKLDNNMatrix from other format.

--- a/paddle/math/Vector.cpp
+++ b/paddle/math/Vector.cpp
@@ -18,6 +18,7 @@ limitations under the License. */
 #include <memory>
 #include "Matrix.h"
 #include "hl_gpu.h"
+#include "hl_matrix.h"
 #include "hl_table_apply.h"
 #include "paddle/utils/Flags.h"
 #include "paddle/utils/Logging.h"
@@ -99,6 +100,19 @@ MatrixPtr VectorT<int>::toOneHotSparseMatrix(size_t idRange, bool useGpu) {
  return mat;
 }

+template <>
+std::shared_ptr<VectorT<int>> VectorT<real>::castToInt() {
+  std::shared_ptr<VectorT<int>> ret = IVector::create(this->getSize(), useGpu_);
+  if (useGpu_) {
+    hl_vector_cast2int(ret->getData(), this->getData(), this->getSize());
+  } else {
+    for (size_t i = 0; i < getSize(); ++i) {
+      ret->getData()[i] = int(this->getData()[i]);
+    }
+  }
+  return ret;
+}
+
 template <class T>
 GpuVectorT<T>::GpuVectorT(size_t size)
    : VectorT<T>(size,

--- a/paddle/math/Vector.h
+++ b/paddle/math/Vector.h
@@ -162,6 +162,13 @@ public:
   */
  std::shared_ptr<Matrix> toOneHotSparseMatrix(size_t idRange, bool useGpu);

+  /**
+   * @brief cast vector of "real" elements to "int" elements.
+   *
+   * @note: float -> int must be casted, or you'll get wrong data.
+   */
+  std::shared_ptr<VectorT<int>> castToInt();
+
  /**
   * This function will crash if the size of src and dest is different.
   */

--- a/paddle/memory/detail/buddy_allocator.cc
+++ b/paddle/memory/detail/buddy_allocator.cc
@@ -27,11 +27,11 @@ BuddyAllocator::BuddyAllocator(SystemAllocator* system_allocator,
      system_allocator_(std::move(system_allocator)) {}

 BuddyAllocator::~BuddyAllocator() {
-  VLOG(3) << "BuddyAllocator Disconstructor makes sure that all of these "
-             "have actually been freed";
+  VLOG(10) << "BuddyAllocator Disconstructor makes sure that all of these "
+              "have actually been freed";
  while (!pool_.empty()) {
    auto block = static_cast<MemoryBlock*>(std::get<2>(*pool_.begin()));
-    VLOG(3) << "Free from block (" << block << ", " << max_chunk_size_ << ")";
+    VLOG(10) << "Free from block (" << block << ", " << max_chunk_size_ << ")";

    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);
@@ -51,11 +51,12 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) {
  // acquire the allocator lock
  std::lock_guard<std::mutex> lock(mutex_);

-  VLOG(3) << "Allocate " << unaligned_size << " bytes from chunk size " << size;
+  VLOG(10) << "Allocate " << unaligned_size << " bytes from chunk size "
+           << size;

  // if the allocation is huge, send directly to the system allocator
  if (size > max_chunk_size_) {
-    VLOG(3) << "Allocate from system allocator.";
+    VLOG(10) << "Allocate from system allocator.";
    return SystemAlloc(size);
  }

@@ -70,9 +71,9 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) {
      return nullptr;
    }
  } else {
-    VLOG(3) << "Allocation from existing memory block " << std::get<2>(*it)
-            << " at address "
-            << reinterpret_cast<MemoryBlock*>(std::get<2>(*it))->data();
+    VLOG(10) << "Allocation from existing memory block " << std::get<2>(*it)
+             << " at address "
+             << reinterpret_cast<MemoryBlock*>(std::get<2>(*it))->data();
  }

  total_used_ += size;
@@ -89,10 +90,10 @@ void BuddyAllocator::Free(void* p) {
  // Acquire the allocator lock
  std::lock_guard<std::mutex> lock(mutex_);

-  VLOG(3) << "Free from address " << block;
+  VLOG(10) << "Free from address " << block;

  if (block->type(cache_) == MemoryBlock::HUGE_CHUNK) {
-    VLOG(3) << "Free directly from system allocator";
+    VLOG(10) << "Free directly from system allocator";
    system_allocator_->Free(block, block->total_size(cache_),
                            block->index(cache_));

@@ -109,8 +110,8 @@ void BuddyAllocator::Free(void* p) {

  // Trying to merge the right buddy
  if (block->has_right_buddy(cache_)) {
-    VLOG(3) << "Merging this block " << block << " with its right buddy "
-            << block->right_buddy(cache_);
+    VLOG(10) << "Merging this block " << block << " with its right buddy "
+             << block->right_buddy(cache_);

    auto right_buddy = block->right_buddy(cache_);

@@ -127,8 +128,8 @@ void BuddyAllocator::Free(void* p) {

  // Trying to merge the left buddy
  if (block->has_left_buddy(cache_)) {
-    VLOG(3) << "Merging this block " << block << " with its left buddy "
-            << block->left_buddy(cache_);
+    VLOG(10) << "Merging this block " << block << " with its left buddy "
+             << block->left_buddy(cache_);

    auto left_buddy = block->left_buddy(cache_);

@@ -144,8 +145,8 @@ void BuddyAllocator::Free(void* p) {
  }

  // Dumping this block into pool
-  VLOG(3) << "Inserting free block (" << block << ", "
-          << block->total_size(cache_) << ")";
+  VLOG(10) << "Inserting free block (" << block << ", "
+           << block->total_size(cache_) << ")";
  pool_.insert(
      IndexSizeAddress(block->index(cache_), block->total_size(cache_), block));

@@ -164,7 +165,7 @@ void* BuddyAllocator::SystemAlloc(size_t size) {
  size_t index = 0;
  void* p = system_allocator_->Alloc(index, size);

-  VLOG(3) << "Allocated " << p << " from system allocator.";
+  VLOG(10) << "Allocated " << p << " from system allocator.";

  if (p == nullptr) return nullptr;

@@ -190,8 +191,8 @@ BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() {

  if (p == nullptr) return pool_.end();

-  VLOG(3) << "Creating and inserting new block " << p
-          << " from system allocator";
+  VLOG(10) << "Creating and inserting new block " << p
+           << " from system allocator";

  static_cast<MemoryBlock*>(p)->init(cache_, MemoryBlock::FREE_CHUNK, index,
                                     max_chunk_size_, nullptr, nullptr);
@@ -235,19 +236,19 @@ void* BuddyAllocator::SplitToAlloc(BuddyAllocator::PoolSet::iterator it,
  auto block = static_cast<MemoryBlock*>(std::get<2>(*it));
  pool_.erase(it);

-  VLOG(3) << "Split block (" << block << ", " << block->total_size(cache_)
-          << ") into";
+  VLOG(10) << "Split block (" << block << ", " << block->total_size(cache_)
+           << ") into";
  block->split(cache_, size);

-  VLOG(3) << "Left block (" << block << ", " << block->total_size(cache_)
-          << ")";
+  VLOG(10) << "Left block (" << block << ", " << block->total_size(cache_)
+           << ")";
  block->set_type(cache_, MemoryBlock::ARENA_CHUNK);

  // the rest of memory if exist
  if (block->has_right_buddy(cache_)) {
    if (block->right_buddy(cache_)->type(cache_) == MemoryBlock::FREE_CHUNK) {
-      VLOG(3) << "Insert right block (" << block->right_buddy(cache_) << ", "
-              << block->right_buddy(cache_)->total_size(cache_) << ")";
+      VLOG(10) << "Insert right block (" << block->right_buddy(cache_) << ", "
+               << block->right_buddy(cache_)->total_size(cache_) << ")";

      pool_.insert(
          IndexSizeAddress(block->right_buddy(cache_)->index(cache_),
@@ -274,7 +275,7 @@ void BuddyAllocator::CleanIdleFallBackAlloc() {
      return;
    }

-    VLOG(3) << "Return block " << block << " to fallback allocator.";
+    VLOG(10) << "Return block " << block << " to fallback allocator.";

    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);
@@ -310,7 +311,7 @@ void BuddyAllocator::CleanIdleNormalAlloc() {

    MemoryBlock* block = static_cast<MemoryBlock*>(std::get<2>(*pool));

-    VLOG(3) << "Return block " << block << " to base allocator.";
+    VLOG(10) << "Return block " << block << " to base allocator.";

    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);

--- a/paddle/memory/detail/meta_cache.cc
+++ b/paddle/memory/detail/meta_cache.cc
@@ -30,7 +30,7 @@ Metadata MetadataCache::load(const MemoryBlock* block) {
    return existing_metadata->second;
  } else {
    auto* meta = reinterpret_cast<const Metadata*>(block);
-    VLOG(3) << "Load MetaData type=" << meta->type;
+    VLOG(10) << "Load MetaData type=" << meta->type;
    PADDLE_ASSERT(meta->check_guards());
    return *reinterpret_cast<const Metadata*>(block);
  }

--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@@ -41,7 +41,16 @@ void* CPUAllocator::Alloc(size_t& index, size_t size) {

  index = 0;  // unlock memory

-  void* p = malloc(size);
+  void* p;
+
+#ifdef PADDLE_USE_MKLDNN
+  // refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
+  // memory alignment
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 4096ul, size), 0);
+#else
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 32ul, size), 0);
+#endif
+  PADDLE_ENFORCE(p, "Fail to allocate CPU memory: size = %d .", size);

  if (p != nullptr) {
    if (FLAGS_use_pinned_memory) {

--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -39,15 +39,15 @@ BuddyAllocator* GetCPUBuddyAllocator() {

 template <>
 void* Alloc<platform::CPUPlace>(platform::CPUPlace place, size_t size) {
-  VLOG(3) << "Allocate " << size << " bytes on " << platform::Place(place);
+  VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place);
  void* p = GetCPUBuddyAllocator()->Alloc(size);
-  VLOG(3) << "  pointer=" << p;
+  VLOG(10) << "  pointer=" << p;
  return p;
 }

 template <>
 void Free<platform::CPUPlace>(platform::CPUPlace place, void* p) {
-  VLOG(3) << "Free pointer=" << p << " on " << platform::Place(place);
+  VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place);
  GetCPUBuddyAllocator()->Free(p);
 }

@@ -69,11 +69,12 @@ BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
                                   platform::GpuMinChunkSize(),
                                   platform::GpuMaxChunkSize());
    }
-    VLOG(3) << "\n\nNOTE: each GPU device use "
-            << FLAGS_fraction_of_gpu_memory_to_use * 100 << "% of GPU memory.\n"
-            << "You can set environment variable '"
-            << platform::kEnvFractionGpuMemoryToUse
-            << "' to change the fraction of GPU usage.\n\n";
+    VLOG(10) << "\n\nNOTE: each GPU device use "
+             << FLAGS_fraction_of_gpu_memory_to_use * 100
+             << "% of GPU memory.\n"
+             << "You can set environment variable '"
+             << platform::kEnvFractionGpuMemoryToUse
+             << "' to change the fraction of GPU usage.\n\n";
  }
  platform::SetDeviceId(gpu_id);
  return as[gpu_id];

--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@@ -131,9 +131,10 @@ add_subdirectory(math)
 add_subdirectory(nccl)

 set(DEPS_OPS
-    recurrent_op
    cond_op
    cross_entropy_op
+    recurrent_op
+    dynamic_recurrent_op
    softmax_with_cross_entropy_op
    sum_op
    pool_op
@@ -142,9 +143,6 @@ set(DEPS_OPS
    sequence_conv_op
    lstm_op)

-
-op_library(recurrent_op SRCS recurrent_op.cc rnn/recurrent_op_utils.cc
-  DEPS framework_proto tensor net_op)
 op_library(cond_op SRCS cond_op.cc DEPS framework_proto tensor operator net_op)
 op_library(cross_entropy_op DEPS cross_entropy)
 op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax)
@@ -156,7 +154,9 @@ op_library(nccl_op DEPS nccl_common)
 endif()
 op_library(sequence_conv_op DEPS context_project)
 op_library(lstm_op DEPS sequence2batch lstm_compute)
-
+op_library(dynamic_recurrent_op SRCS dynamic_recurrent_op.cc rnn/recurrent_op_utils.cc
+        DEPS net_op tensor_array)
+op_library(recurrent_op SRCS recurrent_op.cc DEPS executor)
 list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS})
 foreach(src ${GENERAL_OPS})
    op_library(${src})
@@ -168,8 +168,9 @@ cc_test(gather_test SRCS gather_test.cc DEPS tensor)
 cc_test(net_op_test SRCS net_op_test.cc DEPS net_op)
 cc_test(scatter_test SRCS scatter_test.cc DEPS tensor)
 cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor paddle_memory)
-cc_test(dynamic_recurrent_op_test SRCS dynamic_recurrent_op_test.cc DEPS dynamic_recurrent_op recurrent_op tensor_array)
-
+cc_test(dynamic_recurrent_op_test SRCS dynamic_recurrent_op_test.cc
+        rnn/recurrent_op_utils.cc
+        DEPS dynamic_recurrent_op)
 if(WITH_GPU)
  nv_test(nccl_op_test SRCS nccl_op_test.cu DEPS nccl_op gpu_info device_context)
 endif()

--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -43,7 +43,12 @@ class SigmoidOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Sigmoid operator");
    AddOutput("Y", "Output of Sigmoid operator");
-    AddComment("Sigmoid activation operator, sigmoid = 1 / (1 + exp(-x))");
+    AddComment(R"DOC(
+Sigmoid activation operator.
+
+$y = 1 / (1 + e^{-x})$
+
+)DOC");
  }
 };

@@ -54,8 +59,12 @@ class LogSigmoidOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of LogSigmoid operator");
    AddOutput("Y", "Output of LogSigmoid operator");
-    AddComment(
-        "Logsigmoid activation operator, logsigmoid = log (1 / (1 + exp(-x)))");
+    AddComment(R"DOC(
+Logsigmoid activation operator.
+
+$y = \log(1 / (1 + e^{-x}))$
+
+)DOC");
  }
 };

@@ -65,7 +74,12 @@ class ExpOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Exp operator");
    AddOutput("Y", "Output of Exp operator");
-    AddComment("Exp activation operator, exp(x) = e^x");
+    AddComment(R"DOC(
+Exp activation operator.
+
+$y = e^x$
+
+)DOC");
  }
 };

@@ -75,7 +89,12 @@ class ReluOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Relu operator");
    AddOutput("Y", "Output of Relu operator");
-    AddComment("Relu activation operator, relu(x) = max(x, 0)");
+    AddComment(R"DOC(
+Relu activation operator.
+
+$y = \max(x, 0)$
+
+)DOC");
  }
 };

@@ -87,11 +106,14 @@ class LeakyReluOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of LeakyRelu operator");
    AddOutput("Y", "Output of LeakyRelu operator");
-    AddComment(
-        "LeakyRelu activation operator, "
-        "leaky_relu = max(x, alpha * x)");
    AddAttr<AttrType>("alpha", "The small negative slope")
        .SetDefault(static_cast<AttrType>(0.02f));
+    AddComment(R"DOC(
+LeakyRelu activation operator.
+
+$y = \max(x, \alpha * x)$
+
+)DOC");
  }
 };

@@ -103,12 +125,20 @@ class SoftShrinkOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Softshrink operator");
    AddOutput("Y", "Output of Softshrink operator");
-    AddComment(
-        "Softshrink activation operator, "
-        "softshrink = x - lambda, if x > lambda;"
-        " x + lambda, if x < lambda; 0 otherwise");
    AddAttr<AttrType>("lambda", "non-negative offset")
        .SetDefault(static_cast<AttrType>(0.5f));
+    AddComment(R"DOC(
+Softshrink activation operator.
+
+$$
+y = \begin{cases} 
+    x - \lambda, \text{if } x > \lambda \\
+    x + \lambda, \text{if } x < -\lambda \\
+    0,  \text{otherwise}
+    \end{cases}
+$$
+
+)DOC");
  }
 };

@@ -118,9 +148,12 @@ class TanhOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Tanh operator");
    AddOutput("Y", "Output of Tanh operator");
-    AddComment(
-        "Tanh activation operator, tanh = (exp(x) - exp(-x)) / (exp(x) + "
-        "exp(-x))");
+    AddComment(R"DOC(
+Tanh activation operator.
+
+$$y = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$
+
+)DOC");
  }
 };

@@ -131,7 +164,12 @@ class TanhShrinkOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of TanhShrink operator");
    AddOutput("Y", "Output of TanhShrink operator");
-    AddComment("TanhShrink activation operator, tanhshrink(x) = x - tanh(x)");
+    AddComment(R"DOC(
+TanhShrink activation operator.
+
+$$y = x - \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$
+
+)DOC");
  }
 };

@@ -143,13 +181,20 @@ class HardShrinkOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of HardShrink operator");
    AddOutput("Y", "Output of HardShrink operator");
-    AddComment(
-        "HardShrink activation operator, "
-        "hard_shrink(x) = x if x > lambda"
-        "hard_shrink(x) = x if x < -lambda"
-        "hard_shrink(x) = 0 otherwise");
    AddAttr<AttrType>("threshold", "The value of threshold for HardShrink")
        .SetDefault(static_cast<AttrType>(0.5));
+    AddComment(R"DOC(
+HardShrink activation operator.
+
+$$
+y = \begin{cases} 
+    x, \text{if } x > \lambda \\
+    x, \text{if } x < -\lambda \\
+    0,  \text{otherwise}
+    \end{cases}
+$$
+
+)DOC");
  }
 };

@@ -159,7 +204,12 @@ class SqrtOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Sqrt operator");
    AddOutput("Y", "Output of Sqrt operator");
-    AddComment("Sqrt activation operator, sqrt(x) = x^(1/2)");
+    AddComment(R"DOC(
+Sqrt activation operator.
+
+$y = \sqrt{x}$
+
+)DOC");
  }
 };

@@ -169,7 +219,12 @@ class AbsOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Abs operator");
    AddOutput("Y", "Output of Abs operator");
-    AddComment("Abs activation operator, abs(x) = |x|");
+    AddComment(R"DOC(
+Abs activation operator.
+
+$y = |x|$
+
+)DOC");
  }
 };

@@ -180,7 +235,12 @@ class ReciprocalOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Reciprocal operator");
    AddOutput("Y", "Output of Reciprocal operator");
-    AddComment("Reciprocal activation operator, reciprocal(x) = 1 / x");
+    AddComment(R"DOC(
+Reciprocal activation operator.
+
+$$y = \frac{1}{x}$$
+
+)DOC");
  }
 };

@@ -190,7 +250,14 @@ class LogOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Log operator");
    AddOutput("Y", "Output of Log operator");
-    AddComment("Log activation operator, log(x) = natural logarithm of x");
+    AddComment(R"DOC(
+Log activation operator.
+
+$y = \ln(x)$
+
+Natural logarithm of x.
+
+)DOC");
  }
 };

@@ -200,7 +267,12 @@ class SquareOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Square operator");
    AddOutput("Y", "Output of Square operator");
-    AddComment("Square activation operator, square(x) = x^2");
+    AddComment(R"DOC(
+Square activation operator.
+
+$y = x^2$
+
+)DOC");
  }
 };

@@ -211,7 +283,12 @@ class SoftplusOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Softplus operator");
    AddOutput("Y", "Output of Softplus operator");
-    AddComment("Softplus activation operator, softplus(x) = log(1 + exp(x))");
+    AddComment(R"DOC(
+Softplus activation operator.
+
+$y = \ln(1 + e^{x})$
+
+)DOC");
  }
 };

@@ -222,7 +299,12 @@ class SoftsignOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Softsign operator");
    AddOutput("Y", "Output of Softsign operator");
-    AddComment("Softsign activation operator, softsign(x) = x / (1 + |x|)");
+    AddComment(R"DOC(
+Softsign activation operator.
+
+$$y = \frac{x}{1 + |x|}$$
+
+)DOC");
  }
 };

@@ -233,11 +315,16 @@ class BReluOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of BRelu operator");
    AddOutput("Y", "Output of BRelu operator");
-    AddComment("BRelu activation operator, brelu = max(min(x, t_min), t_max)");
    AddAttr<AttrType>("t_min", "The min marginal value of BRelu")
        .SetDefault(static_cast<AttrType>(0));
    AddAttr<AttrType>("t_max", "The max marginal value of BRelu")
        .SetDefault(static_cast<AttrType>(24));
+    AddComment(R"DOC(
+BRelu activation operator.
+
+$y = \max(\min(x, t_{min}), t_{max})$
+
+)DOC");
  }
 };

@@ -249,11 +336,14 @@ class SoftReluOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of SoftRelu operator");
    AddOutput("Y", "Output of SoftRelu operator");
-    AddComment(
-        "SoftRelu activation operator, soft_relu = log(1 + exp(max(min(x, "
-        "threshold), threshold)))");
    AddAttr<AttrType>("threshold", "The threshold value of SoftRelu")
        .SetDefault(static_cast<AttrType>(40));
+    AddComment(R"DOC(
+SoftRelu activation operator.
+
+$y = \ln(1 + \exp(\max(\min(x, threshold), threshold))$
+
+)DOC");
  }
 };

@@ -262,19 +352,19 @@ class ELUOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  ELUOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X",
-             "(Tensor) The input of ELU operator, it shouldn't be empty. Input "
-             "is flattened and treated as a 1D array.");
-    AddOutput("Y",
-              "(Tensor) The output of ELU operator. It has the same shape as "
-              "the input.");
-    AddAttr<AttrType>(
-        "alpha", "(float, default 1.0) Alpha value in the elu formulation.")
-        .SetDefault(static_cast<AttrType>(1.));
+    AddInput("X", "Input of ELU operator");
+    AddOutput("Y", "Output of ELU operator");
+    AddAttr<AttrType>("alpha", "The alpha value of ELU")
+        .SetDefault(static_cast<AttrType>(1.0f));
    AddComment(R"DOC(
-        ELU activation operator. It applies this element-wise computation on
-        the input: f(x) = max(0, x) + min(0, alpha * (exp(x) - 1)).
-        Check .. _Link: https://arxiv.org/abs/1511.07289 for more details.)DOC");
+ELU activation operator.
+
+Applies the following element-wise computation on the input according to
+https://arxiv.org/abs/1511.07289.
+
+$y = \max(0, x) + \min(0, \alpha * (e^x - 1))$
+
+)DOC");
  }
 };

@@ -285,9 +375,14 @@ class Relu6OpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Relu6 operator");
    AddOutput("Y", "Output of Relu6 operator");
-    AddComment("Relu6 activation operator, relu6 = min(max(0, x), 6)");
    AddAttr<AttrType>("threshold", "The threshold value of Relu6")
        .SetDefault(static_cast<AttrType>(6));
+    AddComment(R"DOC(
+Relu6 activation operator.
+
+$y = \min(\max(0, x), 6)$
+
+)DOC");
  }
 };

@@ -298,9 +393,14 @@ class PowOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of Pow operator");
    AddOutput("Y", "Output of Pow operator");
-    AddComment("Pow activation operator, pow(x, factor) = x^factor");
    AddAttr<AttrType>("factor", "The exponential factor of Pow")
        .SetDefault(static_cast<AttrType>(1));
+    AddComment(R"DOC(
+Pow activation operator.
+
+$y = x^{factor}$
+
+)DOC");
  }
 };

@@ -311,11 +411,16 @@ class STanhOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of STanh operator");
    AddOutput("Y", "Output of STanh operator");
-    AddComment("STanh activation operator, stanh = b * tanh(a * x)");
    AddAttr<AttrType>("scale_a", "The scale parameter of a for the input")
        .SetDefault(static_cast<AttrType>(2 / 3));
    AddAttr<AttrType>("scale_b", "The scale parameter of b for the input")
        .SetDefault(static_cast<AttrType>(1.7159));
+    AddComment(R"DOC(
+STanh activation operator.
+
+$$y = b * \frac{e^{a * x} - e^{-a * x}}{e^{a * x} + e^{-a * x}}$$
+
+)DOC");
  }
 };

@@ -327,12 +432,19 @@ class ThresholdedReluOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of ThresholdedRelu operator");
    AddOutput("Y", "Output of ThresholdedRelu operator");
-    AddComment(
-        "ThresholdedRelu activation operator, "
-        "thresholded_relu = x for x > threshold, "
-        "thresholded_relu = 0 otherwise.");
    AddAttr<AttrType>("threshold", "The threshold location of activation")
        .SetDefault(static_cast<AttrType>(1.0));
+    AddComment(R"DOC(
+ThresholdedRelu activation operator.
+
+$$
+y = \begin{cases} 
+    x, \text{if } x > threshold \\
+    0,  \text{otherwise}
+    \end{cases}
+$$
+
+)DOC");
  }
 };

@@ -344,27 +456,23 @@ class HardSigmoidOpMaker : public framework::OpProtoAndCheckerMaker {
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "Input of HardSigmoid operator");
    AddOutput("Y", "Output of HardSigmoid operator");
+    AddAttr<AttrType>("slope", "Slope for linear approximation of sigmoid")
+        .SetDefault(static_cast<AttrType>(0.2));
+    AddAttr<AttrType>("offset", "Offset for linear approximation of sigmoid")
+        .SetDefault(static_cast<AttrType>(0.5));
    AddComment(R"DOC(
-Hard Sigmoid activation operator.
+HardSigmoid activation operator.

-Segment-wise linear approximation of sigmoid[1].
-This is much faster than sigmoid.
+Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), 
+which is much faster than sigmoid.

-hard_sigmoid = max(0, min(1, slope * x + shift))
+$y = \max(0, \min(1, slope * x + shift))$

 The slope should be positive. The offset can be either positive or negative.
-The default slope and shift are set from [1].
+The default slope and shift are set according to the above reference.
 It is recommended to use the defaults for this activation.

-References:
-  [1] Noisy Activation Functions
-      (https://arxiv.org/abs/1603.00391)
-
-    )DOC");
-    AddAttr<AttrType>("slope", "Slope for linear approximation of sigmoid")
-        .SetDefault(static_cast<AttrType>(0.2));
-    AddAttr<AttrType>("offset", "Offset for linear approximation of sigmoid")
-        .SetDefault(static_cast<AttrType>(0.5));
+)DOC");
  }
 };


--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -232,7 +232,7 @@ struct HardShrinkGradFunctor : public BaseActivationFunctor<T> {
  }
 };

-// softshrink(x) = x - lambda, if x > lambda; x + lambda, if x < lambda; 0
+// softshrink(x) = x - lambda, if x > lambda; x + lambda, if x < -lambda; 0
 // otherwise
 template <typename T>
 struct SoftShrinkFunctor : public BaseActivationFunctor<T> {

--- a/paddle/operators/conv2d_transpose_cudnn_op.cc
+++ b/paddle/operators/conv2d_transpose_cudnn_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/conv2d_transpose_op.h"
+
+namespace paddle {
+namespace operators {
+
+class CudnnConv2DTransposeOpMaker : public Conv2DTransposeOpMaker {
+ public:
+  CudnnConv2DTransposeOpMaker(framework::OpProto* proto,
+                              framework::OpAttrChecker* op_checker)
+      : Conv2DTransposeOpMaker(proto, op_checker) {
+    AddAttr<std::vector<int>>("dilations", "dilations of convolution operator.")
+        .SetDefault(std::vector<int>{1, 1});
+    AddAttr<int>("workspace_size_MB",
+                 "workspace size for cudnn, in MB, "
+                 "workspace is a section of GPU memory which will be "
+                 "allocated/freed each time the operator runs, larger "
+                 "workspace size can increase performance but also requires "
+                 "better hardward. This size should be carefully setted.")
+        .SetDefault(4096);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(conv2d_transpose_cudnn, ops::Conv2DTransposeOp,
+            ops::CudnnConv2DTransposeOpMaker, conv2d_transpose_cudnn_grad,
+            ops::Conv2DTransposeOpGrad);
+
+REGISTER_OP_CPU_KERNEL(
+    conv2d_transpose_cudnn,
+    ops::GemmConv2DTransposeKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    conv2d_transpose_cudnn_grad,
+    ops::GemmConv2DTransposeGradKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/conv2d_transpose_cudnn_op.cu
+++ b/paddle/operators/conv2d_transpose_cudnn_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+#include "paddle/memory/memory.h"
+#include "paddle/operators/conv2d_transpose_op.h"
+#include "paddle/platform/assert.h"
+#include "paddle/platform/cudnn_helper.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
+using ScopedFilterDescriptor = platform::ScopedFilterDescriptor;
+using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor;
+using DataLayout = platform::DataLayout;
+using CUDADeviceContext = platform::CUDADeviceContext;
+
+static constexpr size_t kConvCudnnWorkspaceLimitBytes = 1024 * 1024 * 1024;
+
+template <typename T>
+class CudnnConvTransposeOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "It must use GPUPlace.");
+    auto* input = ctx.Input<Tensor>("Input");
+    auto* filter = ctx.Input<Tensor>("Filter");
+    auto* output = ctx.Output<Tensor>("Output");
+
+    std::vector<int> strides = ctx.Attr<std::vector<int>>("strides");
+    std::vector<int> paddings = ctx.Attr<std::vector<int>>("paddings");
+    // cudnn v5 does not support dilations
+    std::vector<int> dilations = ctx.Attr<std::vector<int>>("dilations");
+    int user_workspace_size = ctx.Attr<int>("workspace_size_MB");
+
+    const T* input_data = input->data<T>();
+    const T* filter_data = filter->data<T>();
+    T* output_data = output->mutable_data<T>(ctx.GetPlace());
+    // ------------------- cudnn descriptors ---------------------
+    ScopedTensorDescriptor input_desc;
+    ScopedTensorDescriptor output_desc;
+    ScopedFilterDescriptor filter_desc;
+    ScopedConvolutionDescriptor conv_desc;
+    DataLayout layout = DataLayout::kNCHW;
+
+    // N, M, H, W
+    cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor<T>(
+        layout, framework::vectorize2int(input->dims()));
+    // N, C, O_h, O_w
+    cudnnTensorDescriptor_t cudnn_output_desc = output_desc.descriptor<T>(
+        layout, framework::vectorize2int(output->dims()));
+    // M, C, K_h, K_w
+    cudnnFilterDescriptor_t cudnn_filter_desc = filter_desc.descriptor<T>(
+        layout, framework::vectorize2int(filter->dims()));
+    cudnnConvolutionDescriptor_t cudnn_conv_desc =
+        conv_desc.descriptor<T>(paddings, strides, dilations);
+
+    // ------------------- cudnn conv workspace ---------------------
+    void* cudnn_workspace = nullptr;
+    size_t workspace_size_in_bytes;  // final workspace to allocate.
+    size_t workspace_size_limit = kConvCudnnWorkspaceLimitBytes;
+    if (user_workspace_size > 0) {
+      workspace_size_limit = user_workspace_size * 1024 * 1024;
+    }
+    // ------------------- cudnn conv algorithm ---------------------
+    cudnnConvolutionBwdDataAlgo_t algo;
+    auto handle = ctx.cuda_device_context().cudnn_handle();
+    // Get the algorithm
+    PADDLE_ENFORCE(platform::dynload::cudnnGetConvolutionBackwardDataAlgorithm(
+        handle, cudnn_filter_desc, cudnn_input_desc, cudnn_conv_desc,
+        // dxDesc: Handle to the previously initialized output tensor
+        // descriptor.
+        cudnn_output_desc, CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT,
+        workspace_size_limit, &algo));
+
+    // get workspace size able to allocate
+    PADDLE_ENFORCE(
+        platform::dynload::cudnnGetConvolutionBackwardDataWorkspaceSize(
+            handle, cudnn_filter_desc, cudnn_input_desc, cudnn_conv_desc,
+            cudnn_output_desc, algo, &workspace_size_in_bytes));
+
+    // Allocate on GPU memory
+    platform::GPUPlace gpu = boost::get<platform::GPUPlace>(ctx.GetPlace());
+    cudnn_workspace = paddle::memory::Alloc(gpu, workspace_size_in_bytes);
+
+    // ------------------- cudnn conv transpose forward ---------------------
+    T alpha = 1.0f, beta = 0.0f;
+    PADDLE_ENFORCE(platform::dynload::cudnnConvolutionBackwardData(
+        handle, &alpha, cudnn_filter_desc, filter_data, cudnn_input_desc,
+        input_data, cudnn_conv_desc, algo, cudnn_workspace,
+        workspace_size_in_bytes, &beta, cudnn_output_desc, output_data));
+
+    // Release the cudnn workspace
+    paddle::memory::Free(gpu, cudnn_workspace);
+  }
+};
+
+template <typename T>
+class CudnnConvTransposeGradOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "It must use GPUPlace.");
+    auto input = ctx.Input<Tensor>("Input");
+    auto filter = ctx.Input<Tensor>("Filter");
+    auto output_grad = ctx.Input<Tensor>(framework::GradVarName("Output"));
+    auto input_grad = ctx.Output<Tensor>(framework::GradVarName("Input"));
+    auto filter_grad = ctx.Output<Tensor>(framework::GradVarName("Filter"));
+    const T* input_data = input->data<T>();
+    const T* output_grad_data = output_grad->data<T>();
+    const T* filter_data = filter->data<T>();
+
+    std::vector<int> strides = ctx.Attr<std::vector<int>>("strides");
+    std::vector<int> paddings = ctx.Attr<std::vector<int>>("paddings");
+    // cudnn v5 does not support dilations
+    std::vector<int> dilations = ctx.Attr<std::vector<int>>("dilations");
+    int user_workspace_size = ctx.Attr<int>("workspace_size_MB");
+
+    // ------------------- cudnn descriptors ---------------------
+    ScopedTensorDescriptor input_desc;
+    ScopedTensorDescriptor output_desc;
+    ScopedFilterDescriptor filter_desc;
+    ScopedConvolutionDescriptor conv_desc;
+    DataLayout layout = DataLayout::kNCHW;
+
+    // Input: (N, M, H, W)
+    cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor<T>(
+        layout, framework::vectorize2int(input->dims()));
+    // Output: (N, C, O_H, O_W)
+    cudnnTensorDescriptor_t cudnn_output_desc = output_desc.descriptor<T>(
+        layout, framework::vectorize2int(output_grad->dims()));
+    // Filter (M, C, K_H, K_W)
+    cudnnFilterDescriptor_t cudnn_filter_desc = filter_desc.descriptor<T>(
+        layout, framework::vectorize2int(filter->dims()));
+
+    cudnnConvolutionDescriptor_t cudnn_conv_desc =
+        conv_desc.descriptor<T>(paddings, strides, dilations);
+
+    // ------------------- cudnn backward algorithm ---------------------
+    cudnnConvolutionFwdAlgo_t data_algo;
+    cudnnConvolutionBwdFilterAlgo_t filter_algo;
+    size_t bwd_filter_ws_size, fwd_ws_size;
+    size_t workspace_size_in_bytes = 0;
+    size_t workspace_size_limit = kConvCudnnWorkspaceLimitBytes;
+    if (user_workspace_size > 0) {
+      workspace_size_limit = user_workspace_size * 1024 * 1024;
+    }
+
+    auto handle = ctx.cuda_device_context().cudnn_handle();
+    if (input_grad) {
+      // choose backward algorithm for data
+      PADDLE_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm(
+          handle, cudnn_output_desc, cudnn_filter_desc, cudnn_conv_desc,
+          cudnn_input_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT,
+          workspace_size_limit, &data_algo));
+      PADDLE_ENFORCE(platform::dynload::cudnnGetConvolutionForwardWorkspaceSize(
+          handle, cudnn_output_desc, cudnn_filter_desc, cudnn_conv_desc,
+          cudnn_input_desc, data_algo, &fwd_ws_size));
+      workspace_size_in_bytes = std::max(workspace_size_in_bytes, fwd_ws_size);
+    }
+
+    if (filter_grad) {
+      // choose backward algorithm for filter
+      PADDLE_ENFORCE(
+          platform::dynload::cudnnGetConvolutionBackwardFilterAlgorithm(
+              handle, cudnn_output_desc, cudnn_input_desc, cudnn_conv_desc,
+              cudnn_filter_desc,
+              CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT,
+              workspace_size_limit, &filter_algo));
+
+      // get workspace for backwards filter algorithm
+      PADDLE_ENFORCE(
+          platform::dynload::cudnnGetConvolutionBackwardFilterWorkspaceSize(
+              handle, cudnn_output_desc, cudnn_input_desc, cudnn_conv_desc,
+              cudnn_filter_desc, filter_algo, &bwd_filter_ws_size));
+      workspace_size_in_bytes =
+          std::max(workspace_size_in_bytes, bwd_filter_ws_size);
+    }
+
+    // ------------------- cudnn conv workspace ---------------------
+    // Already on GPU
+    void* cudnn_workspace = nullptr;
+    platform::GPUPlace gpu = boost::get<platform::GPUPlace>(ctx.GetPlace());
+    cudnn_workspace = paddle::memory::Alloc(gpu, workspace_size_in_bytes);
+    // ------------------- cudnn conv backward data ---------------------
+    // FIXME(typhoonzero): template type T may not be the same as cudnn call.
+    T alpha = 1.0f, beta = 0.0f;
+    if (input_grad) {
+      T* input_grad_data = input_grad->mutable_data<T>(ctx.GetPlace());
+      auto t = framework::EigenVector<T>::Flatten(*input_grad);
+      t.device(ctx.GetEigenDevice<platform::GPUPlace>()) =
+          t.constant(static_cast<T>(0));
+
+      PADDLE_ENFORCE(platform::dynload::cudnnConvolutionForward(
+          handle, &alpha, cudnn_output_desc, output_grad_data,
+          cudnn_filter_desc, filter_data, cudnn_conv_desc, data_algo,
+          cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_input_desc,
+          input_grad_data));
+    }
+
+    // ------------------- cudnn conv backward filter ---------------------
+    if (filter_grad) {
+      T* filter_grad_data = filter_grad->mutable_data<T>(ctx.GetPlace());
+      auto t = framework::EigenVector<T>::Flatten(*filter_grad);
+      t.device(ctx.GetEigenDevice<platform::GPUPlace>()) =
+          t.constant(static_cast<T>(0));
+      // Gradient with respect to the filter
+      PADDLE_ENFORCE(platform::dynload::cudnnConvolutionBackwardFilter(
+          handle, &alpha, cudnn_output_desc, output_grad_data, cudnn_input_desc,
+          input_data, cudnn_conv_desc, filter_algo, cudnn_workspace,
+          workspace_size_in_bytes, &beta, cudnn_filter_desc, filter_grad_data));
+    }
+    // Release the cudnn workspace
+    paddle::memory::Free(gpu, cudnn_workspace);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_GPU_KERNEL(conv2d_transpose_cudnn,
+                       ops::CudnnConvTransposeOpKernel<float>);
+REGISTER_OP_GPU_KERNEL(conv2d_transpose_cudnn_grad,
+                       ops::CudnnConvTransposeGradOpKernel<float>);
--- a/paddle/operators/conv2dtranspose_op.cc
+++ b/paddle/operators/conv2dtranspose_op.cc
@@ -12,7 +12,7 @@
   See the License for the specific language governing permissions and
   limitations under the License. */

-#include "paddle/operators/conv2dtranspose_op.h"
+#include "paddle/operators/conv2d_transpose_op.h"

 namespace paddle {
 namespace operators {
@@ -95,13 +95,13 @@ void Conv2DTransposeOpGrad::InferShape(
 }  // namespace paddle

 namespace ops = paddle::operators;
-REGISTER_OP(conv2dtranspose, ops::Conv2DTransposeOp,
-            ops::Conv2DTransposeOpMaker, conv2dtranspose_grad,
+REGISTER_OP(conv2d_transpose, ops::Conv2DTransposeOp,
+            ops::Conv2DTransposeOpMaker, conv2d_transpose_grad,
            ops::Conv2DTransposeOpGrad);

 REGISTER_OP_CPU_KERNEL(
-    conv2dtranspose,
+    conv2d_transpose,
    ops::GemmConv2DTransposeKernel<paddle::platform::CPUPlace, float>);
 REGISTER_OP_CPU_KERNEL(
-    conv2dtranspose_grad,
+    conv2d_transpose_grad,
    ops::GemmConv2DTransposeGradKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/conv2dtranspose_op.cu
+++ b/paddle/operators/conv2dtranspose_op.cu
@@ -12,13 +12,13 @@
   See the License for the specific language governing permissions and
   limitations under the License. */

-#include "paddle/operators/conv2dtranspose_op.h"
+#include "paddle/operators/conv2d_transpose_op.h"

 namespace ops = paddle::operators;

 REGISTER_OP_GPU_KERNEL(
-    conv2dtranspose,
+    conv2d_transpose,
    ops::GemmConv2DTransposeKernel<paddle::platform::GPUPlace, float>);
 REGISTER_OP_GPU_KERNEL(
-    conv2dtranspose_grad,
+    conv2d_transpose_grad,
    ops::GemmConv2DTransposeGradKernel<paddle::platform::GPUPlace, float>);
--- a/paddle/operators/conv2dtranspose_op.h
+++ b/paddle/operators/conv2dtranspose_op.h
@@ -62,7 +62,7 @@ class GemmConv2DTransposeKernel : public framework::OpKernel<T> {
    std::vector<int> strides = context.Attr<std::vector<int>>("strides");

    // TODO(Zhuoyuan): Paddings can be added in future.
-    // groups will alway be disabled in conv2dtranspose.
+    // groups will alway be disabled in conv2d_transpose.

    const int batch_size = input->dims()[0];
    const int m = input->dims()[1];

--- a/paddle/operators/cross_entropy_op.cc
+++ b/paddle/operators/cross_entropy_op.cc
@@ -28,8 +28,9 @@ class CrossEntropyOp : public framework::OperatorWithKernel {

    auto x_dims = ctx->GetInputDim("X");
    auto label_dims = ctx->GetInputDim("Label");
-    PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank should be 2.");
-    PADDLE_ENFORCE_EQ(label_dims.size(), 2, "Input(Label)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(x_dims.size(), 2UL, "Input(X)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(label_dims.size(), 2UL,
+                      "Input(Label)'s rank should be 2.");
    PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0],
                      "The 1st dimension of Input(X) and Input(Label) should "
                      "be equal.");
@@ -38,8 +39,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
                        "If Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
-      PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "If Attr(soft_label) == false, the 2nd dimension of "
+      PADDLE_ENFORCE_EQ(label_dims[1], 1UL,
+                        "If Attr(softLabel) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }

@@ -48,7 +49,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
  }

 protected:
-  // CrossEntropy's data type just determined by "X"
+  // Explicitly set that data type of the output of the cross_entropy operator
+  // is determined by its input "X".
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("X")->type());

--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
@@ -51,7 +51,7 @@ class RNNAlgorithmTestHelper : public ::testing::Test {
    CreateGlobalVariables();

    auto op_desc = CreateOpDesc();
-    op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    op = paddle::framework::OpRegistry::CreateOp(op_desc);
    dop = &(dynamic_cast<DynamicRecurrentOp*>(op.get())->rnn);
    InitCacheManually();
    InitStepNet();

--- a/paddle/operators/fill_constant_batch_size_like_op.cc
+++ b/paddle/operators/fill_constant_batch_size_like_op.cc
@@ -36,7 +36,12 @@ class FillConstantBatchSizeLikeOp : public framework::OperatorWithKernel {
                   [](int a) { return static_cast<int64_t>(a); });
    auto dims = framework::make_ddim(shape_int64);

-    dims[0] = ctx->GetInputDim("Input")[0];
+    int dim_idx = ctx->Attrs().Get<int>("dim_idx");
+    PADDLE_ENFORCE_GE(dim_idx, 0);
+    PADDLE_ENFORCE_GT(static_cast<int>(shape.size()), dim_idx);
+    PADDLE_ENFORCE_GT(ctx->GetInputDim("Input").size(), dim_idx);
+
+    dims[dim_idx] = ctx->GetInputDim("Input")[dim_idx];
    ctx->SetOutputDim("Out", dims);
  }

@@ -57,15 +62,18 @@ class FillConstantBatchSizeLikeOpMaker
                 "(int, default 5 (FP32)) "
                 "Output data type")
        .SetDefault(framework::DataType::FP32);
-    AddAttr<std::vector<int>>("shape", "(vector<int>) The shape of the output");
-    AddAttr<float>("value", "(float, default 0) The value to be filled")
-        .SetDefault(0.0f);
    AddInput("Input",
             "(Tensor) Tensor "
-             "whose first dimension is used to specify the batch_size");
+             "whose dim_idx th dimension is used to specify the batch_size");
    AddOutput("Out",
              "(Tensor) Tensor of specified shape will be filled "
              "with the specified value");
+    AddAttr<std::vector<int>>("shape", "(vector<int>) The shape of the output");
+    AddAttr<int>("dim_idx",
+                 "(int, default 0) the index of batch size dimension")
+        .SetDefault(0);
+    AddAttr<float>("value", "(float, default 0) The value to be filled")
+        .SetDefault(0.0f);
    AddComment(R"DOC(Fill up a variable with specified constant value.)DOC");
  }
 };

--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -45,14 +45,14 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of GaussianRandomOp should not be null.");
-    auto dims = ctx->Attrs().Get<std::vector<int>>("dims");
+    auto shape = ctx->Attrs().Get<std::vector<int>>("shape");
    std::vector<int64_t> temp;
-    temp.reserve(dims.size());
-    for (auto dim : dims) {
+    temp.reserve(shape.size());
+    for (auto dim : shape) {
      temp.push_back(static_cast<int64_t>(dim));
    }
-    PADDLE_ENFORCE(dims.size() > 0UL,
-                   "dims can be one int or array. dims must be set.");
+    PADDLE_ENFORCE(shape.size() > 0UL,
+                   "shape can be one int or array. shape must be set.");
    ctx->SetOutputDim("Out", framework::make_ddim(temp));
  }

@@ -74,7 +74,7 @@ GaussianRandom operator.
 Use to initialize tensor with gaussian random generator.
 )DOC");

-    AddAttr<std::vector<int>>("dims", "The dimension of random tensor.");
+    AddAttr<std::vector<int>>("shape", "The dimension of random tensor.");
    AddAttr<float>("mean", "mean of random tensor.").SetDefault(.0f);
    AddAttr<float>("std", "std of random tensor.").SetDefault(1.0f);
    AddAttr<int>("seed",

--- a/paddle/operators/linear_chain_crf_op.cc
+++ b/paddle/operators/linear_chain_crf_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/linear_chain_crf_op.h"
+
+namespace paddle {
+namespace operators {
+
+class LinearChainCRFOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LinearChainCRFOpMaker(framework::OpProto* proto,
+                        framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput(
+        "Emission",
+        "(LoDTensor, default: LoDTensor<float>). "
+        "The unscaled emission weight matrix for the linear chain CRF. "
+        "This input is a LoDTensor with shape [N x D] where N is the size of "
+        "the mini-batch and D is the total tag number.");
+    AddInput(
+        "Transition",
+        "(Tensor, default: Tensor<float>). A Tensor with shape [(D + 2) x D]. "
+        "The learnable parameter for the linear_chain_crf operator. "
+        "See more details in the operator's comments.");
+    AddInput(
+        "Label",
+        "(LoDTensor, default: LoDTensor<int>). The ground truth which is a 2-D "
+        "LoDTensor with shape [N x 1], where N is the total element number in "
+        "a mini-batch.");
+    AddOutput(
+        "Alpha",
+        "Tensor, default: Tensor<float>. The forward vectors for the entire "
+        "batch. A two dimensional tensor with shape [N x D], "
+        "denoted as \f$\alpha\f$. \f$\alpha$\f is a memo table used to "
+        "calculate the normalization factor in CRF. \f$\alpha[k, v]$\f stores "
+        "the unnormalized probabilites of all possible unfinished sequences of "
+        "tags that end at position \f$k$\f with tag \f$v$\f. For each \f$k$\f, "
+        "\f$\alpha[k, v]$\f is a vector of length \f$D$\f with a component for "
+        "each tag value \f$v$\f. This vector is called a forward vecotr and "
+        "will also be used in backward computations.")
+        .AsIntermediate();
+    AddOutput("EmissionExps",
+              "The exponentials of Input(Emission). This is an intermediate "
+              "computational result in forward computation, and will be reused "
+              "in backward computation.")
+        .AsIntermediate();
+    AddOutput("TransitionExps",
+              "The exponentials of Input(Transition). This is an intermediate "
+              "computational result in forward computation, and will be reused "
+              "in backward computation.")
+        .AsIntermediate();
+    AddOutput(
+        "LogLikelihood",
+        "(Tensor, default: Tensor<float>). The logarithm of the conditional "
+        "likelihood of each training sample in a mini-batch. This is a 2-D "
+        "tensor with shape [S x 1], where S is the sequence number in a "
+        "mini-batch. Note: S is equal to the sequence number in a mini-batch. "
+        "The output is no longer a LoDTensor.");
+    AddComment(R"DOC(
+Conditional Random Field defines an undirected probabilistic graph with nodes
+denoting random variables and edges denoting dependencies between these
+variables. CRF learns the conditional probability \f$P(Y|X)\f$, where
+\f$X = (x_1, x_2, ... , x_n)\f$ are structured inputs and
+\f$Y = (y_1, y_2, ... , y_n)\f$ are labels for the inputs.
+
+Linear chain CRF is a special case of CRF that is useful for sequence labeling
+task. Sequence labeling tasks do not assume a lot of conditional
+independences among inputs. The only constraint they impose is that the input
+and output must be linear sequences. Thus, the graph of such a CRF is a simple
+chain or a line, which results in the linear chain CRF.
+
+This operator implements the Forward-Backward algorithm for the linear chain
+CRF. Please see http://www.cs.columbia.edu/~mcollins/fb.pdf and
+http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for reference.
+
+Equation:
+
+- Denote Input(Emission) to this operator as \f$x\f$ here.
+- The first D values of Input(Transition) to this operator are for starting
+weights, denoted as \f$a\f$ here.
+- The next D values of Input(Transition) of this operator are for ending
+weights, denoted as \f$b\f$ here.
+- The remaning values of Input(Transition) are for transition weights,
+denoted as \f$w\f$ here.
+- Denote Input(Label) as \f$s\f$ here.
+
+The probability of a sequence \f$s\f$ of length \f$L\f$ is defined as:
+\f$P(s) = (1/Z) exp(a_{s_1} + b_{s_L}
+                 + \sum_{l=1}^L x_{s_l}
+                 + \sum_{l=2}^L w_{s_{l-1},s_l})\f$
+where \f$Z\f$ is a normalization value so that the sum of \f$P(s)\f$ over
+all possible sequences is \f$1\f$, and \f$x\f$ is the emission feature weight
+to the linear chain CRF.
+
+Finaly, the linear chain CRF operator outputs the logarithm of the conditional
+likelihood of each training sample in a mini-batch.
+
+NOTE:
+1. The feature function for a CRF is made up of the emission features and the
+transition features. The emission feature weights are NOT computed in
+this operator. They MUST be computed first before this operator is called.
+
+2. Because this operator performs global normalization over all possible
+sequences internally, it expects UNSCALED emission feature weights.
+Please do not call this op with the emission feature being output of any
+nonlinear activation.
+
+3. The 2nd dimension of Input(Emission) MUST be equal to the tag number.
+
+)DOC");
+  }
+};
+
+class LinearChainCRFOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Emission"),
+                   "Input(Emission) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("Transition"),
+                   "Input(Transition) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("Alpha"),
+                   "Output(Alpha) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("EmissionExps"),
+                   "Output(EmissionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("TransitionExps"),
+                   "Output(TransitionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("LogLikelihood"),
+                   "Output(LogLikelihood) should be not null.");
+
+    auto emission_dims = ctx->GetInputDim("Emission");
+    PADDLE_ENFORCE_EQ(emission_dims.size(), 2UL,
+                      "The Input(Emission) should be a 2-D tensor.");
+    PADDLE_ENFORCE(emission_dims[0], "An empty mini-batch is not allowed.");
+
+    auto transition_dims = ctx->GetInputDim("Transition");
+    PADDLE_ENFORCE_EQ(transition_dims.size(), 2UL,
+                      "The Input(Transition) should be a 2-D tensor.");
+    PADDLE_ENFORCE_EQ(
+        transition_dims[0] - 2, transition_dims[1],
+        "An invalid dimension for the Input(Transition), which should "
+        "be a 2-D tensor with shape [(D + 2) x D].");
+    PADDLE_ENFORCE_EQ(
+        emission_dims[1], transition_dims[1],
+        "The 2nd dimension of the Input(Emission) and the Input(Transition) "
+        "should be equal to the tag number.");
+
+    auto label_dims = ctx->GetInputDim("Label");
+    PADDLE_ENFORCE(label_dims.size() == 2UL && label_dims[1] == 1UL,
+                   "The Input(Label) should be a 2-D tensor with the 2nd "
+                   "dimensions fixed to 1.");
+    PADDLE_ENFORCE_EQ(
+        emission_dims[0], label_dims[0],
+        "The height of Input(Emission) and the height of Input(Label) "
+        "should be the same.");
+
+    ctx->SetOutputDim("Alpha", emission_dims);
+    ctx->SetOutputDim("EmissionExps", emission_dims);
+    ctx->SetOutputDim("TransitionExps", transition_dims);
+    // TODO(caoying) This is tricky. The 1st dimension of Output(LogLikelihood)
+    // is the sequence number in a mini-batch. The dimension set here should be
+    // resized to its correct size in the function Compute. Fix this once we can
+    // get LoD information in the InferShape interface.
+    ctx->SetOutputDim("LogLikelihood", {emission_dims[0], 1});
+  }
+
+ protected:
+  // Explicitly set that the data type of output of the linear_chain_crf
+  // operator is determined by its input "Emission".
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(ctx.Input<LoDTensor>("Emission")->type());
+  }
+};
+
+class LinearChainCRFGradOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("EmissionExps"),
+                   "Input(EmissionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("TransitionExps"),
+                   "Input(TransitionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("LogLikelihood")),
+                   "Input(LogLikelihood@GRAD) shoudl be not null.");
+
+    auto emission_exps_dims = ctx->GetInputDim("EmissionExps");
+    PADDLE_ENFORCE_EQ(emission_exps_dims.size(), 2UL,
+                      "The Input(EmissionExps) should be a 2-D tensor.");
+    PADDLE_ENFORCE(emission_exps_dims[0],
+                   "An empty mini-batch is not allowed.");
+
+    auto transition_exps_dims = ctx->GetInputDim("TransitionExps");
+    PADDLE_ENFORCE_EQ(transition_exps_dims.size(), 2UL,
+                      "The Input(TransitionExps) should be a 2-D tensor.");
+    PADDLE_ENFORCE_EQ(
+        transition_exps_dims[0] - 2, transition_exps_dims[1],
+        "An invalid dimension for the Input(TransitionExps), which should "
+        "be a 2-D tensor with shape [(D + 2) x D].");
+    PADDLE_ENFORCE_EQ(
+        emission_exps_dims[1], transition_exps_dims[1],
+        "The 2nd dimension of the Input(EmissionExps) and the "
+        "Input(TransitionExps) should be equal to the tag number.");
+
+    auto label_dims = ctx->GetInputDim("Label");
+    PADDLE_ENFORCE(label_dims.size() == 2UL && label_dims[1] == 1UL,
+                   "The Input(Label) should be a 2-D tensor with the 2nd "
+                   "dimensions fixed to 1.");
+    PADDLE_ENFORCE_EQ(
+        emission_exps_dims[0], label_dims[0],
+        "The height of Input(EmissionExps) and the height of Input(Label) "
+        "should be the same.");
+
+    if (ctx->HasOutput(framework::GradVarName("Emission"))) {
+      ctx->SetOutputDim(framework::GradVarName("Emission"), emission_exps_dims);
+    }
+    if (ctx->HasOutput(framework::GradVarName("Transition"))) {
+      ctx->SetOutputDim(framework::GradVarName("Transition"),
+                        transition_exps_dims);
+    }
+  }
+
+ protected:
+  // Explicitly set that the data type of output of the linear_chain_crf_grad
+  // operator is determined by its input: gradients of LogLikelihood.
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<LoDTensor>(framework::GradVarName("LogLikelihood"))->type());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(linear_chain_crf, ops::LinearChainCRFOp, ops::LinearChainCRFOpMaker,
+            linear_chain_crf_grad, ops::LinearChainCRFGradOp);
+REGISTER_OP_CPU_KERNEL(
+    linear_chain_crf,
+    ops::LinearChainCRFOpKernel<paddle::platform::CPUPlace, float>,
+    ops::LinearChainCRFOpKernel<paddle::platform::CPUPlace, double>);
+REGISTER_OP_CPU_KERNEL(
+    linear_chain_crf_grad,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::CPUPlace, float>,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::CPUPlace, double>);
--- a/paddle/operators/math/detail/hl_avx_functions.h
+++ b/paddle/operators/math/detail/hl_avx_functions.h
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.

-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at

-    http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
@@ -12,21 +12,15 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifndef HL_AVX_FUNCTIONS_H_
-#define HL_AVX_FUNCTIONS_H_
+#include "paddle/operators/linear_chain_crf_op.h"

-#include <immintrin.h>
+namespace ops = paddle::operators;

-namespace hppl {
-__m256 relu(const __m256 a);
-__m256 sigmoid(const __m256 a);
-__m256 tanh(const __m256 a);
-__m256 linear(const __m256 a);
-
-__m256 relu(const __m256 a, const __m256 b);
-__m256 sigmoid(const __m256 a, const __m256 b);
-__m256 tanh(const __m256 a, const __m256 b);
-__m256 linear(const __m256 a, const __m256 b);
-}  // namespace hppl
-
-#endif  // HL_AVX_FUNCTIONS_H_
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, double>);
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf_grad,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, double>);
--- a/paddle/operators/linear_chain_crf_op.h
+++ b/paddle/operators/linear_chain_crf_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+#include "paddle/operators/math/math_function.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+static inline T NormalizeL1(T* x, size_t len) {
+  T sum = 0.;
+  for (size_t i = 0; i < len; ++i) sum += x[i];
+  // (This comment is from the old LinearChainCRFLayer.)
+  // Right now, we just bet that sum won't be zero. If this really happens, we
+  // will figure out what should be done then.
+  PADDLE_ENFORCE(sum,
+                 "The unnormalized probabilities of all possible unfinished "
+                 "sequences must be greater than 0.");
+  T s = 1. / sum;
+  for (size_t i = 0; i < len; ++i) x[i] *= s;
+  return sum;
+}
+
+template <typename T>
+struct ScalarMul {
+  explicit ScalarMul(const T& scalar) : scalar(scalar) {}
+  T operator()(const T& val) const { return val * scalar; }
+
+  T scalar;
+};
+
+using framework::LoDTensor;
+using framework::LoD;
+using framework::Tensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+
+template <typename Place, typename T>
+class LinearChainCRFOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    // TODO(caoying) The checks related to LoD information should be
+    // moved into InferShape once after the InferShape is refactored.
+    PADDLE_ENFORCE_EQ(ctx.Input<LoDTensor>("Emission")->NumLevels(), 1UL,
+                      "The Input(Emission) should be a sequence.");
+    PADDLE_ENFORCE_EQ(ctx.Input<LoDTensor>("Label")->NumLevels(), 1UL,
+                      "The Input(Label) should be a sequence.");
+    auto in_lod = ctx.Input<LoDTensor>("Label")->lod();
+    PADDLE_ENFORCE(in_lod.size(), "Input(Label) must be a sequence.");
+    const size_t level = 0;
+    const size_t seq_num = in_lod[level].size() - 1;
+
+    // These local variables hold the inputs and outputs, garanteeing them on
+    // CPU memory, to provide a consistent reference.
+    // TODO(caoying) Fix this by moving all these local variables into the
+    // class's data members once we can profile the whole training process.
+    LoDTensor* emission_weights = nullptr;
+    LoDTensor emission_weight_tensor;
+    Tensor* transition_weights = nullptr;
+    Tensor transition_weight_tensor;
+    LoDTensor* label = nullptr;
+    LoDTensor label_tensor;
+
+    Tensor* emission_exps = nullptr;
+    Tensor emission_exps_tensor;
+    Tensor* transition_exps = nullptr;
+    Tensor transition_exps_tensor;
+    Tensor* alpha = nullptr;
+    Tensor alpha_tensor;
+    Tensor* ll = nullptr;
+    Tensor ll_tensor;
+
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+      emission_weights = &emission_weight_tensor;
+      transition_weights = &transition_weight_tensor;
+      label = &label_tensor;
+
+      CopyInputsToCpuMemory(
+          ctx.device_context(), *ctx.Input<LoDTensor>("Emission"),
+          *ctx.Input<Tensor>("Transition"), *ctx.Input<LoDTensor>("Label"),
+          emission_weights, transition_weights, label);
+
+      emission_exps = &emission_exps_tensor;
+      emission_exps->Resize(emission_weights->dims());
+
+      transition_exps = &transition_exps_tensor;
+      transition_exps->Resize(transition_weights->dims());
+
+      alpha = &alpha_tensor;
+      alpha->Resize(ctx.Output<Tensor>("Alpha")->dims());
+
+      ll = &ll_tensor;
+    } else {
+      emission_weights =
+          const_cast<LoDTensor*>(ctx.Input<LoDTensor>("Emission"));
+      transition_weights = const_cast<Tensor*>(ctx.Input<Tensor>("Transition"));
+      label = const_cast<LoDTensor*>(ctx.Input<LoDTensor>("Label"));
+
+      emission_exps = ctx.Output<Tensor>("EmissionExps");
+      transition_exps = ctx.Output<Tensor>("TransitionExps");
+      alpha = ctx.Output<Tensor>("Alpha");
+      ll = ctx.Output<Tensor>("LogLikelihood");
+    }
+
+    // Because the computation codes only runs on CPU, here the memory for all
+    // the outputs is FIXED to be allocated on the CPU memory.
+    emission_exps->mutable_data<T>(platform::CPUPlace());
+    transition_exps->mutable_data<T>(platform::CPUPlace());
+    alpha->mutable_data<T>(platform::CPUPlace());
+
+    // Resize the output tensor to its correct dimension.
+    ll->Resize({static_cast<int>(seq_num), 1});
+    ll->mutable_data<T>(platform::CPUPlace());
+
+    // Now, all the inputs and outputs should be on the CPU memory.
+    auto emission_dims = emission_weights->dims();
+    const size_t batch_size = emission_dims[0];
+    const size_t tag_num = emission_dims[1];
+
+    Tensor emission_row_max;
+    emission_row_max.mutable_data<T>(
+        framework::make_ddim({static_cast<int>(batch_size), 1}),
+        platform::CPUPlace());
+
+    auto place = ctx.GetEigenDevice<platform::CPUPlace>();
+    auto x = EigenMatrix<T>::From(*emission_weights);
+    auto x_row_max = EigenMatrix<T>::From(emission_row_max);
+    x_row_max.device(place) =
+        x.maximum(Eigen::DSizes<int, 1>(1))
+            .reshape(Eigen::DSizes<int, 2>(int(batch_size), 1));
+
+    auto x_exps = EigenMatrix<T>::From(*emission_exps);
+    x_exps.device(place) =
+        (x - x_row_max.broadcast(Eigen::DSizes<int, 2>(1, tag_num))).exp();
+
+    auto w = EigenMatrix<T>::From(*transition_weights);
+    auto w_exps = EigenMatrix<T>::From(*transition_exps);
+    w_exps.device(place) = w.exp();
+
+    T* log_likelihood = ll->data<T>();
+    for (size_t i = 0; i < seq_num; ++i) {
+      int start_pos = static_cast<int>(in_lod[level][i]);
+      int end_pos = static_cast<int>(in_lod[level][i + 1]);
+      if (end_pos == start_pos) {
+        // If an empty input sequence is given, pad 0 for its cost.
+        log_likelihood[i] = 0.;
+        continue;
+      }
+
+      const Tensor one_seq = emission_weights->Slice(start_pos, end_pos);
+      Tensor one_seq_row_max = emission_row_max.Slice(start_pos, end_pos);
+      Tensor one_seq_exps = emission_exps->Slice(start_pos, end_pos);
+      const Tensor one_seq_label = label->Slice(start_pos, end_pos);
+      Tensor one_seq_alpha = alpha->Slice(start_pos, end_pos);
+
+      log_likelihood[i] = ForwardOneSequence(
+          one_seq, one_seq_row_max, one_seq_exps, *transition_weights,
+          *transition_exps, one_seq_label, &one_seq_alpha);
+    }
+
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+      CopyOutputsToGpuMemory(
+          ctx.device_context(), *emission_exps, *transition_exps, *alpha, *ll,
+          ctx.Output<Tensor>("EmissionExps"),
+          ctx.Output<Tensor>("TransitionExps"), ctx.Output<Tensor>("Alpha"),
+          ctx.Output<Tensor>("LogLikelihood"));
+    }
+  };
+
+ private:
+  void CopyInputsToCpuMemory(const platform::DeviceContext& ctx,
+                             const LoDTensor& emission_weights_src,
+                             const Tensor& transition_weights_src,
+                             const LoDTensor& label_src,
+                             LoDTensor* emission_weights_dst,
+                             Tensor* transition_weights_dst,
+                             LoDTensor* label_dst) const {
+    // Copy the inputs from GPU memory to CPU memory if this operators runs on
+    // GPU device.
+    auto copyLoDTensor = [](const platform::DeviceContext& ctx,
+                            const LoDTensor& src, LoDTensor* dst) {
+      dst->mutable_data<T>(src.dims(), platform::CPUPlace());
+      dst->CopyFrom(src, platform::CPUPlace(), ctx);
+    };
+
+    copyLoDTensor(ctx, emission_weights_src, emission_weights_dst);
+    copyLoDTensor(ctx, label_src, label_dst);
+
+    transition_weights_dst->mutable_data<T>(transition_weights_src.dims(),
+                                            platform::CPUPlace());
+    transition_weights_dst->CopyFrom(transition_weights_src,
+                                     platform::CPUPlace(), ctx);
+  }
+
+  void CopyOutputsToGpuMemory(const platform::DeviceContext& ctx,
+                              const Tensor& emission_exps_src,
+                              const Tensor& transition_exps_src,
+                              const Tensor& alpha_src, const Tensor& ll_src,
+                              Tensor* emission_exps_dst,
+                              Tensor* transition_exps_dst, Tensor* alpha_dst,
+                              Tensor* ll_dst) const {
+    // Copy the forward results from CPU memory to GPU memory if this
+    // operators runs on GPU device.
+    auto copyTensor = [](const platform::DeviceContext& ctx, const Tensor& src,
+                         Tensor* dst) {
+      dst->mutable_data<T>(platform::GPUPlace());
+      dst->CopyFrom(src, platform::GPUPlace(), ctx);
+    };
+    copyTensor(ctx, emission_exps_src, emission_exps_dst);
+    copyTensor(ctx, transition_exps_src, transition_exps_dst);
+    copyTensor(ctx, alpha_src, alpha_dst);
+    copyTensor(ctx, ll_src, ll_dst);
+  }
+
+  T ForwardOneSequence(const Tensor& emission, const Tensor& emission_row_max,
+                       const Tensor& emission_exps, const Tensor& trans_weights,
+                       const Tensor& trans_weight_exps, const Tensor& label,
+                       Tensor* alpha) const {
+    const T* x = emission.data<T>();
+    const T* x_row_max = emission_row_max.data<T>();
+    const T* x_exps = emission_exps.data<T>();
+    const T* w = trans_weights.data<T>();
+    const T* w_exps = trans_weight_exps.data<T>();
+    T* alpha_value = alpha->data<T>();
+
+    auto x_dims = emission.dims();
+    const size_t seq_length = x_dims[0];
+    const size_t tag_num = x_dims[1];
+    // The 1st row of w are transition weights for start mask.
+    // The 2nd row of w are transition weights for end mask.
+    // Transition weights between other tags begin from the 3rd row of w.
+    const size_t state_trans_base_idx = 2;
+
+    for (size_t i = 0; i < tag_num; ++i) {
+      alpha_value[i] = w_exps[i] * x_exps[i];
+    }
+    T ll = -x_row_max[0] - std::log(NormalizeL1<T>(alpha_value, tag_num));
+
+    for (size_t k = 1; k < seq_length; ++k) {
+      for (size_t i = 0; i < tag_num; ++i) {
+        T sum = 0.;
+        for (size_t j = 0; j < tag_num; ++j) {
+          sum += alpha_value[(k - 1) * tag_num + j] *  // (*)
+                 w_exps[(j + state_trans_base_idx) * tag_num + i];
+        }
+        alpha_value[k * tag_num + i] = x_exps[k * tag_num + i] * sum;
+      }
+      // NormalizeL1 is to avoid underflow or overflow at (*).
+      ll -= x_row_max[k] +
+            std::log(NormalizeL1<T>(alpha_value + k * tag_num, tag_num));
+    }
+    T sum = 0.;
+    for (size_t i = 0; i < tag_num; ++i) {
+      sum += alpha_value[(seq_length - 1) * tag_num + i] * w_exps[tag_num + i];
+    }
+    ll -= std::log(sum);
+    // Now ll is equal to -log(Z).
+
+    const int* lbl = label.data<int>();
+    PADDLE_ENFORCE_LT(
+        *std::max_element(lbl, lbl + seq_length), tag_num,
+        "An invalid tag label that execesses the largest tag number.");
+
+    // Calculate the nominator part, which depends on the label sequence.
+    ll += w[lbl[0]] /*start transition*/ + x[lbl[0]] +
+          w[tag_num + lbl[seq_length - 1]] /*end transition*/;
+    for (size_t k = 1; k < seq_length; ++k) {
+      ll += x[k * tag_num + lbl[k]] +
+            w[(lbl[k - 1] + state_trans_base_idx) * tag_num + lbl[k]];
+    }
+    return -ll;
+  }
+};
+
+template <typename Place, typename T>
+class LinearChainCRFGradOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    const size_t level = 0;  // currently, only support sequence.
+    auto lod = ctx.Input<LoDTensor>("Label")->lod();
+    PADDLE_ENFORCE(lod.size(), "Input(Label) must be a sequence.");
+
+    // These local variables hold the inputs and outputs, garanteeing them on
+    // CPU memory, to provide a consistent reference.
+    // TODO(caoying) Fix this by moving all these local variables into the
+    // class's data members once we can profile the training process, or
+    // implementing a real GPU kernel for CRF.
+    Tensor* label = nullptr;
+    Tensor label_tensor;
+    Tensor* emission_exps = nullptr;
+    Tensor emission_exps_tensor;
+    Tensor* transition_exps = nullptr;
+    Tensor transition_exps_tensor;
+    Tensor* alpha = nullptr;
+    Tensor alpha_tensor;
+    Tensor ll_grad_tensor;
+    T* ll_grad = nullptr;
+
+    Tensor* emission_grad = nullptr;
+    Tensor emission_grad_tensor;
+    Tensor* transition_grad = nullptr;
+    Tensor transition_grad_tensor;
+
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+      label = &label_tensor;
+      emission_exps = &emission_exps_tensor;
+      transition_exps = &transition_exps_tensor;
+      alpha = &alpha_tensor;
+      CopyInputsToCpuMemory(
+          ctx.device_context(), *ctx.Input<LoDTensor>("Label"),
+          *ctx.Input<Tensor>("EmissionExps"),
+          *ctx.Input<Tensor>("TransitionExps"), *ctx.Input<Tensor>("Alpha"),
+          *ctx.Input<Tensor>(framework::GradVarName("LogLikelihood")), label,
+          emission_exps, transition_exps, alpha, &ll_grad_tensor);
+      ll_grad = ll_grad_tensor.data<T>();
+
+      if (ctx.Output<Tensor>(framework::GradVarName("Emission"))) {
+        emission_grad = &emission_grad_tensor;
+        emission_grad->Resize(emission_exps->dims());
+      }
+
+      if (ctx.Output<Tensor>(framework::GradVarName("Transition"))) {
+        transition_grad = &transition_grad_tensor;
+        transition_grad->Resize(transition_exps->dims());
+      }
+    } else {
+      label = const_cast<LoDTensor*>(ctx.Input<LoDTensor>("Label"));
+      emission_exps = const_cast<Tensor*>(ctx.Input<Tensor>("EmissionExps"));
+      transition_exps =
+          const_cast<Tensor*>(ctx.Input<Tensor>("TransitionExps"));
+      alpha = const_cast<Tensor*>(ctx.Input<Tensor>("Alpha"));
+      ll_grad = const_cast<Tensor*>(
+                    ctx.Input<Tensor>(framework::GradVarName("LogLikelihood")))
+                    ->data<T>();
+
+      emission_grad = ctx.Output<Tensor>(framework::GradVarName("Emission"));
+      transition_grad =
+          ctx.Output<Tensor>(framework::GradVarName("Transition"));
+    }
+
+    // TODO(caoying) Fix this constraint. When the Input(Emission) is from the
+    // data reader operator, it can have no gradients.
+    PADDLE_ENFORCE(emission_grad, "Output(Emission@Grad) should not be null.");
+    emission_grad->mutable_data<T>(platform::CPUPlace());
+    if (transition_grad) {
+      transition_grad->mutable_data<T>(platform::CPUPlace());
+      math::SetConstant<platform::CPUPlace, T>()(ctx.device_context(),
+                                                 transition_grad, 0.);
+    }
+    // Now, all the inputs and outputs should be on the CPU memory.
+
+    auto emission_dims = emission_exps->dims();
+    // Beta is the memo table used in dynamic programming to calculate the
+    // backwark vectors. For a backward vector i (the i-th row of beta), it
+    // captures the unnormalized probabilities of partial sequences starting
+    // at position i.
+    Tensor beta;
+    beta.mutable_data<T>(emission_dims, platform::CPUPlace());
+
+    for (size_t i = 0; i < lod[level].size() - 1; ++i) {
+      int start_pos = static_cast<int>(lod[level][i]);
+      int end_pos = static_cast<int>(lod[level][i + 1]);
+      if (end_pos == start_pos) continue;
+
+      const Tensor one_seq_emission_exps =
+          emission_exps->Slice(start_pos, end_pos);
+      const Tensor one_seq_label = label->Slice(start_pos, end_pos);
+      const Tensor one_seq_alpha = alpha->Slice(start_pos, end_pos);
+      Tensor one_seq_beta = beta.Slice(start_pos, end_pos);
+      Tensor one_seq_emission_grad = emission_grad->Slice(start_pos, end_pos);
+
+      BackwardOneSequence(ctx.device_context(), ll_grad[i],
+                          one_seq_emission_exps, *transition_exps,
+                          one_seq_alpha, one_seq_label, &one_seq_beta,
+                          transition_grad, &one_seq_emission_grad);
+    }
+
+    if (platform::is_gpu_place(ctx.GetPlace())) {
+      CopyOutputsToGpuMemory(
+          ctx.device_context(), emission_grad, transition_grad,
+          ctx.Output<Tensor>(framework::GradVarName("Emission")),
+          ctx.Output<Tensor>(framework::GradVarName("Transition")));
+    }
+  };
+
+ private:
+  void CopyInputsToCpuMemory(const platform::DeviceContext& ctx,
+                             const LoDTensor& label_src,
+                             const Tensor& emission_exps_src,
+                             const Tensor& transition_exps_src,
+                             const Tensor& alpha_src, const Tensor& ll_grad_src,
+                             Tensor* label_dst, Tensor* emission_exps_dst,
+                             Tensor* transition_exps_dst, Tensor* alpha_dst,
+                             Tensor* ll_grad_dst) const {
+    // Copy the inputs from GPU memory to CPU memory when this operators runs on
+    // GPU device.
+    label_dst->mutable_data<T>(label_src.dims(), platform::CPUPlace());
+    label_dst->CopyFrom(label_src, platform::CPUPlace(), ctx);
+
+    auto copyTensor = [](const platform::DeviceContext& ctx, const Tensor& src,
+                         Tensor* dst) {
+      dst->mutable_data<T>(src.dims(), platform::CPUPlace());
+      dst->CopyFrom(src, platform::CPUPlace(), ctx);
+    };
+    copyTensor(ctx, emission_exps_src, emission_exps_dst);
+    copyTensor(ctx, transition_exps_src, transition_exps_dst);
+    copyTensor(ctx, alpha_src, alpha_dst);
+    copyTensor(ctx, ll_grad_src, ll_grad_dst);
+  }
+
+  void CopyOutputsToGpuMemory(const platform::DeviceContext& ctx,
+                              const Tensor* emission_grad_src,
+                              const Tensor* transition_grad_src,
+                              Tensor* emission_grad_dst,
+                              Tensor* transition_grad_dst) const {
+    // Copy the backward results from CPU memory to GPU
+    // memory if this operators runs on GPU device.
+    auto copyTensor = [](const platform::DeviceContext& ctx, const Tensor* src,
+                         Tensor* dst) {
+      if (src && dst) {
+        dst->mutable_data<T>(platform::GPUPlace());
+        dst->CopyFrom(*src, platform::GPUPlace(), ctx);
+      }
+    };
+    copyTensor(ctx, emission_grad_src, emission_grad_dst);
+    copyTensor(ctx, transition_grad_src, transition_grad_dst);
+  }
+
+  void BackwardOneSequence(const platform::DeviceContext& ctx, const T ll_grad,
+                           const Tensor& emission_exps,
+                           const Tensor& transition_exps, const Tensor& alpha,
+                           const Tensor& label, Tensor* beta,
+                           Tensor* transition_grad,
+                           Tensor* emission_grad) const {
+    const T* w_exps = transition_exps.data<T>();
+    const T* x_exps = emission_exps.data<T>();
+    const int* label_value = label.data<int>();
+    T* beta_value = beta->data<T>();
+
+    auto x_dims = emission_exps.dims();
+    const size_t seq_length = x_dims[0];
+    const size_t tag_num = x_dims[1];
+    const size_t state_trans_base_idx = 2;
+
+    // Calculate the backward vectors: beta.
+    // First, calculate the initialition state.
+    for (size_t i = 0; i < tag_num; ++i) {
+      beta_value[(seq_length - 1) * tag_num + i] = w_exps[tag_num + i];
+    }
+    NormalizeL1<T>(beta_value + (seq_length - 1) * tag_num, tag_num);
+    for (int k = static_cast<int>(seq_length) - 2; k >= 0; --k) {
+      for (size_t i = 0; i < tag_num; ++i) {
+        T sum = 0.;
+        for (size_t j = 0; j < tag_num; ++j) {
+          sum += w_exps[(i + state_trans_base_idx) * tag_num + j] *  // (**)
+                 x_exps[(k + 1) * tag_num + j] *
+                 beta_value[(k + 1) * tag_num + j];
+        }
+        beta_value[k * tag_num + i] = sum;
+      }
+      // NormalizeL1 is to avoid underflow or overflow at (**).
+      NormalizeL1<T>(beta_value + k * tag_num, tag_num);
+    }
+
+    auto x_grad_mat = EigenMatrix<T>::From(*emission_grad);
+    auto alpha_mat = EigenMatrix<T>::From(alpha);
+    auto beta_mat = EigenMatrix<T>::From(*beta);
+
+    auto* place = ctx.GetEigenDevice<platform::CPUPlace>();
+    auto prob = alpha_mat * beta_mat;
+    auto row_sum = prob.sum(Eigen::DSizes<int, 1>(1))
+                       .reshape(Eigen::DSizes<int, 2>(seq_length, 1))
+                       .broadcast(Eigen::DSizes<int, 2>(1, tag_num));
+    x_grad_mat.device(*place) =
+        (prob / row_sum).unaryExpr(ScalarMul<T>(ll_grad));
+
+    for (size_t k = 0; k < seq_length; ++k) {
+      x_grad_mat(k, label_value[k]) -= static_cast<T>(ll_grad);
+    }
+
+    if (transition_grad) {
+      T* trans_grad = transition_grad->data<T>();
+      for (size_t k = 0; k < tag_num; ++k) {
+        // Do not multiply by the output gradient here, because x_grad_mat has
+        // alrealy done this.
+        trans_grad[k] += x_grad_mat(/*from start state*/ 0, k);
+        trans_grad[tag_num + k] +=
+            x_grad_mat(/*to end state*/ seq_length - 1, k);
+      }
+
+      auto x_exps_mat = EigenMatrix<T>::From(emission_exps);
+
+      // TODO(caoying): Fix this to avoid using this local variable if we can
+      // profile the training process.
+      Tensor tmp;
+      tmp.mutable_data<T>(beta->dims(), platform::CPUPlace());
+      auto tmp_mat = EigenMatrix<T>::From(tmp);
+      auto prob = beta_mat * x_exps_mat;
+      auto row_sum = prob.sum(Eigen::DSizes<int, 1>(1))
+                         .reshape(Eigen::DSizes<int, 2>(seq_length, 1))
+                         .broadcast(Eigen::DSizes<int, 2>(1, tag_num));
+      tmp_mat.device(*place) = prob / row_sum;
+
+      for (size_t k = 1; k < seq_length; ++k) {
+        T sum = 0.;
+        for (size_t i = 0; i < tag_num; ++i) {
+          for (size_t j = 0; j < tag_num; ++j) {
+            sum += w_exps[(i + state_trans_base_idx) * tag_num + j] *  // (**)
+                   alpha_mat(k - 1, i) * tmp_mat(k, j);
+          }
+        }
+        sum = 1. / sum;
+        for (size_t i = 0; i < tag_num; ++i) {
+          for (size_t j = 0; j < tag_num; ++j) {
+            trans_grad[(i + state_trans_base_idx) * tag_num + j] +=
+                sum * w_exps[(i + state_trans_base_idx) * tag_num + j] *
+                alpha_mat(k - 1, i) * tmp_mat(k, j) * ll_grad;
+          }
+        }
+        trans_grad[(label_value[k - 1] + state_trans_base_idx) * tag_num +
+                   label_value[k]] -= static_cast<T>(ll_grad);
+      }
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@@ -43,7 +43,7 @@ class LookupTableOp : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return framework::ToDataType(ctx.Input<Tensor>("W")->type());
+    return framework::ToDataType(ctx.Input<LoDTensor>("W")->type());
  }
 };

@@ -93,7 +93,7 @@ class LookupTableOpGrad : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return framework::ToDataType(ctx.Input<Tensor>("W")->type());
+    return framework::ToDataType(ctx.Input<LoDTensor>("W")->type());
  }
 };


--- a/paddle/operators/lookup_table_op.cu
+++ b/paddle/operators/lookup_table_op.cu
@@ -61,16 +61,16 @@ template <typename T>
 class LookupTableCUDAKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto table_t = context.Input<Tensor>("W");
-    auto ids_t = context.Input<Tensor>("Ids");
-    auto output_t = context.Output<Tensor>("Out");
+    auto* table_t = context.Input<LoDTensor>("W");
+    auto* ids_t = context.Input<LoDTensor>("Ids");
+    auto* output_t = context.Output<LoDTensor>("Out");

    size_t N = table_t->dims()[0];
    size_t D = table_t->dims()[1];
    size_t K = ids_t->numel();
-    auto ids = ids_t->data<int64_t>();
-    auto table = table_t->data<T>();
-    auto output = output_t->mutable_data<T>(context.GetPlace());
+    auto* ids = ids_t->data<int64_t>();
+    auto* table = table_t->data<T>();
+    auto* output = output_t->mutable_data<T>(context.GetPlace());

    dim3 threads(128, 8);
    dim3 grids(8, 1);
@@ -87,9 +87,9 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& context) const override {
    bool is_sparse = context.Attr<bool>("is_sparse");
    if (is_sparse) {
-      auto* ids = context.Input<Tensor>("Ids");
-      auto* table = context.Input<Tensor>("W");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto* ids = context.Input<LoDTensor>("Ids");
+      auto* table = context.Input<LoDTensor>("W");
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
      auto* d_table = context.Output<SelectedRows>(framework::GradVarName("W"));

      auto* ids_data = ids->data<int64_t>();
@@ -116,12 +116,12 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
      auto* d_output_data = d_output->data<T>();
      PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims());
      memory::Copy(gpu_place, d_table_data, gpu_place, d_output_data,
-                   d_output->numel(), stream);
+                   d_output->numel() * sizeof(T), stream);

    } else {
-      auto ids_t = context.Input<Tensor>("Ids");
-      auto d_output_t = context.Input<Tensor>(framework::GradVarName("Out"));
-      auto d_table_t = context.Output<Tensor>(framework::GradVarName("W"));
+      auto ids_t = context.Input<LoDTensor>("Ids");
+      auto d_output_t = context.Input<LoDTensor>(framework::GradVarName("Out"));
+      auto d_table_t = context.Output<LoDTensor>(framework::GradVarName("W"));

      int N = d_table_t->dims()[0];
      int D = d_table_t->dims()[1];

--- a/paddle/operators/lookup_table_op.h
+++ b/paddle/operators/lookup_table_op.h
@@ -19,22 +19,22 @@
 namespace paddle {
 namespace operators {

-using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
 using SelectedRows = framework::SelectedRows;

 template <typename T>
 class LookupTableKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto table_t = context.Input<Tensor>("W");      // float tensor
-    auto ids_t = context.Input<Tensor>("Ids");      // int tensor
-    auto output_t = context.Output<Tensor>("Out");  // float tensor
+    auto* table_t = context.Input<LoDTensor>("W");      // float tensor
+    auto* ids_t = context.Input<LoDTensor>("Ids");      // int tensor
+    auto* output_t = context.Output<LoDTensor>("Out");  // float tensor

    int N = table_t->dims()[0];
    int D = table_t->dims()[1];
-    auto ids = ids_t->data<int64_t>();
-    auto table = table_t->data<T>();
-    auto output = output_t->mutable_data<T>(context.GetPlace());
+    auto* ids = ids_t->data<int64_t>();
+    auto* table = table_t->data<T>();
+    auto* output = output_t->mutable_data<T>(context.GetPlace());
    for (int64_t i = 0; i < ids_t->numel(); ++i) {
      PADDLE_ENFORCE_LT(ids[i], N);
      PADDLE_ENFORCE_GE(ids[i], 0);
@@ -49,9 +49,9 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& context) const override {
    bool is_sparse = context.Attr<bool>("is_sparse");
    if (is_sparse) {
-      auto* ids = context.Input<Tensor>("Ids");
-      auto* table = context.Input<Tensor>("W");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto* ids = context.Input<LoDTensor>("Ids");
+      auto* table = context.Input<LoDTensor>("W");
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
      auto* d_table = context.Output<SelectedRows>(framework::GradVarName("W"));

      auto* ids_data = ids->data<int64_t>();
@@ -76,10 +76,10 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
      PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims());
      memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel());
    } else {
-      auto* ids = context.Input<Tensor>("Ids");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
-      auto* d_table = context.Output<Tensor>(framework::GradVarName("W"));
-      auto* table = context.Input<Tensor>("W");
+      auto* ids = context.Input<LoDTensor>("Ids");
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
+      auto* d_table = context.Output<LoDTensor>(framework::GradVarName("W"));
+      auto* table = context.Input<LoDTensor>("W");

      auto* ids_data = ids->data<int64_t>();
      auto ids_dim = ids->dims();
@@ -90,11 +90,13 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
      auto* d_output_data = d_output->data<T>();
      auto* d_table_data = d_table->mutable_data<T>(context.GetPlace());

+      memset(d_table_data, 0, d_table->numel() * sizeof(T));
+
      for (int64_t i = 0; i < ids->numel(); ++i) {
        PADDLE_ENFORCE_LT(ids_data[i], N);
        PADDLE_ENFORCE_GE(ids_data[i], 0);
        for (int j = 0; j < D; ++j) {
-          d_table_data[ids_data[i] * D + j] = d_output_data[i * D + j];
+          d_table_data[ids_data[i] * D + j] += d_output_data[i * D + j];
        }
      }
    }

--- a/paddle/operators/lstm_op.cc
+++ b/paddle/operators/lstm_op.cc
@@ -21,7 +21,6 @@ class LSTMOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;

- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Input"),
                   "Input(Input) of LSTM should not be null.");
@@ -29,9 +28,13 @@ class LSTMOp : public framework::OperatorWithKernel {
                   "Output(Hidden) of LSTM should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Cell"),
                   "Output(Cell) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("BatchGate"),
+                   "Output(BatchGate) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("BatchCellPreAct"),
+                   "Output(BatchGate) of LSTM should not be null.");

-    auto x_dims = ctx->GetInputDim("Input");
-    PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank must be 2.");
+    auto in_dims = ctx->GetInputDim("Input");
+    PADDLE_ENFORCE_EQ(in_dims.size(), 2, "Input(X)'s rank must be 2.");

    if (ctx->HasInput("H0")) {
      PADDLE_ENFORCE(ctx->HasInput("C0"),
@@ -44,7 +47,7 @@ class LSTMOp : public framework::OperatorWithKernel {
                     "should be the same.");
    }

-    int frame_size = x_dims[1] / 4;
+    int frame_size = in_dims[1] / 4;
    auto w_dims = ctx->GetInputDim("Weight");
    PADDLE_ENFORCE_EQ(w_dims.size(), 2,
                      "The rank of Input(Weight) should be 2.");
@@ -71,12 +74,21 @@ class LSTMOp : public framework::OperatorWithKernel {
                        "4 * %d if disable peepholes connection",
                        frame_size);
    }
-    ctx->SetOutputDim("Hidden", {x_dims[0], frame_size});
-    ctx->SetOutputDim("Cell", {x_dims[0], frame_size});
-    ctx->SetOutputDim("BatchGate", x_dims);
+    framework::DDim out_dims({in_dims[0], frame_size});
+    ctx->SetOutputDim("Hidden", out_dims);
+    ctx->SetOutputDim("Cell", out_dims);
+    ctx->SetOutputDim("BatchGate", in_dims);
+    ctx->SetOutputDim("BatchCellPreAct", out_dims);
    ctx->ShareLoD("Input", "Hidden");
    ctx->ShareLoD("Input", "Cell");
  }
+
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<framework::LoDTensor>("Input")->type());
+  }
 };

 class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
@@ -86,16 +98,18 @@ class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
    AddInput("Input",
             "(LoDTensor) the first input is a LodTensor, which support "
             "variable-time length input sequence. The underlying tensor in "
-             "this LoDTensor is a matrix with shape (T X 4D), where, T is the "
+             "this LoDTensor is a matrix with shape (T X 4D), where T is the "
             "total time steps in this mini-batch, D is the hidden size.");
    AddInput("H0",
             "(Tensor, optional) the initial hidden state is an optional "
             "input. This is a tensor with shape (N x D), where N is the "
-             "batch size, D is the hidden size.");
+             "batch size, D is the hidden size.")
+        .AsDispensable();
    AddInput("C0",
             "(Tensor, optional) the initial cell state is an optional "
             "input. This is a tensor with shape (N x D), where N is the "
-             "batch size. `H0` and `C0` can be NULL but only at the same time");
+             "batch size. `H0` and `C0` can be NULL but only at the same time")
+        .AsDispensable();
    AddInput("Weight",
             "(Tensor) the learnable hidden-hidden weights."
             " - The shape is (D x 4D), where D is the hidden size. "
@@ -109,22 +123,27 @@ class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
             " - Bias = {b_c, b_i, b_f, b_o}."
             "2. `usePeepholes = True` "
             " - The shape is (1 x 7D). "
-             " - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.");
+             " - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.")
+        .AsDispensable();
+    AddOutput("Hidden",
+              "(LoDTensor) the hidden state of LSTM operator. "
+              "The shape is (T x D), and lod is the same with the `Input`.");
+    AddOutput("Cell",
+              "(LoDTensor) the cell state of LSTM operator. "
+              "The shape is (T x D), and lod is the same with the `Input`.");
    AddOutput("BatchGate",
              "(LoDTensor) This LoDTensor contains input gate, forget gate "
              "and output gate after the nonlinear computation. This "
              "LoDTensor has the same shape with the reorganized input, which "
-              "was also be called batch input. The LoD size is 2. The first "
+              "is also be called batch input. The LoD size is 2. The first "
              "LoD is the batch offsets and the second LoD contains the "
              "indexes, which denote the position of reorganized sequence "
              "in the raw input.")
        .AsIntermediate();
-    AddOutput("Hidden",
-              "(LoDTensor) the hidden state lod tensor of LSTM operator. "
-              "The shape and lod is the same with the `Input`.");
-    AddOutput("Cell",
-              "(LoDTensor) the cell state lod tensor of LSTM operator. "
-              "The shape and lod is the same with the `Input`.");
+    AddOutput("BatchCellPreAct",
+              "(LoDTensor) This LoDTensor is got in the forward and used "
+              "in the backward.")
+        .AsIntermediate();
    AddAttr<bool>("usePeepholes",
                  "(bool, defalut: True) "
                  "whether to enable diagonal/peephole connections.")
@@ -202,15 +221,37 @@ class LSTMGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;

- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Hidden")),
-                   "Input(Hidden@GRAD) should not be null");
-    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Cell")),
-                   "Input(Cell@GRAD) should not be null");
-    ctx->SetOutputDim(framework::GradVarName("Weight"),
-                      ctx->GetInputDim("Weight"));
-    ctx->SetOutputDim(framework::GradVarName("Bias"), ctx->GetInputDim("Bias"));
+    PADDLE_ENFORCE(ctx->HasInput("Input"),
+                   "Input(Input) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Hidden"),
+                   "Input(Hidden) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Cell"),
+                   "Input(Cell) of LSTM should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasInput("BatchGate"),
+                   "Input(BatchGate) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("BatchCellPreAct"),
+                   "Input(BatchGate) of LSTM should not be null.");
+
+    auto in_g_name = framework::GradVarName("Input");
+    if (ctx->HasOutput(in_g_name))
+      ctx->SetOutputDim(in_g_name, ctx->GetInputDim("Input"));
+
+    auto w_g_name = framework::GradVarName("Weight");
+    if (ctx->HasOutput(w_g_name))
+      ctx->SetOutputDim(w_g_name, ctx->GetInputDim("Weight"));
+
+    auto b_g_name = framework::GradVarName("Bias");
+    if (ctx->HasOutput(b_g_name))
+      ctx->SetOutputDim(b_g_name, ctx->GetInputDim("Bias"));
+  }
+
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<framework::LoDTensor>("Input")->type());
  }
 };


--- a/paddle/operators/lstm_op.h
+++ b/paddle/operators/lstm_op.h
@@ -21,8 +21,9 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

-using framework::LoDTensor;
-using framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+using Tensor = framework::Tensor;
+
 template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
@@ -31,15 +32,15 @@ template <typename Place, typename T>
 class LSTMKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* input = ctx.Input<framework::LoDTensor>("Input");
-    auto* weight = ctx.Input<framework::Tensor>("Weight");
-    auto* bias = ctx.Input<framework::Tensor>("Bias");
+    auto* input = ctx.Input<LoDTensor>("Input");
+    auto* weight = ctx.Input<Tensor>("Weight");
+    auto* bias = ctx.Input<Tensor>("Bias");

-    auto* batch_gate = ctx.Output<framework::LoDTensor>("BatchGate");
+    auto* batch_gate = ctx.Output<LoDTensor>("BatchGate");
    batch_gate->mutable_data<T>(ctx.GetPlace());
-    auto* hidden_out = ctx.Output<framework::LoDTensor>("Hidden");
+    auto* hidden_out = ctx.Output<LoDTensor>("Hidden");
    hidden_out->mutable_data<T>(ctx.GetPlace());
-    auto* cell_out = ctx.Output<framework::LoDTensor>("Cell");
+    auto* cell_out = ctx.Output<LoDTensor>("Cell");
    cell_out->mutable_data<T>(ctx.GetPlace());

    // Now the function ShareLoD in InferShape is not implemented.
@@ -49,7 +50,8 @@ class LSTMKernel : public framework::OpKernel<T> {

    bool is_reverse = ctx.Attr<bool>("isReverse");
    math::LoDTensor2BatchFunctor<Place, T> to_batch;
-    to_batch(ctx.device_context(), *input, *batch_gate, is_reverse);
+    auto& device_ctx = ctx.device_context();
+    to_batch(device_ctx, *input, *batch_gate, true, is_reverse);

    auto in_dims = input->dims();
    int frame_size = static_cast<int>(in_dims[1] / 4);
@@ -69,17 +71,26 @@ class LSTMKernel : public framework::OpKernel<T> {
    }

    math::LstmMetaValue<T> lstm_value;
-    T* bias_data = const_cast<T*>(bias->data<T>());
-    // the code style in LstmMetaValue will be updated later.
-    lstm_value.checkIg = bias_data + 4 * frame_size;
-    lstm_value.checkFg = lstm_value.checkIg + frame_size;
-    lstm_value.checkOg = lstm_value.checkFg + frame_size;
+    if (bias) {
+      T* bias_data = const_cast<T*>(bias->data<T>());
+      // the code style in LstmMetaValue will be updated later.
+
+      lstm_value.checkIg = bias_data + 4 * frame_size;
+      lstm_value.checkFg = lstm_value.checkIg + frame_size;
+      lstm_value.checkOg = lstm_value.checkFg + frame_size;
+    } else {
+      lstm_value.checkIg = nullptr;
+      lstm_value.checkFg = nullptr;
+      lstm_value.checkOg = nullptr;
+    }
    lstm_value.prevStateValue = nullptr;

-    framework::LoDTensor batch_out, batch_cell, batch_cell_pre_act;
-    batch_out.mutable_data<T>(dims, ctx.GetPlace());
+    // Use the local variable as here.
+    LoDTensor batch_hidden, batch_cell;
+    auto* batch_cell_pre_act = ctx.Output<LoDTensor>("BatchCellPreAct");
+    batch_hidden.mutable_data<T>(dims, ctx.GetPlace());
    batch_cell.mutable_data<T>(dims, ctx.GetPlace());
-    batch_cell_pre_act.mutable_data<T>(dims, ctx.GetPlace());
+    batch_cell_pre_act->mutable_data<T>(dims, ctx.GetPlace());

    auto batch_starts = batch_gate->lod()[0];
    size_t num_batch = batch_starts.size() - 1;
@@ -92,18 +103,18 @@ class LSTMKernel : public framework::OpKernel<T> {
      int bend = static_cast<int>(batch_starts[n + 1]);

      Tensor gate_t = batch_gate->Slice(bstart, bend);
-      Tensor out_t = batch_out.Slice(bstart, bend);
+      Tensor out_t = batch_hidden.Slice(bstart, bend);
      Tensor cell_t = batch_cell.Slice(bstart, bend);
-      Tensor cell_pre_act_t = batch_cell_pre_act.Slice(bstart, bend);
+      Tensor cell_pre_act_t = batch_cell_pre_act->Slice(bstart, bend);

      int cur_batch_size = bend - bstart;

      if (n != 0) {
        int pre_h_start = static_cast<int>(batch_starts[n - 1]);
        int pre_h_end = pre_h_start + cur_batch_size;
-        auto pre_hidden_t = batch_out.Slice(pre_h_start, pre_h_end);
-        math::matmul<Place, T>(ctx.device_context(), pre_hidden_t, false,
-                               *weight, false, static_cast<T>(1.0), &gate_t,
+        auto pre_hidden_t = batch_hidden.Slice(pre_h_start, pre_h_end);
+        math::matmul<Place, T>(device_ctx, pre_hidden_t, false, *weight, false,
+                               static_cast<T>(1.0), &gate_t,
                               static_cast<T>(1.0));
      }
      // else if : FIXME support the initial hidden and cell
@@ -112,27 +123,186 @@ class LSTMKernel : public framework::OpKernel<T> {
      lstm_value.outputValue = out_t.data<T>();
      lstm_value.stateValue = cell_t.data<T>();
      lstm_value.stateActiveValue = cell_pre_act_t.data<T>();
-      math::LstmUnitFunctor<Place, T>::compute(ctx.device_context(), lstm_value,
+      math::LstmUnitFunctor<Place, T>::compute(device_ctx, lstm_value,
                                               frame_size, cur_batch_size,
                                               gate_act, cell_act, cand_act);
      lstm_value.prevStateValue = lstm_value.stateValue;
    }

    math::Batch2LoDTensorFunctor<Place, T> to_seq;
-    batch_out.set_lod(batch_gate->lod());
+    batch_hidden.set_lod(batch_gate->lod());
    // restore the output hidden in LoDTensor from the batch hidden
-    to_seq(ctx.device_context(), batch_out, *hidden_out);
+    to_seq(device_ctx, batch_hidden, *hidden_out);

    batch_cell.set_lod(batch_gate->lod());
    // restore the output cell state in LoDTensor from the batch cell
-    to_seq(ctx.device_context(), batch_cell, *cell_out);
+    to_seq(device_ctx, batch_cell, *cell_out);
  }
 };

 template <typename Place, typename T>
 class LSTMGradKernel : public framework::OpKernel<T> {
 public:
-  void Compute(const framework::ExecutionContext& ctx) const override {}
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<LoDTensor>("Input");
+    auto* weight = ctx.Input<Tensor>("Weight");
+    auto* bias = ctx.Input<Tensor>("Bias");
+
+    auto* hidden_out = ctx.Input<LoDTensor>("Hidden");
+    auto* cell_out = ctx.Input<LoDTensor>("Cell");
+
+    auto* batch_gate = ctx.Input<LoDTensor>("BatchGate");
+    auto* batch_cell_pre_act = ctx.Input<LoDTensor>("BatchCellPreAct");
+
+    auto* hidden_g = ctx.Input<LoDTensor>(framework::GradVarName("Hidden"));
+
+    auto* in_g = ctx.Output<LoDTensor>(framework::GradVarName("Input"));
+    auto* weight_g = ctx.Output<Tensor>(framework::GradVarName("Weight"));
+    auto* bias_g = ctx.Output<Tensor>(framework::GradVarName("Bias"));
+
+    auto& device_ctx = ctx.device_context();
+    math::SetConstant<Place, T> zero;
+    if (weight_g) {
+      weight_g->mutable_data<T>(ctx.GetPlace());
+      zero(device_ctx, weight_g, static_cast<T>(0.0));
+    }
+
+    auto in_dims = input->dims();
+    auto out_dims = hidden_g->dims();
+    int frame_size = static_cast<int>(in_dims[1] / 4);
+    PADDLE_ENFORCE_EQ(frame_size, out_dims[1]);
+
+    math::LstmMetaValue<T> lstm_value;
+    if (bias) {
+      T* bias_data = const_cast<T*>(bias->data<T>());
+      lstm_value.checkIg = bias_data + 4 * frame_size;
+      lstm_value.checkFg = lstm_value.checkIg + frame_size;
+      lstm_value.checkOg = lstm_value.checkFg + frame_size;
+    } else {
+      lstm_value.checkIg = nullptr;
+      lstm_value.checkFg = nullptr;
+      lstm_value.checkOg = nullptr;
+    }
+
+    math::LstmMetaGrad<T> lstm_grad;
+    if (bias && bias_g) {
+      T* bias_g_data = const_cast<T*>(bias_g->mutable_data<T>(ctx.GetPlace()));
+      zero(device_ctx, bias_g, static_cast<T>(0.0));
+      lstm_grad.checkIgGrad = bias_g_data + 4 * frame_size;
+      lstm_grad.checkFgGrad = lstm_grad.checkIgGrad + frame_size;
+      lstm_grad.checkOgGrad = lstm_grad.checkFgGrad + frame_size;
+    } else {
+      lstm_grad.checkIgGrad = nullptr;
+      lstm_grad.checkFgGrad = nullptr;
+      lstm_grad.checkOgGrad = nullptr;
+    }
+
+    math::LoDTensor2BatchFunctor<Place, T> to_batch;
+
+    // use the local variable as here.
+    LoDTensor batch_hidden;
+    batch_hidden.mutable_data<T>(out_dims, ctx.GetPlace());
+    batch_hidden.set_lod(batch_gate->lod());
+    to_batch(device_ctx, *hidden_out, batch_hidden, false);
+
+    LoDTensor batch_hidden_g;
+    batch_hidden_g.mutable_data<T>(out_dims, ctx.GetPlace());
+    batch_hidden_g.set_lod(batch_gate->lod());
+    to_batch(device_ctx, *hidden_g, batch_hidden_g, false);
+
+    LoDTensor batch_cell;
+    batch_cell.mutable_data<T>(out_dims, ctx.GetPlace());
+    batch_cell.set_lod(batch_gate->lod());
+    to_batch(device_ctx, *cell_out, batch_cell, false);
+
+    LoDTensor batch_cell_g;
+    batch_cell_g.mutable_data<T>(out_dims, ctx.GetPlace());
+    batch_cell_g.set_lod(batch_gate->lod());
+    // TODO(qingqing) support the case output cell has gradient.
+    // to_batch(device_ctx, *cell_g, batch_cell_g, false);
+    zero(device_ctx, &batch_cell_g, static_cast<T>(0.0));
+
+    LoDTensor batch_gate_g;
+    batch_gate_g.mutable_data<T>(batch_gate->dims(), ctx.GetPlace());
+    batch_gate_g.set_lod(batch_gate->lod());
+
+    auto gate_act = ctx.Attr<std::string>("gateActivation");
+    auto cell_act = ctx.Attr<std::string>("cellActivation");
+    auto cand_act = ctx.Attr<std::string>("candidateActivation");
+
+    auto batch_starts = batch_gate->lod()[0];
+    size_t num_batch = batch_starts.size() - 1;
+    for (int n = static_cast<int>(num_batch) - 1; n >= 0; n--) {
+      int bstart = static_cast<int>(batch_starts[n]);
+      int bend = static_cast<int>(batch_starts[n + 1]);
+
+      Tensor gate = batch_gate->Slice(bstart, bend);
+      Tensor cell = batch_cell.Slice(bstart, bend);
+      Tensor cell_pre_act = batch_cell_pre_act->Slice(bstart, bend);
+      lstm_value.gateValue = gate.data<T>();
+      lstm_value.stateValue = cell.data<T>();
+      lstm_value.stateActiveValue = cell_pre_act.data<T>();
+
+      Tensor out_g = batch_hidden_g.Slice(bstart, bend);
+      Tensor gate_g = batch_gate_g.Slice(bstart, bend);
+      Tensor cell_g = batch_cell_g.Slice(bstart, bend);
+      lstm_grad.stateGrad = cell_g.data<T>();
+      lstm_grad.gateGrad = gate_g.data<T>();
+      lstm_grad.outputGrad = out_g.data<T>();
+
+      if (n) {
+        int bstart_pre = static_cast<int>(batch_starts[n - 1]);
+        Tensor cell_pre = batch_cell.Slice(bstart_pre, bstart);
+        Tensor cell_pre_g = batch_cell_g.Slice(bstart_pre, bstart);
+        lstm_value.prevStateValue = cell_pre.data<T>();
+        lstm_grad.prevStateGrad = cell_pre_g.data<T>();
+      } else {
+        lstm_value.prevStateValue = nullptr;
+        lstm_grad.prevStateGrad = nullptr;
+      }
+
+      int cur_batch_size = bend - bstart;
+      math::LstmUnitGradFunctor<Place, T>::compute(
+          device_ctx, lstm_value, lstm_grad, frame_size, cur_batch_size,
+          gate_act, cell_act, cand_act);
+
+      if (n != 0) {
+        int pre_h_start = static_cast<int>(batch_starts[n - 1]);
+        int pre_h_end = pre_h_start + cur_batch_size;
+        auto pre_hidden_g = batch_hidden_g.Slice(pre_h_start, pre_h_end);
+        math::matmul<Place, T>(device_ctx, gate_g, false, *weight, true,
+                               static_cast<T>(1.0), &pre_hidden_g,
+                               static_cast<T>(1.0));
+        if (weight_g) {
+          /* backward weight */
+          auto pre_hidden = batch_hidden.Slice(pre_h_start, pre_h_end);
+          math::matmul<Place, T>(device_ctx, pre_hidden, true, gate_g, false,
+                                 static_cast<T>(1.0), weight_g,
+                                 static_cast<T>(1.0));
+        }
+      }
+    }
+
+    math::Batch2LoDTensorFunctor<Place, T> to_seq;
+    if (in_g) {
+      /* backward data */
+      in_g->mutable_data<T>(ctx.GetPlace());
+      to_seq(device_ctx, batch_gate_g, *in_g);
+    }
+    if (bias && bias_g) {
+      /* backward bias */
+      int m = static_cast<int>(batch_gate_g.dims()[0]);
+      int n = static_cast<int>(batch_gate_g.dims()[1]);
+
+      Tensor ones;
+      ones.mutable_data<T>({m}, ctx.GetPlace());
+      math::SetConstant<Place, T> set;
+      set(device_ctx, &ones, static_cast<T>(1.0));
+
+      math::gemv<Place, T>(device_ctx, true, m, n, 1., batch_gate_g.data<T>(),
+                           ones.data<T>(), 0., bias_g->data<T>());
+    }
+  }
 };

 }  // namespace operators

--- a/paddle/operators/lstm_unit_op.cu
+++ b/paddle/operators/lstm_unit_op.cu
@@ -12,6 +12,10 @@
   See the License for the specific language governing permissions and
   limitations under the License. */

+/* Acknowledgement: the following code is strongly inspired by
+https://github.com/caffe2/caffe2/blob/master/caffe2/operators/lstm_unit_op_gpu.cu
+*/
+
 #include "paddle/framework/op_registry.h"
 #include "paddle/operators/cross_entropy_op.h"
 #include "paddle/platform/assert.h"

--- a/paddle/operators/lstm_unit_op.h
+++ b/paddle/operators/lstm_unit_op.h
@@ -12,6 +12,10 @@
   See the License for the specific language governing permissions and
   limitations under the License. */

+/* Acknowledgement: the following code is strongly inspired by
+https://github.com/caffe2/caffe2/blob/master/caffe2/operators/lstm_unit_op.h
+*/
+
 #pragma once
 #include "glog/logging.h"
 #include "paddle/framework/op_registry.h"

--- a/paddle/operators/math/detail/CMakeLists.txt
+++ b/paddle/operators/math/detail/CMakeLists.txt
 if(WITH_AVX)
-    cc_library(activation_functions SRCS hl_cpu_functions.cc hl_avx_functions.cc)
-else()
-    cc_library(activation_functions SRCS hl_cpu_functions.cc)
+    cc_library(activation_functions SRCS avx_functions.cc)
 endif()
--- a/paddle/operators/math/detail/activation_functions.h
+++ b/paddle/operators/math/detail/activation_functions.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <math.h>
+#include "paddle/platform/hostdevice.h"
+
+#ifdef __AVX__
+#include <immintrin.h>
+#endif
+
+namespace paddle {
+namespace operators {
+namespace math {
+namespace detail {
+
+#define SIGMOID_THRESHOLD_MIN -40.0
+#define SIGMOID_THRESHOLD_MAX 13.0
+#define EXP_MAX_INPUT 40.0
+
+namespace forward {
+
+template <typename T>
+DEVICE T Identity(const T a) {
+  return a;
+}
+
+template <typename T>
+DEVICE T Relu(const T a) {
+  return a > static_cast<T>(0.0) ? a : static_cast<T>(0.0);
+}
+
+template <typename T>
+DEVICE T Sigmoid(const T a) {
+  const T min = SIGMOID_THRESHOLD_MIN;
+  const T max = SIGMOID_THRESHOLD_MAX;
+  T tmp = (a < min) ? min : ((a > max) ? max : a);
+  return static_cast<T>(1.0) / (static_cast<T>(1.0) + exp(-tmp));
+}
+
+template <typename T>
+DEVICE T Tanh(const T a) {
+  T tmp = -2.0 * a;
+  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
+  return (2.0 / (1.0 + exp(tmp))) - 1.0;
+}
+
+}  // namespace forward
+
+namespace backward {
+
+template <typename T>
+DEVICE T Identity(const T a, const T b) {
+  return a;
+}
+
+template <typename T>
+DEVICE T Relu(const T a, const T b) {
+  return a * (b > 0.0 ? 1.0 : 0.0);
+}
+
+template <typename T>
+DEVICE T Sigmoid(const T a, const T b) {
+  return a * b * (1.0 - b);
+}
+
+template <typename T>
+DEVICE T Tanh(const T a, const T b) {
+  return a * (1.0 - b * b);
+}
+
+}  // namespace backward
+
+template <typename T>
+struct Active {
+  typedef T (*Act)(T);
+  typedef T (*ActGrad)(T, T);
+};
+
+static DEVICE Active<float>::Act kActFloat[] = {
+    &forward::Sigmoid<float>, &forward::Relu<float>, &forward::Tanh<float>,
+    &forward::Identity<float>};
+
+static DEVICE Active<float>::ActGrad kActGradFloat[] = {
+    &backward::Sigmoid<float>, &backward::Relu<float>, &backward::Tanh<float>,
+    &backward::Identity<float>};
+
+static DEVICE Active<double>::Act kActDouble[] = {
+    &forward::Sigmoid<double>, &forward::Relu<double>, &forward::Tanh<double>,
+    &forward::Identity<double>};
+
+static DEVICE Active<double>::ActGrad kActGradDouble[] = {
+    &backward::Sigmoid<double>, &backward::Relu<double>,
+    &backward::Tanh<double>, &backward::Identity<double>};
+
+namespace forward {
+inline DEVICE float activation(float a, int index) {
+  return kActFloat[index](a);
+}
+
+inline DEVICE double activation(double a, int index) {
+  return kActDouble[index](a);
+}
+
+}  // namespace forward
+
+namespace backward {
+inline DEVICE float activation(float a, float b, int index) {
+  return kActGradFloat[index](a, b);
+}
+
+inline DEVICE double activation(double a, double b, int index) {
+  return kActGradDouble[index](a, b);
+}
+}  // namespace backward
+
+#ifdef __AVX__
+namespace forward {
+namespace avx {
+__m256 Relu(const __m256 a);
+__m256 Sigmoid(const __m256 a);
+__m256 Tanh(const __m256 a);
+__m256 Identity(const __m256 a);
+}  // namespace avx
+}  // namespace forward
+
+namespace backward {
+namespace avx {
+__m256 Relu(const __m256 a, const __m256 b);
+__m256 Sigmoid(const __m256 a, const __m256 b);
+__m256 Tanh(const __m256 a, const __m256 b);
+__m256 Identity(const __m256 a, const __m256 b);
+}  // namespace avx
+}  // namespace backward
+
+static Active<__m256>::Act kActAvx[] = {
+    &forward::avx::Sigmoid, &forward::avx::Relu, &forward::avx::Tanh,
+    &forward::avx::Identity};
+
+static Active<__m256>::ActGrad kActGradAvx[] = {
+    &backward::avx::Sigmoid, &backward::avx::Relu, &backward::avx::Tanh,
+    &backward::avx::Identity};
+
+namespace forward {
+inline __m256 activation(__m256 a, int index) { return kActAvx[index](a); }
+}  // namespace forward
+
+namespace backward {
+inline __m256 activation(__m256 a, __m256 b, int index) {
+  return kActGradAvx[index](a, b);
+}
+}  // namespace backward
+
+#endif
+
+}  // namespace detail
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/math/detail/hl_avx_functions.cc
+++ b/paddle/operators/math/detail/hl_avx_functions.cc
@@ -13,58 +13,74 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include <immintrin.h>
-#include "hl_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 // TODO(qingqing) refine this dependence
 #include "paddle/cuda/src/avx_mathfun.h"

-namespace hppl {
+namespace paddle {
+namespace operators {
+namespace math {
+namespace detail {

-__m256 exp(__m256 a) { return exp256_ps(a); }
+__m256 Exp(__m256 a) { return exp256_ps(a); }

-__m256 relu(const __m256 a) {
+namespace forward {
+namespace avx {
+__m256 Relu(const __m256 a) {
  __m256 tmp = _mm256_set1_ps(0.0f);
  return _mm256_max_ps(a, tmp);
 }

-__m256 sigmoid(const __m256 a) {
+__m256 Sigmoid(const __m256 a) {
  __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX);
  __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN);
  __m256 tmp = _mm256_max_ps(a, min);
  tmp = _mm256_min_ps(tmp, max);
  tmp = _mm256_sub_ps(_mm256_set1_ps(0.0f), tmp);
-  tmp = exp(tmp);
+  tmp = Exp(tmp);
  tmp = _mm256_add_ps(_mm256_set1_ps(1.0f), tmp);
  tmp = _mm256_div_ps(_mm256_set1_ps(1.0f), tmp);
  return tmp;
 }

-__m256 tanh(const __m256 a) {
+__m256 Tanh(const __m256 a) {
  __m256 max = _mm256_set1_ps(EXP_MAX_INPUT);
  __m256 tmp = _mm256_mul_ps(_mm256_set1_ps(-2.0f), a);
  tmp = _mm256_min_ps(tmp, max);
-  tmp = exp(tmp);
+  tmp = Exp(tmp);
  return _mm256_sub_ps(_mm256_div_ps(_mm256_set1_ps(2.0f),
                                     _mm256_add_ps(_mm256_set1_ps(1.0f), tmp)),
                       _mm256_set1_ps(1.0f));
 }

-__m256 linear(const __m256 a) { return a; }
+__m256 Identity(const __m256 a) { return a; }

-__m256 relu(const __m256 a, const __m256 b) {
+}  // namespace avx
+}  // namespace forward
+
+namespace backward {
+namespace avx {
+__m256 Relu(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(
      a, _mm256_and_ps(_mm256_cmp_ps(b, _mm256_set1_ps(0.0f), _CMP_GT_OS),
                       _mm256_set1_ps(1.0f)));
 }

-__m256 sigmoid(const __m256 a, const __m256 b) {
+__m256 Sigmoid(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(_mm256_mul_ps(a, b),
                       _mm256_sub_ps(_mm256_set1_ps(1.0f), b));
 }

-__m256 tanh(const __m256 a, const __m256 b) {
+__m256 Tanh(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(
      a, _mm256_sub_ps(_mm256_set1_ps(1.0f), _mm256_mul_ps(b, b)));
 }

-__m256 linear(const __m256 a, const __m256 b) { return a; }
-}  // namespace hppl
+__m256 Identity(const __m256 a, const __m256 b) { return a; }
+}  // namespace avx
+}  // namespace backward
+
+}  // namespace detail
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/math/detail/hl_activation_functions.h
+++ b/paddle/operators/math/detail/hl_activation_functions.h
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#ifndef HL_ACTIVATION_FUNCTIONS_H_
-#define HL_ACTIVATION_FUNCTIONS_H_
-
-#include "hl_functions.h"
-#include "paddle/operators/math/lstm_compute.h"
-
-/**
- * Active functions: sigmoid, relu, tanh and linear.
- */
-#define FLOAT_ACTIVE_FUNCTION                                   \
-  {                                                             \
-    hppl::typef::sigmoid, hppl::typef::relu, hppl::typef::tanh, \
-        hppl::typef::linear                                     \
-  }
-
-#define DOUBLE_ACTIVE_FUNCTION                                  \
-  {                                                             \
-    hppl::typed::sigmoid, hppl::typed::relu, hppl::typed::tanh, \
-        hppl::typed::linear                                     \
-  }
-
-#define AVX_ACTIVE_FUNCTION \
-  { hppl::sigmoid, hppl::relu, hppl::tanh, hppl::linear }
-
-namespace hppl {
-
-using activation_mode_t = paddle::operators::math::activation_mode_t;
-
-/**
- * Hppl supports sigmoid, relu, tanh, linear active functions
- * for neural networks' forward and backward activation.
- */
-template <class T>
-class Active {
- public:
-  typedef T (*forward)(T);
-  typedef T (*backward)(T, T);
-};
-
-template <typename T>
-struct ForwardActType;
-
-template <>
-struct ForwardActType<float> {
-  using type = Active<float>::forward;
-};
-
-template <>
-struct ForwardActType<double> {
-  using type = Active<double>::forward;
-};
-
-template <typename T>
-struct BackwardActType;
-
-template <>
-struct BackwardActType<float> {
-  using type = Active<float>::backward;
-};
-
-template <>
-struct BackwardActType<double> {
-  using type = Active<double>::backward;
-};
-
-#ifdef __NVCC__
-namespace gpu {
-static __device__ Active<float>::forward forward[] = FLOAT_ACTIVE_FUNCTION;
-static __device__ Active<float>::backward backward[] = FLOAT_ACTIVE_FUNCTION;
-
-static __device__ Active<double>::forward forward_d[] = DOUBLE_ACTIVE_FUNCTION;
-static __device__ Active<double>::backward backward_d[] =
-    DOUBLE_ACTIVE_FUNCTION;
-
-template <typename T>
-struct ForwardAct {
-  __device__ typename ForwardActType<T>::type operator()(
-      activation_mode_t type);
-};
-
-template <>
-struct ForwardAct<float> {
-  __device__ ForwardActType<float>::type operator()(activation_mode_t type) {
-    return forward[type];
-  }
-};
-
-template <>
-struct ForwardAct<double> {
-  __device__ ForwardActType<double>::type operator()(activation_mode_t type) {
-    return forward_d[type];
-  }
-};
-
-template <typename T>
-struct BackwardAct {
-  __device__ typename BackwardActType<T>::type operator()(
-      activation_mode_t type);
-};
-
-template <>
-struct BackwardAct<float> {
-  __device__ BackwardActType<float>::type operator()(activation_mode_t type) {
-    return backward[type];
-  }
-};
-
-template <>
-struct BackwardAct<double> {
-  __device__ BackwardActType<double>::type operator()(activation_mode_t type) {
-    return backward_d[type];
-  }
-};
-
-}  // namespace gpu
-#else
-namespace cpu {
-static Active<float>::forward forward[] = FLOAT_ACTIVE_FUNCTION;
-static Active<float>::backward backward[] = FLOAT_ACTIVE_FUNCTION;
-
-static Active<double>::forward forward_d[] = DOUBLE_ACTIVE_FUNCTION;
-static Active<double>::backward backward_d[] = DOUBLE_ACTIVE_FUNCTION;
-
-template <typename T>
-struct ForwardAct {
-  typename ForwardActType<T>::type operator()(activation_mode_t type);
-};
-
-template <>
-struct ForwardAct<float> {
-  ForwardActType<float>::type operator()(activation_mode_t type) {
-    return forward[type];
-  }
-};
-
-template <>
-struct ForwardAct<double> {
-  ForwardActType<double>::type operator()(activation_mode_t type) {
-    return forward_d[type];
-  }
-};
-
-template <typename T>
-struct BackwardAct {
-  typename BackwardActType<T>::type operator()(activation_mode_t type);
-};
-
-template <>
-struct BackwardAct<float> {
-  BackwardActType<float>::type operator()(activation_mode_t type) {
-    return backward[type];
-  }
-};
-
-template <>
-struct BackwardAct<double> {
-  BackwardActType<double>::type operator()(activation_mode_t type) {
-    return backward_d[type];
-  }
-};
-
-}  // namespace cpu
-
-#ifdef __AVX__
-namespace avx {
-static Active<__m256>::forward forward[] = AVX_ACTIVE_FUNCTION;
-static Active<__m256>::backward backward[] = AVX_ACTIVE_FUNCTION;
-}  // namespace avx
-#endif
-#endif
-
-}  // namespace hppl
-
-#endif  // HL_ACTIVATION_FUNCTIONS_H_
--- a/paddle/operators/math/detail/hl_cpu_functions.cc
+++ b/paddle/operators/math/detail/hl_cpu_functions.cc
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <math.h>
-#include "hl_functions.h"
-
-namespace hppl {
-namespace typef {
-
-float relu(const float a) {
-  return a > static_cast<float>(0.0) ? a : static_cast<float>(0.0);
-}
-
-float sigmoid(const float a) {
-  const float min = SIGMOID_THRESHOLD_MIN;
-  const float max = SIGMOID_THRESHOLD_MAX;
-  float tmp = (a < min) ? min : ((a > max) ? max : a);
-  return static_cast<float>(1.0) / (static_cast<float>(1.0) + exp(-tmp));
-}
-
-float tanh(const float a) {
-  float tmp = -2.0 * a;
-  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
-  return (2.0 / (1.0 + exp(tmp))) - 1.0;
-}
-
-float linear(const float a) { return a; }
-
-float relu(const float a, const float b) { return a * (b > 0.0 ? 1.0 : 0.0); }
-
-float sigmoid(const float a, const float b) {
-  return a * b * (static_cast<float>(1) - b);
-}
-
-float tanh(const float a, const float b) {
-  return a * (static_cast<float>(1) - b * b);
-}
-
-float linear(const float a, const float b) { return a; }
-
-}  // namespace typef
-
-namespace typed {
-double relu(const double a) {
-  return a > static_cast<double>(0.0) ? a : static_cast<double>(0.0);
-}
-
-double sigmoid(const double a) {
-  const double min = SIGMOID_THRESHOLD_MIN;
-  const double max = SIGMOID_THRESHOLD_MAX;
-  double tmp = (a < min) ? min : ((a > max) ? max : a);
-  return static_cast<double>(1.0) / (static_cast<double>(1.0) + exp(-tmp));
-}
-
-double tanh(const double a) {
-  double tmp = -2.0 * a;
-  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
-  return (2.0 / (1.0 + exp(tmp))) - 1.0;
-}
-
-double linear(const double a) { return a; }
-
-double relu(const double a, const double b) {
-  return a * (b > 0.0 ? 1.0 : 0.0);
-}
-
-double sigmoid(const double a, const double b) {
-  return a * b * (static_cast<double>(1) - b);
-}
-
-double tanh(const double a, const double b) {
-  return a * (static_cast<double>(1) - b * b);
-}
-
-double linear(const double a, const double b) { return a; }
-
-}  // namespace typed
-}  // namespace hppl
--- a/paddle/operators/math/detail/hl_functions.h
+++ b/paddle/operators/math/detail/hl_functions.h
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#ifndef HL_FUNCTIONS_H_
-#define HL_FUNCTIONS_H_
-
-/**
- * sigmoid threshold maximum
- */
-#define SIGMOID_THRESHOLD_MIN -40.0
-
-/**
- * sigmoid threshold minimum
- */
-#define SIGMOID_THRESHOLD_MAX 13.0
-
-/**
- * The maximum input value for exp, used to avoid overflow problem.
- * currently only used for tanh function.
- */
-#define EXP_MAX_INPUT 40.0
-
-#ifndef __NVCC__
-namespace hppl {
-namespace typef {
-float relu(const float a);
-float sigmoid(const float a);
-float tanh(const float a);
-float linear(const float a);
-
-float relu(const float a, const float b);
-float sigmoid(const float a, const float b);
-float tanh(const float a, const float b);
-float linear(const float a, const float b);
-
-}  // namespace typef
-
-namespace typed {
-double relu(const double a);
-double sigmoid(const double a);
-double tanh(const double a);
-double linear(const double a);
-
-double relu(const double a, const double b);
-double sigmoid(const double a, const double b);
-double tanh(const double a, const double b);
-double linear(const double a, const double b);
-}  // namespace typed
-
-}  // namespace hppl
-
-#ifdef __AVX__
-#include "hl_avx_functions.h"
-#endif
-
-#else
-#include "hl_gpu_functions.h"
-#endif
-
-#endif  // HL_FUNCTIONS_H_
--- a/paddle/operators/math/detail/hl_gpu_functions.h
+++ b/paddle/operators/math/detail/hl_gpu_functions.h
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#ifndef HL_GPU_FUNCTIONS_CUH_
-#define HL_GPU_FUNCTIONS_CUH_
-
-#include "hl_base.h"
-
-namespace hppl {
-namespace typef {
-
-__device__ static float relu(const float a) { return a > 0.0f ? a : 0.0f; }
-
-__device__ static float sigmoid(const float a) {
-  const float min = SIGMOID_THRESHOLD_MIN;
-  const float max = SIGMOID_THRESHOLD_MAX;
-  float tmp = (a < min) ? min : ((a > max) ? max : a);
-  return __fdividef(1.0f, 1.0f + __expf(-tmp));
-}
-
-__device__ static float tanh(const float a) {
-  float tmp = -2.0 * a;
-  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
-  return __fdividef(2.0f, (1.0f + __expf(-2.0f * tmp))) - 1.0f;
-}
-
-__device__ static float linear(const float a) { return a; }
-
-__device__ static float relu(const float a, const float b) {
-  return a * (b > 0.0f ? 1.0f : 0.0f);
-}
-
-__device__ static float sigmoid(const float a, const float b) {
-  return a * b * (1.0f - b);
-}
-
-__device__ static float tanh(const float a, const float b) {
-  return a * (1.0f - b * b);
-}
-
-__device__ static float linear(const float a, const float b) { return a; }
-
-}  // namespace typef
-
-namespace typed {
-
-__device__ static double relu(const double a) { return a > 0.0 ? a : 0.0; }
-
-__device__ static double sigmoid(const double a) {
-  const double min = SIGMOID_THRESHOLD_MIN;
-  const double max = SIGMOID_THRESHOLD_MAX;
-  double tmp = (a < min) ? min : ((a > max) ? max : a);
-  return 1.0 / (1.0 + exp(-tmp));
-}
-
-__device__ static double tanh(const double a) {
-  double tmp = -2.0 * a;
-  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
-  return (2.0 / (1.0 + exp(-2.0 * a))) - 1.0;
-}
-
-__device__ static double linear(const double a) { return a; }
-
-__device__ static double relu(const double a, const double b) {
-  return a * (b > 0.0 ? 1.0 : 0.0);
-}
-
-__device__ static double sigmoid(const double a, const double b) {
-  return a * b * (1 - b);
-}
-
-__device__ static double tanh(const double a, const double b) {
-  return a * (1.0 - b * b);
-}
-
-__device__ static double linear(const double a, const double b) { return a; }
-
-}  // namespace typef
-
-}  // namespace hppl
-
-#endif  // HL_GPU_FUNCTIONS_CUH_
--- a/paddle/operators/math/detail/lstm_cpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_cpu_kernel.h
@@ -14,7 +14,7 @@ limitations under the License. */

 #pragma once
 #include <type_traits>
-#include "paddle/operators/math/detail/hl_activation_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/operators/math/lstm_compute.h"

 namespace paddle {
@@ -60,10 +60,8 @@ void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
      rPrevState = value.prevStateValue[i];
    }

-    hppl::cpu::ForwardAct<T> act;
    op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-       rOut, rCheckI, rCheckF, rCheckO, act(active_node), act(active_gate),
-       act(active_state));
+       rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);

    valueIn[i] = rValueIn;
    valueIg[i] = rValueIg;
@@ -127,11 +125,10 @@ void naive_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
      rPrevState = value.prevStateValue[i];
    }

-    hppl::cpu::BackwardAct<T> act;
    op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg,
       rGradOg, rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv,
       rOutputGrad, rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad,
-       rCheckOGrad, act(active_node), act(active_gate), act(active_state));
+       rCheckOGrad, active_node, active_gate, active_state);

    gradIn[i] = rGradIn;
    gradIg[i] = rGradIg;
@@ -185,8 +182,7 @@ void avx_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value, int frameSize,
    }

    op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-       rOut, rCheckI, rCheckF, rCheckO, hppl::avx::forward[active_node],
-       hppl::avx::forward[active_gate], hppl::avx::forward[active_state]);
+       rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);

    valueIn[i] = rValueIn;
    valueIg[i] = rValueIg;
@@ -255,8 +251,7 @@ void avx_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
    op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg,
       rGradOg, rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv,
       rOutputGrad, rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad,
-       rCheckOGrad, hppl::avx::backward[active_node],
-       hppl::avx::backward[active_gate], hppl::avx::backward[active_state]);
+       rCheckOGrad, active_node, active_gate, active_state);

    gradIn[i] = rGradIn;
    gradIg[i] = rGradIg;

--- a/paddle/operators/math/detail/lstm_gpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_gpu_kernel.h
@@ -13,13 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #pragma once
-#include <type_traits>
-#include "paddle/operators/math/detail/hl_activation_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/operators/math/lstm_compute.h"
 #include "paddle/platform/cuda_helper.h"
 #include "paddle/platform/device_context.h"

-#include <glog/logging.h>
+#include <type_traits>

 namespace paddle {
 namespace operators {
@@ -70,10 +69,8 @@ __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
    rPrevState = value.prevStateValue[frameIdx];
  }

-  hppl::gpu::ForwardAct<T> act;
  op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-     rOut, rCheckI, rCheckF, rCheckO, act(active_node), act(active_gate),
-     act(active_state));
+     rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);

  value.gateValue[frameIdx] = rValueIn;
  value.gateValue[frameIdx + frameSize] = rValueIg;
@@ -145,11 +142,10 @@ __global__ void KeLstmBackward(Op op, LstmMetaValue<T> value,
    rPrevState = value.prevStateValue[frameIdx];
  }

-  hppl::gpu::BackwardAct<T> act;
  op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg, rGradOg,
     rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv, rOutputGrad,
     rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad, rCheckOGrad,
-     act(active_node), act(active_gate), act(active_state));
+     active_node, active_gate, active_state);

  grad.gateGrad[frameIdx] = rGradIn;
  grad.gateGrad[frameIdx + frameSize] = rGradIg;
@@ -230,9 +226,9 @@ void gpu_lstm_backward(const platform::DeviceContext& context, Op op,
    threads = dim3(framePerBlock, 1);
    grid = dim3(frameBlocks, 1);
  } else {
-    /* framePerBlock = 32 batchPerBlock = 32 */
-    threads = dim3(32, 32);
-    grid = dim3((frameSize + 32 - 1) / 32, (batchSize + 32 - 1) / 32);
+    /* framePerBlock = 32 batchPerBlock = 16 */
+    threads = dim3(32, 16);
+    grid = dim3((frameSize + 32 - 1) / 32, (batchSize + 16 - 1) / 16);
  }

  auto stream =
@@ -248,6 +244,11 @@ void gpu_lstm_backward(const platform::DeviceContext& context, Op op,
        op, value, grad, frameSize, batchSize, active_node, active_gate,
        active_state);
  }
+
+  cudaStreamSynchronize(stream);
+  // TODO(qingqing): Add cuda error check for each kernel.
+  cudaError_t err = cudaGetLastError();
+  PADDLE_ENFORCE(err, cudaGetErrorString(err));
 }

 }  // namespace detail

--- a/paddle/operators/math/detail/lstm_kernel.h
+++ b/paddle/operators/math/detail/lstm_kernel.h
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#include "paddle/operators/math/detail/hl_activation_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/platform/hostdevice.h"

 #include <type_traits>
@@ -30,15 +30,15 @@ class lstm {
  HOSTDEVICE void operator()(T &valueIn, T &valueIg, T &valueFg, T &valueOg,
                             T &prevState, T &state, T &stateAtv, T &output,
                             T &checkI, T &checkF, T &checkO,
-                             typename hppl::ForwardActType<T>::type actInput,
-                             typename hppl::ForwardActType<T>::type actGate,
-                             typename hppl::ForwardActType<T>::type actState) {
-    valueIn = actInput(valueIn);
-    valueIg = actGate(valueIg + prevState * checkI);
-    valueFg = actGate(valueFg + prevState * checkF);
+                             activation_mode_t active_node,
+                             activation_mode_t active_gate,
+                             activation_mode_t active_state) {
+    valueIn = activation(valueIn, active_node);
+    valueIg = activation(valueIg + prevState * checkI, active_gate);
+    valueFg = activation(valueFg + prevState * checkF, active_gate);
    state = valueIn * valueIg + prevState * valueFg;
-    valueOg = actGate(valueOg + state * checkO);
-    stateAtv = actState(state);
+    valueOg = activation(valueOg + state * checkO, active_gate);
+    stateAtv = activation(state, active_state);
    output = valueOg * stateAtv;
  }
 #ifndef __NVCC__
@@ -52,16 +52,19 @@ class lstm {
                             __m256 &valueOg, __m256 &prevState, __m256 &state,
                             __m256 &stateAtv, __m256 &output, __m256 &checkI,
                             __m256 &checkF, __m256 &checkO,
-                             hppl::Active<__m256>::forward actInput,
-                             hppl::Active<__m256>::forward actGate,
-                             hppl::Active<__m256>::forward actState) {
-    valueIn = actInput(valueIn);
-    valueIg = actGate(_mm256_add_ps(valueIg, _mm256_mul_ps(prevState, checkI)));
-    valueFg = actGate(_mm256_add_ps(valueFg, _mm256_mul_ps(prevState, checkF)));
+                             activation_mode_t active_node,
+                             activation_mode_t active_gate,
+                             activation_mode_t active_state) {
+    valueIn = activation(valueIn, active_node);
+    valueIg = activation(
+        _mm256_add_ps(valueIg, _mm256_mul_ps(prevState, checkI)), active_gate);
+    valueFg = activation(
+        _mm256_add_ps(valueFg, _mm256_mul_ps(prevState, checkF)), active_gate);
    state = _mm256_add_ps(_mm256_mul_ps(valueIn, valueIg),
                          _mm256_mul_ps(prevState, valueFg));
-    valueOg = actGate(_mm256_add_ps(valueOg, _mm256_mul_ps(state, checkO)));
-    stateAtv = actState(state);
+    valueOg = activation(_mm256_add_ps(valueOg, _mm256_mul_ps(state, checkO)),
+                         active_gate);
+    stateAtv = activation(state, active_state);
    output = _mm256_mul_ps(valueOg, stateAtv);
  }
 #endif
@@ -81,14 +84,15 @@ class lstm {
                             T &stateGrad, T &stateAtv, T &outputGrad,
                             T &checkI, T &checkF, T &checkO, T &checkIGrad,
                             T &checkFGrad, T &checkOGrad,
-                             typename hppl::BackwardActType<T>::type actInput,
-                             typename hppl::BackwardActType<T>::type actGate,
-                             typename hppl::BackwardActType<T>::type actState) {
-    gradOg = actGate(outputGrad * stateAtv, valueOg);
-    stateGrad += actState(outputGrad * valueOg, stateAtv) + gradOg * checkO;
-    gradIn = actInput(stateGrad * valueIg, valueIn);
-    gradIg = actGate(stateGrad * valueIn, valueIg);
-    gradFg = actGate(stateGrad * prevState, valueFg);
+                             activation_mode_t active_node,
+                             activation_mode_t active_gate,
+                             activation_mode_t active_state) {
+    gradOg = activation(outputGrad * stateAtv, valueOg, active_gate);
+    stateGrad += activation(outputGrad * valueOg, stateAtv, active_state) +
+                 gradOg * checkO;
+    gradIn = activation(stateGrad * valueIg, valueIn, active_node);
+    gradIg = activation(stateGrad * valueIn, valueIg, active_gate);
+    gradFg = activation(stateGrad * prevState, valueFg, active_gate);
    prevStateGrad = gradIg * checkI + gradFg * checkF + stateGrad * valueFg;
    checkIGrad = gradIg * prevState;
    checkFGrad = gradFg * prevState;
@@ -100,24 +104,26 @@ class lstm {
 #else
  // Only float support AVX optimization
  static const bool avx = std::is_same<T, float>::value;
-  HOSTDEVICE void operator()(__m256 &valueIn, __m256 &valueIg, __m256 &valueFg,
-                             __m256 &valueOg, __m256 &gradIn, __m256 &gradIg,
-                             __m256 &gradFg, __m256 &gradOg, __m256 &prevState,
-                             __m256 &prevStateGrad, __m256 &state,
-                             __m256 &stateGrad, __m256 &stateAtv,
-                             __m256 &outputGrad, __m256 &checkI, __m256 &checkF,
-                             __m256 &checkO, __m256 &checkIGrad,
-                             __m256 &checkFGrad, __m256 &checkOGrad,
-                             hppl::Active<__m256>::backward actInput,
-                             hppl::Active<__m256>::backward actGate,
-                             hppl::Active<__m256>::backward actState) {
-    gradOg = actGate(_mm256_mul_ps(outputGrad, stateAtv), valueOg);
+  HOSTDEVICE void operator()(
+      __m256 &valueIn, __m256 &valueIg, __m256 &valueFg, __m256 &valueOg,
+      __m256 &gradIn, __m256 &gradIg, __m256 &gradFg, __m256 &gradOg,
+      __m256 &prevState, __m256 &prevStateGrad, __m256 &state,
+      __m256 &stateGrad, __m256 &stateAtv, __m256 &outputGrad, __m256 &checkI,
+      __m256 &checkF, __m256 &checkO, __m256 &checkIGrad, __m256 &checkFGrad,
+      __m256 &checkOGrad, activation_mode_t active_node,
+      activation_mode_t active_gate, activation_mode_t active_state) {
+    gradOg =
+        activation(_mm256_mul_ps(outputGrad, stateAtv), valueOg, active_gate);
    stateGrad = _mm256_add_ps(
-        actState(_mm256_mul_ps(outputGrad, valueOg), stateAtv), stateGrad);
+        activation(_mm256_mul_ps(outputGrad, valueOg), stateAtv, active_state),
+        stateGrad);
    stateGrad = _mm256_add_ps(_mm256_mul_ps(gradOg, checkO), stateGrad);
-    gradIn = actInput(_mm256_mul_ps(stateGrad, valueIg), valueIn);
-    gradIg = actGate(_mm256_mul_ps(stateGrad, valueIn), valueIg);
-    gradFg = actGate(_mm256_mul_ps(stateGrad, prevState), valueFg);
+    gradIn =
+        activation(_mm256_mul_ps(stateGrad, valueIg), valueIn, active_node);
+    gradIg =
+        activation(_mm256_mul_ps(stateGrad, valueIn), valueIg, active_gate);
+    gradFg =
+        activation(_mm256_mul_ps(stateGrad, prevState), valueFg, active_gate);
    prevStateGrad = _mm256_add_ps(_mm256_mul_ps(gradIg, checkI),
                                  _mm256_mul_ps(gradFg, checkF));
    prevStateGrad =

--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
@@ -211,6 +211,26 @@ void batched_gemm<platform::CPUPlace, double>(
 }
 #endif

+template <>
+void gemv<platform::CPUPlace, float>(const platform::DeviceContext& context,
+                                     const bool trans_a, const int M,
+                                     const int N, const float alpha,
+                                     const float* A, const float* B,
+                                     const float beta, float* C) {
+  CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+  cblas_sgemv(CblasRowMajor, transA, M, N, alpha, A, N, B, 1, beta, C, 1);
+}
+
+template <>
+void gemv<platform::CPUPlace, double>(const platform::DeviceContext& context,
+                                      const bool trans_a, const int M,
+                                      const int N, const double alpha,
+                                      const double* A, const double* B,
+                                      const double beta, double* C) {
+  CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+  cblas_dgemv(CblasRowMajor, transA, M, N, alpha, A, N, B, 1, beta, C, 1);
+}
+
 template struct SetConstant<platform::CPUPlace, float>;

 }  // namespace math

--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
@@ -203,6 +203,33 @@ void batched_gemm<platform::GPUPlace, double>(
      &beta, C, ldc, strideC, batchCount));
 }

+template <>
+void gemv<platform::GPUPlace, float>(const platform::DeviceContext& context,
+                                     const bool trans_a, const int M,
+                                     const int N, const float alpha,
+                                     const float* A, const float* B,
+                                     const float beta, float* C) {
+  cublasOperation_t cuTransA = (trans_a == false) ? CUBLAS_OP_T : CUBLAS_OP_N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasSgemv(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransA, N, M, &alpha, A, N, B, 1, &beta, C, 1));
+}
+
+template <>
+void gemv<platform::GPUPlace, double>(const platform::DeviceContext& context,
+                                      const bool trans_a, const int M,
+                                      const int N, const double alpha,
+                                      const double* A, const double* B,
+                                      const double beta, double* C) {
+  cublasOperation_t cuTransA = (trans_a == false) ? CUBLAS_OP_T : CUBLAS_OP_N;
+  PADDLE_ENFORCE(platform::dynload::cublasDgemv(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransA, N, M, &alpha, A, N, B, 1, &beta, C, 1));
+}
+
 template struct SetConstant<platform::GPUPlace, float>;

 }  // namespace math

--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
@@ -93,6 +93,11 @@ void batched_gemm(const platform::DeviceContext& context,
                  const T* A, const T* B, const T beta, T* C,
                  const int batchCount, const int strideA, const int strideB);

+template <typename Place, typename T>
+void gemv(const platform::DeviceContext& context, const bool trans_a,
+          const int M, const int N, const T alpha, const T* A, const T* B,
+          const T beta, T* C);
+
 template <typename Place, typename T>
 struct SetConstant {
  void operator()(const platform::DeviceContext& context,

--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
@@ -89,3 +89,53 @@ TEST(math_function, zero) {
  EXPECT_EQ(t[2], 1);
  EXPECT_EQ(t[3], 1);
 }
+
+template <typename T>
+void GemvTest(int m, int n, bool trans) {
+  paddle::framework::Tensor mat_a;
+  paddle::framework::Tensor vec_b;
+  paddle::framework::Tensor vec_c;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  int b_num = trans ? m : n;
+  int c_num = trans ? n : m;
+
+  T* data_a = mat_a.mutable_data<T>({m, n}, *cpu_place);
+  T* data_b = vec_b.mutable_data<T>({b_num}, *cpu_place);
+  T* data_c = vec_c.mutable_data<T>({c_num}, *cpu_place);
+  for (int i = 0; i < mat_a.numel(); ++i) {
+    data_a[i] = static_cast<T>(i);
+  }
+  for (int i = 0; i < vec_b.numel(); ++i) {
+    data_b[i] = static_cast<T>(i);
+  }
+
+  paddle::platform::CPUDeviceContext context(*cpu_place);
+  paddle::operators::math::gemv<paddle::platform::CPUPlace, T>(
+      context, trans, static_cast<int>(m), static_cast<int>(n), 1., data_a,
+      data_b, 0., data_c);
+
+  if (!trans) {
+    for (int i = 0; i < m; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < n; ++j) {
+        sum += data_a[i * n + j] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  } else {
+    for (int i = 0; i < n; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < m; ++j) {
+        sum += data_a[j * n + i] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  }
+}
+
+TEST(math_function, gemv) {
+  GemvTest<float>(3, 13, false);
+  GemvTest<double>(4, 5, false);
+  GemvTest<float>(12, 7, true);
+  GemvTest<double>(7, 9, true);
+}
--- a/paddle/operators/math/math_function_test.cu
+++ b/paddle/operators/math/math_function_test.cu
@@ -177,3 +177,65 @@ TEST(math_function, gemm_trans_cublas) {
  EXPECT_EQ(input3_ptr[7], 99);
  delete gpu_place;
 }
+
+template <typename T>
+void GemvTest(int m, int n, bool trans) {
+  paddle::framework::Tensor mat_a;
+  paddle::framework::Tensor vec_b;
+  paddle::framework::Tensor vec_c;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+
+  T* data_a = mat_a.mutable_data<T>({m, n}, *cpu_place);
+  T* data_b = vec_b.mutable_data<T>({trans ? m : n}, *cpu_place);
+  T* data_c = vec_c.mutable_data<T>({trans ? n : m}, *cpu_place);
+
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::framework::Tensor g_mat_a;
+  paddle::framework::Tensor g_vec_b;
+  paddle::framework::Tensor g_vec_c;
+  T* g_data_a = g_mat_a.mutable_data<T>(mat_a.dims(), *gpu_place);
+  T* g_data_b = g_vec_b.mutable_data<T>(vec_b.dims(), *gpu_place);
+  T* g_data_c = g_vec_c.mutable_data<T>(vec_c.dims(), *gpu_place);
+
+  for (int i = 0; i < mat_a.numel(); ++i) {
+    data_a[i] = static_cast<T>(i);
+  }
+  for (int i = 0; i < vec_b.numel(); ++i) {
+    data_b[i] = static_cast<T>(i);
+  }
+
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+  g_mat_a.CopyFrom(mat_a, *gpu_place, context);
+  g_vec_b.CopyFrom(vec_b, *gpu_place, context);
+
+  paddle::operators::math::gemv<paddle::platform::GPUPlace, T>(
+      context, trans, static_cast<int>(m), static_cast<int>(n), 1., g_data_a,
+      g_data_b, 0., g_data_c);
+
+  vec_c.CopyFrom(g_vec_c, paddle::platform::CPUPlace(), context);
+
+  if (!trans) {
+    for (int i = 0; i < m; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < n; ++j) {
+        sum += data_a[i * n + j] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  } else {
+    for (int i = 0; i < n; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < m; ++j) {
+        sum += data_a[j * n + i] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  }
+}
+
+TEST(math_function, gemv) {
+  GemvTest<float>(3, 13, false);
+  GemvTest<double>(3, 13, false);
+  GemvTest<float>(3, 13, true);
+  GemvTest<double>(3, 13, true);
+}
--- a/paddle/operators/math/sequence2batch.h
+++ b/paddle/operators/math/sequence2batch.h
@@ -53,7 +53,18 @@ class LoDTensor2BatchFunctor {
 public:
  void operator()(const platform::DeviceContext& context,
                  const framework::LoDTensor& lod_tensor,
-                  framework::LoDTensor& batch, bool is_reverse) const {
+                  framework::LoDTensor& batch, bool is_cal_batch_lod,
+                  bool is_reverse = false) const {
+    if (!is_cal_batch_lod) {
+      auto lods = batch.lod();
+      PADDLE_ENFORCE_EQ(lods.size(), 2UL);
+      PADDLE_ENFORCE_EQ(lods[1].size(),
+                        static_cast<size_t>(lod_tensor.dims()[0]));
+      CopyMatrixRowsFunctor<Place, T> to_batch;
+      to_batch(context, lod_tensor, lods[1].data(), batch, true);
+      return;
+    }
+
    auto lods = lod_tensor.lod();
    PADDLE_ENFORCE_EQ(lods.size(), 1UL, "Only support one level sequence now.");
    auto lod = lods[0];
@@ -101,10 +112,10 @@ class LoDTensor2BatchFunctor {
    size_t* batch_starts = batch_lods[0].data();
    size_t* seq2batch_idx = batch_lods[1].data();
    batch_starts[0] = 0;
-    for (size_t n = 0; n < num_batch; n++) {
+    for (int n = 0; n < num_batch; n++) {
      auto batch_id = static_cast<int>(batch_starts[n]);
      for (size_t i = 0; i < seq_info.size(); ++i) {
-        size_t seq_len = seq_info[i].length;
+        int seq_len = seq_info[i].length;
        int start = seq_info[i].start;
        if (n < seq_len) {
          seq2batch_idx[batch_id] =
@@ -132,11 +143,8 @@ class Batch2LoDTensorFunctor {
    auto in_lod = batch.lod();
    PADDLE_ENFORCE_EQ(in_lod.size(), 2UL,
                      "The LoD size of input `batch` should be 2.");
-    auto out_lod = lod_tensor.lod()[0];
-    auto num = out_lod[out_lod.size() - 1];
-    PADDLE_ENFORCE_EQ(num, lod_tensor.dims()[0]);
-    PADDLE_ENFORCE_EQ(num, in_lod[1].size());
-    PADDLE_ENFORCE_EQ(num, batch.dims()[0]);
+    PADDLE_ENFORCE_EQ(in_lod[1].size(),
+                      static_cast<size_t>(lod_tensor.dims()[0]));
    CopyMatrixRowsFunctor<Place, T> to_seq;
    size_t* index = in_lod[1].data();
    to_seq(context, batch, index, lod_tensor, false);

--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
@@ -29,9 +29,14 @@ class MulOpShapeInference : public framework::InferShapeBase {

    auto x_dims = ctx->GetInputDim("X");
    auto y_dims = ctx->GetInputDim("Y");
+
    int x_num_col_dims = ctx->Attrs().Get<int>("x_num_col_dims");
    int y_num_col_dims = ctx->Attrs().Get<int>("y_num_col_dims");

+    VLOG(3) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims
+            << " x_num_col_dims=" << x_num_col_dims
+            << " y_num_col_dims=" << y_num_col_dims;
+
    PADDLE_ENFORCE_GT(
        x_dims.size(), x_num_col_dims,
        "The input tensor X's rank of MulOp should be larger than "

--- a/paddle/operators/nccl_op_test.cu
+++ b/paddle/operators/nccl_op_test.cu
@@ -185,7 +185,7 @@ TEST_F(NCCLTester, ncclAllReduceOp) {
        recv_tensor.numel() * sizeof(float),
        static_cast<p::CUDADeviceContext *>(dev_ctxs[i])->stream());

-    for (size_t j = 0; j < f::product(kDims); ++j) {
+    for (int64_t j = 0; j < f::product(kDims); ++j) {
      ASSERT_NEAR(ct[j], result, 1e-5);
    }
  }
@@ -234,7 +234,7 @@ TEST_F(NCCLTester, ncclReduceOp) {
      recv_tensor.numel() * sizeof(float),
      static_cast<p::CUDADeviceContext *>(dev_ctxs[kRoot])->stream());

-  for (int j = 0; j < f::product(kDims); ++j) {
+  for (int64_t j = 0; j < f::product(kDims); ++j) {
    ASSERT_NEAR(ct[j], result, 1e-5);
  }
 }
@@ -282,7 +282,7 @@ TEST_F(NCCLTester, ncclBcastOp) {
      recv_tensor.numel() * sizeof(float),
      static_cast<p::CUDADeviceContext *>(dev_ctxs[idx])->stream());

-  for (size_t j = 0; j < f::product(kDims); ++j) {
+  for (int64_t j = 0; j < f::product(kDims); ++j) {
    ASSERT_NEAR(ct[j], result, 1e-5);
  }
 }

--- a/paddle/operators/positive_negative_pair_op.cc
+++ b/paddle/operators/positive_negative_pair_op.cc
@@ -105,7 +105,7 @@ class PositiveNegativePairOpMaker : public framework::OpProtoAndCheckerMaker {
             "(Tensor, float) Label of an item (with repsect to "
             "QueryId). It's a 2-D tensor with shape [batch_size, 1].");
    AddInput("QueryID",
-             "(Tensor, int) Query ID that indicates the context. Its shape "
+             "(Tensor, int64) Query ID that indicates the context. Its shape "
             "should be the same as Label.");
    AddInput(
        "AccumulatePositivePair",

--- a/paddle/operators/positive_negative_pair_op.h
+++ b/paddle/operators/positive_negative_pair_op.h
@@ -47,10 +47,9 @@ class PositiveNegativePairKernel : public framework::OpKernel<T> {

    auto score = score_t->data<T>();
    auto label = label_t->data<T>();
-    auto query = query_t->data<int32_t>();
+    auto query = query_t->data<int64_t>();
    const T* weight = nullptr;
-    auto has_weight = weight_t != nullptr;
-    if (has_weight) {
+    if (weight_t != nullptr) {
      weight = weight_t->data<T>();
    }
    T* positive = positive_t->mutable_data<T>(context.GetPlace());
@@ -66,15 +65,15 @@ class PositiveNegativePairKernel : public framework::OpKernel<T> {
    }

    // construct document instances for each query: Query => List[<score#0,
-    // label#0>, ...]
-    std::unordered_map<int32_t, std::vector<PredictionResult>> predictions;
+    // label#0, weight#0>, ...]
+    std::unordered_map<int64_t, std::vector<PredictionResult>> predictions;
    for (auto i = 0; i < batch_size; ++i) {
      if (predictions.find(query[i]) == predictions.end()) {
        predictions.emplace(
            std::make_pair(query[i], std::vector<PredictionResult>()));
      }
-      predictions[query[i]].push_back(PredictionResult(
-          score[i * width + column], label[i], has_weight ? weight[i] : 1.0));
+      predictions[query[i]].emplace_back(score[i * width + column], label[i],
+                                         weight_t != nullptr ? weight[i] : 1.0);
    }

    // for each query, accumulate pair counts

--- a/paddle/operators/precision_recall_op.cc
+++ b/paddle/operators/precision_recall_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/precision_recall_op.h"
+
+namespace paddle {
+namespace operators {
+
+class PrecisionRecallOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("MaxProbs"),
+                   "Input(MaxProbs) should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Indices"),
+                   "Input(Indices) should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Labels"),
+                   "Input(Labels) should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("BatchMetrics"),
+                   "Output(BatchMetrics) should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("AccumMetrics"),
+                   "Output(AccumMetrics) should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("AccumStatesInfo"),
+                   "Output(AccumStatesInfo) should not be null.");
+
+    int64_t cls_num =
+        static_cast<int64_t>(ctx->Attrs().Get<int>("class_number"));
+    auto max_probs_dims = ctx->GetInputDim("MaxProbs");
+    auto labels_dims = ctx->GetInputDim("Labels");
+
+    PADDLE_ENFORCE_EQ(max_probs_dims[1], 1,
+                      "Each instance contains one max probability, so the "
+                      "shape of Input(MaxProbs) should be [batch_size, 1].");
+    PADDLE_ENFORCE_EQ(ctx->GetInputDim("Indices"), max_probs_dims,
+                      "The shape of Input(Indices) should be [batch_size, 1].");
+    PADDLE_ENFORCE_EQ(max_probs_dims[0], labels_dims[0],
+                      "The 1st dimension of Input(MaxProbs) and "
+                      "Input(Labels) both are batch_size and the shape should "
+                      "be the same.");
+    PADDLE_ENFORCE_EQ(labels_dims[1], 1,
+                      "The 2nd dimension of Input(Labels) contains instance "
+                      "label and the shape should be equal to 1.");
+    if (ctx->HasInput("Weights")) {
+      auto weights_dims = ctx->GetInputDim("Weights");
+      PADDLE_ENFORCE_EQ(weights_dims,
+                        framework::make_ddim({max_probs_dims[0], 1}),
+                        "The shape of Input(Weights) should be "
+                        "[batch_size, 1].");
+    }
+    if (ctx->HasInput("StatesInfo")) {
+      auto states_dims = ctx->GetInputDim("StatesInfo");
+      PADDLE_ENFORCE_EQ(states_dims, framework::make_ddim({cls_num, 4}),
+                        "The shape of Input(StatesInfo) should be "
+                        "[class_number, 4].");
+    }
+
+    // Layouts of BatchMetrics and AccumMetrics both are:
+    // [
+    //  macro average precision, macro average recall, macro average F1 score,
+    //  micro average precision, micro average recall, micro average F1 score
+    // ]
+    ctx->SetOutputDim("BatchMetrics", {6});
+    ctx->SetOutputDim("AccumMetrics", {6});
+    // Shape of AccumStatesInfo is [class_number, 4]
+    // The layout of each row is:
+    // [ TP, FP, TN, FN ]
+    ctx->SetOutputDim("AccumStatesInfo", {cls_num, 4});
+  }
+
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext &ctx) const override {
+    return framework::ToDataType(ctx.Input<Tensor>("MaxProbs")->type());
+  }
+};
+
+class PrecisionRecallOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  PrecisionRecallOpMaker(framework::OpProto *proto,
+                         framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("MaxProbs",
+             "(Tensor, default Tensor<float>), a 2-D tensor with shape N x 1, "
+             "where N is the batch size. Each row contains the max probability "
+             "of an instance which computed by the previous top_k (k=1) "
+             "operator.");
+    AddInput("Indices",
+             "(Tensor, default Tensor<int>), a 2-D tensor with shape N x 1, "
+             "where N is the batch size. Each row contains the corresponding "
+             "index which computed by the previous top_k (k=1) operator.");
+    AddInput("Labels",
+             "(Tensor, default Tensor<int>), a 2-D tensor with shape N x 1, "
+             "where N is the batch size. Each element is a label and the "
+             "value should be in [0, class_number - 1].");
+    AddInput("Weights",
+             "(Tensor, default Tensor<float>), a 2-D tensor with shape N x 1, "
+             "where N is the batch size. This input is optional. If provided, "
+             "weight of instance would be considered when computing metrics.")
+        .AsDispensable();
+    AddInput("StatesInfo",
+             "(Tensor, default Tensor<int>), a 2-D tensor with shape D x 4, "
+             "where D is the number of classes. This input is optional. If "
+             "provided, current state will be accumulated to this state and "
+             "the accumulation state will be as the output state.")
+        .AsDispensable();
+    AddOutput("BatchMetrics",
+              "(Tensor, default Tensor<float>), a 1-D tensor with shape {6}."
+              "This output tensor contains metrics for current batch data."
+              "The layout is [macro average precision, macro average recall, "
+              "macro f1 score, micro average precision, micro average recall, "
+              "micro f1 score]");
+    AddOutput("AccumMetrics",
+              "(Tensor, default Tensor<float>), a 1-D tensor with shape {6}."
+              "This output tensor contains metrics for accumulated data."
+              "The layout is [macro average precision, macro average recall, "
+              "macro f1 score, micro average precision, micro average recall, "
+              "micro f1 score]");
+    AddOutput("AccumStatesInfo",
+              "(Tensor, default Tensor<float>), a 2-D tensor with shape D x 4, "
+              "where D is equal to class number. This output tensor contains "
+              "accumulated state variables used to compute metrics. The layout "
+              "for each class is [true positives, false positives, "
+              "true negatives, false negatives].");
+    AddAttr<int>("class_number", "Number of classes to be evaluated.");
+    AddComment(R"DOC(
+When given 'Input(Indices)' and 'Input(Labels)', this operator can be used
+to compute various metrics including:
+  - macro average precision
+  - macro average recall
+  - macro f1 score
+  - micro average precision
+  - micro average recall
+  - micro f1 score
+
+To compute the above metrics, we need to do statistics for true positives,
+false positives and false negatives. Here count of true negatives is not
+necessary, but counting it may provide potential usage and the cost is
+trivial, so the operator also provides count of true negatives.
+
+We define state as a 2-D tensor with shape [class_number, 4]. Each row of a
+state contains statistic variables for corresponding class. Layout of each row
+is: TP(true positives), FP(false positives), TN(true negatives),
+FN(false negatives). If 'Input(Weights)' provided, TP, FP, TN, FN will be
+calculated by given weight instead of instance count.
+
+This operator also supports metrics computing for cross-batch situation. To
+achieve this, 'Input(StatesInfo)' should be provided. State of current batch
+data will be accumulated to 'Input(StatesInfo)' and 'Output(AccumStatesInfo)'
+is the accumulation state.
+
+'Output(BatchMetrics)' is metrics of current batch data while
+'Output(AccumStatesInfo)' is metrics of accumulation data.
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(precision_recall, ops::PrecisionRecallOp,
+                             ops::PrecisionRecallOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    precision_recall,
+    ops::PrecisionRecallKernel<paddle::platform::CPUPlace, float>,
+    ops::PrecisionRecallKernel<paddle::platform::CPUPlace, double>);
--- a/paddle/operators/precision_recall_op.h
+++ b/paddle/operators/precision_recall_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+
+enum StateVariable { TP = 0, FP, TN, FN };
+
+template <typename Place, typename T>
+class PrecisionRecallKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* in0 = ctx.Input<Tensor>("Indices");
+    auto* in1 = ctx.Input<Tensor>("Labels");
+    auto* in2 = ctx.Input<Tensor>("Weights");
+    auto* in3 = ctx.Input<Tensor>("StatesInfo");
+    auto* out0 = ctx.Output<Tensor>("BatchMetrics");
+    auto* out1 = ctx.Output<Tensor>("AccumMetrics");
+    auto* out2 = ctx.Output<Tensor>("AccumStatesInfo");
+
+    const int* ids_data = in0->data<int>();
+    const int* labels_data = in1->data<int>();
+    size_t cls_num = static_cast<size_t>(ctx.Attr<int>("class_number"));
+    const T* weights_data = in2 ? in2->data<T>() : nullptr;
+    const T* states_data = in3 ? in3->data<T>() : nullptr;
+    double* batch_metrics_data = out0->mutable_data<double>(ctx.GetPlace());
+    double* accum_metrics_data = out1->mutable_data<double>(ctx.GetPlace());
+    out2->mutable_data<T>(ctx.GetPlace());
+    auto accum_states = EigenMatrix<T>::From(*out2);
+    accum_states.setZero();
+    T* accum_states_data = out2->data<T>();
+
+    size_t sample_num = in0->dims()[0];
+    size_t state_var_num = 4;  // TP FP TN FN
+
+    // get states info for current batch
+    for (size_t i = 0; i < sample_num; ++i) {
+      size_t idx = ids_data[i];
+      size_t label = labels_data[i];
+
+      PADDLE_ENFORCE(idx >= 0 && idx < cls_num,
+                     "Class index of each instance should be in "
+                     "[0, class_number).");
+      PADDLE_ENFORCE(label >= 0 && label < cls_num,
+                     "Label of each instance should be in [0, class_number).");
+
+      T w = weights_data ? weights_data[i] : 1.0;
+      if (idx == label) {
+        accum_states_data[idx * state_var_num + TP] += w;
+        for (size_t j = 0; j < cls_num; ++j) {
+          accum_states_data[j * state_var_num + TN] += w;
+        }
+        accum_states_data[idx * state_var_num + TN] -= w;
+      } else {
+        accum_states_data[label * state_var_num + FN] += w;
+        accum_states_data[idx * state_var_num + FP] += w;
+        for (size_t j = 0; j < cls_num; ++j) {
+          accum_states_data[j * state_var_num + TN] += w;
+        }
+        accum_states_data[idx * state_var_num + TN] -= w;
+        accum_states_data[label * state_var_num + TN] -= w;
+      }
+    }
+
+    ComputeMetrics(accum_states_data, batch_metrics_data, state_var_num,
+                   cls_num);
+
+    if (states_data) {
+      for (size_t i = 0; i < cls_num; ++i) {
+        for (size_t j = 0; j < state_var_num; ++j) {
+          size_t idx = i * state_var_num + j;
+          accum_states_data[idx] += states_data[idx];
+        }
+      }
+    }
+
+    ComputeMetrics(accum_states_data, accum_metrics_data, state_var_num,
+                   cls_num);
+  }
+
+  // expose to be reused
+  static inline T CalcPrecision(T tp_count, T fp_count) {
+    if (tp_count > 0.0 || fp_count > 0.0) {
+      return tp_count / (tp_count + fp_count);
+    }
+    return 1.0;
+  }
+
+  static inline T CalcRecall(T tp_count, T fn_count) {
+    if (tp_count > 0.0 || fn_count > 0.0) {
+      return tp_count / (tp_count + fn_count);
+    }
+    return 1.0;
+  }
+
+  static inline T CalcF1Score(T precision, T recall) {
+    if (precision > 0.0 || recall > 0.0) {
+      return 2 * precision * recall / (precision + recall);
+    }
+    return 0.0;
+  }
+
+ protected:
+  void ComputeMetrics(const T* states_data, double* metrics_data,
+                      size_t state_var_num, size_t cls_num) const {
+    T total_tp_count = 0;
+    T total_fp_count = 0;
+    T total_fn_count = 0;
+    T macro_avg_precision = 0.0;
+    T macro_avg_recall = 0.0;
+
+    for (size_t i = 0; i < cls_num; ++i) {
+      T tp_count = states_data[i * state_var_num + TP];
+      T fp_count = states_data[i * state_var_num + FP];
+      T fn_count = states_data[i * state_var_num + FN];
+      total_tp_count += tp_count;
+      total_fp_count += fp_count;
+      total_fn_count += fn_count;
+      macro_avg_precision += CalcPrecision(tp_count, fp_count);
+      macro_avg_recall += CalcRecall(tp_count, fn_count);
+    }
+    macro_avg_precision /= cls_num;
+    macro_avg_recall /= cls_num;
+    T macro_f1_score = CalcF1Score(macro_avg_precision, macro_avg_recall);
+
+    T micro_avg_precision = CalcPrecision(total_tp_count, total_fp_count);
+    T micro_avg_recall = CalcRecall(total_tp_count, total_fn_count);
+    T micro_f1_score = CalcF1Score(micro_avg_precision, micro_avg_recall);
+
+    // fill metrics data
+    metrics_data[0] = macro_avg_precision;
+    metrics_data[1] = macro_avg_recall;
+    metrics_data[2] = macro_f1_score;
+    metrics_data[3] = micro_avg_precision;
+    metrics_data[4] = micro_avg_recall;
+    metrics_data[5] = micro_f1_score;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/recurrent_op.cc
+++ b/paddle/operators/recurrent_op.cc
@@ -12,181 +12,618 @@
   See the License for the specific language governing permissions and
   limitations under the License. */

-#include "paddle/operators/recurrent_op.h"
-
-#include <cstring>
-#include <sstream>
-
+#include <vector>
+#include "paddle/framework/executor.h"
 #include "paddle/framework/op_registry.h"
-#include "paddle/operators/net_op.h"

 namespace paddle {
 namespace operators {
+constexpr char kInputs[] = "inputs";
+constexpr char kInitialStates[] = "initial_states";
+constexpr char kParameters[] = "parameters";
+constexpr char kOutputs[] = "outputs";
+constexpr char kStepScopes[] = "step_scopes";
+constexpr char kExStates[] = "ex_states";
+constexpr char kStates[] = "states";
+constexpr char kStepBlock[] = "step_block";
+constexpr char kReverse[] = "reverse";
+constexpr char kIsTrain[] = "is_train";
+#define GRAD_SUFFIX "@GRAD"
+constexpr char kInputGrads[] = "inputs" GRAD_SUFFIX;
+constexpr char kOutputGrads[] = "outputs" GRAD_SUFFIX;
+constexpr char kParamGrads[] = "parameters" GRAD_SUFFIX;
+constexpr char kInitStateGrads[] = "initial_states" GRAD_SUFFIX;

-using Scope = framework::Scope;
-using Variable = framework::Variable;
-using Tensor = framework::Tensor;
-using LoDTensor = framework::LoDTensor;
-
-void RecurrentAlgorithm::Run(const Scope& scope,
-                             const platform::DeviceContext& dev_ctx) const {
-  auto* input0 = scope.FindVar(arg_->inlinks[0]);
-  PADDLE_ENFORCE_NOT_NULL(input0);
-  size_t seq_len = input0->GetMutable<LoDTensor>()->dims()[0];
-  PADDLE_ENFORCE_GT(seq_len, 0);
-
-  CreateScopes(scope, seq_len);
-  auto& step_scopes = GetStepScopes(scope);
-  rnn::SegmentInputs(step_scopes, arg_->inlinks, seq_len);
-  InitMemories(step_scopes[0]);
-
-  for (size_t step_id = 0; step_id < seq_len; step_id++) {
-    if (step_id > 0) {
-      rnn::LinkMemories(step_scopes, arg_->states, step_id, -1);
+using StepScopeVar = std::vector<framework::Scope *>;
+
+// StepScopes manages scopes inside RNN.
+//    StepScopes::CurScope() get the current scope
+//    StepScopes::ExScope() get the ex-scope, or scope in previous time step.
+//    StepScopes::Next() move to next time step.
+//
+// if is_train = False, then
+//   there are two scopes for the RNN and just support forward.
+// else
+//   the len(scopes) == seq_len
+//
+// if is_backward = True, then
+//   reversely access scopes
+// else
+//   access scopes from begin to end.
+class StepScopes {
+ public:
+  StepScopes(const framework::Scope &parent, StepScopeVar *scopes,
+             bool is_train, size_t seq_len, bool is_backward = false)
+      : counter_(is_backward ? seq_len - 1 : 0UL),
+        scopes_(scopes),
+        is_train_(is_train),
+        is_backward_(is_backward) {
+    size_t num_step_scopes = is_train ? seq_len : 2;
+    PADDLE_ENFORCE(is_train || !is_backward,
+                   "Cannot backward when is not training");
+    if (!is_backward_) {
+      PADDLE_ENFORCE(scopes->empty());
+      scopes->reserve(static_cast<size_t>(num_step_scopes));
+      for (size_t i = 0; i < num_step_scopes; ++i) {
+        scopes->emplace_back(&parent.NewScope());
+      }
    }
-    (*stepnet_)->Run(*step_scopes[step_id], dev_ctx);
-  }
-  rnn::ConcatOutputs(step_scopes, arg_->outlinks, seq_len, dev_ctx);
-}
-
-void RecurrentAlgorithm::CreateScopes(const Scope& scope,
-                                      size_t seq_len) const {
-  // TODO(superjom) Only two scopes are needed for inference, this case will be
-  // supported later.
-  auto* step_scopes_var = scope.FindVar(arg_->step_scopes);
-  PADDLE_ENFORCE(step_scopes_var != nullptr, "");
-  auto* step_scopes = step_scopes_var->GetMutable<std::vector<Scope*>>();
-
-  // Now all variables in scope must be created outside of op.
-  PADDLE_ENFORCE_NOT_NULL(stepnet_);
-  PADDLE_ENFORCE(!(*stepnet_)->Outputs().empty(),
-                 "step_unit_ op has no outputs");
-
-  if (seq_len > step_scopes->size()) {
-    for (size_t i = step_scopes->size(); i < seq_len; ++i) {
-      auto& step_scope = scope.NewScope();
-
-      // create step net's temp inputs
-      for (auto& input : (*stepnet_)->Inputs()) {
-        // the weight are located in parent scope
-        for (auto& var_name : input.second) {
-          if (!step_scope.FindVar(var_name)) {
-            step_scope.Var(var_name)->GetMutable<LoDTensor>();
-          }
+  }
+
+  framework::Scope &CurScope() { return GetScope(counter_); }
+
+  framework::Scope &ExScope() {
+    auto &scope = GetScope(is_backward_ ? counter_ + 1 : counter_ - 1);
+    return scope;
+  }
+
+  void Next() {
+    if (is_backward_) {
+      --counter_;
+    } else {
+      ++counter_;
+    }
+  }
+
+ private:
+  framework::Scope &GetScope(size_t scope_id) const {
+    if (!is_train_) {
+      scope_id %= 2;
+    }
+    PADDLE_ENFORCE_LT(scope_id, scopes_->size());
+    return *(*scopes_)[scope_id];
+  }
+
+  size_t counter_;
+  StepScopeVar *scopes_;
+  bool is_train_;
+  bool is_backward_;
+};
+
+// Base class for RecurrentOp/RecurrentGradOp
+//    Some common protected functions for RecurrentOp/RecurrentGradOp
+class RecurrentBase : public framework::OperatorBase {
+ public:
+  RecurrentBase(const std::string &type,
+                const framework::VariableNameMap &inputs,
+                const framework::VariableNameMap &outputs,
+                const framework::AttributeMap &attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+
+ protected:
+  // Get SequenceLength from Scope
+  //   The sequence length is got from input tensor. The input tensor's
+  //   dimension should be [SEQ_LEN, ..., ...]. The first of the tensor's shape
+  //   is SEQ_LEN. The second of the tensor's shape could be the batch size or
+  //   nested sequence length.
+  int64_t GetSequenceLength(const framework::Scope &scope) const {
+    // Dim format SEQ_LEN, BATCH_SIZE, ...
+    int64_t seq_len = -1;
+    auto &all_inputs = Inputs(kInputs);
+    PADDLE_ENFORCE(!all_inputs.empty());
+    for (auto &iname : all_inputs) {
+      auto *var = scope.FindVar(iname);
+      PADDLE_ENFORCE(var != nullptr);
+      PADDLE_ENFORCE(var->IsType<framework::LoDTensor>());
+      auto &dim = var->Get<framework::LoDTensor>().dims();
+      if (seq_len == -1) {
+        seq_len = dim[0];
+      } else {
+        PADDLE_ENFORCE_EQ(seq_len, dim[0]);
+      }
+    }
+    return seq_len;
+  }
+
+  // for src_tensor, dst_tensor in zip(map(src_scope.FindVar, src_vars),
+  //                                   map(dst_scope.Var, dst_vars)):
+  //   dst_tensor.ShareDataWith(src_tensor)
+  static void LinkTensor(const framework::Scope &src_scope,
+                         const std::vector<std::string> &src_vars,
+                         framework::Scope *dst_scope,
+                         const std::vector<std::string> &dst_vars) {
+    LinkTensorWithCallback(
+        src_scope, src_vars, dst_scope, dst_vars,
+        [&](const framework::Tensor &src, framework::Tensor *dst) {
+          dst->ShareDataWith(src);
+        });
+  }
+
+  // for src_tensor, dst_tensor in zip(map(src_scope.FindVar, src_vars),
+  //                                   map(dst_scope.Var, dst_vars)):
+  //   callback(src_tensor, &dst_tensor)
+  template <typename Callback>
+  static void LinkTensorWithCallback(const framework::Scope &src_scope,
+                                     const std::vector<std::string> &src_vars,
+                                     framework::Scope *dst_scope,
+                                     const std::vector<std::string> &dst_vars,
+                                     Callback callback) {
+    PADDLE_ENFORCE_EQ(src_vars.size(), dst_vars.size());
+    for (size_t i = 0; i < dst_vars.size(); ++i) {
+      VLOG(10) << "Link " << src_vars[i] << " to " << dst_vars[i];
+      AccessTensor(src_scope, src_vars[i], dst_scope, dst_vars[i], callback);
+    }
+  }
+
+  // for src_tensor, dst_tensor in zip(map(src_scope.FindVar, src_vars),
+  //                                   map(dst_scope.FindVar, dst_vars)):
+  //   callback(src_tensor, &dst_tensor)
+  template <typename Callback>
+  static void LinkTensorWithCallback(const framework::Scope &src_scope,
+                                     const std::vector<std::string> &src_vars,
+                                     const framework::Scope &dst_scope,
+                                     const std::vector<std::string> &dst_vars,
+                                     Callback callback) {
+    PADDLE_ENFORCE_EQ(src_vars.size(), dst_vars.size());
+    for (size_t i = 0; i < dst_vars.size(); ++i) {
+      VLOG(10) << "Link " << src_vars[i] << " to " << dst_vars[i];
+      AccessTensor(src_scope, src_vars[i], dst_scope, dst_vars[i], callback);
+    }
+  }
+
+  // (seq_len, shape) -> return [seq_len] + list(shape)
+  static framework::DDim PrependDims(size_t seq_len,
+                                     const framework::DDim &src) {
+    auto dims = framework::vectorize(src);
+    dims.insert(dims.begin(), static_cast<int64_t>(seq_len));
+    return framework::make_ddim(dims);
+  }
+
+ private:
+  template <typename Callback>
+  static void AccessTensor(const framework::Scope &src_scope,
+                           const std::string &src_var_name,
+                           framework::Scope *dst_scope,
+                           const std::string &dst_var_name, Callback callback) {
+    auto *src_var = src_scope.FindVar(src_var_name);
+    PADDLE_ENFORCE(src_var != nullptr);
+    auto &src_tensor = src_var->Get<framework::LoDTensor>();
+
+    auto *dst_var = dst_scope->Var(dst_var_name);
+    auto *dst_tensor = dst_var->GetMutable<framework::LoDTensor>();
+    callback(src_tensor, dst_tensor);
+  }
+
+  template <typename Callback>
+  static void AccessTensor(const framework::Scope &src_scope,
+                           const std::string &src_var_name,
+                           const framework::Scope &dst_scope,
+                           const std::string &dst_var_name, Callback callback) {
+    auto *src_var = src_scope.FindVar(src_var_name);
+    PADDLE_ENFORCE(src_var != nullptr);
+    auto &src_tensor = src_var->Get<framework::LoDTensor>();
+    auto *dst_var = dst_scope.FindVar(dst_var_name);
+    PADDLE_ENFORCE(dst_var != nullptr);
+    auto *dst_tensor = dst_var->GetMutable<framework::LoDTensor>();
+    callback(src_tensor, dst_tensor);
+  }
+};
+
+class RecurrentOp : public RecurrentBase {
+ public:
+  RecurrentOp(const std::string &type, const framework::VariableNameMap &inputs,
+              const framework::VariableNameMap &outputs,
+              const framework::AttributeMap &attrs)
+      : RecurrentBase(type, inputs, outputs, attrs) {}
+
+  void Run(const framework::Scope &scope,
+           const platform::DeviceContext &dev_ctx) const override {
+    auto seq_len = static_cast<size_t>(this->GetSequenceLength(scope));
+    VLOG(3) << "Static RNN input sequence length = " << seq_len;
+    StepScopes scopes = CreateStepScopes(scope, seq_len);
+    auto reverse = Attr<bool>(kReverse);
+
+    framework::Executor executor(dev_ctx);
+    auto *block = Attr<framework::BlockDescBind *>(kStepBlock);
+    auto *program = block->Program();
+
+    for (size_t i = 0; i < seq_len; ++i) {
+      size_t seq_offset = reverse ? seq_len - i - 1 : i;
+      VLOG(3) << "Recurrent operate at the time step " << seq_offset;
+
+      auto &cur_scope = scopes.CurScope();
+
+      // Link outside::input --> inside::input
+      //   inside::input = outside::input[seq_offset: seq_offset+1]
+      LinkTensorWithCallback(
+          scope, Inputs(kInputs), &cur_scope, Inputs(kInputs),
+          [&seq_offset](const framework::Tensor &outside,
+                        framework::Tensor *inside) {
+            inside->ShareDataWith(outside.Slice(seq_offset, seq_offset + 1));
+            auto dims = framework::vectorize(inside->dims());
+            dims.erase(dims.begin());
+            inside->Resize(framework::make_ddim(dims));
+          });
+
+      if (i == 0) {
+        // Link initial states  --> ex_states
+        LinkTensor(scope, Inputs(kInitialStates), &cur_scope,
+                   Attr<std::vector<std::string>>(kExStates));
+      } else {
+        auto &ex_scope = scopes.ExScope();
+        // Link ex_scope::state --> cur_scope::ex_state
+        LinkTensor(ex_scope, Attr<std::vector<std::string>>(kStates),
+                   &cur_scope, Attr<std::vector<std::string>>(kExStates));
+      }
+
+      // Every inputs are linked now, execute!
+      executor.Run(*program, &cur_scope, block->ID(),
+                   false /*create_local_scope*/);
+
+      // Copy inside::output -> outside::output
+      //    outside::output[seq_offset: seq_offset + 1] = inside::output
+      this->LinkTensorWithCallback(
+          cur_scope, Outputs(kOutputs), scope, Outputs(kOutputs),
+          [&](const framework::LoDTensor &src_tensor,
+              framework::LoDTensor *dst_tensor) {
+            if (i == 0) {  // create output tensor at begin
+              dst_tensor->Resize(PrependDims(seq_len, src_tensor.dims()));
+              dst_tensor->mutable_data(dev_ctx.GetPlace(), src_tensor.type());
+            }
+
+            auto dst_out = dst_tensor->Slice(seq_offset, seq_offset + 1);
+            // Explicit copy output since the local RNN scope can be destroyed
+            // early.
+            dst_out.CopyFrom(src_tensor, dev_ctx.GetPlace(), dev_ctx);
+          });
+
+      scopes.Next();
+    }
+  }
+
+ private:
+  StepScopes CreateStepScopes(const framework::Scope &scope,
+                              size_t seq_len) const {
+    auto *var = scope.FindVar(Output(kStepScopes));
+    PADDLE_ENFORCE(var != nullptr);
+    return StepScopes(scope, var->GetMutable<StepScopeVar>(),
+                      Attr<bool>(kIsTrain), seq_len);
+  }
+};
+
+class RecurrentGradOp : public RecurrentBase {
+ public:
+  RecurrentGradOp(const std::string &type,
+                  const framework::VariableNameMap &inputs,
+                  const framework::VariableNameMap &outputs,
+                  const framework::AttributeMap &attrs)
+      : RecurrentBase(type, inputs, outputs, attrs) {}
+
+  void Run(const framework::Scope &scope,
+           const platform::DeviceContext &dev_ctx) const override {
+    auto seq_len = static_cast<size_t>(GetSequenceLength(scope));
+    StepScopes scopes = CreateStepScopes(scope, seq_len);
+    auto reverse = Attr<bool>(kReverse);
+
+    framework::Executor executor(dev_ctx);
+    auto *block = Attr<framework::BlockDescBind *>(kStepBlock);
+    auto *program = block->Program();
+
+    for (size_t step_id = 0; step_id < seq_len; ++step_id) {
+      size_t seq_offset = reverse ? step_id : seq_len - step_id - 1;
+      VLOG(3) << "Recurrent backward operate at the time step " << seq_offset;
+      auto &cur_scope = scopes.CurScope();
+      // Link outside::output_grads --> inside::output_grads
+      //   inside::output_grad = outside::output_grad[seq_offset:seq_offset+1]
+      LinkTensorWithCallback(
+          scope, Inputs(kOutputGrads), &cur_scope, Inputs(kOutputGrads),
+          [&](const framework::Tensor &outside, framework::Tensor *inside) {
+            inside->ShareDataWith(outside.Slice(seq_offset, seq_offset + 1));
+            auto dims = framework::vectorize(inside->dims());
+            dims.erase(dims.begin());
+            inside->Resize(framework::make_ddim(dims));
+          });
+      auto og_set = List2Set(Inputs(kOutputGrads));
+
+      if (VLOG_IS_ON(10)) {
+        std::ostringstream sout;
+        std::copy(og_set.begin(), og_set.end(),
+                  std::ostream_iterator<std::string>(sout, ","));
+        VLOG(10) << " RNN output gradients = [" << sout.str() << "]";
+      }
+
+      // Link states
+      //   if cur_scope::cur_state_grad in out_grads:
+      //     cur_scope::cur_state_grad += ex_scope::ex_state_grad
+      //   else:
+      //     ex_scope::ex_state_grad --> cur_scope::cur_state_grad
+      if (step_id != 0) {  // not at beginning
+        auto &ex_scope = scopes.ExScope();
+        auto ex_state_grads =
+            GradVarLists(Attr<std::vector<std::string>>(kExStates));
+        auto cur_state_grads =
+            GradVarLists(Attr<std::vector<std::string>>(kStates));
+
+        PADDLE_ENFORCE_EQ(ex_state_grads.size(), cur_state_grads.size());
+        for (size_t i = 0; i < ex_state_grads.size(); ++i) {
+          auto &cur_grad = cur_state_grads[i];
+          auto &ex_grad = ex_state_grads[i];
+          auto &ex_tensor =
+              ex_scope.FindVar(ex_grad)->Get<framework::LoDTensor>();
+
+          VLOG(10) << " RNN link " << cur_grad << " from " << ex_grad;
+          auto *cur_grad_var = cur_scope.Var(cur_grad);
+          auto cur_grad_tensor =
+              cur_grad_var->GetMutable<framework::LoDTensor>();
+          cur_grad_tensor->CopyFrom(ex_tensor, dev_ctx.GetPlace(), dev_ctx);
        }
      }
-      // create stepnet's outputs
-      for (const auto& output : (*stepnet_)->Outputs()) {
-        for (auto& var_name : output.second) {
-          step_scope.Var(var_name);
+
+      VLOG(5) << "Recurrent memory linking finished ";
+      // Run step block with cur_scope
+      executor.Run(*program, &cur_scope, block->ID(),
+                   false /*create_local_scope*/);
+
+      VLOG(5) << "executor.Run finished ";
+
+      auto local_var_names = LocalVarNames(cur_scope);
+
+      // Accumulate params
+      //   if (step == 0):
+      //      outside::param_grad = 0.0
+      //   outside::param_grad += inside::param_grad
+      {
+        auto &pg_names = Outputs(kParamGrads);
+        auto &p_names = Inputs(kParameters);
+        PADDLE_ENFORCE_EQ(pg_names.size(), p_names.size());
+
+        for (size_t prog_id = 0; prog_id < pg_names.size(); ++prog_id) {
+          auto inside_grad_name = framework::GradVarName(p_names[prog_id]);
+
+          // If does not compute gradient of that variable inside rnn, just
+          // continue
+          if (local_var_names.find(inside_grad_name) == local_var_names.end()) {
+            continue;
+          }
+
+          // zero gradient variable in step 0
+          if (step_id == 0) {
+            auto &inside_tensor = cur_scope.FindVar(inside_grad_name)
+                                      ->Get<framework::LoDTensor>();
+            framework::AttributeMap attrs;
+            attrs["data_type"] = framework::ToDataType(inside_tensor.type());
+            attrs["shape"] = framework::vectorize2int(inside_tensor.dims());
+            attrs["value"] = 0.0f;
+
+            auto zero_op = framework::OpRegistry::CreateOp(
+                "fill_constant", {}, {{"Out", {pg_names[prog_id]}}}, attrs);
+            zero_op->Run(scope, dev_ctx);
+          }
+
+          // sum gradient
+          auto *outside_var = scope.FindVar(pg_names[prog_id]);
+          PADDLE_ENFORCE(outside_var != nullptr);
+          auto &outside_tensor =
+              *outside_var->GetMutable<framework::LoDTensor>();
+
+          std::string result_var_name;
+          auto *local_result_var = cur_scope.Var(&result_var_name);
+          auto &local_result_tensor =
+              *local_result_var->GetMutable<framework::LoDTensor>();
+
+          local_result_tensor.ShareDataWith(outside_tensor);
+
+          auto sum_op = framework::OpRegistry::CreateOp(
+              "sum", {{"X", {result_var_name, inside_grad_name}}},
+              {{"Out", {result_var_name}}}, {});
+          sum_op->Run(cur_scope, dev_ctx);
        }
      }
-      step_scopes->emplace_back(&step_scope);
+      VLOG(5) << "Accumulate Parameter finished ";
+
+      // Copy input gradient from inside to outside
+      //   outside::input_grad[seq_offset: seq_offset + 1] = inside::input_grad
+      LinkTensorWithCallback(
+          cur_scope, GradVarLists(Inputs(kInputs)), scope, Outputs(kInputGrads),
+          [&](const framework::LoDTensor &inside,
+              framework::LoDTensor *outside) {
+            if (inside.memory_size() == 0) {  // IG is not created.
+              return;
+            }
+            if (step_id == 0) {  // alloc memory
+              outside->Resize(PrependDims(seq_len, inside.dims()));
+              outside->mutable_data(dev_ctx.GetPlace(), inside.type());
+            }
+
+            auto dst = outside->Slice(seq_offset, seq_offset + 1);
+            dst.CopyFrom(inside, dev_ctx.GetPlace(), dev_ctx);
+          });
+      VLOG(5) << "Link outside gradient finished ";
+
+      if (step_id + 1 == seq_len) {  // at_end
+        // copy initialize states gradient from inside to outside
+        LinkTensorWithCallback(
+            cur_scope, GradVarLists(Attr<std::vector<std::string>>(kExStates)),
+            scope, Outputs(kInitStateGrads),
+            [&](const framework::LoDTensor &inside,
+                framework::LoDTensor *outside) {
+              outside->Resize(inside.dims());
+              outside->mutable_data(dev_ctx.GetPlace(), inside.type());
+              outside->CopyFrom(inside, dev_ctx.GetPlace(), dev_ctx);
+            });
+        VLOG(5) << "Link initialize state gradient finished ";
+      }
+      scopes.Next();
    }
  }
-}
-
-void RecurrentAlgorithm::InitMemories(Scope* step_scope) const {
-  for (auto& attr : arg_->states) {
-    auto* pre_mem = step_scope->Var(attr.pre_var)->GetMutable<LoDTensor>();
-    PADDLE_ENFORCE(step_scope->FindVar(attr.boot_var) != nullptr,
-                   "memory [%s]'s boot variable [%s] not exists", attr.var,
-                   attr.boot_var);
-    auto* boot_mem =
-        step_scope->FindVar(attr.boot_var)->GetMutable<LoDTensor>();
-    pre_mem->Resize(boot_mem->dims());
-    PADDLE_ENFORCE_EQ(pre_mem->dims().size(), 2);
-    pre_mem->ShareDataWith(*boot_mem);
-  }
-}
-
-const rnn::ArgumentName RecurrentOp::kArgName{
-    "step_net", "step_scopes", "inputs",        "outputs",
-    "states",   "ex_states",   "initial_states"};
-
-const rnn::ArgumentName RecurrentGradientOp::kArgName{
-    "step_net", "step_scopes@GRAD", "outputs@GRAD",       "inputs@GRAD",
-    "states",   "ex_states",        "initial_states@GRAD"};
-
-RecurrentOp::RecurrentOp(const std::string& type,
-                         const framework::VariableNameMap& inputs,
-                         const framework::VariableNameMap& outputs,
-                         const framework::AttributeMap& attrs)
-    : OperatorBase(type, inputs, outputs, attrs) {
-  rnn::InitArgument(kArgName, &arg_, *this);
-  alg_.Init(&arg_, &stepnet_);
-}
-
-class RecurrentAlgorithmProtoAndCheckerMaker
-    : public framework::OpProtoAndCheckerMaker {
+
+ private:
+  StepScopes CreateStepScopes(const framework::Scope &scope,
+                              size_t seq_len) const {
+    auto *var = scope.FindVar(Input(kStepScopes));
+    PADDLE_ENFORCE(var != nullptr);
+    return StepScopes(scope, var->GetMutable<StepScopeVar>(),
+                      Attr<bool>(kIsTrain), seq_len, true /*is_backward*/);
+  }
+
+  std::unordered_set<std::string> List2Set(
+      const std::vector<std::string> &list) const {
+    std::unordered_set<std::string> local_var_name_set;
+    local_var_name_set.reserve(list.size());
+    for (auto &each : list) {
+      local_var_name_set.insert(each);
+    }
+    return local_var_name_set;
+  }
+
+  std::unordered_set<std::string> LocalVarNames(
+      const framework::Scope &scope) const {
+    return this->List2Set(scope.GetAllNames(false));
+  }
+  static std::vector<std::string> GradVarLists(
+      const std::vector<std::string> &var_names) {
+    std::vector<std::string> retv;
+    retv.reserve(var_names.size());
+    std::transform(var_names.begin(), var_names.end(), std::back_inserter(retv),
+                   framework::GradVarName);
+    return retv;
+  }
+};
+
+class RecurrentOpProtoMaker : public framework::OpProtoAndCheckerMaker {
 public:
-  RecurrentAlgorithmProtoAndCheckerMaker(framework::OpProto* proto,
-                                         framework::OpAttrChecker* op_checker)
+  RecurrentOpProtoMaker(framework::OpProto *proto,
+                        framework::OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    const auto& name = RecurrentOp::kArgName;
-    // inputs and outputs stored in proto
-    AddInput(name.inlinks,
-             "the inputs that need to be segmented for each step.")
+    AddInput(kInputs, "rnn inputs").AsDuplicable();
+    AddInput(kInitialStates, "rnn initial states").AsDuplicable();
+    AddInput(kParameters,
+             "Parameters are used by step block as its input. However, the "
+             "inputs is not a sequence tensor. Every time step, each operator "
+             "in step block just use the parameter directly")
        .AsDuplicable();
-    AddInput(name.initial_states, "variables to initialize states.")
+    AddOutput(kOutputs,
+              "The output sequence of RNN. The sequence length must be same")
        .AsDuplicable();
+    AddOutput(kStepScopes,
+              "StepScopes contains all local variables in each time step.");
+    AddAttr<std::vector<std::string>>(kExStates,
+                                      string::Sprintf(
+                                          R"DOC(The ex-state variable names.
+The ex-state means the state value in the ex-timestep or the previous time step
+[%s, %s, %s] must be the same order)DOC",
+                                          kExStates, kStates, kInitStateGrads));
+    AddAttr<std::vector<std::string>>(
+        kStates,
+        string::Sprintf(
+            "The state variable names. [%s, %s, %s] must be the same order",
+            kExStates, kStates, kInitStateGrads));
+    AddAttr<framework::BlockDescBind *>(kStepBlock,
+                                        "The step block inside RNN");
+    AddAttr<bool>(kReverse, R"DOC(Calculate RNN reversely or not.
+By default reverse=False

-    AddOutput(name.outlinks, "the outputs that need to concated for all steps.")
-        .AsDuplicable();
-    AddOutput(name.step_scopes, "step scopes");
+Assume the input data is [A, B, C, D]
+
+if reverse is False:
+  the computation of RNN is like
+      A          B          C         D
+      |          |          |         |
+      v          v          v         v
+     rnn -----> rnn -----> rnn ----> rnn
+      |          |          |         |
+      v          v          v         v
+      o          o          o         o
+
+if reverse is True
+  the computation of RNN is like
+      A          B          C         D
+      |          |          |         |
+      v          v          v         v
+     rnn <----- rnn <----- rnn <---- rnn
+      |          |          |         |
+      v          v          v         v
+      o          o          o         o
+)DOC").SetDefault(false);
+    AddAttr<bool>(kIsTrain, "").SetDefault(true);
+    AddComment(R"DOC(Static Length Recurrent Operator
+
+The static length recurrent operator can only operate on fix sized sequence
+data, i.e. in each mini-batch, the sequence length of all inputs are same.
+)DOC");
+  }
+};
+
+class RecurrentGradOpDescMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;

-    // Attributes stored in AttributeMap
-    AddAttr<std::vector<std::string>>(name.ex_states, "names of pre-states");
-    AddAttr<std::vector<std::string>>(name.states, "names of states");
+ protected:
+  virtual std::unique_ptr<framework::OpDescBind> Apply() const {
+    auto *grad = new framework::OpDescBind();
+    grad->SetType("recurrent_grad");
+    for (auto &input_param : this->InputNames()) {
+      grad->SetInput(input_param, this->Input(input_param));
+      grad->SetOutput(framework::GradVarName(input_param),
+                      this->InputGrad(input_param));
+    }
+
+    for (auto &output_param : this->OutputNames()) {
+      if (output_param == kStepScopes) {
+        grad->SetInput(output_param, this->Output(output_param));
+        grad->SetInput(framework::GradVarName(output_param),
+                       this->Output(output_param));
+      } else {
+        grad->SetInput(output_param, this->Output(output_param));
+        grad->SetInput(framework::GradVarName(output_param),
+                       this->OutputGrad(output_param));
+      }
+    }
+    grad->SetAttrMap(this->Attrs());
+    grad->SetBlockAttr(kStepBlock, *grad_block_[0]);

-    AddComment("This is a recurrent group operator.");
+    return std::unique_ptr<framework::OpDescBind>(grad);
  }
 };

-void RecurrentGradientAlgorithm::Run(
-    const Scope& scope, const platform::DeviceContext& dev_ctx) const {
-  auto* input0 = scope.FindVar(arg_->inlinks[0]);
-  PADDLE_ENFORCE_NOT_NULL(input0);
-  size_t seq_len = input0->GetMutable<LoDTensor>()->dims()[0];
-  auto& step_scopes = GetStepScopes(scope);
-  rnn::SegmentInputs(step_scopes, arg_->inlinks, seq_len);
-  for (int step_id = seq_len - 1; step_id >= 0; --step_id) {
-    if (static_cast<size_t>(step_id) != seq_len - 1) {
-      rnn::LinkMemories(step_scopes, arg_->states, step_id, 1);
+class RecurrentGradOpShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext *ctx) const override {
+    std::vector<std::string> input{kInputs, kInitialStates};
+    std::vector<std::string> output{kOutputs};
+    for (auto &s : input) {
+      PADDLE_ENFORCE(ctx->HasInputs(s));
+      PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName(s)));
+    }
+    for (auto &s : output) {
+      PADDLE_ENFORCE(ctx->HasInputs(s));
+    }
+    for (auto &s : input) {
+      ctx->SetOutputsDim(framework::GradVarName(s), ctx->GetInputsDim(s));
    }
-    (*stepnet_)->Run(*step_scopes[step_id], dev_ctx);
-  }
-  rnn::ConcatOutputs(step_scopes, arg_->outlinks, seq_len, dev_ctx);
-  LinkBootMemoryGradients(step_scopes[0]);
-}
-
-void RecurrentGradientAlgorithm::LinkBootMemoryGradients(
-    Scope* step_scope) const {
-  for (auto& attr : arg_->states) {
-    PADDLE_ENFORCE(step_scope->FindVar(attr.var) != nullptr,
-                   "memory variable [%s] does not exists", attr.var);
-    PADDLE_ENFORCE(step_scope->FindVar(attr.boot_var) != nullptr,
-                   "boot variable [%s] does not exists", attr.boot_var);
-    auto* mem_grad = step_scope->Var(attr.var)->GetMutable<LoDTensor>();
-    auto* boot_mem_grad =
-        step_scope->Var(attr.boot_var)->GetMutable<LoDTensor>();
-    boot_mem_grad->Resize(mem_grad->dims());
-    boot_mem_grad->ShareDataWith(*mem_grad);
-  }
-}
-
-RecurrentGradientOp::RecurrentGradientOp(
-    const std::string& type, const framework::VariableNameMap& inputs,
-    const framework::VariableNameMap& outputs,
-    const framework::AttributeMap& attrs)
-    : OperatorBase(type, inputs, outputs, attrs) {
-  rnn::InitArgument(kArgName, &arg_, *this, true /*is grad*/);
-  alg_.Init(&arg_, &stepnet_);
-}
+    if (ctx->HasInputs(kParameters)) {
+      PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName(kParameters)));
+      ctx->SetOutputsDim(framework::GradVarName(kParameters),
+                         ctx->GetInputsDim(kParameters));
+    }
+  }
+};

 }  // namespace operators
 }  // namespace paddle

-REGISTER_OP(recurrent, paddle::operators::RecurrentOp,
-            paddle::operators::RecurrentAlgorithmProtoAndCheckerMaker,
-            recurrent_grad, paddle::operators::RecurrentGradientOp);
+REGISTER_OPERATOR(recurrent, paddle::operators::RecurrentOp,
+                  paddle::operators::RecurrentOpProtoMaker,
+                  paddle::operators::RecurrentGradOpDescMaker);
+REGISTER_OPERATOR(recurrent_grad, paddle::operators::RecurrentGradOp,
+                  paddle::operators::RecurrentGradOpShapeInference);
--- a/paddle/operators/recurrent_op.h
+++ b/paddle/operators/recurrent_op.h
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
-
-#pragma once
-
-#include "paddle/framework/operator.h"
-#include "paddle/operators/net_op.h"
-#include "paddle/operators/rnn/recurrent_op_utils.h"
-
-namespace paddle {
-namespace operators {
-
-// The sequence format in RecurrentOp is Tensor<seq_len, batch_size, dim> now.
-// TODO(Superjom)
-// 1. No-padding computing for sequences with indifinite length in one batch.
-// 2. Hierarchical RNN for sequence with sub-sequence.
-// 3. Internal Memory.
-// 4. More Complex RNN architecture, such as Gated Feedback RNN.
-//    Refer to: https://arxiv.org/pdf/1502.02367.pdf
-
-class RecurrentAlgorithm {
- public:
-  void Run(const framework::Scope& scope,
-           const platform::DeviceContext& dev_ctx) const;
-
-  void Init(rnn::Argument* arg,
-            std::unique_ptr<framework::OperatorBase>* stepnet) {
-    PADDLE_ENFORCE_NOT_NULL(stepnet, "stepnet should be set before.");
-    arg_ = arg;
-    stepnet_ = stepnet;
-  }
-
- protected:
-  /*
-   * The step scopes will be stored in the father scope as a variable.
-   *
-   * NOTE the scopes are reused in both the forward and backward, so just
-   * create once and expand its size if more steps need.
-   */
-  void CreateScopes(const framework::Scope& scope, size_t seq_len) const;
-
-  const std::vector<framework::Scope*>& GetStepScopes(
-      const framework::Scope& scope) const {
-    return *scope.FindVar(arg_->step_scopes)
-                ->GetMutable<std::vector<framework::Scope*>>();
-  }
-
-  void InitMemories(framework::Scope* step_scopes) const;
-
- private:
-  std::unique_ptr<framework::OperatorBase>* stepnet_;
-  rnn::Argument* arg_;
-};
-
-class RecurrentGradientAlgorithm {
-  /**
-   * RNN's backward alogorithm.
-   *
-   * To accelerate the development of RecurrentGradientOp, we decouple RNN's
-   * algorithm and `OperatorBase`'s implementation, the former contains the core
-   * implementation of a RNN, and will keep stable even if the framework changes
-   * a
-   * lot, and the latter is a wrapper acts like an dapter for it to make RNN an
-   * operator.
-   */
- public:
-  void Init(rnn::Argument* arg,
-            std::unique_ptr<framework::OperatorBase>* stepnet) {
-    PADDLE_ENFORCE_NOT_NULL(stepnet, "stepnet should be set before.");
-    arg_ = std::move(arg);
-    stepnet_ = stepnet;
-  }
-
-  void Run(const framework::Scope& scope,
-           const platform::DeviceContext& dev_ctx) const;
-
-  void LinkBootMemoryGradients(framework::Scope* step_scopes) const;
-
- protected:
-  inline const std::vector<framework::Scope*>& GetStepScopes(
-      const framework::Scope& scope) const {
-    return *scope.FindVar(arg_->step_scopes)
-                ->GetMutable<std::vector<framework::Scope*>>();
-  }
-
- private:
-  rnn::Argument* arg_;
-  std::unique_ptr<framework::OperatorBase>* stepnet_;
-};
-
-class RecurrentOp : public framework::OperatorBase {
- public:
-  RecurrentOp(const std::string& type, const framework::VariableNameMap& inputs,
-              const framework::VariableNameMap& outputs,
-              const framework::AttributeMap& attrs);
-
-  RecurrentOp(const RecurrentOp& o)
-      : framework::OperatorBase(
-            static_cast<const framework::OperatorBase&>(o)) {
-    // TODO(yuyang18): Implement copy ctor well.
-    PADDLE_THROW("Not implemented");
-  }
-
-  void Run(const framework::Scope& scope,
-           const platform::DeviceContext& dev_ctx) const override {
-    alg_.Run(scope, dev_ctx);
-  }
-
-  void set_stepnet(std::unique_ptr<OperatorBase> net) {
-    stepnet_ = std::move(net);
-  }
-
-  const OperatorBase& stepnet() const { return *stepnet_; }
-
-  static const rnn::ArgumentName kArgName;
-
- private:
-  RecurrentAlgorithm alg_;
-  rnn::Argument arg_;
-  std::unique_ptr<OperatorBase> stepnet_;
-};
-
-class RecurrentGradientOp : public framework::OperatorBase {
- public:
-  RecurrentGradientOp(const std::string& type,
-                      const framework::VariableNameMap& inputs,
-                      const framework::VariableNameMap& outputs,
-                      const framework::AttributeMap& attrs);
-
-  RecurrentGradientOp(const RecurrentGradientOp& o)
-      : framework::OperatorBase(
-            static_cast<const framework::OperatorBase&>(o)) {
-    // TODO(yuyang18): Implement Copy ctor.
-    PADDLE_THROW("Not Implemented");
-  }
-
-  void Run(const framework::Scope& scope,
-           const platform::DeviceContext& dev_ctx) const override {
-    alg_.Run(scope, dev_ctx);
-  }
-
-  static const rnn::ArgumentName kArgName;
-
-  /*
-   * set a stepnet that is created according to a RecurrentOp's stepnet.
-   */
-  void set_stepnet(std::unique_ptr<OperatorBase> net) {
-    stepnet_ = std::move(net);
-  }
-  const OperatorBase& stepnet() const { return *stepnet_; }
-
- private:
-  RecurrentGradientAlgorithm alg_;
-  std::unique_ptr<OperatorBase> stepnet_;
-  rnn::Argument arg_;
-};
-
-}  // namespace operators
-}  // namespace paddle
--- a/paddle/operators/reshape_op.cc
+++ b/paddle/operators/reshape_op.cc
@@ -36,7 +36,7 @@ class ReshapeOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(shape.size() > 0, "Attr(shape) shouldn't be empty.");
    auto x_dims = ctx->GetInputDim("X");
    // TODO(qiao) change batch_size
-    for (int i = 1; i < shape.size(); ++i) {
+    for (size_t i = 1; i < shape.size(); ++i) {
      PADDLE_ENFORCE(shape[i] > 0,
                     "Each dimension of shape "
                     "must be positiv except the first.");

--- a/paddle/operators/rnn_memory_helper_op.cc
+++ b/paddle/operators/rnn_memory_helper_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/framework/op_registry.h"
+#include "paddle/framework/operator.h"
+
+namespace paddle {
+namespace operators {
+class RNNMemoryHelperOp : public framework::OperatorBase {
+ public:
+  RNNMemoryHelperOp(const std::string &type,
+                    const framework::VariableNameMap &inputs,
+                    const framework::VariableNameMap &outputs,
+                    const framework::AttributeMap &attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+  void Run(const framework::Scope &scope,
+           const platform::DeviceContext &dev_ctx) const override {
+    auto mem_var_name = Input("X");
+    auto *mem_var = scope.FindVar(mem_var_name);
+    PADDLE_ENFORCE(mem_var != nullptr,
+                   "Cannot find mem_var in scope, mem_var_name is %s",
+                   mem_var_name);
+
+    auto out_name = this->Output("Out");
+    auto *out_var = scope.FindVar(out_name);
+    PADDLE_ENFORCE(out_var != nullptr,
+                   "Cannot find out_var in scope, out_var_name is %s",
+                   out_name);
+
+    auto *out_tensor = out_var->GetMutable<framework::LoDTensor>();
+    auto &mem_tensor = mem_var->Get<framework::LoDTensor>();
+    out_tensor->ShareDataWith(mem_tensor);
+    out_tensor->set_lod(mem_tensor.lod());
+  }
+};
+
+class RNNMemoryHelperOpShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"), "");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"), "");
+    ctx->SetOutputDim("Out", ctx->GetInputDim("X"));
+    ctx->ShareLoD("X", /*->*/ "Out");
+  }
+};
+
+class RNNMemoryHelperOpInfoMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  RNNMemoryHelperOpInfoMaker(framework::OpProto *proto,
+                             framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "");
+    AddOutput("Out", "");
+    AddAttr<int>("data_type",
+                 "(int, default 5 (FP32)) "
+                 "Output data type")
+        .SetDefault(framework::DataType::FP32);
+    AddComment("");
+  }
+};
+
+class RNNMemoryHelperGradOp : public framework::OperatorBase {
+ public:
+  RNNMemoryHelperGradOp(const std::string &type,
+                        const framework::VariableNameMap &inputs,
+                        const framework::VariableNameMap &outputs,
+                        const framework::AttributeMap &attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+  void Run(const framework::Scope &scope,
+           const platform::DeviceContext &dev_ctx) const override {
+    auto out_grad_var_name = Input(framework::GradVarName("Out"));
+    auto *out_grad_var = scope.FindVar(out_grad_var_name);
+
+    auto in_grad_var_name = Output(framework::GradVarName("X"));
+    auto *in_grad_var = scope.FindVar(in_grad_var_name);
+    PADDLE_ENFORCE(in_grad_var != nullptr,
+                   "Cannot find in_grad_var in scope, name is %s",
+                   in_grad_var_name);
+
+    if (out_grad_var == nullptr) {
+      VLOG(5) << "Using fill constant 0 as starting gradient";
+      auto in_var_name = Input("X");
+      auto *in_var = scope.FindVar(in_var_name);
+      auto &in_var_tensor = in_var->Get<framework::LoDTensor>();
+
+      framework::AttributeMap attrs;
+      attrs["data_type"] = framework::ToDataType(in_var_tensor.type());
+      attrs["shape"] = framework::vectorize2int(in_var_tensor.dims());
+      attrs["value"] = 0.0f;
+
+      auto zero_op = framework::OpRegistry::CreateOp(
+          "fill_constant", {}, {{"Out", {in_grad_var_name}}}, attrs);
+      zero_op->Run(scope, dev_ctx);
+    } else {
+      auto &out_grad_tensor = out_grad_var->Get<framework::LoDTensor>();
+      auto *in_grad_tensor = in_grad_var->GetMutable<framework::LoDTensor>();
+      in_grad_tensor->ShareDataWith(out_grad_tensor);
+      in_grad_tensor->set_lod(out_grad_tensor.lod());
+    }
+  }
+};
+
+class RNNMemoryHelperGradOpInfoMaker
+    : public framework::OpProtoAndCheckerMaker {
+ public:
+  RNNMemoryHelperGradOpInfoMaker(framework::OpProto *proto,
+                                 framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput(framework::GradVarName("Out"), "");
+    AddInput("X", "");
+    AddInput("Out", "");
+    AddOutput(framework::GradVarName("X"), "");
+    AddAttr<int>("data_type",
+                 "(int, default 5 (FP32)) "
+                 "Output data type")
+        .SetDefault(framework::DataType::FP32);
+    AddComment("");
+  }
+};
+
+class RNNMemoryHelperGradOpShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext *ctx) const override {
+    auto x_grad_name = framework::GradVarName("X");
+    PADDLE_ENFORCE(ctx->HasOutput(x_grad_name), "");
+    PADDLE_ENFORCE(ctx->HasInput("X"), "");
+    ctx->SetOutputDim(x_grad_name, ctx->GetInputDim("X"));
+    ctx->ShareLoD("X", /*->*/ x_grad_name);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+REGISTER_OPERATOR(rnn_memory_helper, paddle::operators::RNNMemoryHelperOp,
+                  paddle::operators::RNNMemoryHelperOpInfoMaker,
+                  paddle::operators::RNNMemoryHelperOpShapeInference,
+                  paddle::framework::DefaultGradOpDescMaker<true>);
+REGISTER_OPERATOR(rnn_memory_helper_grad,
+                  paddle::operators::RNNMemoryHelperGradOp,
+                  paddle::operators::RNNMemoryHelperGradOpInfoMaker,
+                  paddle::operators::RNNMemoryHelperGradOpShapeInference);
--- a/paddle/operators/save_load_op_test.cc
+++ b/paddle/operators/save_load_op_test.cc
@@ -34,7 +34,7 @@ TEST(SaveLoadOp, CPU) {

  tensor->set_lod(expect_lod);
  int* expect = tensor->mutable_data<int>(place);
-  for (size_t i = 0; i < paddle::framework::product(tensor->dims()); ++i) {
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
    expect[i] = static_cast<int>(i);
  }
  paddle::framework::AttributeMap attrs;
@@ -50,7 +50,7 @@ TEST(SaveLoadOp, CPU) {
      "load", {}, {{"Out", {"out_var"}}}, attrs);
  load_op->Run(scope, ctx);
  int* actual = target->data<int>();
-  for (size_t i = 0; i < paddle::framework::product(tensor->dims()); ++i) {
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
    EXPECT_EQ(expect[i], actual[i]);
  }
  auto& actual_lod = target->lod();
@@ -60,4 +60,4 @@ TEST(SaveLoadOp, CPU) {
      EXPECT_EQ(expect_lod[i][j], actual_lod[i][j]);
    }
  }
-}
\ No newline at end of file
+}
--- a/paddle/operators/seq_expand_op.h
+++ b/paddle/operators/seq_expand_op.h
@@ -32,7 +32,8 @@ class SeqExpandKernel : public framework::OpKernel<T> {
    const T* x_data = x->data<T>();
    auto x_dims = x->dims();
    auto* y = context.Input<LoDTensor>("Y");
-    PADDLE_ENFORCE_EQ(x_dims[0], y->lod().back().size() - 1,
+    PADDLE_ENFORCE_EQ(static_cast<size_t>(x_dims[0]),
+                      y->lod().back().size() - 1,
                      "The size of last lod level in Input(Y)"
                      "must be equal to dims[0] of Input(X).");
    out->set_lod(y->lod());

--- a/paddle/operators/sequence_conv_op.cc
+++ b/paddle/operators/sequence_conv_op.cc
@@ -89,7 +89,7 @@ class SequenceConvGradOp : public framework::OperatorWithKernel {
    }
    if (ctx->HasOutput(framework::GradVarName("X"))) {
      ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
-      ctx->ShareLoD(framework::GradVarName("X"), "X");
+      ctx->ShareLoD("X", framework::GradVarName("X"));
    }
    if (ctx->HasOutput(framework::GradVarName("Filter"))) {
      ctx->SetOutputDim(framework::GradVarName("Filter"),

--- a/paddle/operators/sequence_pool_op.cc
+++ b/paddle/operators/sequence_pool_op.cc
@@ -39,15 +39,15 @@ class SequencePoolOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("Out",
              "(Tensor), output of SequencePoolOp, which does not contain LoD "
              "infomation.");
-    AddAttr<int>(
-        "strategy",
-        "(int, default AVERAGE) the pooling strategy of SequencePoolOp.")
-        .SetDefault(AVERAGE)
-        .InEnum({AVERAGE, SUM, SQRT, MAX, LAST, FIRST});
+    AddAttr<std::string>(
+        "pooltype",
+        "(int, default AVERAGE) the pooling pooltype of SequencePoolOp.")
+        .SetDefault("AVERAGE")
+        .InEnum({"AVERAGE", "SUM", "SQRT", "LAST", "FIRST", "MAX"});
    AddComment(R"DOC(
    SequencePoolOp pools features of all time-steps of each instance.

-    It supports six pooling strategy:
+    It supports six pooling pooltype:
    - AVERAGE: Out[i] = average_{for each instance in i-th sequence}{X[i]}
    - SUM:     Out[i] = sum_{for each instance in i-th sequence}{X[i]}
    - SQRT:    Out[i] = sum_{for each instance in i-th sequence}{X[i]} 
@@ -63,7 +63,7 @@ class SequencePoolOpMaker : public framework::OpProtoAndCheckerMaker {
    and the value of X = [[1, 3], [2, 4, 6], [5, 1]].

    Thus, Out is a [3,1,1] Tensor without LoD infomation.
-    And for different strategy, the value of Out is as follows:
+    And for different pooltype, the value of Out is as follows:

    - AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
    - SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1

--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
@@ -29,22 +29,13 @@ template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;

-enum SeqPoolType {
-  AVERAGE = 0,
-  SUM = 1,
-  SQRT = 2,  // square_root_n
-  MAX = 3,
-  LAST = 4,
-  FIRST = 5
-};
-
 template <typename Place, typename T>
 class SequencePoolKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
    auto* in = context.Input<LoDTensor>("X");
    auto* out = context.Output<LoDTensor>("Out");
-    int strategy = context.Attr<int>("strategy");
+    std::string pooltype = context.Attr<std::string>("pooltype");

    auto dims = in->dims();
    auto lod = in->lod();
@@ -71,28 +62,21 @@ class SequencePoolKernel : public framework::OpKernel<T> {
      auto in_e = EigenMatrix<T>::From(in_t, framework::make_ddim({h, w}));
      auto out_e = EigenVector<T>::Flatten(out_t);

-      switch (strategy) {
-        case AVERAGE:
-          out_e.device(place) = in_e.mean(Eigen::array<int, 1>({{0}}));
-          break;
-        case SUM:
-          out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}}));
-          break;
-        case SQRT:
-          out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}})) /
-                                std::sqrt(static_cast<T>(h));
-          break;
-        case MAX:
-          out_e.device(place) = in_e.maximum(Eigen::array<int, 1>({{0}}));
-          break;
-        case LAST:
-          out_e.device(place) = in_e.chip(h - 1, 0);
-          break;
-        case FIRST:
-          out_e.device(place) = in_e.chip(0, 0);
-          break;
-        default:
-          PADDLE_THROW("unsupported pooling strategy");
+      if (pooltype == "AVERAGE") {
+        out_e.device(place) = in_e.mean(Eigen::array<int, 1>({{0}}));
+      } else if (pooltype == "SUM") {
+        out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}}));
+      } else if (pooltype == "SQRT") {
+        out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}})) /
+                              std::sqrt(static_cast<T>(h));
+      } else if (pooltype == "MAX") {
+        out_e.device(place) = in_e.maximum(Eigen::array<int, 1>({{0}}));
+      } else if (pooltype == "LAST") {
+        out_e.device(place) = in_e.chip(h - 1, 0);
+      } else if (pooltype == "FIRST") {
+        out_e.device(place) = in_e.chip(0, 0);
+      } else {
+        PADDLE_THROW("unsupported pooling pooltype");
      }
    }
  }
@@ -105,15 +89,15 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
    auto* in = context.Input<LoDTensor>("X");
    auto* in_g = context.Output<LoDTensor>(framework::GradVarName("X"));
    auto* out_g = context.Input<LoDTensor>(framework::GradVarName("Out"));
-    int strategy = context.Attr<int>("strategy");
+    std::string pooltype = context.Attr<std::string>("pooltype");

    auto dims = in->dims();
    auto lod = in->lod()[0];
    int64_t w = in->numel() / dims[0];

    in_g->mutable_data<T>(context.GetPlace());
-    if (strategy == LAST || strategy == FIRST) {
-      // set X@Grad be zero at first when strategy is LAST/FIRST
+    if (pooltype == "LAST" || pooltype == "FIRST") {
+      // set X@Grad be zero at first when pooltype is LAST/FIRST
      math::SetConstant<Place, T> functor;
      functor(context.device_context(), in_g, 0);
    }
@@ -127,41 +111,33 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
      auto out_g_e = EigenMatrix<T>::From(out_g_t, {1, w});
      Eigen::DSizes<int, 2> bcast(h, 1);

-      switch (strategy) {
-        case AVERAGE:
-          in_g_e.device(place) = (out_g_e / static_cast<T>(h)).broadcast(bcast);
-          break;
-        case SUM:
-          in_g_e.device(place) = (out_g_e).broadcast(bcast);
-          break;
-        case SQRT:
-          in_g_e.device(place) =
-              (out_g_e / std::sqrt(static_cast<T>(h))).broadcast(bcast);
-          break;
-        case MAX: {
-          auto in_t =
-              in->Slice(static_cast<int>(lod[i]), static_cast<int>(lod[i + 1]));
-          Eigen::Map<const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>>
-              in_t_map(in_t.data<T>(), h, w);
-          int row_id;
-          Eigen::array<int, 2> extents{{1, 1}};
-          for (int col_id = 0; col_id < w; col_id++) {
-            in_t_map.col(col_id).maxCoeff(&row_id);
-            Eigen::array<int, 2> in_offsets{{row_id, col_id}};
-            Eigen::array<int, 2> out_offsets{{0, col_id}};
-            in_g_e.slice(in_offsets, extents).device(place) =
-                out_g_e.slice(out_offsets, extents);
-          }
-          break;
+      if (pooltype == "AVERAGE") {
+        in_g_e.device(place) = (out_g_e / static_cast<T>(h)).broadcast(bcast);
+      } else if (pooltype == "SUM") {
+        in_g_e.device(place) = (out_g_e).broadcast(bcast);
+      } else if (pooltype == "SQRT") {
+        in_g_e.device(place) =
+            (out_g_e / std::sqrt(static_cast<T>(h))).broadcast(bcast);
+      } else if (pooltype == "MAX") {
+        auto in_t =
+            in->Slice(static_cast<int>(lod[i]), static_cast<int>(lod[i + 1]));
+        Eigen::Map<const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>>
+            in_t_map(in_t.data<T>(), h, w);
+        int row_id;
+        Eigen::array<int, 2> extents{{1, 1}};
+        for (int col_id = 0; col_id < w; col_id++) {
+          in_t_map.col(col_id).maxCoeff(&row_id);
+          Eigen::array<int, 2> in_offsets{{row_id, col_id}};
+          Eigen::array<int, 2> out_offsets{{0, col_id}};
+          in_g_e.slice(in_offsets, extents).device(place) =
+              out_g_e.slice(out_offsets, extents);
        }
-        case LAST:
-          in_g_e.chip(h - 1, 0).device(place) = out_g_e;
-          break;
-        case FIRST:
-          in_g_e.chip(0, 0).device(place) = out_g_e;
-          break;
-        default:
-          PADDLE_THROW("unsupported pooling strategy");
+      } else if (pooltype == "LAST") {
+        in_g_e.chip(h - 1, 0).device(place) = out_g_e;
+      } else if (pooltype == "FIRST") {
+        in_g_e.chip(0, 0).device(place) = out_g_e;
+      } else {
+        PADDLE_THROW("unsupported pooling pooltype");
      }
    }
  }

--- a/paddle/operators/softmax_with_cross_entropy_op.cc
+++ b/paddle/operators/softmax_with_cross_entropy_op.cc
@@ -32,9 +32,9 @@ class SoftmaxWithCrossEntropyOpMaker
    AddInput("Label",
             "(Tensor, default: Tensor<int>), The ground truth which is a 2-D "
             "tensor. "
-             "If softLable is set to 0, Label is a Tensor<int> with shape [N x "
-             "1]. "
-             "If softLable is set to 1, Label is a Tensor<float/double> "
+             "If softLabel is set to false, Label is a Tensor<int> with shape "
+             "[N x 1]."
+             "If softLabel is set to true, Label is a Tensor<float/double> "
             "with shape [N x K].");
    AddOutput(
        "Softmax",
@@ -60,19 +60,23 @@ Because this operators performs a softmax on logits internally, it expects
 unscaled logits. Please do not call this op with the output of softmax operator,
 which will produce incorrect results.

-This operators expects mutually exclusive hard labels, each sample in a batch
-is in exactly one class with probabilities 1. Each sample in the batch with one
-and only one label.
+When the attribute softLabel is set false, this operators expects mutually
+exclusive hard labels, each sample in a batch is in exactly one class with
+probabilities 1. Each sample in the batch with one and only one label.

 Equation:

 1) hard label (one-hot label)

-Loss_j = -\text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1, ..., K
+Loss_j = \f$ -\text{Logit}_{Label_j} +
+\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right),
+j = 1, ..., K $\f

 2) soft label (a distribution over all classes)

-Loss_j = -\sum_{i=0}^{K}\text{Label}_i\left(\text{Logit}_i-\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K
+Loss_j = \f$ -\sum_{i=0}^{K}\text{Label}_i\left(\text{Logit}_i -
+\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right),
+j = 1,...,K $\f

 )DOC");
  }

--- a/paddle/operators/sum_op.h
+++ b/paddle/operators/sum_op.h
@@ -29,22 +29,27 @@ template <typename Place, typename T>
 class SumKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto& in_vars = context.MultiInputVar("X");
+    auto in_vars = context.MultiInputVar("X");
    int N = in_vars.size();
    auto out_var = context.OutputVar("Out");

+    bool in_place = out_var == in_vars[0];
+
    if (out_var->IsType<framework::LoDTensor>()) {
      auto* out = context.Output<Tensor>("Out");
      out->mutable_data<T>(context.GetPlace());

      auto result = EigenVector<T>::Flatten(*out);

-      math::SetConstant<Place, T> constant_functor;
-      constant_functor(context.device_context(), out, 0.0);
+      if (!in_place) {
+        math::SetConstant<Place, T> constant_functor;
+        constant_functor(context.device_context(), out, 0.0);
+      }

      math::SelectedRowsAddToTensor<Place, T> functor;
      auto place = context.GetEigenDevice<Place>();
-      for (int i = 0; i < N; i++) {
+      // If in_place, just skip the first tensor
+      for (int i = in_place ? 1 : 0; i < N; i++) {
        if (in_vars[i]->IsType<framework::LoDTensor>()) {
          auto& in_t = in_vars[i]->Get<framework::LoDTensor>();
          auto in = EigenVector<T>::Flatten(in_t);
@@ -57,6 +62,7 @@ class SumKernel : public framework::OpKernel<T> {
        }
      }
    } else if (out_var->IsType<framework::SelectedRows>()) {
+      PADDLE_ENFORCE(!in_place, "SelectedRows not support inplace sum now");
      auto* out = context.Output<SelectedRows>("Out");
      auto* out_value = out->mutable_value();


--- a/paddle/optimizer/parameter_optimizer_test.cpp
+++ b/paddle/optimizer/parameter_optimizer_test.cpp
@@ -85,7 +85,7 @@ public:
    for (size_t i = 0; i < opts_.size(); ++i) {
      int s = 0;
      float* newp = (float*)opts_[i]->get_weight(&s);
-      EXPECT_EQ(s, kSize);
+      EXPECT_EQ(static_cast<size_t>(s), kSize);
      for (size_t j = 0; j < kSize; ++j) {
        EXPECT_EQ(newp[j], (*p)[j]);
      }

--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
@@ -129,7 +129,8 @@ void BindProgramDesc(py::module &m) {
             }
             return retv;
           })
-      .def("block", &ProgramDescBind::Block, py::return_value_policy::reference)
+      .def("block", &ProgramDescBind::MutableBlock,
+           py::return_value_policy::reference)
      .def("num_blocks", &ProgramDescBind::Size)
      .def("serialize_to_string",
           [](ProgramDescBind &program_desc) -> py::bytes {

--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -28,7 +28,6 @@ limitations under the License. */
 #include "paddle/operators/cond_op.h"
 #include "paddle/operators/dynamic_recurrent_op.h"
 #include "paddle/operators/net_op.h"
-#include "paddle/operators/recurrent_op.h"
 #include "paddle/platform/enforce.h"
 #include "paddle/platform/place.h"
 #include "paddle/pybind/exception.h"
@@ -275,7 +274,7 @@ All parameter, weight, gradient are variables in Paddle.
                    const std::vector<std::array<size_t, 2>> &targets) {
    ProgramDescBind prog_with_targets(origin);
    for (const auto &t : targets) {
-      prog_with_targets.Block(t[0])->Op(t[1])->MarkAsTarget();
+      prog_with_targets.MutableBlock(t[0])->Op(t[1])->MarkAsTarget();
    }
    ProgramDesc pruned_desc;
    Prune(*prog_with_targets.Proto(), &pruned_desc);
@@ -335,7 +334,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    return OpRegistry::CreateOp(desc, nullptr);
+                    return OpRegistry::CreateOp(desc);
                  })
      .def("backward",
           [](const OperatorBase &forwardOp,
@@ -428,25 +427,6 @@ All parameter, weight, gradient are variables in Paddle.
        return self.UnstackShared(source);
      });

-  // recurrent_op
-  py::class_<operators::RecurrentOp, OperatorBase>(m, "RecurrentOp")
-      .def_static(
-          "create",
-          [](py::bytes protobin) -> operators::RecurrentOp * {
-            OpDesc desc;
-            PADDLE_ENFORCE(desc.ParsePartialFromString(protobin),
-                           "Cannot parse user input to OpDesc");
-            PADDLE_ENFORCE(desc.IsInitialized(),
-                           "User OpDesc is not initialized, reason %s",
-                           desc.InitializationErrorString());
-            auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
-            return static_cast<operators::RecurrentOp *>(rnn_op.release());
-          })
-      .def("set_stepnet", [](operators::RecurrentOp &self,
-                             const operators::NetOp &net) -> void {
-        self.set_stepnet(net.Clone());
-      });
-
  py::class_<operators::DynamicRecurrentOp, OperatorBase>(m,
                                                          "DynamicRecurrentOp")
      .def_static("create",
@@ -457,7 +437,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
+                    auto rnn_op = OpRegistry::CreateOp(desc);
                    return static_cast<operators::DynamicRecurrentOp *>(
                        rnn_op.release());
                  })
@@ -484,7 +464,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    auto cond_op = OpRegistry::CreateOp(desc, nullptr);
+                    auto cond_op = OpRegistry::CreateOp(desc);
                    return static_cast<operators::CondOp *>(cond_op.release());
                  })
      .def("set_truenet",
@@ -498,10 +478,7 @@ All parameter, weight, gradient are variables in Paddle.

  py::class_<framework::Executor>(m, "Executor")
      .def(py::init<std::vector<platform::Place> &>())
-      .def("run", [](Executor &self, ProgramDescBind *program_bind,
-                     Scope *scope, int block_id) {
-        self.Run(*program_bind->Proto(), scope, block_id);
-      });
+      .def("run", &Executor::Run);

  m.def("unique_integer", UniqueIntegerGenerator);
  m.def("init_gflags", InitGflags);

--- a/paddle/scripts/docker/build_android.sh
+++ b/paddle/scripts/docker/build_android.sh
@@ -4,6 +4,10 @@ set -xe

 if [ $ANDROID_ABI == "arm64-v8a" ]; then
  ANDROID_ARCH=arm64
+  if [ $ANDROID_API -lt 21 ]; then
+    echo "Warning: arm64-v8a requires ANDROID_API >= 21."
+    ANDROID_API=21
+  fi
 else # armeabi, armeabi-v7a
  ANDROID_ARCH=arm
 fi

--- a/paddle/scripts/travis/build_doc.sh
+++ b/paddle/scripts/travis/build_doc.sh
@@ -53,8 +53,8 @@ function deploy_docs() {
  set +e
  rm -rf ${DIR}/doc ${DIR}/doc_cn
  set -e
-  mv ../doc/cn/html ${DIR}/doc_cn
-  mv ../doc/en/html ${DIR}/doc
+  cp -r ../doc/cn/html ${DIR}/doc_cn
+  cp -r ../doc/en/html ${DIR}/doc
  git add .
 }


--- a/paddle/trainer/MergeModel.cpp
+++ b/paddle/trainer/MergeModel.cpp
@@ -27,6 +27,13 @@ using namespace paddle;  // NOLINT
 using namespace std;     // NOLINT

 int main(int argc, char** argv) {
+  if (FLAGS_model_dir.empty() || FLAGS_config_file.empty() ||
+      FLAGS_model_file.empty()) {
+    LOG(INFO) << "Usage: ./paddle_merge_model --model_dir=pass-00000 "
+                 "--config_file=config.py --model_file=out.paddle";
+    return 0;
+  }
+
  initMain(argc, argv);
  initPython(argc, argv);


--- a/paddle/trainer/tests/CMakeLists.txt
+++ b/paddle/trainer/tests/CMakeLists.txt
@@ -37,22 +37,6 @@ add_test(NAME test_CompareTwoNets
            --config_file_a=trainer/tests/sample_trainer_config_qb_rnn.conf --config_file_b=trainer/tests/sample_trainer_config_rnn.conf
    WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle/)

-################ test_CompareMKLDNNandCPU ######################
-if(WITH_MKLDNN)
-  macro(gen_command VAR_NAME CONFIG_FILE)
-    set(${VAR_NAME} "${PADDLE_SOURCE_DIR}/paddle/.set_python_path.sh" "-d" "${PADDLE_SOURCE_DIR}/python/"
-                    "${CMAKE_CURRENT_BINARY_DIR}/test_CompareMKLDNNandCPU --use_gpu=False"
-                    "--config_file_a=trainer/tests/${CONFIG_FILE} --use_mkldnn_a=True"
-                    "--config_file_b=trainer/tests/${CONFIG_FILE} --use_mkldnn_b=False"
-                    "WORKING_DIRECTORY" "${PADDLE_SOURCE_DIR}/paddle/")
-  endmacro()
-  add_unittest_without_exec(test_CompareMKLDNNandCPU test_CompareTwoNets.cpp)
-  gen_command(compare_simple_net "sample_trainer_config_simple_net.conf")
-  gen_command(compare_branch_net "sample_trainer_config_branch_net.conf")
-  add_test(NAME test_CompareMKLDNNandCPU_simple_net COMMAND ${compare_simple_net})
-  add_test(NAME test_CompareMKLDNNandCPU_branch_net COMMAND ${compare_branch_net})
-endif()
-
 ############### test_CompareTwoOpts ###################
 add_unittest_without_exec(test_CompareTwoOpts
    test_CompareTwoOpts.cpp)

--- a/paddle/trainer/tests/sample_trainer_config_simple_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_simple_net.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle.trainer_config_helpers import *
-
-################################### Data Configuration ###################################
-TrainData(ProtoData(files = "trainer/tests/mnist.list"))
-################################### Algorithm Configuration ###################################
-settings(batch_size = 128,
-         learning_method = MomentumOptimizer(momentum=0.5, sparse=False))
-################################### Network Configuration ###################################
-data = data_layer(name ="input", size=784)
-
-tmp = img_conv_layer(input=data,
-            num_channels=1,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-
-tmp = img_pool_layer(input=tmp,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=AvgPooling())
-            
-tmp = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation(),
-            bias_attr=False)
-
-tmp = batch_norm_layer(input=tmp,
-            use_global_stats=False,
-            act=ReluActivation())
-
-tmp = img_pool_layer(input=tmp,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
-            
-tmp = fc_layer(input=tmp, size=64,
-               bias_attr=True,
-               act=ReluActivation())
-
-output = fc_layer(input=tmp, size=10,
-                  bias_attr=True,
-                  act=SoftmaxActivation())
-
-lbl = data_layer(name ="label", size=10)
-
-cost = classification_cost(input=output, label=lbl)
-outputs(cost)
--- a/paddle/trainer/tests/test_CompareTwoNets.cpp
+++ b/paddle/trainer/tests/test_CompareTwoNets.cpp
@@ -26,15 +26,12 @@ DECLARE_int32(gpu_id);

 DECLARE_bool(local);
 DECLARE_bool(use_gpu);
-DECLARE_bool(use_mkldnn);

 DECLARE_string(config);
 DECLARE_string(nics);

 DEFINE_string(config_file_a, "", "config of one network to compare");
 DEFINE_string(config_file_b, "", "config of another network to compare");
-DEFINE_bool(use_mkldnn_a, false, "whether to use mkldnn to run config_file_a");
-DEFINE_bool(use_mkldnn_b, false, "whether to use mkldnn to run config_file_b");
 DEFINE_bool(need_high_accuracy,
            false,
            "whether need to run in double accuracy");
@@ -131,12 +128,6 @@ void compareGradient(ComData& comDataA, ComData& comDataB) {
                matA.getWidth());
  }

-  if (FLAGS_use_mkldnn_a || FLAGS_use_mkldnn_b) {
-    // some format of mkldnn parameter is different with cpu
-    // test_MKLDNN will check the parameters
-    return;
-  }
-
  vector<ParameterPtr>& parametersA = comDataA.parameters;
  vector<ParameterPtr>& parametersB = comDataB.parameters;

@@ -176,12 +167,10 @@ void compareGradient(ComData& comDataA, ComData& comDataB) {

 TEST(Trainer, create) {
  ComData dataA;
-  FLAGS_use_mkldnn = FLAGS_use_mkldnn_a;
  calcGradient(dataA, FLAGS_config_file_a);
  LOG(INFO) << "\n\nforwardBackward of Network A is finished\n\n";

  ComData dataB;
-  FLAGS_use_mkldnn = FLAGS_use_mkldnn_b;
  calcGradient(dataB, FLAGS_config_file_b);
  LOG(INFO) << "\n\nforwardBackward of the Network B is finished\n\n";


--- a/python/paddle/v2/dataset/imdb.py
+++ b/python/paddle/v2/dataset/imdb.py
@@ -116,7 +116,7 @@ def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
            yield [word_idx.get(w, UNK) for w in doc], i % 2
            doc = qs[i % 2].get()

-    return reader()
+    return reader


 def train(word_idx):

--- a/python/paddle/v2/framework/evaluator.py
+++ b/python/paddle/v2/framework/evaluator.py
+import paddle.v2.framework.op as op
+import numpy as np
+import paddle.v2.framework.core as core
+
+
+def avg_accumulate(accumulated_var, per_eval, num_batches, place):
+    t = np.array(accumulated_var.get_tensor())
+    t[0] += per_eval[0]
+    accumulated_var.get_tensor().set([t[0] / float(num_batches)], place)
+
+
+class Evaluator(object):
+    def __init__(self,
+                 scope,
+                 operator='accuracy',
+                 input='Inference',
+                 label='Label',
+                 output='Output',
+                 place=core.CPUPlace()):
+        """
+        create an evaluator for evaluating the inference.
+        NOTE: default run on CPUPlace(), running on GPUPlace doesn't improve performance much.
+
+        :param scope: the scope instance contains the input.
+        :type scope: paddle.v2.framework.core.scope
+        :param operator: operator name for caculating the evaluation for each mini-batch.
+        :type operator: string
+        :param input: output variable name of forward network.
+        :type input: string
+        :param label: variable name of label
+        :type label: string
+        """
+        self.scope = scope
+        self.place = place
+        self.output_name = output
+        self.num_batches = 0
+        # create variable to store accumulated evaluator output
+        eval_name = ''.join([operator, "@Eval"])
+        if scope.find_var(eval_name):
+            raise Exception("evaluator already exist in scope: %s" % eval_name)
+        self.accumulated_var = scope.var(eval_name)
+        t = self.accumulated_var.get_tensor()
+        t.set_dims((1, ))
+        t.set([0.0], place)
+        # self.accumulated_var = block.create_var(block, name=eval_name, shape=(1,))
+        # self.accumulated_var.get_tensor().set([0.0])
+        # create operator of evaluation
+        var_map = dict()  # var name -> variable
+        var_map[input] = [input]
+        var_map[label] = [label]
+        var_map[output] = [output]
+        self.op = op.Operator(operator, **var_map)
+
+    def evaluate(self, ctx, accumulator=avg_accumulate):
+        self.op.run(self.scope, ctx)
+        per_eval = np.array(self.scope.find_var(self.output_name).get_tensor())
+        self.num_batches += 1
+        accumulator(self.accumulated_var, per_eval, self.num_batches,
+                    self.place)
--- a/python/paddle/v2/framework/executor.py
+++ b/python/paddle/v2/framework/executor.py
@@ -62,7 +62,7 @@ class Executor(object):
                outputs={'Out': [fetch_var]},
                attrs={'col': i})

-        self.executor.run(program.desc, scope, 0)
+        self.executor.run(program.desc, scope, 0, True)
        return [
            core.get_fetch_variable(scope, fetch_var_name, i)
            for i in xrange(len(fetch_list))

--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -7,6 +7,11 @@ import copy
 __all__ = ['Block', 'Variable', 'Program', 'Operator']


+def unique_name(prefix):
+    uid = core.unique_integer(prefix)  # unique during whole process.
+    return "_".join([prefix, str(uid)])
+
+
 class Variable(object):
    def __init__(self,
                 block,
@@ -264,7 +269,10 @@ class Operator(object):
                    self.desc.set_attr(attr_name, attrs[attr_name])

        self.desc.check_attrs()
-        no_kernel_op_set = {'feed', 'fetch', 'save', 'load'}
+        no_kernel_op_set = {
+            'feed', 'fetch', 'save', 'load', 'recurrent',
+            'rnn_memory_helper_grad'
+        }
        if type not in no_kernel_op_set:
            self.desc.infer_var_type(self.block.desc)
            self.desc.infer_shape(self.block.desc)
@@ -354,8 +362,8 @@ class Block(object):

    def create_var(self, *args, **kwargs):
        var = Variable(self, *args, **kwargs)
-        if 'init_attr' in kwargs:
-            self._prepend_initialize_ops_(var, kwargs['init_attr'])
+        if 'initializer' in kwargs:
+            kwargs['initializer'](var, self)
        return var

    def has_var(self, name):
@@ -364,8 +372,8 @@ class Block(object):
    def create_parameter(self, *args, **kwargs):
        global_block = self.program.global_block()
        param = Parameter(global_block, *args, **kwargs)
-        if 'init_attr' in kwargs:
-            self._prepend_initialize_ops_(param, kwargs['init_attr'])
+        if 'initializer' in kwargs:
+            kwargs['initializer'](param, self)
        return param

    def append_op(self, *args, **kwargs):
@@ -424,17 +432,6 @@ class Block(object):
        for index in range(len(self.ops)):
            assert self.ops[index].desc == ops_in_cpp[index]

-    def _prepend_initialize_ops_(self, param, init_attr):
-        op_type = init_attr['type']
-        init_attr['shape'] = param.shape
-        init_attr['data_type'] = int(param.data_type)
-        op = self.prepend_op(
-            type=op_type,
-            inputs=None,
-            outputs={'Out': [param]},
-            attrs=init_attr)
-        param.op = op
-

 class Program(object):
    def __init__(self):

--- a/python/paddle/v2/framework/initializer.py
+++ b/python/paddle/v2/framework/initializer.py
+import paddle.v2.framework.framework as framework
+import numpy as np
+
+__all__ = [
+    'ConstantInitializer', 'UniformInitializer', 'NormalInitializer',
+    'XavierInitializer'
+]
+
+
+class Initializer(object):
+    """Base class for variable initializers
+
+    Defines the common interface of variable initializers.
+    They add operations to the init program that are used
+    to initialize variables. Users should not use this class
+    directly, but need to use one of its implementations.
+    """
+
+    def __init_(self):
+        pass
+
+    def __call__(self, param, block):
+        """Add corresponding initialization operations to the network
+        """
+        raise NotImplementedError()
+
+    def _compute_fans(self, var):
+        """Compute the fan_in and the fan_out for layers
+
+        This method computes the fan_in and the fan_out
+        for neural network layers, if not specified. It is
+        not possible to perfectly estimate fan_in and fan_out.
+        This method will estimate it correctly for matrix multiply and
+        convolutions.
+
+        Args:
+            var: variable for which fan_in and fan_out have to be computed
+
+        Returns:
+            tuple of two integers (fan_in, fan_out)
+        """
+        shape = var.shape
+        if not shape or len(shape) == 0:
+            fan_in = fan_out = 1
+        elif len(shape) == 1:
+            fan_in = fan_out = shape[0]
+        elif len(shape) == 2:
+            # This is the case for simple matrix multiply
+            fan_in = shape[0]
+            fan_out = shape[1]
+        else:
+            # Assume this to be a convolutional kernel
+            # In PaddlePaddle, the shape of the kernel is like:
+            # [num_filters, num_filter_channels, ...] where the remaining
+            # dimensions are the filter_size
+            receptive_field_size = np.prod(shape[2:])
+            fan_in = shape[1] * receptive_field_size
+            fan_out = shape[0] * receptive_field_size
+
+        return (fan_in, fan_out)
+
+
+class ConstantInitializer(Initializer):
+    """Implements the constant initializer
+    """
+
+    def __init__(self, value=0.0):
+        """Constructor for ConstantInitializer
+
+        Args:
+            value: constant value to initialize the variable
+        """
+        assert value is not None
+        super(ConstantInitializer, self).__init__()
+        self._value = value
+
+    def __call__(self, var, block):
+        """Add constant initialization ops for a variable
+
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="fill_constant",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "value": self._value
+            })
+        var.op = op
+        return op
+
+
+class UniformInitializer(Initializer):
+    """Implements the random uniform distribution initializer
+    """
+
+    def __init__(self, low=-1.0, high=1.0, seed=0):
+        """Constructor for UniformInitializer
+
+        Args:
+            low: lower boundary of the uniform distribution
+            high: upper boundary of the uniform distribution
+            seed: random seed
+        """
+        assert low is not None
+        assert high is not None
+        assert high >= low
+        assert seed is not None
+        super(UniformInitializer, self).__init__()
+        self._low = low
+        self._high = high
+        self._seed = seed
+
+    def __call__(self, var, block):
+        """Add uniform distribution initialization ops for a variable
+
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="uniform_random",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "min": self._low,
+                "max": self._high,
+                "seed": self._seed
+            })
+        var.op = op
+        return op
+
+
+class NormalInitializer(Initializer):
+    """Implements the  random Normal(Gaussian) distribution initializer
+    """
+
+    def __init__(self, loc=0.0, scale=1.0, seed=0):
+        """Constructor for NormalInitializer
+
+        Args:
+            loc: mean of the normal distribution
+            scale: standard deviation of the normal distribution
+            seed: random seed
+        """
+        assert loc is not None
+        assert scale is not None
+        assert seed is not None
+        super(NormalInitializer, self).__init__()
+        self._mean = loc
+        self._std_dev = scale
+        self._seed = seed
+
+    def __call__(self, var, block):
+        """Add normal distribution initialization ops for a variable
+
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="gaussian_random",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "mean": self._mean,
+                "std": self._std_dev,
+                "seed": self._seed
+            })
+        var.op = op
+        return op
+
+
+class XavierInitializer(Initializer):
+    """Implements the Xavier initializer
+
+    This class implements the Xavier weight initializer from the paper
+    Understanding the difficulty of training deep feedforward neural
+    networks[1] by Xavier Glorot and Yoshua Bengio.
+
+    This initializer is designed to keep the scale of the gradients
+    approximately same in all the layers. In case of Uniform distribution,
+    the range is [-x, x], where x = sqrt(6 / (fan_in + fan_out)).
+    In case of Normal distribution, the mean is 0 and the standard deviation
+    is sqrt(2/ (fan_in + fan_out)).
+
+    References:
+        [1] Understanding the difficulty of training deep feedforward neural
+            networks. International conference on artificial intelligence and
+            statistics.
+            (http://proceedings.mlr.press/v9/glorot10a.html)
+    """
+
+    def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0):
+        """Constructor for XavierInitializer
+
+        Args:
+            uniform: whether to use uniform or normal distribution
+            fan_in: fan_in for Xavier initialization. If None, it is
+                    inferred from the variable.
+            fan_out: fan_out for Xavier initialization. If None, it is
+                     inferred from the variable.
+            seed: random seed
+
+        Note: It is recommended to set fan_in and fan_out to None for
+              most cases.
+        """
+        assert uniform is not None
+        assert seed is not None
+        super(XavierInitializer, self).__init__()
+        self._uniform = uniform
+        self._fan_in = fan_in
+        self._fan_out = fan_out
+        self._seed = seed
+
+    def __call__(self, var, block):
+        """Add xavier initialization ops for a variable
+
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        f_in, f_out = self._compute_fans(var)
+
+        # If fan_in and fan_out are passed, use them
+        fan_in = f_in if self._fan_in is None else self._fan_in
+        fan_out = f_out if self._fan_out is None else self._fan_out
+
+        if self._uniform:
+            limit = np.sqrt(6.0 / float(fan_in + fan_out))
+            op = block.prepend_op(
+                type="uniform_random",
+                outputs={"Out": var},
+                attrs={
+                    "shape": var.shape,
+                    "data_type": int(var.data_type),
+                    "min": -limit,
+                    "max": limit,
+                    "seed": self._seed
+                })
+
+        else:
+            std = np.sqrt(2.0 / float(fan_in + fan_out))
+            op = block.prepend_op(
+                type="gaussian_random",
+                outputs={"Out": var},
+                attrs={
+                    "shape": var.shape,
+                    "data_type": int(var.data_type),
+                    "mean": 0.0,
+                    "std": std,
+                    "seed": self._seed
+                })
+        var.op = op
+        return op
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
 import copy
 import itertools

-import paddle.v2.framework.core as core
-
 from paddle.v2.framework.framework import Variable, g_program, \
-    g_init_program
-
-
-def unique_name(prefix):
-    uid = core.unique_integer(prefix)  # unique during whole process.
-    return "_".join([prefix, str(uid)])
+    g_init_program, unique_name, Program
+from paddle.v2.framework.initializer import ConstantInitializer, \
+    UniformInitializer


 class LayerHelper(object):
@@ -66,14 +61,7 @@ class LayerHelper(object):

    @property
    def param_attr(self):
-        default = {
-            'name': None,
-            'init_attr': {
-                'type': 'uniform_random',
-                'min': -1.0,
-                'max': 1.0
-            }
-        }
+        default = {'name': None, 'initializer': UniformInitializer()}
        actual = self.kwargs.get('param_attr', None)
        if actual is None:
            actual = default
@@ -83,13 +71,7 @@ class LayerHelper(object):
        return actual

    def bias_attr(self):
-        default = {
-            'name': None,
-            'init_attr': {
-                'type': 'fill_constant',
-                'value': 0.0
-            }
-        }
+        default = {'name': None, 'initializer': ConstantInitializer()}
        bias_attr = self.kwargs.get('bias_attr', None)
        if bias_attr is True:
            bias_attr = default
@@ -149,12 +131,38 @@ class LayerHelper(object):
    def create_variable(self, *args, **kwargs):
        return self.program.current_block().create_var(*args, **kwargs)

-    def create_global_variable(self, *args, **kwargs):
+    def create_global_variable(self, persistable=False, *args, **kwargs):
        return self.program.global_block().create_var(
-            *args, persistable=False, **kwargs)
-
-    def append_bias_op(self, input_var):
-        size = list(input_var.shape[1:])
+            *args, persistable=persistable, **kwargs)
+
+    def set_variable_initializer(self, var, initializer):
+        assert isinstance(var, Variable)
+        self.init_program.global_block().create_var(
+            name=var.name,
+            type=var.type,
+            dtype=var.data_type,
+            shape=var.shape,
+            persistable=True,
+            initializer=initializer)
+
+    def append_bias_op(self, input_var, num_flatten_dims=None):
+        """
+        Append bias operator and return its output. If the user does not set 
+        bias_attr, append_bias_op will return input_var
+         
+        :param input_var: the input variable. The len(input_var.shape) is larger
+        or equal than 2.
+        :param num_flatten_dims: The input tensor will be flatten as a matrix 
+        when adding bias.
+        `matrix.shape = product(input_var.shape[0:num_flatten_dims]), product(
+                input_var.shape[num_flatten_dims:])`
+        """
+        if num_flatten_dims is None:
+            num_flatten_dims = self.kwargs.get('num_flatten_dims', None)
+            if num_flatten_dims is None:
+                num_flatten_dims = 1
+
+        size = list(input_var.shape[num_flatten_dims:])
        bias_attr = self.bias_attr()
        if not bias_attr:
            return input_var

--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
 from paddle.v2.framework.layer_helper import LayerHelper, unique_name
 import paddle.v2.framework.core as core
-from paddle.v2.framework.framework import OpProtoHolder, Variable, Program
+from paddle.v2.framework.framework import OpProtoHolder, Variable, Program, \
+    Operator
+from paddle.v2.framework.initializer import ConstantInitializer
 import re

 __all__ = [
    'fc', 'data', 'cross_entropy', 'conv2d', 'pool2d', 'embedding', 'concat',
-    'StaticRNN', 'cast', 'sequence_conv', 'sequence_pool', 'accuracy'
+    'StaticRNN', 'cast', 'sequence_conv', 'sequence_pool', 'sums', 'cos_sim',
+    'batch_norm', 'accuracy'
 ]


@@ -30,7 +33,6 @@ def fc(input,
        param_shape = [
            reduce(lambda a, b: a * b, input_shape[num_flatten_dims:], 1)
        ] + [size]
-
        w = helper.create_parameter(
            attr=param_attr, shape=param_shape, dtype=dtype)
        tmp = helper.create_tmp_variable(dtype)
@@ -86,8 +88,17 @@ def data(name,
         program=None,
         init_program=None):
    helper = LayerHelper('data', **locals())
+    shape = list(shape)
+    for i in xrange(len(shape)):
+        if shape[i] is None:
+            shape[i] = -1
+            append_batch_size = False
+        elif shape[i] < 0:
+            append_batch_size = False
+
    if append_batch_size:
        shape = [-1] + shape  # append batch size as -1
+
    return helper.create_global_variable(
        name=name, shape=shape, dtype=data_type, type=type)

@@ -163,18 +174,9 @@ _create_op_func_('mul')
 _create_op_func_('elementwise_add')
 _create_op_func_('dropout')
 _create_op_func_('reshape')
-
-
-def cast(x, data_type, program=None):
-    helper = LayerHelper('cast', **locals())
-    out = helper.create_tmp_variable(dtype=data_type)
-    helper.append_op(
-        type='cast',
-        inputs={'X': [x]},
-        outputs={'Out': [out]},
-        attrs={'in_data_type': x.data_type,
-               'out_data_type': out.data_type})
-    return out
+_create_op_func_('elementwise_add')
+_create_op_func_('sigmoid')
+_create_op_func_('scale')


 def cast(x, data_type, program=None):
@@ -191,9 +193,7 @@ def cast(x, data_type, program=None):

 def concat(input, axis, program=None, init_program=None):
    helper = LayerHelper('concat', **locals())
-    if not isinstance(input, list) and not isinstance(input, tuple):
-        input = [input]
-    out = helper.create_tmp_variable(dtype=input[0].data_type)
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
    helper.append_op(
        type='concat',
        inputs={'X': input},
@@ -202,6 +202,28 @@ def concat(input, axis, program=None, init_program=None):
    return out


+def sums(input, program=None, init_program=None):
+    helper = LayerHelper('sum', **locals())
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
+    helper.append_op(type='sum', inputs={'X': input}, outputs={'Out': out})
+    return out
+
+
+def cos_sim(X, Y, **kwargs):
+    helper = LayerHelper('cos_sim', **kwargs)
+    out = helper.create_tmp_variable(dtype=X.data_type)
+    xnorm = helper.create_tmp_variable(dtype=X.data_type)
+    ynorm = helper.create_tmp_variable(dtype=X.data_type)
+    helper.append_op(
+        type='cos_sim',
+        inputs={'X': [X],
+                'Y': [Y]},
+        outputs={'Out': [out],
+                 'XNorm': [xnorm],
+                 'YNorm': [ynorm]})
+    return out
+
+
 def cross_entropy(input, label, **kwargs):
    helper = LayerHelper('cross_entropy', **kwargs)
    out = helper.create_tmp_variable(dtype=input.data_type)
@@ -254,10 +276,9 @@ def accuracy(input, label, k=1, **kwargs):

 def sequence_conv(input,
                  num_filters,
-                  name=None,
                  filter_size=3,
+                  filter_stride=1,
                  act=None,
-                  stride=1,
                  padding=None,
                  bias_attr=None,
                  param_attr=None,
@@ -270,7 +291,7 @@ def sequence_conv(input,
    helper = LayerHelper('sequence_conv', **locals())
    dtype = helper.input_dtype()

-    filter_shape = [num_filters, filter_size]
+    filter_shape = [filter_size * input.shape[1], num_filters]
    filter = helper.create_parameter(
        attr=helper.param_attr, shape=filter_shape, dtype=dtype)
    pre_bias = helper.create_tmp_variable(dtype)
@@ -279,15 +300,14 @@ def sequence_conv(input,
        type='sequence_conv',
        inputs={
            'X': [input],
-            'Filter': filter,
+            'Filter': [filter],
        },
        outputs={"Out": pre_bias},
        attrs={
-            'context_stride': stride,
-            'context_start': 0,
-            'context_length': filter_size
+            'contextStride': filter_stride,
+            'contextStart': -int(filter_size / 2),
+            'contextLength': filter_size
        })
-
    pre_act = helper.append_bias_op(pre_bias)
    return helper.append_activation(pre_act)

@@ -339,36 +359,21 @@ def conv2d(input,
               'paddings': padding,
               'groups': groups})

-    pre_act = helper.append_bias_op(pre_bias)
+    pre_act = helper.append_bias_op(pre_bias, 1)

    return helper.append_activation(pre_act)


-def sequence_pool(input,
-                  pool_size,
-                  pool_type,
-                  pool_stride=1,
-                  pool_padding=0,
-                  global_pooling=False,
-                  program=None,
-                  init_program=None):
-    # FIXME(dzh) : want to unify the argument of python layer
-    # function. So we ignore some unecessary attributes
-
-    ENUM_POOL_TYPE = set(["max", "avg", "sqrt", "last", "first"])
-    if pool_type not in ENUM_POOL_TYPE:
-        raise ValueError("Unknown pool_type: '%s'. It can only be %s.",
-                         str(pool_type), " ".join(ENUM_POOL_TYPE))
-
-    helper = LayerHelper('sequence_pool', **locals())
+def sequence_pool(input, pool_type, **kwargs):
+    helper = LayerHelper('sequence_pool', input=input, **kwargs)
    dtype = helper.input_dtype()
    pool_out = helper.create_tmp_variable(dtype)

    helper.append_op(
        type="sequence_pool",
        inputs={"X": [input]},
-        outputs={"Out": pool_out},
-        attrs={"strategy": pool_type})
+        outputs={"Out": [pool_out]},
+        attrs={"pooltype": pool_type.upper()})

    return pool_out

@@ -433,26 +438,12 @@ def batch_norm(input,
        else:
            raise ValueError("unsupported data layout:" + data_layout)

-    def get_init_attr(value):
-        if not isinstance(value, float):
-            raise ValueError("attr value should be a float")
-        return {'type': 'fill_constant', 'value': value}
-
-    def prepend_init_op(var, init_attr):
-        assert isinstance(var, Variable)
-        op_type = init_attr['type']
-        init_attr['shape'] = var.shape
-        init_attr['data_type'] = int(var.data_type)
-        op = var.block.prepend_op(
-            type=op_type, inputs=None, outputs={'Out': [var]}, attrs=init_attr)
-        return op
-
-    def create_persistable_var(dtype, shape, init_attr=None):
+    def create_persistable_var(dtype, shape, initializer=None):
        name = unique_name(".".join([helper.name, "xxxx"]))
        var = init_program.global_block().create_var(
            dtype=dtype, shape=shape, name=name, persistable=True)
-        if 'init_attr' is not None:
-            prepend_init_op(var, init_attr)
+        if initializer is not None:
+            initializer(var, var.block)
        return program.global_block().create_var(
            name=name, dtype=dtype, shape=shape, persistable=True)

@@ -465,8 +456,9 @@ def batch_norm(input,
        attr=helper.param_attr, shape=param_shape, dtype=dtype)

    # create input
-    mean = create_persistable_var(dtype, param_shape, get_init_attr(0.0))
-    variance = create_persistable_var(dtype, param_shape, get_init_attr(1.0))
+    mean = create_persistable_var(dtype, param_shape, ConstantInitializer(0.0))
+    variance = create_persistable_var(dtype, param_shape,
+                                      ConstantInitializer(1.0))

    # create output
    # mean and mean_out share the same memory
@@ -534,6 +526,8 @@ class StaticRNNGuard(BlockGuard):
        return super(StaticRNNGuard, self).__enter__()

    def __exit__(self, exc_type, exc_val, exc_tb):
+        if exc_type is not None:
+            return False
        self.rnn.status = StaticRNN.AFTER_RNN_BLOCK
        self.rnn.complete_rnn_op()
        return super(StaticRNNGuard, self).__exit__(exc_type, exc_val, exc_tb)
@@ -593,7 +587,7 @@ class StaticRNN(object):
                outputs={'Out': [boot_var]},
                attrs={
                    'value': init_value,
-                    'shape': boot_var.shape,
+                    'shape': [40] + list(boot_var.shape[1:]),
                    'data_type': boot_var.data_type
                })

@@ -612,14 +606,14 @@ class StaticRNN(object):
        if not isinstance(x, Variable):
            raise TypeError("step input takes a Variable")
        if self.seq_len is None:
-            self.seq_len = x.shape[1]
-        elif self.seq_len != x.shape[1]:
+            self.seq_len = x.shape[0]
+        elif self.seq_len != x.shape[0]:
            raise ValueError("Static RNN only take fix seq_len input")

        ipt = self.helper.create_variable(
            name=x.name,
            dtype=x.data_type,
-            shape=[-1] + list(x.shape[2:]),
+            shape=list(x.shape[1:]),
            type=x.type)
        self.inputs.append(ipt)
        return ipt
@@ -629,10 +623,17 @@ class StaticRNN(object):
        if not isinstance(o, Variable):
            raise TypeError("step output takes a Variable")

+        tmp_o = self.helper.create_tmp_variable(dtype=o.data_type)
+        self.helper.append_op(
+            type='rnn_memory_helper',
+            inputs={'X': [o]},
+            outputs={'Out': tmp_o},
+            attrs={'data_type': o.data_type})
+
        out_var = self.parent_block().create_var(
-            name=o.name,
-            shape=[-1, self.seq_len] + list(o.shape[1:]),
-            dtype=o.data_type)
+            name=tmp_o.name,
+            shape=[self.seq_len] + list(tmp_o.shape),
+            dtype=tmp_o.data_type)

        self.outputs.append(out_var)

@@ -663,6 +664,68 @@ class StaticRNN(object):
            return self.outputs

    def complete_rnn_op(self):
-        # TODO(yuyang18): Create RNN Op here.
-        # Implement this method after RNN op complete.
-        pass
+        program = self.helper.program
+        rnn_block = program.current_block()
+        parent_block = self.parent_block()
+
+        local_inputs = set()
+
+        for op in rnn_block.ops:
+            assert isinstance(op, Operator)
+            for oname in op.output_names:
+                for out_var_name in op.output(oname):
+                    local_inputs.add(out_var_name)
+
+        for var in self.inputs:
+            local_inputs.add(var.name)
+        for m in self.memories:
+            local_inputs.add(m)
+
+        params = list()
+        for op in rnn_block.ops:
+            assert isinstance(op, Operator)
+            for iname in op.input_names:
+                for in_var_name in op.input(iname):
+                    if in_var_name not in local_inputs:
+                        params.append(in_var_name)
+
+        parameters = [parent_block.var(name) for name in params]
+
+        step_scope = parent_block.create_var(
+            type=core.VarDesc.VarType.STEP_SCOPES)
+
+        inlinks = [parent_block.var(i.name) for i in self.inputs]
+        outlinks = self.outputs
+
+        boot_memories = []
+        pre_memories = []
+        memories = []
+        for _, mem in self.memories.iteritems():
+            boot_memories.append(mem.init)
+            pre_memories.append(mem.pre_mem.name)
+            mem_var = rnn_block.var(mem.mem.name)
+            assert isinstance(mem_var, Variable)
+            new_mem = self.helper.create_tmp_variable(dtype=mem_var.data_type)
+
+            rnn_block.append_op(
+                type='rnn_memory_helper',
+                inputs={'X': [mem_var]},
+                outputs={'Out': [new_mem]},
+                attrs={'data_type': mem_var.data_type})
+
+            memories.append(new_mem.name)
+
+        parent_block.append_op(
+            type='recurrent',
+            inputs={
+                'inputs': inlinks,
+                'initial_states': boot_memories,
+                'parameters': parameters
+            },
+            outputs={'outputs': outlinks,
+                     'step_scopes': [step_scope]},
+            attrs={
+                'ex_states': pre_memories,
+                'states': memories,
+                'step_block': rnn_block
+            })
--- a/python/paddle/v2/framework/net_drawer.py
+++ b/python/paddle/v2/framework/net_drawer.py
+import argparse
+import json
+import logging
+from collections import defaultdict
+
+import paddle.v2.framework.core as core
+import paddle.v2.framework.proto.framework_pb2 as framework_pb2
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+try:
+    from graphviz import Digraph
+except ImportError:
+    logger.info(
+        'Cannot import graphviz, which is required for drawing a network. This '
+        'can usually be installed in python with "pip install graphviz". Also, '
+        'pydot requires graphviz to convert dot files to pdf: in ubuntu, this '
+        'can usually be installed with "sudo apt-get install graphviz".')
+    print('net_drawer will not run correctly. Please install the correct '
+          'dependencies.')
+    exit(0)
+
+OP_STYLE = {
+    'shape': 'oval',
+    'color': '#0F9D58',
+    'style': 'filled',
+    'fontcolor': '#FFFFFF'
+}
+
+VAR_STYLE = {}
+
+GRAPH_STYLE = {"rankdir": "TB", }
+
+GRAPH_ID = 0
+
+
+def unique_id():
+    def generator():
+        GRAPH_ID += 1
+        return GRAPH_ID
+
+    return generator
+
+
+def draw_node(op):
+    node = OP_STYLE
+    node["name"] = op.type
+    node["label"] = op.type
+    return node
+
+
+def draw_edge(var_parent, op, var, arg):
+    edge = VAR_STYLE
+    edge["label"] = "%s(%s)" % (var.parameter, arg)
+    edge["head_name"] = op.type
+    edge["tail_name"] = var_parent[arg]
+    return edge
+
+
+def parse_graph(program, graph, var_dict, **kwargs):
+
+    # fill the known variables
+    for block in program.blocks:
+        for var in block.vars:
+            if not var_dict.has_key(var):
+                var_dict[var] = "Feed"
+
+    proto = framework_pb2.ProgramDesc.FromString(
+        program.desc.serialize_to_string())
+    for block in proto.blocks:
+        for op in block.ops:
+            graph.node(**draw_node(op))
+            for o in op.outputs:
+                for arg in o.arguments:
+                    var_dict[arg] = op.type
+            for e in op.inputs:
+                for arg in e.arguments:
+                    if var_dict.has_key(arg):
+                        graph.edge(**draw_edge(var_dict, op, e, arg))
+
+
+def draw_graph(init_program, program, **kwargs):
+    if kwargs.has_key("graph_attr"):
+        GRAPH_STYLE.update(kwargs[graph_attr])
+    if kwargs.has_key("node_attr"):
+        OP_STYLE.update(kwargs[node_attr])
+    if kwargs.has_key("edge_attr"):
+        VAR_STYLE.update(kwargs[edge_attr])
+
+    graph_id = unique_id()
+    filename = kwargs.get("filename")
+    if filename == None:
+        filename = str(graph_id) + ".gv"
+    g = Digraph(
+        name=str(graph_id),
+        filename=filename,
+        graph_attr=GRAPH_STYLE,
+        node_attr=OP_STYLE,
+        edge_attr=VAR_STYLE,
+        **kwargs)
+
+    var_dict = {}
+    parse_graph(init_program, g, var_dict)
+    parse_graph(program, g, var_dict)
+
+    if filename != None:
+        g.save()
+    return g
--- a/python/paddle/v2/framework/nets.py
+++ b/python/paddle/v2/framework/nets.py
@@ -47,7 +47,7 @@ def img_conv_group(input,
    """
    tmp = input
    assert isinstance(conv_num_filter, list) or \
-           isinstance(conv_num_filter, tuple)
+        isinstance(conv_num_filter, tuple)

    def __extend_list__(obj):
        if not hasattr(obj, '__len__'):
@@ -101,9 +101,8 @@ def img_conv_group(input,
 def sequence_conv_pool(input,
                       num_filters,
                       filter_size,
-                       pool_size,
-                       pool_stride,
-                       act,
+                       act="sigmoid",
+                       pool_type="max",
                       program=None,
                       init_program=None):
    conv_out = layers.sequence_conv(
@@ -116,9 +115,7 @@ def sequence_conv_pool(input,

    pool_out = layers.sequence_pool(
        input=conv_out,
-        pool_size=pool_size,
-        pool_type='max',
-        pool_stride=pool_stride,
+        pool_type=pool_type,
        program=program,
        init_program=init_program)
    return pool_out
--- a/python/paddle/v2/framework/optimizer.py
+++ b/python/paddle/v2/framework/optimizer.py
 from collections import defaultdict

 import paddle.v2.framework.framework as framework
+from paddle.v2.framework.framework import unique_name, Program
 from paddle.v2.framework.backward import append_backward_ops
+from paddle.v2.framework.initializer import ConstantInitializer
 from paddle.v2.framework.regularizer import append_regularization_ops
+from paddle.v2.framework.layer_helper import LayerHelper

 __all__ = [
    'SGDOptimizer', 'MomentumOptimizer', 'AdagradOptimizer', 'AdamOptimizer',
@@ -25,6 +28,7 @@ class Optimizer(object):
        # to train. These variables are called accumulators.
        # {accum_name : { paramter_name : accumulator_for_parameter, ...}, ...}
        self._accumulators = defaultdict(lambda: dict())
+        self.helper = None

    def _append_optimize_op(self, block, param_and_grad):
        """ append optimize operator to block and return all the added optimize_op
@@ -63,7 +67,7 @@ class Optimizer(object):
        """
        pass

-    def _add_accumulator(self, block, name, param, dtype=None, fill_value=0.0):
+    def _add_accumulator(self, name, param, dtype=None, fill_value=0.0):
        """Utility function to add an accumulator for a parameter

        Args:
@@ -77,22 +81,17 @@ class Optimizer(object):
                param.name in self._accumulators[name]):
            raise Exception("Accumulator {} already exists for parmeter {}".
                            format(name, param.name))
-        global_block = block.program.global_block()
-        param_shape = list(param.shape)
-        param_acc = global_block.create_var(
-            dtype=dtype, shape=param_shape, lod_level=0)
-
-        # Initialize the accumulator with fill_value
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        global_block.append_op(
-            type="fill_constant",
-            outputs={"Out": param_acc},
-            attrs={"shape": param_shape,
-                   "value": fill_value})
-
-        # Add to accumulators dict
-        self._accumulators[name][param.name] = param_acc
+
+        assert isinstance(self.helper, LayerHelper)
+        var = self.helper.create_global_variable(
+            name=unique_name(name),
+            persistable=True,
+            dtype=dtype or param.data_type,
+            type=param.type,
+            shape=param.shape)
+        self.helper.set_variable_initializer(
+            var, initializer=ConstantInitializer(value=float(fill_value)))
+        self._accumulators[name][param.name] = var

    def _get_accumulator(self, name, param):
        """Utility function to fetch an accumulator for a parameter
@@ -130,7 +129,10 @@ class Optimizer(object):

        return increment_op

-    def create_optimization_pass(self, parameters_and_grads, loss):
+    def create_optimization_pass(self,
+                                 parameters_and_grads,
+                                 loss,
+                                 init_program=None):
        """Add optimization operators to update gradients to variables.

        Args:
@@ -142,6 +144,7 @@ class Optimizer(object):
          optimization. This will include parameter update ops, global step
          update ops and any other custom ops required by subclasses to manage
          their internal state.
+          :param init_program: 
        """
        # This is a default implementation of create_optimization_pass that
        # can be shared by most optimizers. This implementation assumes that
@@ -151,6 +154,9 @@ class Optimizer(object):
        # for parameters and extend _finish_update method to add custom ops.

        # Create any accumulators
+        program = loss.block.program
+        self.helper = LayerHelper(
+            self.__class__.__name__, program=program, init_program=init_program)
        self._create_accumulators(loss.block,
                                  [p[0] for p in parameters_and_grads])
        # Create any necessary tensors
@@ -177,7 +183,11 @@ class Optimizer(object):
            return_ops.append(self._increment_global_step(loss.block))
        return return_ops

-    def minimize(self, loss, parameter_list=None, no_grad_set=None):
+    def minimize(self,
+                 loss,
+                 init_program=None,
+                 parameter_list=None,
+                 no_grad_set=None):
        """Add operations to minimize `loss` by updating `parameter_list`.

        This method combines interface `append_backward_ops()` and
@@ -187,7 +197,8 @@ class Optimizer(object):
                                           set())
        # Add regularization if any 
        params_grads = append_regularization_ops(params_grads)
-        optimize_ops = self.create_optimization_pass(params_grads, loss)
+        optimize_ops = self.create_optimization_pass(params_grads, loss,
+                                                     init_program)
        return optimize_ops


@@ -202,24 +213,19 @@ class SGDOptimizer(Optimizer):
        self._learning_rate = learning_rate

    def _initialize_tensors(self, block):
-        assert isinstance(block, framework.Block)
        lr_shape = [1]
        # create a variable for learning_rate
-        self._lr = block.create_var(
-            dtype="float32", shape=lr_shape, lod_level=0)
-
-        # create an op to init the learning_rate
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._lr},
-            attrs={"shape": lr_shape,
-                   "value": self._learning_rate})
+        self._lr = self.helper.create_global_variable(
+            name=unique_name("learning_rate"),
+            dtype='float32',
+            shape=lr_shape,
+            lod_level=1,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            var=self._lr, initializer=ConstantInitializer(self._learning_rate))

    def _append_optimize_op(self, block, param_and_grad):
        assert isinstance(block, framework.Block)
-
        # create the optimize op
        sgd_op = block.append_op(
            type=self.type,
@@ -255,23 +261,20 @@ class MomentumOptimizer(Optimizer):
        assert isinstance(block, framework.Block)
        lr_shape = [1]
        # create a variable for learning_rate
-        self._lr = block.create_var(
-            dtype="float32", shape=lr_shape, lod_level=0)
-
-        # create an op to init the learning_rate
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._lr},
-            attrs={"shape": lr_shape,
-                   "value": self._learning_rate})
+        self._lr = self.helper.create_global_variable(
+            name=unique_name("learning_rate"),
+            dtype='float32',
+            shape=lr_shape,
+            lod_level=1,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            var=self._lr, initializer=ConstantInitializer(self._learning_rate))

    def _create_accumulators(self, block, parameters):
        assert isinstance(block, framework.Block)

        for p in parameters:
-            self._add_accumulator(block, self._velocity_acc_str, p, 'float32')
+            self._add_accumulator(self._velocity_acc_str, p)

    def _append_optimize_op(self, block, param_and_grad):
        assert isinstance(block, framework.Block)
@@ -311,26 +314,22 @@ class AdagradOptimizer(Optimizer):
        self._epsilon = epsilon

    def _initialize_tensors(self, block):
-        assert isinstance(block, framework.Block)
        lr_shape = [1]
        # create a variable for learning_rate
-        self._lr = block.create_var(
-            dtype="float32", shape=lr_shape, lod_level=0)
-
-        # create an op to init the learning_rate
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._lr},
-            attrs={"shape": lr_shape,
-                   "value": self._learning_rate})
+        self._lr = self.helper.create_global_variable(
+            name=unique_name("learning_rate"),
+            dtype='float32',
+            shape=lr_shape,
+            lod_level=1,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            var=self._lr, initializer=ConstantInitializer(self._learning_rate))

    def _create_accumulators(self, block, parameters):
        assert isinstance(block, framework.Block)

        for p in parameters:
-            self._add_accumulator(block, self._moment_acc_str, p, 'float32')
+            self._add_accumulator(self._moment_acc_str, p)

    def _append_optimize_op(self, block, param_and_grad):
        assert isinstance(block, framework.Block)
@@ -378,51 +377,46 @@ class AdamOptimizer(Optimizer):
        self._epsilon = epsilon

    def _initialize_tensors(self, block):
-        assert isinstance(block, framework.Block)
        lr_shape = [1]
        # create a variable for learning_rate
-        self._lr = block.create_var(
-            dtype="float32", shape=lr_shape, lod_level=0)
-
-        # create an op to init the learning_rate
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._lr},
-            attrs={"shape": lr_shape,
-                   "value": self._learning_rate})
+        self._lr = self.helper.create_global_variable(
+            name=unique_name("learning_rate"),
+            dtype='float32',
+            shape=lr_shape,
+            lod_level=1,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            var=self._lr, initializer=ConstantInitializer(self._learning_rate))

    def _create_accumulators(self, block, parameters):
        assert isinstance(block, framework.Block)

-        global_block = block.program.global_block()
+        main_block = block.program.global_block()
        # Create beta1 and beta2 power tensors
        beta_shape = [1]
-        # Create variables for beta1 and beta2 powers
-        self._beta1_pow_acc = global_block.create_var(
-            dtype="float32", shape=beta_shape, lod_level=0)
-        self._beta2_pow_acc = global_block.create_var(
-            dtype="float32", shape=beta_shape, lod_level=0)
-
-        # Initialize beta1 and beta2 power accumulators
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        global_block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._beta1_pow_acc},
-            attrs={"shape": beta_shape,
-                   "value": self._beta1})
-        global_block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._beta2_pow_acc},
-            attrs={"shape": beta_shape,
-                   "value": self._beta2})
+        self._beta1_pow_acc = self.helper.create_global_variable(
+            name=unique_name('beta1_pow_acc'),
+            dtype='float32',
+            shape=beta_shape,
+            lod_level=0,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            self._beta1_pow_acc, initializer=ConstantInitializer(self._beta1))
+
+        self._beta2_pow_acc = self.helper.create_global_variable(
+            name=unique_name('beta2_pow_acc'),
+            dtype='float32',
+            shape=beta_shape,
+            lod_level=0,
+            persistable=True)
+
+        self.helper.set_variable_initializer(
+            self._beta2_pow_acc, initializer=ConstantInitializer(self._beta2))

        # Create accumulator tensors for first and second moments
        for p in parameters:
-            self._add_accumulator(block, self._moment1_acc_str, p, 'float32')
-            self._add_accumulator(block, self._moment2_acc_str, p, 'float32')
+            self._add_accumulator(self._moment1_acc_str, p)
+            self._add_accumulator(self._moment2_acc_str, p)

    def _append_optimize_op(self, block, param_and_grad):
        assert isinstance(block, framework.Block)
@@ -460,14 +454,14 @@ class AdamOptimizer(Optimizer):
        """Update Beta1 and Beta2 Power accumulators
        """
        assert isinstance(block, framework.Block)
-        global_block = block.program.global_block()
-        scale_beta1 = global_block.append_op(
+        main_block = block.program.global_block()
+        scale_beta1 = main_block.append_op(
            type="scale",
            inputs={"X": self._beta1_pow_acc},
            outputs={"Out": self._beta1_pow_acc},
            attrs={"scale": self._beta1})

-        scale_beta2 = global_block.append_op(
+        scale_beta2 = main_block.append_op(
            type="scale",
            inputs={"X": self._beta2_pow_acc},
            outputs={"Out": self._beta2_pow_acc},
@@ -500,43 +494,33 @@ class AdamaxOptimizer(Optimizer):
        self._epsilon = epsilon

    def _initialize_tensors(self, block):
-        assert isinstance(block, framework.Block)
        lr_shape = [1]
        # create a variable for learning_rate
-        self._lr = block.create_var(
-            dtype="float32", shape=lr_shape, lod_level=0)
-
-        # create an op to init the learning_rate
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._lr},
-            attrs={"shape": lr_shape,
-                   "value": self._learning_rate})
+        self._lr = self.helper.create_global_variable(
+            name=unique_name("learning_rate"),
+            dtype='float32',
+            shape=lr_shape,
+            lod_level=1,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            var=self._lr, initializer=ConstantInitializer(self._learning_rate))

    def _create_accumulators(self, block, parameters):
-        assert isinstance(block, framework.Block)
-
-        global_block = block.program.global_block()
        # Create beta1 power accumulator tensor
        beta_shape = [1]
-        self._beta1_pow_acc = global_block.create_var(
-            dtype="float32", shape=beta_shape, lod_level=0)
-
-        # Initialize beta1 power accumulator
-        # FIXME: Fix when Initialization design has been implemented
-        # https://github.com/PaddlePaddle/Paddle/pull/4852
-        global_block.append_op(
-            type="fill_constant",
-            outputs={"Out": self._beta1_pow_acc},
-            attrs={"shape": beta_shape,
-                   "value": self._beta1})
+        self._beta1_pow_acc = self.helper.create_global_variable(
+            name=unique_name('beta1_pow_acc'),
+            dtype='float32',
+            shape=beta_shape,
+            lod_level=0,
+            persistable=True)
+        self.helper.set_variable_initializer(
+            self._beta1_pow_acc, initializer=ConstantInitializer(self._beta1))

        # Create accumulator tensors for first moment and infinity norm
        for p in parameters:
-            self._add_accumulator(block, self._moment_acc_str, p, 'float32')
-            self._add_accumulator(block, self._inf_norm_acc_str, p, 'float32')
+            self._add_accumulator(self._moment_acc_str, p)
+            self._add_accumulator(self._inf_norm_acc_str, p)

    def _append_optimize_op(self, block, param_and_grad):
        assert isinstance(block, framework.Block)
@@ -572,8 +556,8 @@ class AdamaxOptimizer(Optimizer):
        """Update Beta1 Power accumulator
        """
        assert isinstance(block, framework.Block)
-        global_block = block.program.global_block()
-        scale_beta1 = global_block.append_op(
+        main_block = block.program.global_block()
+        scale_beta1 = main_block.append_op(
            type="scale",
            inputs={"X": self._beta1_pow_acc},
            outputs={"Out": self._beta1_pow_acc},

--- a/python/paddle/v2/framework/tests/test_conv2dtranspose_op.py
+++ b/python/paddle/v2/framework/tests/test_conv2dtranspose_op.py
@@ -45,23 +45,36 @@ class TestConv2dTransposeOp(OpTest):
        filter_ = np.random.random(self.filter_size).astype("float32")
        output = conv2dtranspose_forward_naive(
            input_, filter_, conv2dtranspose_param).astype('float32')
-        # print 'deconv output py', output, output.shape

        self.inputs = {'Input': input_, 'Filter': filter_}
        self.attrs = {
            'strides': self.stride,
            'paddings': self.pad,
-            # 'dilations': self.dilations
+            'dilations': self.dilations
        }
        self.outputs = {'Output': output}

    def test_check_output(self):
-        print 'check output here'
+        print 'check output here for', self.op_type
        self.check_output()

-    def test_check_grad(self):
+    def init_test_case(self):
+        self.pad = [0, 0]
+        self.stride = [1, 1]
+        self.dilations = [1, 1]
+        self.input_size = [2, 3, 5, 5]  # NCHW
+        f_c = self.input_size[1]
+        self.filter_size = [f_c, 6, 3, 3]
+
+    def init_op_type(self):
+        self.op_type = "conv2d_transpose"
+
+    def test_check_grad_no_input(self):
        self.check_grad(
-            set(['Input', 'Filter']), 'Output', max_relative_error=0.05)
+            ['Filter'],
+            'Output',
+            max_relative_error=0.05,
+            no_grad_set=set(['Input']))

    def test_check_grad_no_filter(self):
        self.check_grad(
@@ -70,33 +83,15 @@ class TestConv2dTransposeOp(OpTest):
            max_relative_error=0.05,
            no_grad_set=set(['Filter']))

-    def test_check_grad_no_input(self):
+    def test_check_grad(self):
        self.check_grad(
-            ['Filter'],
-            'Output',
-            max_relative_error=0.05,
-            no_grad_set=set(['Input']))
+            set(['Input', 'Filter']), 'Output', max_relative_error=0.05)

-    def init_test_case(self):
-        self.pad = [0, 0]
-        self.stride = [1, 1]
-        self.dilations = [1, 1]
-        self.input_size = [2, 3, 5, 5]  # NCHW
-        f_c = self.input_size[1]
-        self.filter_size = [f_c, 6, 3, 3]

+class TestCudnn(TestConv2dTransposeOp):
    def init_op_type(self):
-        self.op_type = "conv2dtranspose"
-
+        self.op_type = "conv2d_transpose_cudnn"

-"""
-class TestCudnn(TestConv2dOp):
-    def init_group(self):
-        self.groups = 1
-
-    def init_op_type(self):
-        self.op_type = "conv_cudnn"
-"""

 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_evaluator.py
+++ b/python/paddle/v2/framework/tests/test_evaluator.py
+from paddle.v2.framework.evaluator import Evaluator
+from paddle.v2.framework.op import Operator
+import paddle.v2.framework.core as core
+import unittest
+import op_test
+import numpy as np
+
+
+class TestEvaluator(unittest.TestCase):
+    def setup(self, scope, inputs, outputs):
+        def __create_var__(var_name, arr):
+            np_arr = np.array(arr)
+            scope.var(var_name)
+            # tensor = var.get_tensor()
+            # tensor.set_dims(np_arr.shape)
+
+        for var_name, arr in inputs.iteritems():
+            __create_var__(var_name, arr)
+
+        for var_name, arr in outputs.iteritems():
+            __create_var__(var_name, arr)
+
+    def test_evaluator(self):
+
+        inputs = {
+            'Inference': np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 1]]).T,
+            'Label': np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
+        }
+        outputs = {'Accuracy': np.array([0.9])}
+        out_name = 'Accuracy'
+
+        places = [core.CPUPlace()]
+        if core.is_compile_gpu():
+            places.append(core.GPUPlace(0))
+
+        for place in places:
+            scope = core.Scope()
+            self.setup(scope, inputs, outputs)
+
+            evaluator = Evaluator(
+                scope,
+                operator='accuracy',
+                input='Inference',
+                label='Label',
+                output=out_name,
+                place=place)
+            op_test.set_input(scope, evaluator.op, inputs, place)
+            ctx = core.DeviceContext.create(place)
+
+            for i in range(10):  # simulate 10 mini-batches
+                evaluator.evaluate(ctx)
+
+            actual = np.array(scope.find_var(out_name).get_tensor())
+            print actual
+
+            self.assertTrue(
+                np.allclose(
+                    actual, outputs[out_name], atol=1e-5),
+                "output name: " + out_name + " has diff.")
+
+
+if __name__ == '__main__':
+    exit(0)
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_fill_constant_batch_size_like_op.py
+++ b/python/paddle/v2/framework/tests/test_fill_constant_batch_size_like_op.py
@@ -3,13 +3,27 @@ import numpy as np
 from op_test import OpTest


-class TestFillConstantBatchSizeLikeOp(OpTest):
+class TestFillConstantBatchSizeLikeWhenFirstDimIsBatchSize(OpTest):
    def setUp(self):
        self.op_type = "fill_constant_batch_size_like"
        self.inputs = {'Input': np.random.random((219, 232)).astype("float32")}
-        self.attrs = {'value': 3.5, 'shape': [-1, 132, 777]}
+        self.attrs = {'value': 3.5, 'shape': [-1, 132, 7]}

-        out = np.random.random((219, 132, 777)).astype("float32")
+        out = np.random.random((219, 132, 7)).astype("float32")
+        out.fill(3.5)
+        self.outputs = {'Out': out}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestFillConstantBatchSizeLikeWhenSecondDimIsBatchSize(OpTest):
+    def setUp(self):
+        self.op_type = "fill_constant_batch_size_like"
+        self.inputs = {'Input': np.random.random((219, 232)).astype("float32")}
+        self.attrs = {'value': 3.5, 'shape': [132, -1, 7], 'dim_idx': 1}
+
+        out = np.random.random((132, 232, 7)).astype("float32")
        out.fill(3.5)
        self.outputs = {'Out': out}


--- a/python/paddle/v2/framework/tests/test_fit_a_line.py
+++ b/python/paddle/v2/framework/tests/test_fit_a_line.py
@@ -36,7 +36,7 @@ cost = layers.square_error_cost(
 avg_cost = layers.mean(x=cost, program=program, init_program=init_program)

 sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-opts = sgd_optimizer.minimize(avg_cost)
+opts = sgd_optimizer.minimize(avg_cost, init_program)

 BATCH_SIZE = 20


--- a/python/paddle/v2/framework/tests/test_gaussian_random_op.py
+++ b/python/paddle/v2/framework/tests/test_gaussian_random_op.py
@@ -19,7 +19,7 @@ class TestGaussianRandomOp(unittest.TestCase):
        op = Operator(
            "gaussian_random",
            Out='Out',
-            dims=[1000, 784],
+            shape=[1000, 784],
            mean=.0,
            std=1.,
            seed=10)

--- a/python/paddle/v2/framework/tests/test_image_classification_train.py
+++ b/python/paddle/v2/framework/tests/test_image_classification_train.py
@@ -208,7 +208,7 @@ cost = layers.cross_entropy(
 avg_cost = layers.mean(x=cost, program=program, init_program=init_program)

 sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-opts = sgd_optimizer.minimize(avg_cost)
+opts = sgd_optimizer.minimize(avg_cost, init_program)

 BATCH_SIZE = 128
 PASS_NUM = 1

--- a/python/paddle/v2/framework/tests/test_inference_model_io.py
+++ b/python/paddle/v2/framework/tests/test_inference_model_io.py
@@ -44,7 +44,7 @@ class TestBook(unittest.TestCase):
            x=cost, program=program, init_program=init_program)

        sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-        opts = sgd_optimizer.minimize(avg_cost)
+        opts = sgd_optimizer.minimize(avg_cost, init_program)

        place = core.CPUPlace()
        exe = executor.Executor(place)

--- a/python/paddle/v2/framework/tests/test_initializer.py
+++ b/python/paddle/v2/framework/tests/test_initializer.py
+import numpy as np
+import unittest
+
+import paddle.v2.framework.framework as framework
+import paddle.v2.framework.initializer as initializer
+
+DELTA = 0.00001
+
+
+class TestConstantInitializer(unittest.TestCase):
+    def test_constant_initializer_default_value(self):
+        """Test the constant initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.ConstantInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'fill_constant')
+        self.assertAlmostEqual(init_op.attr('value'), 0.0, delta=DELTA)
+
+    def test_constant_initializer(self):
+        """Test constant initializer with supplied value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.ConstantInitializer(2.3))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'fill_constant')
+        self.assertAlmostEqual(init_op.attr('value'), 2.3, delta=DELTA)
+
+
+class TestUniformInitializer(unittest.TestCase):
+    def test_uniform_initializer_default_value(self):
+        """Test the uniform initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.UniformInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        self.assertAlmostEqual(init_op.attr('min'), -1.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), 1.0, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_uniform_initializer(self):
+        """Test uniform initializer with supplied attributes
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.UniformInitializer(-4.2, 3.1, 123))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        self.assertAlmostEqual(init_op.attr('min'), -4.2, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), 3.1, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 123)
+
+
+class TestNormalInitializer(unittest.TestCase):
+    def test_normal_initializer_default_value(self):
+        """Test the normal initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.NormalInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), 1.0, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_normal_initializer(self):
+        """Test normal initializer with supplied attributes
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.NormalInitializer(2.3, 1.9, 123))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        self.assertAlmostEqual(init_op.attr('mean'), 2.3, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), 1.9, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 123)
+
+
+class TestXavierInitializer(unittest.TestCase):
+    def test_uniform_xavier_initializer(self):
+        """Test Xavier initializer with uniform distribution on
+           for matrix multiply.
+        """
+        program = framework.Program()
+        block = program.global_block()
+        param = block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.XavierInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        limit = np.sqrt(6.0 / (param.shape[0] + param.shape[1]))
+        self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_uniform_xavier_initializer_conv(self):
+        """Test Xavier initializer with uniform distribution on
+           for convolutions.
+        """
+        program = framework.Program()
+        block = program.global_block()
+        param = block.create_parameter(
+            dtype="float32",
+            shape=[5, 10, 15, 20],
+            lod_level=0,
+            name="param",
+            initializer=initializer.XavierInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        receptive_field_size = float(15 * 20)
+        limit = np.sqrt(6.0 / (
+            (param.shape[0] + param.shape[1]) * receptive_field_size))
+        self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_normal_xavier_initializer(self):
+        """Test Xavier initializer with normal distribution on
+           for matrix multiply.
+        """
+        program = framework.Program()
+        block = program.global_block()
+        param = block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.XavierInitializer(uniform=False))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        std = np.sqrt(2.0 / (param.shape[0] + param.shape[1]))
+        self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_normal_xavier_initializer_conv(self):
+        """Test Xavier initializer with normal distribution on
+           for convolutions.
+        """
+        program = framework.Program()
+        block = program.global_block()
+        param = block.create_parameter(
+            dtype="float32",
+            shape=[5, 10, 15, 20],
+            lod_level=0,
+            name="param",
+            initializer=initializer.XavierInitializer(uniform=False))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        receptive_field_size = float(15 * 20)
+        std = np.sqrt(2.0 / (
+            (param.shape[0] + param.shape[1]) * receptive_field_size))
+        self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+
+    def test_xavier_initializer_supplied_arguments(self):
+        """Test the Xavier initializer with supplied arguments
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.XavierInitializer(
+                fan_in=12, fan_out=23, seed=134))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        limit = np.sqrt(6.0 / (12 + 23))
+        self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 134)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_linear_chain_crf_op.py
+++ b/python/paddle/v2/framework/tests/test_linear_chain_crf_op.py
+import unittest
+import random
+import numpy as np
+
+from op_test import OpTest
+
+
+class LinearChainCrfForward(object):
+    def __init__(self, seq_start_positions, emission_weights, emission_row_max,
+                 emission_exps, transition_weights, transition_exps, labels):
+        self.tag_num = emission_weights.shape[1]
+        self.seq_num = len(seq_start_positions) - 1
+
+        self.seq_start_positions = seq_start_positions
+        self.labels = labels
+        self.x = emission_weights
+
+        self.x_row_max = emission_row_max
+        self.x_exps = emission_exps
+
+        # unnormalized logits of the transition weights for the start mark.
+        self.a = transition_weights[0, :]
+        self.a_exps = transition_exps[0, :]
+        # unnormalized logits of the transition weights for the end mark.
+        self.b = transition_weights[1, :]
+        self.b_exps = transition_exps[1, :]
+        # unnormalized logits of the transition weights for all the other tags.
+        self.w = transition_weights[2:, :]
+        self.w_exps = transition_exps[2:, :]
+
+        # The output of linear chain crf operator.
+        # alpha is a memo table in dynamic programming to caculate
+        # nomalization factor.
+        self.alpha = np.zeros(
+            (seq_start_positions[-1], self.tag_num), dtype="float64")
+        self.log_likelihood = np.zeros((self.seq_num, 1))
+
+    def _l1_norm(self, x):
+        s = np.sum(x)
+        x /= s
+        return s
+
+    def _forward_a_sequence(self, x, x_row_max, x_exps, label, alpha):
+        seq_len = x_row_max.shape[0]
+        log_likelihood = 0.
+
+        for i in range(self.tag_num):
+            alpha[0, i] = self.a_exps[i] * x_exps[0, i]
+        log_likelihood = -x_row_max[0] - np.log(self._l1_norm(alpha[0, :]))
+
+        # calculate the unnormalized logits of the normalization factor.
+        for k in range(1, seq_len):
+            for i in range(self.tag_num):
+                s = 0.
+                for j in range(self.tag_num):
+                    s += alpha[k - 1, j] * self.w_exps[j, i]
+                alpha[k, i] = x_exps[k, i] * s
+            log_likelihood -= x_row_max[k] + np.log(self._l1_norm(alpha[k, :]))
+        s = 0.
+        for i in range(self.tag_num):
+            s += alpha[-1, i] * self.b_exps[i]
+        log_likelihood -= np.log(s)
+
+        # calculate the nominator part.
+        log_likelihood += (
+            self.a[label[0]] + x[0, label[0]] + self.b[label[-1]])
+
+        for k in range(1, seq_len):
+            log_likelihood += (x[k, label[k]] + self.w[label[k - 1], label[k]])
+        return -log_likelihood
+
+    def crf_forward_compute(self):
+        for i in range(self.seq_num):
+            start = self.seq_start_positions[i]
+            end = self.seq_start_positions[i + 1]
+
+            self.log_likelihood[i] = self._forward_a_sequence(
+                self.x[start:end, :], self.x_row_max[start:end, :],
+                self.x_exps[start:end, :], self.labels[start:end, :],
+                self.alpha[start:end, :])
+        return self.alpha, self.log_likelihood
+
+
+class TestLinearChainCrfOp(OpTest):
+    def set_test_data(self):
+        # TODO(caoying) Fix the unittest by: add the boundary cases when
+        # sequence lengths are 1, 2, and 3.
+
+        SEQ_NUM = 3
+        TAG_NUM = 17
+        MAX_SEQ_LEN = 5
+
+        # the linear_chain_crf operator only supports sequence (LoD level = 1)
+        lod = [[0]]
+        for i in range(SEQ_NUM):
+            lod[-1].append(lod[-1][-1] + random.randint(1, MAX_SEQ_LEN))
+        emission = np.random.uniform(-1, 1,
+                                     [lod[-1][-1], TAG_NUM]).astype("float64")
+        emission_row_max = np.amax(emission, axis=1, keepdims=True)
+        emission_exps = np.exp(emission - emission_row_max)
+
+        transition = np.random.uniform(-0.5, 0.5,
+                                       [TAG_NUM + 2, TAG_NUM]).astype("float64")
+        transition_exps = np.exp(transition)
+
+        labels = np.random.randint(
+            low=0, high=TAG_NUM, size=(lod[-1][-1], 1), dtype="int32")
+
+        self.inputs = {
+            "Emission": (emission, lod),
+            "Transition": transition,
+            "Label": (labels, lod)
+        }
+        crf = LinearChainCrfForward(lod[0], emission, emission_row_max,
+                                    emission_exps, transition, transition_exps,
+                                    labels)
+        alpha, log_likelihood = crf.crf_forward_compute()
+
+        self.outputs = {
+            "Alpha": alpha,
+            "EmissionExps": emission_exps,
+            "TransitionExps": transition_exps,
+            "LogLikelihood": log_likelihood
+        }
+
+    def setUp(self):
+        self.op_type = "linear_chain_crf"
+        self.set_test_data()
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(["Emission", "Transition"], "LogLikelihood")
+
+    def test_check_grad_ignore_transition(self):
+        self.check_grad(
+            ["Emission"], "LogLikelihood", no_grad_set=set("Transition"))
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_lstm_op.py
+++ b/python/paddle/v2/framework/tests/test_lstm_op.py
@@ -52,7 +52,7 @@ def lstm(
        g = np.dot(h_pre, w_h)  # 1 x 4D
        g = g + x
        g = np.reshape(g, (1, g.size))
-        c_tmp, g_i, g_f, g_o = np.split(g, 4, axis=1)
+        c, g_i, g_f, g_o = np.split(g, 4, axis=1)
        if w_c is None:
            g_i = act_gate(g_i)  # 1 x D
            g_f = act_gate(g_f)  # 1 x D
@@ -60,7 +60,7 @@ def lstm(
            w_ic, w_fc, w_oc = np.split(w_c, 3, axis=1)
            g_i = act_gate(g_i + w_ic * c_pre)  # 1 x D
            g_f = act_gate(g_f + w_fc * c_pre)  # 1 x D
-        c = g_f * c_pre + g_i * act_cand(c_tmp)  # 1 x D
+        c = g_f * c_pre + g_i * act_cand(c)  # 1 x D

        if w_c is None:
            g_o = act_gate(g_o)  # 1 x D
@@ -68,8 +68,7 @@ def lstm(
            _, _, w_oc = np.split(w_c, 3, axis=1)
            g_o = act_gate(g_o + w_oc * c)  # 1 x D
        h = g_o * act_cell(c)
-        bg = np.concatenate((act_cand(c_tmp), g_i, g_f, g_o), axis=1)
-        return h, c, bg
+        return h, c

    def _reverse(x, lod):
        y = np.zeros_like(x)
@@ -82,7 +81,6 @@ def lstm(
    batch_size = len(offset) - 1
    hidden = []
    cell = []
-    gate = []
    input = _reverse(input, offset) if is_reverse else input
    if w_b is not None:
        input = input + np.tile(w_b, (offset[-1], 1))
@@ -94,96 +92,109 @@ def lstm(
        c_pre = c0[i]  # 1 x D
        for j in range(seq_len):
            # compute one step
-            h_pre, c_pre, g_pre = _step(x[j], w_h, w_c, h_pre, c_pre, act_gate,
-                                        act_cell, act_cand)
+            h_pre, c_pre = _step(x[j], w_h, w_c, h_pre, c_pre, act_gate,
+                                 act_cell, act_cand)
            hidden.append(h_pre.flatten())
            cell.append(c_pre.flatten())
-            gate.append(g_pre.flatten())

-    hidden = np.array(hidden).astype("float64")
-    cell = np.array(cell).astype("float64")
-    gate = np.array(gate).astype("float64")
+    hidden = np.array(hidden).astype('float64')
+    cell = np.array(cell).astype('float64')

    hidden = _reverse(hidden, offset) if is_reverse else hidden
    cell = _reverse(cell, offset) if is_reverse else cell

-    assert gate.shape == input.shape
    assert hidden.shape == (input.shape[0], input.shape[1] / 4)
    assert cell.shape == (input.shape[0], input.shape[1] / 4)
-    return hidden, cell, gate
+    return hidden, cell


 class TestLstmOp(OpTest):
-    def set_data(self):
-        self.lod = [[0, 2, 6, 9]]
-        self.D = 64
-        self.sort_idx = [2, 6, 0, 3, 7, 1, 4, 8, 5]
+    def set_argument(self):
+        self.lod = [[0, 2, 5, 7]]
+        self.D = 16

-        self.act_gate = "sigmoid"
-        self.act_cell = "tanh"
-        self.act_cand = "tanh"
+        self.act_gate = 'sigmoid'
+        self.act_cell = 'tanh'
+        self.act_cand = 'tanh'

+        self.has_initial_state = True
        self.is_reverse = False

    def setUp(self):
-        self.set_data()
-        self.op_type = "lstm"
+        self.set_argument()
+        self.op_type = 'lstm'

        T = self.lod[0][-1]
        N = len(self.lod[0]) - 1

-        x = np.random.normal(size=(T, 4 * self.D)).astype("float64")
-        h0 = np.zeros((N, self.D)).astype("float64")
-        c0 = np.zeros((N, self.D)).astype("float64")
-        w = np.random.normal(size=(self.D, 4 * self.D)).astype("float64")
-        b = np.random.normal(size=(1, 7 * self.D)).astype("float64")
+        x = np.random.normal(size=(T, 4 * self.D)).astype('float64')
+        h0 = np.zeros((N, self.D)).astype('float64')
+        c0 = np.zeros((N, self.D)).astype('float64')
+        w = np.random.normal(size=(self.D, 4 * self.D)).astype('float64')
+        b = np.random.normal(size=(1, 7 * self.D)).astype('float64')

        w_b = b[:, 0:4 * self.D]
        w_c = b[:, 4 * self.D:]
-        h, c, g = lstm(x, self.lod, h0, c0, w, w_b, w_c, self.is_reverse,
-                       ACTVATION[self.act_gate], ACTVATION[self.act_cell],
-                       ACTVATION[self.act_cand])
-
-        g_sort = np.zeros_like(x)
-        for i, j in enumerate(self.sort_idx):
-            g_sort[i, :] = g[j, :]
-
-        self.inputs = {
-            'Input': (x, self.lod),
-            'H0': h0,
-            'C0': c0,
-            'Weight': w,
-            'Bias': b
-        }
+        h, c = lstm(x, self.lod, h0, c0, w, w_b, w_c, self.is_reverse,
+                    ACTVATION[self.act_gate], ACTVATION[self.act_cell],
+                    ACTVATION[self.act_cand])
+
+        self.inputs = {'Input': (x, self.lod), 'Weight': w, 'Bias': b}
+        if self.has_initial_state:
+            self.inputs['H0'] = h0
+            self.inputs['C0'] = c0
+
        self.outputs = {
            'Hidden': (h, self.lod),
            'Cell': (c, self.lod),
-            'BatchGate': g_sort
        }
        self.attrs = {
            'usePeepholes': True,
            'isReverse': self.is_reverse,
-            'gateActivation': 'sigmoid',
-            'cellActivation': 'tanh',
-            'candidateActivation': 'tanh'
+            'gateActivation': self.act_gate,
+            'cellActivation': self.act_cell,
+            'candidateActivation': self.act_cand
        }

    def test_check_output(self):
-        self.check_output()
+        self.check_output(atol=1e-8)
+
+    #TODO(qingqing) add more unit testing case
+    def test_check_grad(self):
+        # TODO(qingqing) remove folowing lines after the check_grad is refined.
+        N = len(self.lod[0]) - 1
+        self.outputs['BatchGate'] = np.zeros((N, 4 * self.D)).astype('float64')
+        self.outputs['BatchCellPreAct'] = np.zeros(
+            (N, self.D)).astype('float64')
+        self.check_grad(
+            ['Input', 'Weight', 'Bias'], ['Hidden'], max_relative_error=5e-4)
+
+
+class TestLstmOpHasNoInitial(TestLstmOp):
+    def set_argument(self):
+        self.lod = [[0, 2, 5, 7]]
+        self.D = 16
+
+        self.act_gate = 'sigmoid'
+        self.act_cell = 'tanh'
+        self.act_cand = 'tanh'
+
+        self.has_initial_state = False
+        self.is_reverse = True


 class TestLstmOpRerverse(TestLstmOp):
-    def set_data(self):
-        self.lod = [[0, 2, 6, 9]]
-        self.D = 64
-        self.sort_idx = [2, 6, 0, 3, 7, 1, 4, 8, 5]
+    def set_argument(self):
+        self.lod = [[0, 2, 5, 7]]
+        self.D = 16

-        self.act_gate = "sigmoid"
-        self.act_cell = "tanh"
-        self.act_cand = "tanh"
+        self.act_gate = 'sigmoid'
+        self.act_cell = 'tanh'
+        self.act_cand = 'tanh'

+        self.has_initial_state = True
        self.is_reverse = True


-if __name__ == "__main__":
+if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_optimizer.py
+++ b/python/paddle/v2/framework/tests/test_optimizer.py
@@ -7,6 +7,7 @@ from paddle.v2.framework.backward import append_backward_ops

 class TestOptimizer(unittest.TestCase):
    def test_sgd_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -22,12 +23,13 @@ class TestOptimizer(unittest.TestCase):
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
        sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.01)
-        opts = sgd_optimizer.minimize(mul_out)
+        opts = sgd_optimizer.minimize(mul_out, init_program)
        self.assertEqual(len(opts), 1)
        sgd_op = opts[0]
        self.assertEqual(sgd_op.type, "sgd")

    def test_sgd_optimizer_with_global_step(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -44,15 +46,22 @@ class TestOptimizer(unittest.TestCase):
            attrs={"x_num_col_dims": 1})
        global_step = block.create_var(
            dtype="float32", shape=[1], lod_level=0, name="step")
+        learning_rate = 0.01
        sgd_optimizer = optimizer.SGDOptimizer(
-            learning_rate=0.01, global_step=global_step)
-        opts = sgd_optimizer.minimize(mul_out)
+            learning_rate=learning_rate, global_step=global_step)
+        opts = sgd_optimizer.minimize(mul_out, init_program)
        self.assertEqual(len(opts), 2)
        sgd_op = opts[0]
        self.assertEqual(sgd_op.type, "sgd")
        increment_op = opts[1]
        self.assertEqual(increment_op.type, "increment")

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 1)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+

 class TestMomentumOptimizer(unittest.TestCase):
    class MockMomentum(optimizer.MomentumOptimizer):
@@ -63,6 +72,7 @@ class TestMomentumOptimizer(unittest.TestCase):
            return self._velocity_acc_str

    def test_vanilla_momentum_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -77,12 +87,14 @@ class TestMomentumOptimizer(unittest.TestCase):
                    "Y": mul_y},
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
-        momentum_optimizer = self.MockMomentum(learning_rate=0.01, momentum=0.2)
+        learning_rate = 0.01
+        momentum_optimizer = self.MockMomentum(
+            learning_rate=learning_rate, momentum=0.2)
        params_grads = append_backward_ops(mul_out)
        self.assertEqual(len(params_grads), 1)
        self.assertEqual(len(momentum_optimizer.get_accumulators()), 0)
-        opts = momentum_optimizer.create_optimization_pass(params_grads,
-                                                           mul_out)
+        opts = momentum_optimizer.create_optimization_pass(
+            params_grads, mul_out, init_program)
        self.assertEqual(len(opts), 1)
        sgd_op = opts[0]
        self.assertEqual(sgd_op.type, "momentum")
@@ -96,7 +108,16 @@ class TestMomentumOptimizer(unittest.TestCase):
        self.assertEqual(len(velocity_acc), 1)
        self.assertTrue(mul_x.name in velocity_acc)

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 2)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+        self.assertEqual(init_ops[1].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[1].attr('value'), 0.0)
+
    def test_nesterov_momentum_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -111,13 +132,14 @@ class TestMomentumOptimizer(unittest.TestCase):
                    "Y": mul_y},
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
+        learning_rate = 0.01
        momentum_optimizer = self.MockMomentum(
-            learning_rate=0.01, momentum=0.2, use_nesterov=True)
+            learning_rate=learning_rate, momentum=0.2, use_nesterov=True)
        params_grads = append_backward_ops(mul_out)
        self.assertEqual(len(params_grads), 1)
        self.assertEqual(len(momentum_optimizer.get_accumulators()), 0)
-        opts = momentum_optimizer.create_optimization_pass(params_grads,
-                                                           mul_out)
+        opts = momentum_optimizer.create_optimization_pass(
+            params_grads, mul_out, init_program)
        self.assertEqual(len(opts), 1)
        sgd_op = opts[0]
        self.assertEqual(sgd_op.type, "momentum")
@@ -131,6 +153,14 @@ class TestMomentumOptimizer(unittest.TestCase):
        self.assertEqual(len(velocity_acc), 1)
        self.assertTrue(mul_x.name in velocity_acc)

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 2)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+        self.assertEqual(init_ops[1].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[1].attr('value'), 0.0)
+

 class TestAdagradOptimizer(unittest.TestCase):
    class MockAdagrad(optimizer.AdagradOptimizer):
@@ -141,6 +171,7 @@ class TestAdagradOptimizer(unittest.TestCase):
            return self._moment_acc_str

    def test_adagrad_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -155,11 +186,14 @@ class TestAdagradOptimizer(unittest.TestCase):
                    "Y": mul_y},
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
-        adagrad_optimizer = self.MockAdagrad(learning_rate=0.01, epsilon=1.0e-6)
+        learning_rate = 0.01
+        adagrad_optimizer = self.MockAdagrad(
+            learning_rate=learning_rate, epsilon=1.0e-6)
        params_grads = append_backward_ops(mul_out)
        self.assertEqual(len(params_grads), 1)
        self.assertEqual(len(adagrad_optimizer.get_accumulators()), 0)
-        opts = adagrad_optimizer.create_optimization_pass(params_grads, mul_out)
+        opts = adagrad_optimizer.create_optimization_pass(params_grads, mul_out,
+                                                          init_program)
        self.assertEqual(len(opts), 1)
        adagrad_op = opts[0]
        self.assertEqual(adagrad_op.type, "adagrad")
@@ -172,6 +206,14 @@ class TestAdagradOptimizer(unittest.TestCase):
        self.assertEqual(len(moment_acc), 1)
        self.assertTrue(mul_x.name in moment_acc)

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 2)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+        self.assertEqual(init_ops[1].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[1].attr('value'), 0.0)
+

 class TestAdamOptimizer(unittest.TestCase):
    class MockAdam(optimizer.AdamOptimizer):
@@ -185,6 +227,7 @@ class TestAdamOptimizer(unittest.TestCase):
            return self._moment2_acc_str

    def test_adam_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -199,12 +242,14 @@ class TestAdamOptimizer(unittest.TestCase):
                    "Y": mul_y},
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
+        learning_rate = 0.01
        adam_optimizer = self.MockAdam(
-            learning_rate=0.01, beta1=0.9, beta2=0.999)
+            learning_rate=learning_rate, beta1=0.9, beta2=0.999)
        params_grads = append_backward_ops(mul_out)
        self.assertEqual(len(params_grads), 1)
        self.assertEqual(len(adam_optimizer.get_accumulators()), 0)
-        opts = adam_optimizer.create_optimization_pass(params_grads, mul_out)
+        opts = adam_optimizer.create_optimization_pass(params_grads, mul_out,
+                                                       init_program)
        self.assertEqual(len(opts), 3)
        adam_op = opts[0]
        self.assertEqual(adam_op.type, "adam")
@@ -221,6 +266,12 @@ class TestAdamOptimizer(unittest.TestCase):
        self.assertTrue(mul_x.name in moment1_acc)
        self.assertTrue(mul_x.name in moment2_acc)

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 5)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+

 class TestAdamaxOptimizer(unittest.TestCase):
    class MockAdamax(optimizer.AdamaxOptimizer):
@@ -234,6 +285,7 @@ class TestAdamaxOptimizer(unittest.TestCase):
            return self._inf_norm_acc_str

    def test_adamax_optimizer(self):
+        init_program = framework.Program()
        program = framework.Program()
        block = program.global_block()
        mul_x = block.create_parameter(
@@ -248,12 +300,14 @@ class TestAdamaxOptimizer(unittest.TestCase):
                    "Y": mul_y},
            outputs={"Out": mul_out},
            attrs={"x_num_col_dims": 1})
+        learning_rate = 0.01
        adamax_optimizer = self.MockAdamax(
-            learning_rate=0.01, beta1=0.9, beta2=0.999)
+            learning_rate=learning_rate, beta1=0.9, beta2=0.999)
        params_grads = append_backward_ops(mul_out)
        self.assertEqual(len(params_grads), 1)
        self.assertEqual(len(adamax_optimizer.get_accumulators()), 0)
-        opts = adamax_optimizer.create_optimization_pass(params_grads, mul_out)
+        opts = adamax_optimizer.create_optimization_pass(params_grads, mul_out,
+                                                         init_program)
        self.assertEqual(len(opts), 2)
        adam_op = opts[0]
        self.assertEqual(adam_op.type, "adamax")
@@ -270,6 +324,12 @@ class TestAdamaxOptimizer(unittest.TestCase):
        self.assertTrue(mul_x.name in moment_acc)
        self.assertTrue(mul_x.name in inf_norm_acc)

+        # Check init_program
+        init_ops = init_program.global_block().ops
+        self.assertEqual(len(init_ops), 4)
+        self.assertEqual(init_ops[0].type, "fill_constant")
+        self.assertAlmostEqual(init_ops[0].attr('value'), learning_rate)
+

 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_positive_negative_pair_op.py
+++ b/python/paddle/v2/framework/tests/test_positive_negative_pair_op.py
@@ -8,7 +8,6 @@ def py_pnpair_op(score, label, query, column=-1, weight=None):
    # group by query id
    predictions = {}
    batch_size = label.shape[0]
-    print "batch_size=", batch_size
    if weight is None:
        weight = np.ones(shape=(batch_size, 1)).astype('float32')
    for s, l, q, w in zip(score, label, query, weight):
@@ -45,7 +44,7 @@ class TestPositiveNegativePairOp(OpTest):
        label = np.random.normal(size=(batch_size, 1)).astype('float32')
        query = np.array(
            [np.random.randint(max_query_id) for i in range(batch_size)])
-        query = np.reshape(query, newshape=(batch_size, 1)).astype('int32')
+        query = np.reshape(query, newshape=(batch_size, 1)).astype('int64')

        pos, neg, neu = py_pnpair_op(score, label, query)
        self.inputs = {'Score': score, 'Label': label, 'QueryID': query}
@@ -60,64 +59,26 @@ class TestPositiveNegativePairOp(OpTest):
        self.check_output()


-class TestPositiveNegativePairOpAccumulate(OpTest):
-    def setUp(self):
-        self.op_type = 'positive_negative_pair'
-        batch_size = 20
-        max_query_id = 5
-        max_random_num = 2 << 15
-        score = np.random.normal(size=(batch_size, 2)).astype('float32')
-        label = np.random.normal(size=(batch_size, 1)).astype('float32')
-        query = np.array(
-            [np.random.randint(max_query_id) for i in range(batch_size)])
-        query = np.reshape(query, newshape=(batch_size, 1)).astype('int32')
-        acc_pos = np.reshape(
-            np.random.randint(max_random_num), newshape=(1)).astype('float32')
-        acc_neg = np.reshape(
-            np.random.randint(max_random_num), newshape=(1)).astype('float32')
-        acc_neu = np.reshape(
-            np.random.randint(max_random_num), newshape=(1)).astype('float32')
-        column = 0
-
-        pos, neg, neu = py_pnpair_op(score, label, query, column=column)
-        self.inputs = {
-            'Score': score,
-            'Label': label,
-            'QueryID': query,
-            'AccumulatePositivePair': acc_pos,
-            'AccumulateNegativePair': acc_neg,
-            'AccumulateNeutralPair': acc_neu,
-        }
-        self.attrs = {'column': column}
-        self.outputs = {
-            'PositivePair': pos + acc_pos,
-            'NegativePair': neg + acc_neg,
-            'NeutralPair': neu + acc_neu
-        }
-
-    def test_check_output(self):
-        self.check_output()
-
-
 class TestPositiveNegativePairOpAccumulateWeight(OpTest):
    def setUp(self):
        self.op_type = 'positive_negative_pair'
        batch_size = 20
        max_query_id = 5
        max_random_num = 2 << 15
+        score_dim = 2
        score = np.random.normal(size=(batch_size, 2)).astype('float32')
        label = np.random.normal(size=(batch_size, 1)).astype('float32')
        weight = np.random.normal(size=(batch_size, 1)).astype('float32')
        query = np.array(
            [np.random.randint(max_query_id) for i in range(batch_size)])
-        query = np.reshape(query, newshape=(batch_size, 1)).astype('int32')
+        query = np.reshape(query, newshape=(batch_size, 1)).astype('int64')
        acc_pos = np.reshape(
            np.random.randint(max_random_num), newshape=(1)).astype('float32')
        acc_neg = np.reshape(
            np.random.randint(max_random_num), newshape=(1)).astype('float32')
        acc_neu = np.reshape(
            np.random.randint(max_random_num), newshape=(1)).astype('float32')
-        column = 0
+        column = np.random.randint(score_dim)

        pos, neg, neu = py_pnpair_op(
            score, label, query, column=column, weight=weight)

--- a/python/paddle/v2/framework/tests/test_precision_recall_op.py
+++ b/python/paddle/v2/framework/tests/test_precision_recall_op.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+def calc_precision(tp_count, fp_count):
+    if tp_count > 0.0 or fp_count > 0.0:
+        return tp_count / (tp_count + fp_count)
+    return 1.0
+
+
+def calc_recall(tp_count, fn_count):
+    if tp_count > 0.0 or fn_count > 0.0:
+        return tp_count / (tp_count + fn_count)
+    return 1.0
+
+
+def calc_f1_score(precision, recall):
+    if precision > 0.0 or recall > 0.0:
+        return 2 * precision * recall / (precision + recall)
+    return 0.0
+
+
+def get_states(idxs, labels, cls_num, weights=None):
+    ins_num = idxs.shape[0]
+    # TP FP TN FN
+    states = np.zeros((cls_num, 4)).astype('float32')
+    for i in xrange(ins_num):
+        w = weights[i] if weights is not None else 1.0
+        idx = idxs[i][0]
+        label = labels[i][0]
+        if idx == label:
+            states[idx][0] += w
+            for j in xrange(cls_num):
+                states[j][2] += w
+            states[idx][2] -= w
+        else:
+            states[label][3] += w
+            states[idx][1] += w
+            for j in xrange(cls_num):
+                states[j][2] += w
+            states[label][2] -= w
+            states[idx][2] -= w
+    return states
+
+
+def compute_metrics(states, cls_num):
+    total_tp_count = 0.0
+    total_fp_count = 0.0
+    total_fn_count = 0.0
+    macro_avg_precision = 0.0
+    macro_avg_recall = 0.0
+    for i in xrange(cls_num):
+        total_tp_count += states[i][0]
+        total_fp_count += states[i][1]
+        total_fn_count += states[i][3]
+        macro_avg_precision += calc_precision(states[i][0], states[i][1])
+        macro_avg_recall += calc_recall(states[i][0], states[i][3])
+    metrics = []
+    macro_avg_precision /= cls_num
+    macro_avg_recall /= cls_num
+    metrics.append(macro_avg_precision)
+    metrics.append(macro_avg_recall)
+    metrics.append(calc_f1_score(macro_avg_precision, macro_avg_recall))
+    micro_avg_precision = calc_precision(total_tp_count, total_fp_count)
+    metrics.append(micro_avg_precision)
+    micro_avg_recall = calc_recall(total_tp_count, total_fn_count)
+    metrics.append(micro_avg_recall)
+    metrics.append(calc_f1_score(micro_avg_precision, micro_avg_recall))
+    return np.array(metrics).astype('float32')
+
+
+class TestPrecisionRecallOp_0(OpTest):
+    def setUp(self):
+        self.op_type = "precision_recall"
+        ins_num = 64
+        cls_num = 10
+        max_probs = np.random.uniform(0, 1.0, (ins_num, 1)).astype('float32')
+        idxs = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+        labels = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+        states = get_states(idxs, labels, cls_num)
+        metrics = compute_metrics(states, cls_num)
+
+        self.attrs = {'class_number': cls_num}
+
+        self.inputs = {'MaxProbs': max_probs, 'Indices': idxs, 'Labels': labels}
+
+        self.outputs = {
+            'BatchMetrics': metrics,
+            'AccumMetrics': metrics,
+            'AccumStatesInfo': states
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestPrecisionRecallOp_1(OpTest):
+    def setUp(self):
+        self.op_type = "precision_recall"
+        ins_num = 64
+        cls_num = 10
+        max_probs = np.random.uniform(0, 1.0, (ins_num, 1)).astype('float32')
+        idxs = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+        weights = np.random.uniform(0, 1.0, (ins_num, 1)).astype('float32')
+        labels = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+
+        states = get_states(idxs, labels, cls_num, weights)
+        metrics = compute_metrics(states, cls_num)
+
+        self.attrs = {'class_number': cls_num}
+
+        self.inputs = {
+            'MaxProbs': max_probs,
+            'Indices': idxs,
+            'Labels': labels,
+            'Weights': weights
+        }
+
+        self.outputs = {
+            'BatchMetrics': metrics,
+            'AccumMetrics': metrics,
+            'AccumStatesInfo': states
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestPrecisionRecallOp_2(OpTest):
+    def setUp(self):
+        self.op_type = "precision_recall"
+        ins_num = 64
+        cls_num = 10
+        max_probs = np.random.uniform(0, 1.0, (ins_num, 1)).astype('float32')
+        idxs = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+        weights = np.random.uniform(0, 1.0, (ins_num, 1)).astype('float32')
+        labels = np.random.choice(xrange(cls_num), ins_num).reshape(
+            (ins_num, 1)).astype('int32')
+        states = np.random.randint(0, 30, (cls_num, 4)).astype('float32')
+
+        accum_states = get_states(idxs, labels, cls_num, weights)
+        batch_metrics = compute_metrics(accum_states, cls_num)
+        accum_states += states
+        accum_metrics = compute_metrics(accum_states, cls_num)
+
+        self.attrs = {'class_number': cls_num}
+
+        self.inputs = {
+            'MaxProbs': max_probs,
+            'Indices': idxs,
+            'Labels': labels,
+            'Weights': weights,
+            'StatesInfo': states
+        }
+
+        self.outputs = {
+            'BatchMetrics': batch_metrics,
+            'AccumMetrics': accum_metrics,
+            'AccumStatesInfo': accum_states
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
@@ -54,8 +54,10 @@ avg_cost = layers.mean(x=cost, program=program)
 accuracy = layers.accuracy(
    input=predict, label=label, program=program, init_program=init_program)

-sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-opts = sgd_optimizer.minimize(avg_cost)
+# optimizer = optimizer.MomentumOptimizer(learning_rate=0.1 / 128.0,
+# momentum=0.9)
+optimizer = optimizer.AdamOptimizer(learning_rate=0.01, beta1=0.9, beta2=0.999)
+opts = optimizer.minimize(avg_cost, init_program)

 BATCH_SIZE = 50
 PASS_NUM = 3

--- a/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
@@ -3,9 +3,10 @@ import paddle.v2.framework.layers as layers
 import paddle.v2.framework.core as core
 import paddle.v2.framework.optimizer as optimizer

-from paddle.v2.framework.framework import Program, g_program
+from paddle.v2.framework.framework import Program
 from paddle.v2.framework.executor import Executor
 from paddle.v2.framework.regularizer import L2DecayRegularizer
+from paddle.v2.framework.initializer import UniformInitializer

 import numpy as np

@@ -21,11 +22,8 @@ image = layers.data(

 param_attr = {
    'name': None,
-    'init_attr': {
-        'type': 'uniform_random',
-        'min': -1.0,
-        'max': 1.0
-    },
+    'initializer': UniformInitializer(
+        low=-1.0, high=1.0),
    'regularization': L2DecayRegularizer(0.0005 * BATCH_SIZE)
 }

@@ -60,8 +58,8 @@ cost = layers.cross_entropy(
    input=predict, label=label, program=program, init_program=init_program)
 avg_cost = layers.mean(x=cost, program=program, init_program=init_program)

-sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-opts = sgd_optimizer.minimize(avg_cost)
+optimizer = optimizer.MomentumOptimizer(learning_rate=0.001, momentum=0.9)
+opts = optimizer.minimize(avg_cost, init_program)

 train_reader = paddle.batch(
    paddle.reader.shuffle(
@@ -91,6 +89,7 @@ for pass_id in range(PASS_NUM):
                             'y': tensor_y},
                       fetch_list=[avg_cost])
        out = np.array(outs[0])
+
        if out[0] < 5.0:
            exit(0)  # if avg cost less than 5.0, we think our code is good.
 exit(1)
--- a/python/paddle/v2/framework/tests/test_recommender_system.py
+++ b/python/paddle/v2/framework/tests/test_recommender_system.py
+import paddle.v2 as paddle
+import paddle.v2.framework.layers as layers
+import paddle.v2.framework.nets as nets
+import paddle.v2.framework.core as core
+import paddle.v2.framework.optimizer as optimizer
+
+from paddle.v2.framework.framework import Program, g_program
+from paddle.v2.framework.executor import Executor
+
+import numpy as np
+
+init_program = Program()
+program = Program()
+is_sparse = True
+use_gpu = False
+BATCH_SIZE = 256
+
+
+def get_usr_combined_features():
+    # FIXME(dzh) : old API integer_value(10) may has range check.
+    # currently we don't have user configurated check.
+
+    USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
+
+    uid = layers.data(
+        name='user_id',
+        shape=[1],
+        data_type='int64',
+        program=program,
+        init_program=init_program)
+
+    usr_emb = layers.embedding(
+        input=uid,
+        data_type='float32',
+        size=[USR_DICT_SIZE, 32],
+        param_attr={'name': 'user_table'},
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    usr_fc = layers.fc(input=usr_emb,
+                       size=32,
+                       program=program,
+                       init_program=init_program)
+
+    USR_GENDER_DICT_SIZE = 2
+
+    usr_gender_id = layers.data(
+        name='gender_id',
+        shape=[1],
+        data_type='int64',
+        program=program,
+        init_program=init_program)
+
+    usr_gender_emb = layers.embedding(
+        input=usr_gender_id,
+        size=[USR_GENDER_DICT_SIZE, 16],
+        param_attr={'name': 'gender_table'},
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    usr_gender_fc = layers.fc(input=usr_gender_emb,
+                              size=16,
+                              program=program,
+                              init_program=init_program)
+
+    USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
+    usr_age_id = layers.data(
+        name='age_id',
+        shape=[1],
+        data_type="int64",
+        program=program,
+        init_program=init_program)
+
+    usr_age_emb = layers.embedding(
+        input=usr_age_id,
+        size=[USR_AGE_DICT_SIZE, 16],
+        is_sparse=is_sparse,
+        param_attr={'name': 'age_table'},
+        program=program,
+        init_program=init_program)
+
+    usr_age_fc = layers.fc(input=usr_age_emb,
+                           size=16,
+                           program=program,
+                           init_program=init_program)
+
+    USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
+    usr_job_id = layers.data(
+        name='job_id',
+        shape=[1],
+        data_type="int64",
+        program=program,
+        init_program=init_program)
+
+    usr_job_emb = layers.embedding(
+        input=usr_job_id,
+        size=[USR_JOB_DICT_SIZE, 16],
+        param_attr={'name': 'job_table'},
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    usr_job_fc = layers.fc(input=usr_job_emb,
+                           size=16,
+                           program=program,
+                           init_program=init_program)
+
+    concat_embed = layers.concat(
+        input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc],
+        axis=1,
+        program=program,
+        init_program=init_program)
+
+    usr_combined_features = layers.fc(input=concat_embed,
+                                      size=200,
+                                      act="tanh",
+                                      program=program,
+                                      init_program=init_program)
+
+    return usr_combined_features
+
+
+def get_mov_combined_features():
+
+    MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
+
+    mov_id = layers.data(
+        name='movie_id',
+        shape=[1],
+        data_type='int64',
+        program=program,
+        init_program=init_program)
+
+    mov_emb = layers.embedding(
+        input=mov_id,
+        data_type='float32',
+        size=[MOV_DICT_SIZE, 32],
+        param_attr={'name': 'movie_table'},
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    mov_fc = layers.fc(input=mov_emb,
+                       size=32,
+                       program=program,
+                       init_program=init_program)
+
+    CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
+
+    category_id = layers.data(
+        name='category_id',
+        shape=[1],
+        data_type='int64',
+        program=program,
+        init_program=init_program)
+
+    mov_categories_emb = layers.embedding(
+        input=category_id,
+        size=[CATEGORY_DICT_SIZE, 32],
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    mov_categories_hidden = layers.sequence_pool(
+        input=mov_categories_emb,
+        pool_type="sum",
+        program=program,
+        init_program=init_program)
+
+    MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
+
+    mov_title_id = layers.data(
+        name='movie_title',
+        shape=[1],
+        data_type='int64',
+        program=program,
+        init_program=init_program)
+
+    mov_title_emb = layers.embedding(
+        input=mov_title_id,
+        size=[MOV_TITLE_DICT_SIZE, 32],
+        is_sparse=is_sparse,
+        program=program,
+        init_program=init_program)
+
+    mov_title_conv = nets.sequence_conv_pool(
+        input=mov_title_emb,
+        num_filters=32,
+        filter_size=3,
+        act="tanh",
+        pool_type="sum",
+        program=program,
+        init_program=init_program)
+
+    concat_embed = layers.concat(
+        input=[mov_fc, mov_categories_hidden, mov_title_conv],
+        axis=1,
+        program=program,
+        init_program=init_program)
+
+    # FIXME(dzh) : need tanh operator
+    mov_combined_features = layers.fc(input=concat_embed,
+                                      size=200,
+                                      act="tanh",
+                                      program=program,
+                                      init_program=init_program)
+
+    return mov_combined_features
+
+
+def model():
+    usr_combined_features = get_usr_combined_features()
+    mov_combined_features = get_mov_combined_features()
+
+    # need cos sim
+    inference = layers.cos_sim(
+        X=usr_combined_features,
+        Y=mov_combined_features,
+        program=program,
+        init_program=init_program)
+
+    label = layers.data(
+        name='score',
+        shape=[1],
+        data_type='float32',
+        program=program,
+        init_program=init_program)
+
+    square_cost = layers.square_error_cost(
+        input=inference,
+        label=label,
+        program=program,
+        init_program=init_program)
+
+    avg_cost = layers.mean(
+        x=square_cost, program=program, init_program=init_program)
+
+    return avg_cost
+
+
+def main():
+    cost = model()
+    sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.2)
+    opts = sgd_optimizer.minimize(cost, init_program=init_program)
+    block = program.block(0)
+
+    if use_gpu:
+        place = core.GPUPlace(0)
+    else:
+        place = core.CPUPlace()
+
+    exe = Executor(place)
+    exe.run(init_program, feed={}, fetch_list=[])
+
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.movielens.train(), buf_size=8192),
+        batch_size=BATCH_SIZE)
+
+    feeding = {
+        'user_id': 0,
+        'gender_id': 1,
+        'age_id': 2,
+        'job_id': 3,
+        'movie_id': 4,
+        'category_id': 5,
+        'movie_title': 6,
+        'score': 7
+    }
+
+    def func_feed(feeding, data):
+        feed_tensors = {}
+        for (key, idx) in feeding.iteritems():
+            tensor = core.LoDTensor()
+            if key != "category_id" and key != "movie_title":
+                if key == "score":
+                    numpy_data = np.array(map(lambda x: x[idx], data)).astype(
+                        "float32")
+                else:
+                    numpy_data = np.array(map(lambda x: x[idx], data)).astype(
+                        "int64")
+            else:
+                numpy_data = map(lambda x: np.array(x[idx]).astype("int64"),
+                                 data)
+                lod_info = [len(item) for item in numpy_data]
+                offset = 0
+                lod = [offset]
+                for item in lod_info:
+                    offset += item
+                    lod.append(offset)
+                numpy_data = np.concatenate(numpy_data, axis=0)
+                tensor.set_lod([lod])
+
+            numpy_data = numpy_data.reshape([numpy_data.shape[0], 1])
+            tensor.set(numpy_data, place)
+            feed_tensors[key] = tensor
+        return feed_tensors
+
+    PASS_NUM = 100
+    for pass_id in range(PASS_NUM):
+        for data in train_reader():
+            outs = exe.run(program,
+                           feed=func_feed(feeding, data),
+                           fetch_list=[cost])
+            out = np.array(outs[0])
+            if out[0] < 6.0:
+                # if avg cost less than 6.0, we think our code is good.
+                exit(0)
+
+
+main()
--- a/python/paddle/v2/framework/tests/test_recurrent_op.py
+++ b/python/paddle/v2/framework/tests/test_recurrent_op.py
-import logging
-import paddle.v2.framework.core as core
 import unittest
-import numpy as np
-from paddle.v2.framework.op import Operator, RecurrentOp
-from op_test import get_numeric_gradient
-

-def py_sigmoid(x):
-    return 1. / (1. + np.exp(-x))
+import logging

+from op_test import get_numeric_gradient
+from paddle.v2.framework.layers import *
+from paddle.v2.framework.framework import Program
+from paddle.v2.framework.executor import Executor
+from paddle.v2.framework.backward import append_backward_ops
+import numpy as np
+import paddle.v2.framework.core as core

-class PySimpleRNN(object):
-    '''
-    A simple implementation of RNN based on numpy, to futhur test RecurrentOp's alogorithm
-    '''

-    def __init__(self, input_dim=30, batch_size=50, weight_dim=15, sent_len=11):
-        self.x = np.random.normal(size=(sent_len, batch_size,
-                                        input_dim)).astype("float32")
-        self.W = np.random.normal(size=(input_dim, input_dim)).astype("float32")
-        self.U = np.random.normal(size=(input_dim, input_dim)).astype("float32")
-        self.h_boot = np.random.normal(size=(batch_size,
-                                             input_dim)).astype("float32")
+class PyRNNBase(object):
+    def __init__(self, input_shape, output_shape):
+        self.x = np.ones(shape=input_shape).astype("float32")
+        self.y = np.zeros(shape=output_shape).astype("float32")

-        # memories
-        self.mems = [
-            np.zeros(shape=(batch_size, input_dim)).astype("float32")
-            for i in range(sent_len)
-        ]
+    def step(self):
+        pass

    def forward(self):
-        xs = self.segment_inputs()
        for step_id in range(self.x.shape[0]):
-            self.step(step_id, xs[step_id])
-        return self.concat_outputs()
+            self.step(step_id, self.x[step_id])
+        return np.array([np.mean(self.y)])

    def segment_inputs(self):
        return [self.x[i] for i in range(self.x.shape[0])]

-    def concat_outputs(self):
-        return np.array(self.mems).astype("float32")
+
+class PySimpleRNN1(PyRNNBase):
+    def __init__(self, input_shape, output_shape):
+        super(PySimpleRNN1, self).__init__(input_shape, output_shape)
+
+        seq_len, batch_size, input_dim = input_shape
+        self.h_boot = np.random.normal(size=(batch_size,
+                                             input_dim)).astype("float32")
+
+        self.scale = 1.0 / 2.0
+        men_dim = (seq_len, batch_size, input_dim)
+        self.mems = np.zeros(shape=men_dim).astype("float32")
+
+    def step(self, step_id, x):
+        if step_id == 0:
+            pre_mem = self.h_boot
+        else:
+            pre_mem = self.mems[step_id - 1]
+        self.mems[step_id] = (pre_mem + x) * self.scale
+        self.y[step_id] = self.mems[step_id]
+
+
+class PySimpleRNN2(PyRNNBase):
+    def __init__(self, input_shape, output_shape):
+        super(PySimpleRNN2, self).__init__(input_shape, output_shape)
+
+        seq_len, batch_size, input_dim = input_shape
+        self.W = np.random.normal(size=(input_dim, input_dim)).astype("float32")
+        self.U = np.random.normal(size=(input_dim, input_dim)).astype("float32")
+        self.h_boot = np.ones(shape=(batch_size, input_dim)).astype("float32")
+
+        men_dim = (seq_len, batch_size, input_dim)
+        self.mems = np.zeros(shape=men_dim).astype("float32")

    def step(self, step_id, x):
-        '''
-        run a step
-        '''
-        mem = self.mems[step_id]
        if step_id > 0:
            pre_mem = self.mems[step_id - 1]
        else:
@@ -53,108 +69,124 @@ class PySimpleRNN(object):
        xW = np.matmul(x, self.W).astype("float32")
        hU = np.matmul(pre_mem, self.U).astype("float32")

-        sum = xW + hU
-        self.mems[step_id] = py_sigmoid(sum)
-
+        def py_sigmoid(x):
+            return 1. / (1. + np.exp(-x))

-class PySimpleRNNTest(unittest.TestCase):
-    def setUp(self):
-        self.rnn = PySimpleRNN()
-
-    def test_forward(self):
-        output = self.rnn.forward()
+        self.mems[step_id] = py_sigmoid(xW + hU)
+        self.y[step_id] = self.mems[step_id]


-def create_tensor(scope, name, shape, np_data):
-    tensor = scope.var(name).get_tensor()
-    tensor.set_dims(shape)
-    tensor.set(np_data, core.CPUPlace())
+def create_tensor(np_data, place):
+    tensor = core.LoDTensor()
+    tensor.set(np_data, place)
    return tensor


-class RecurrentOpTest(unittest.TestCase):
+class RecurrentOpTest1(unittest.TestCase):
    '''
    Test RNNOp
-
    equation:
-        h_t = \sigma (W x_t + U h_{t-1})
-    weights:
-        - W
-        - U
+        h_t = ( x_t + h_{t-1} ) / scale
    vars:
        - x
    memories:
        - h
    outputs:
-       - h
+        - h
    '''

-    input_dim = 30
-    batch_size = 50
-    weight_dim = 15
-    sent_len = 11
+    input_dim = 2
+    batch_size = 1
+    sent_len = 1
+
+    def init_program(self):
+        self.program = Program()
+        self.init_program = Program()
+        self.p_info = {
+            "program": self.program,
+            "init_program": self.init_program
+        }
+        self.place = core.CPUPlace()

    def setUp(self):
-        self.py_rnn = PySimpleRNN(self.input_dim, self.batch_size,
-                                  self.weight_dim, self.sent_len)
+        self.init_program()
+        self.data_field = {"x", "h_boot"}

-    def forward(self):
-        self.scope = core.Scope()
-        self.create_global_variables()
-        self.create_rnn_op()
-        self.create_step_net()
-        ctx = core.DeviceContext.create(core.CPUPlace())
-        self.rnnop.run(self.scope, ctx)
-        return np.array(self.scope.find_var("h@mem").get_tensor()).astype(
-            "float32")
-
-    def create_global_variables(self):
-        # create inlink
-        x_np_data = self.py_rnn.x
-        create_tensor(self.scope, "x",
-                      [self.sent_len, self.batch_size, self.input_dim],
-                      x_np_data)
-        W_np_data = self.py_rnn.W
-        create_tensor(self.scope, "W", [self.input_dim, self.input_dim],
-                      W_np_data)
-
-        U_np_data = self.py_rnn.U
-        create_tensor(self.scope, "U", [self.input_dim, self.input_dim],
-                      U_np_data)
-
-        h_boot_np_data = self.py_rnn.h_boot
-        create_tensor(self.scope, "h_boot", [self.batch_size, self.input_dim],
-                      h_boot_np_data)
-        self.scope.var("step_scopes")
-        self.scope.var("h@mem")
+        self.input_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.output_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.py_rnn = PySimpleRNN1(self.input_shape, self.output_shape)
+
+        self.output = mean(x=self.create_rnn_op(), **self.p_info)

    def create_rnn_op(self):
-        # create RNNOp
-        self.rnnop = RecurrentOp(
-            # inputs
-            inputs=["x"],
-            initial_states=["h_boot"],
-            step_net="stepnet",
-            # outputs
-            outputs=["h@mem"],
-            step_scopes="step_scopes",
-            # attributes
-            ex_states=["h@pre"],
-            states=["h@mem"])
-
-    def create_step_net(self):
-        stepnet = core.Net.create()
-        x_fc_op = Operator("mul", X="x", Y="W", Out="Wx")
-        h_fc_op = Operator("mul", X="h@pre", Y="U", Out="Uh")
-        sum_op = Operator("sum", X=["Wx", "Uh"], Out="sum")
-        sig_op = Operator("sigmoid", X="sum", Y="h@mem")
-
-        for op in [x_fc_op, h_fc_op, sum_op, sig_op]:
-            stepnet.append_op(op)
-        stepnet.complete_add_op(True)
-        self.rnnop.set_stepnet(stepnet)
-
-    def test_forward(self):
+        x = data(
+            shape=[self.sent_len, self.batch_size, self.input_dim],
+            data_type='float32',
+            name='x',
+            append_batch_size=False,
+            **self.p_info)
+        h_boot = data(
+            shape=[self.input_dim],
+            data_type='float32',
+            name='h_boot',
+            **self.p_info)
+
+        rnn = StaticRNN(program=self.program)
+        with rnn.step():
+            h_pre = rnn.memory(init=h_boot)
+            x_t = rnn.step_input(x)
+
+            h = scale(
+                x=elementwise_add(
+                    x=h_pre, y=x_t, **self.p_info),
+                scale=self.py_rnn.scale,
+                **self.p_info)
+
+            rnn.update_memory(h_pre, h)
+            rnn.output(h)
+
+        return rnn()
+
+    def forward(self):
+        self.feed_map = {
+            x: create_tensor(getattr(self.py_rnn, x), self.place)
+            for x in self.data_field
+        }
+        exe = Executor(self.place)
+        out = exe.run(self.program,
+                      feed=self.feed_map,
+                      fetch_list=[self.output])
+
+        return np.array(out[0])
+
+    def backward(self):
+        self.feed_map = {
+            x: create_tensor(getattr(self.py_rnn, x), self.place)
+            for x in self.data_field
+        }
+        fetch_list = [
+            self.program.global_block().var(x + "@GRAD")
+            for x in self.data_field
+        ]
+
+        exe = Executor(self.place)
+        return exe.run(self.program, feed=self.feed_map, fetch_list=fetch_list)
+
+    def test_backward(self):
+        self.check_forward()
+
+        append_backward_ops(self.output)
+
+        ana_grad = [np.array(x) for x in self.backward()]
+
+        num_grad = self.get_numerical_gradient()
+        for idx, name in enumerate(self.data_field):
+            self.assertEqual(num_grad[idx].shape, ana_grad[idx].shape)
+            self.assertTrue(
+                np.isclose(
+                    num_grad[idx], ana_grad[idx], rtol=0.1).all())
+
+    def check_forward(self):
        print 'test recurrent op forward'
        pd_output = self.forward()
        py_output = self.py_rnn.forward()
@@ -164,44 +196,190 @@ class RecurrentOpTest(unittest.TestCase):
        self.assertEqual(pd_output.shape, py_output.shape)
        self.assertTrue(np.isclose(pd_output, py_output, rtol=0.1).all())

+    def get_numerical_gradient(self, delta=0.005):
+        dloss_dout = 1.0
+        feed_list = [getattr(self.py_rnn, x) for x in self.data_field]
+        grad_list = [np.zeros_like(x) for x in feed_list]
+        for feed, grad in zip(feed_list, grad_list):
+            for f, g in np.nditer([feed, grad], op_flags=['readwrite']):
+                o = float(f)
+                f[...] = o + delta
+                y_pos = self.forward()

-class RecurrentGradientOpTest(unittest.TestCase):
-    def create_forward_op(self):
-        self.forward_op = RecurrentOp(
-            # inputs
-            inputs=["x"],
-            initial_states=["h_boot"],
-            step_net="stepnet",
-            # outputs
-            outputs=["h"],
-            step_scopes="step_scopes",
-            # attributes
-            ex_states=["h@pre"],
-            states=["h@alias"])
-
-        # create a stepnet for RNN
-        stepnet = core.Net.create()
-        x_fc_op = Operator("mul", X="x@alias", Y="W", Out="Wx")
-        h_fc_op = Operator("mul", X="h@pre", Y="U", Out="Uh")
-        sum_op = Operator("sum", X=["Wx", "Uh"], Out="sum")
-        sig_op = Operator("sigmoid", X="sum", Y="h@alias")
-
-        for op in [x_fc_op, h_fc_op, sum_op, sig_op]:
-            stepnet.append_op(op)
-        stepnet.complete_add_op(True)
-        self.forward_op.set_stepnet(stepnet)
-
-    def create_gradient_op(self):
-        a = set()
-        backward_op = core.RecurrentOp.backward(self.forward_op, a)
-
-    def test_grad(self):
-        self.create_forward_op()
-        self.create_gradient_op()
+                f[...] = o - delta
+                y_neg = self.forward()
+
+                f[...] = o
+                dout_dfeed = (y_pos - y_neg) / (delta * 2)
+                g[...] = dout_dfeed[0]
+
+        return grad_list
+
+
+class RecurrentOpTest2(RecurrentOpTest1):
+    '''
+    Test RNNOp
+    equation:
+        h_t = \sigma (W x_t + U h_{t-1})
+    weights:
+        - W
+        - U
+    vars:
+        - x
+    memories:
+        - h
+    outputs:
+       - h
+    '''
+
+    input_dim = 2
+    batch_size = 10
+    sent_len = 2
+
+    def setUp(self):
+        self.init_program()
+
+        self.data_field = {"x", "h_boot", "W", "U"}
+
+        self.input_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.output_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.py_rnn = PySimpleRNN2(self.input_shape, self.output_shape)
+
+        self.output = mean(x=self.create_rnn_op(), **self.p_info)
+
+    def create_rnn_op(self):
+        x = data(
+            shape=[self.sent_len, self.batch_size, self.input_dim],
+            data_type='float32',
+            name='x',
+            append_batch_size=False,
+            **self.p_info)
+        h_boot = data(
+            shape=[self.input_dim],
+            data_type='float32',
+            name='h_boot',
+            **self.p_info)
+
+        rnn = StaticRNN(program=self.program)
+        with rnn.step():
+            h_pre = rnn.memory(init=h_boot)
+            x_t = rnn.step_input(x)
+
+            temp_l = fc(input=x_t,
+                        size=self.input_dim,
+                        param_attr={'name': 'W'},
+                        bias_attr=False,
+                        **self.p_info)
+            temp_r = fc(input=h_pre,
+                        size=self.input_dim,
+                        param_attr={'name': 'U'},
+                        bias_attr=False,
+                        **self.p_info)
+
+            h = sigmoid(
+                x=elementwise_add(
+                    x=temp_l, y=temp_r, **self.p_info),
+                **self.p_info)
+
+            rnn.update_memory(h_pre, h)
+            rnn.output(h)
+
+        return rnn()
+
+
+class RecurrentOpTest3(RecurrentOpTest1):
+    '''
+    Test RNNOp with two memories
+    equation:
+        h_1 = h_pre_1
+        h_2 = h_pre_2
+        y = h_1 + h_2
+    vars:
+        - x
+    memories:
+        - h_1, h_2
+    outputs:
+       - y
+    '''
+
+    class PySimpleRNN3(PyRNNBase):
+        def __init__(self, input_shape, output_shape):
+            super(RecurrentOpTest3.PySimpleRNN3, self).__init__(input_shape,
+                                                                output_shape)
+
+            seq_len, batch_size, input_dim = input_shape
+            self.h_boot1 = np.random.normal(size=(batch_size,
+                                                  input_dim)).astype("float32")
+            self.h_boot2 = np.random.normal(size=(batch_size,
+                                                  input_dim)).astype("float32")
+
+            men_dim = (seq_len, batch_size, input_dim)
+            self.mems1 = np.zeros(shape=men_dim).astype("float32")
+            self.mems2 = np.zeros(shape=men_dim).astype("float32")
+
+        def step(self, step_id, x):
+            if step_id == 0:
+                pre_mem1 = self.h_boot1
+                pre_mem2 = self.h_boot2
+            else:
+                pre_mem1 = self.mems1[step_id - 1]
+                pre_mem2 = self.mems2[step_id - 1]
+            self.mems1[step_id] = pre_mem1
+            self.mems2[step_id] = pre_mem2
+            self.y[step_id] = self.mems1[step_id] + self.mems2[step_id] + x
+
+    input_dim = 1
+    batch_size = 1
+    sent_len = 2
+
+    def setUp(self):
+        self.init_program()
+
+        self.data_field = {"x", "h_boot1", "h_boot2"}
+
+        self.input_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.output_shape = (self.sent_len, self.batch_size, self.input_dim)
+        self.py_rnn = RecurrentOpTest3.PySimpleRNN3(self.input_shape,
+                                                    self.output_shape)
+
+        self.output = mean(x=self.create_rnn_op(), **self.p_info)
+
+    def create_rnn_op(self):
+        x = data(
+            shape=[self.sent_len, self.batch_size, self.input_dim],
+            data_type='float32',
+            name='x',
+            append_batch_size=False,
+            **self.p_info)
+        h_boot1 = data(
+            shape=[self.batch_size, self.input_dim],
+            data_type='float32',
+            name='h_boot1',
+            append_batch_size=False,
+            **self.p_info)
+        h_boot2 = data(
+            shape=[self.batch_size, self.input_dim],
+            data_type='float32',
+            name='h_boot2',
+            append_batch_size=False,
+            **self.p_info)
+
+        rnn = StaticRNN(program=self.program)
+        with rnn.step():
+            h_pre1 = rnn.memory(init=h_boot1)
+            h_pre2 = rnn.memory(init=h_boot2)
+            x_t = rnn.step_input(x)
+
+            mem1 = scale(x=h_pre1, scale=1.0, **self.p_info)
+            mem2 = scale(x=h_pre2, scale=1.0, **self.p_info)
+            out = sums(input=[mem1, x_t, mem2], **self.p_info)
+
+            rnn.update_memory(h_pre1, mem1)
+            rnn.update_memory(h_pre2, mem2)
+            rnn.output(out)
+
+        return rnn()


 if __name__ == '__main__':
-    exit(
-        0
-    )  # FIXME(qijun): https://github.com/PaddlePaddle/Paddle/issues/5101#issuecomment-339814957
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_rnn_helpers.py
+++ b/python/paddle/v2/framework/tests/test_rnn_helpers.py
-import unittest
-from paddle.v2.framework.layers import *
-from paddle.v2.framework.framework import g_program
-
-
-class TestRNN(unittest.TestCase):
-    def test_rnn(self):
-        img = data(
-            shape=[
-                80,  # sequence length
-                22,  # image height
-                22
-            ],  # image width
-            data_type='float32',
-            name='image')
-        hidden = fc(input=img, size=100, act='sigmoid', num_flatten_dims=2)
-        self.assertEqual((-1, 80, 100), hidden.shape)
-        hidden = fc(input=hidden, size=100, act='sigmoid', num_flatten_dims=2)
-        self.assertEqual((-1, 80, 100), hidden.shape)
-
-        rnn = StaticRNN()
-        with rnn.step():
-            hidden = rnn.step_input(hidden)
-            self.assertEqual((-1, 100), hidden.shape)
-            memory = rnn.memory(shape=(-1, 32), dtype='float32', init_value=0.0)
-
-            rnn_out = fc(input=[hidden, memory], size=32, act='sigmoid')
-            self.assertEqual((-1, 32), rnn_out.shape)
-            rnn.update_memory(memory, rnn_out)
-            rnn.output(rnn_out)
-
-        out = rnn()
-        self.assertEqual((-1, 80, 32), out.shape)
-        print g_program
-
-
-if __name__ == '__main__':
-    unittest.main()
--- a/python/paddle/v2/framework/tests/test_rnn_memory_helper_op.py
+++ b/python/paddle/v2/framework/tests/test_rnn_memory_helper_op.py
+import unittest
+
+from paddle.v2.framework.framework import Program
+from paddle.v2.framework.executor import Executor
+from paddle.v2.framework.backward import append_backward_ops
+import numpy as np
+import paddle.v2.framework.core as core
+
+
+def create_tensor(np_data, place):
+    tensor = core.LoDTensor()
+    tensor.set(np_data, place)
+    return tensor
+
+
+class RNNMemoryHelperOpTest(unittest.TestCase):
+    def setUp(self):
+        self.program = Program()
+        self.place = core.CPUPlace()
+
+        self.X = self.program.global_block().create_var(
+            name='X', shape=[2, 3], dtype='float32')
+        self.Out = self.program.global_block().create_var(
+            name='Out', shape=[2, 3], dtype='float32')
+        self.program.global_block().append_op(
+            type='rnn_memory_helper',
+            inputs={"X": self.X},
+            outputs={"Out": self.Out},
+            attrs={})
+
+    def test_forward(self):
+        x_np = np.random.normal(size=(2, 3)).astype("float32")
+        self.feed_map = {'X': create_tensor(x_np, self.place)}
+        self.fetch_list = [self.Out]
+        exe = Executor(self.place)
+        out = exe.run(self.program,
+                      feed=self.feed_map,
+                      fetch_list=self.fetch_list)
+        np.isclose(np.array(out[0]), x_np, rtol=1e-5)
+
+
+class RNNMemoryHelperGradOpTest(unittest.TestCase):
+    def setUp(self):
+        self.program = Program()
+        self.place = core.CPUPlace()
+
+        self.input_names = ['X', 'Out', 'Out@GRAD']
+        self.input_vars = {
+            name: self.program.global_block().create_var(
+                name=name, shape=[2, 3], dtype='float32')
+            for name in self.input_names
+        }
+
+        self.output_names = ['X@GRAD']
+        self.output_vars = {
+            name: self.program.global_block().create_var(
+                name=name, shape=[2, 3], dtype='float32')
+            for name in self.output_names
+        }
+
+        self.program.global_block().append_op(
+            type='rnn_memory_helper_grad',
+            inputs=self.input_vars,
+            outputs=self.output_vars,
+            attrs={})
+
+    def test_backward(self):
+        self.feed_map = {
+            name: create_tensor(
+                np.random.normal(size=(2, 3)).astype("float32"), self.place)
+            for name in self.input_names
+        }
+        self.fetch_list = [self.output_vars['X@GRAD']]
+
+        exe = Executor(self.place)
+        out = exe.run(self.program,
+                      feed=self.feed_map,
+                      fetch_list=self.fetch_list)
+        np.isclose(np.array(out[0]), self.feed_map['Out@GRAD'], rtol=1e-5)
+
+
+class RNNMemoryHelperGradOpWithoutInputTest(unittest.TestCase):
+    def setUp(self):
+        self.program = Program()
+        self.fake_program = Program()
+        self.place = core.CPUPlace()
+
+        self.input_names = ['X', 'Out']
+        self.input_vars = {
+            name: self.program.global_block().create_var(
+                name=name, shape=[2, 3], dtype='float32')
+            for name in self.input_names
+        }
+        self.input_vars["Out@GRAD"] = \
+            self.fake_program.global_block().create_var(
+                name="Out@GRAD", shape=[2, 3], dtype='float32')
+
+        self.output_names = ['X@GRAD']
+        self.output_vars = {
+            name: self.program.global_block().create_var(
+                name=name, shape=[2, 3], dtype='float32')
+            for name in self.output_names
+        }
+
+        self.program.global_block().append_op(
+            type='rnn_memory_helper_grad',
+            inputs=self.input_vars,
+            outputs=self.output_vars,
+            attrs={})
+
+    def test_backward(self):
+        self.feed_map = {
+            name: create_tensor(
+                np.random.normal(size=(2, 3)).astype("float32"), self.place)
+            for name in ['X', 'Out']
+        }
+        self.fetch_list = [self.output_vars['X@GRAD']]
+
+        exe = Executor(self.place)
+        out = exe.run(self.program,
+                      feed=self.feed_map,
+                      fetch_list=self.fetch_list)
+        np.isclose(
+            np.array(out[0]),
+            np.zeros(shape=(2, 3)).astype("float32"),
+            rtol=1e-5)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_seq_pool.py
+++ b/python/paddle/v2/framework/tests/test_seq_pool.py
@@ -3,15 +3,6 @@ import numpy as np
 from op_test import OpTest


-class SeqPoolType(OpTest):
-    AVERAGE = 0
-    SUM = 1
-    SQRT = 2
-    MAX = 3
-    LAST = 4
-    FIRST = 5
-
-
 class TestSeqAvgPool(OpTest):
    def set_data(self):
        self.op_type = 'sequence_pool'
@@ -25,7 +16,7 @@ class TestSeqAvgPool(OpTest):
        return x, lod, out

    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.AVERAGE}
+        self.attrs = {'pooltype': "AVERAGE"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x.mean(axis=0)
@@ -54,7 +45,7 @@ class TestSeqAvgPool2D(TestSeqAvgPool):
        return x, lod, out

    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.AVERAGE}
+        self.attrs = {'pooltype': "AVERAGE"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x.mean(axis=0), (3, 17))
@@ -62,7 +53,7 @@ class TestSeqAvgPool2D(TestSeqAvgPool):

 class TestSeqSumPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SUM}
+        self.attrs = {'pooltype': "SUM"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x.sum(axis=0)
@@ -70,7 +61,7 @@ class TestSeqSumPool(TestSeqAvgPool):

 class TestSeqSumPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SUM}
+        self.attrs = {'pooltype': "SUM"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x.sum(axis=0), (3, 17))
@@ -78,7 +69,7 @@ class TestSeqSumPool2D(TestSeqAvgPool2D):

 class TestSeqSqrtPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SQRT}
+        self.attrs = {'pooltype': "SQRT"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            len = lod[0][i + 1] - lod[0][i]
@@ -87,7 +78,7 @@ class TestSeqSqrtPool(TestSeqAvgPool):

 class TestSeqSqrtPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SQRT}
+        self.attrs = {'pooltype': "SQRT"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            len = lod[0][i + 1] - lod[0][i]
@@ -99,7 +90,7 @@ class TestSeqSqrtPool2D(TestSeqAvgPool2D):

 class TestSeqMaxPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.MAX}
+        self.attrs = {'pooltype': "MAX"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = np.amax(sub_x, axis=0)
@@ -111,7 +102,7 @@ class TestSeqMaxPool(TestSeqAvgPool):

 class TestSeqMaxPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.MAX}
+        self.attrs = {'pooltype': "MAX"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(np.amax(sub_x, axis=0), (3, 17))
@@ -123,7 +114,7 @@ class TestSeqMaxPool2D(TestSeqAvgPool2D):

 class TestSeqLastPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.LAST}
+        self.attrs = {'pooltype': "LAST"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x[-1, :]
@@ -131,7 +122,7 @@ class TestSeqLastPool(TestSeqAvgPool):

 class TestSeqLastPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.LAST}
+        self.attrs = {'pooltype': "LAST"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x[-1, :], (3, 17))
@@ -139,7 +130,7 @@ class TestSeqLastPool2D(TestSeqAvgPool2D):

 class TestSeqFirstPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.FIRST}
+        self.attrs = {'pooltype': "FIRST"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x[0, :]
@@ -147,7 +138,7 @@ class TestSeqFirstPool(TestSeqAvgPool):

 class TestSeqFirstPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.FIRST}
+        self.attrs = {'pooltype': "FIRST"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x[0, :], (3, 17))

--- a/python/paddle/v2/framework/tests/test_understand_sentiment_conv.py
+++ b/python/paddle/v2/framework/tests/test_understand_sentiment_conv.py
+import paddle.v2 as paddle
+import paddle.v2.framework.layers as layers
+import paddle.v2.framework.nets as nets
+import paddle.v2.framework.core as core
+import paddle.v2.framework.optimizer as optimizer
+
+from paddle.v2.framework.framework import Program, g_program, g_init_program
+from paddle.v2.framework.executor import Executor
+
+import numpy as np
+
+
+def convolution_net(input_dim, class_dim=2, emb_dim=32, hid_dim=32):
+    data = layers.data(name="words", shape=[1], data_type="int64")
+    label = layers.data(name="label", shape=[1], data_type="int64")
+
+    emb = layers.embedding(input=data, size=[input_dim, emb_dim])
+    conv_3 = nets.sequence_conv_pool(
+        input=emb,
+        num_filters=hid_dim,
+        filter_size=3,
+        act="tanh",
+        pool_type="sqrt")
+    conv_4 = nets.sequence_conv_pool(
+        input=emb,
+        num_filters=hid_dim,
+        filter_size=4,
+        act="tanh",
+        pool_type="sqrt")
+    prediction = layers.fc(input=[conv_3, conv_4],
+                           size=class_dim,
+                           act="softmax")
+    cost = layers.cross_entropy(input=prediction, label=label)
+    avg_cost = layers.mean(x=cost)
+    adam_optimizer = optimizer.AdamOptimizer(learning_rate=0.002)
+    opts = adam_optimizer.minimize(avg_cost)
+    acc = layers.accuracy(input=prediction, label=label)
+    return avg_cost, acc
+
+
+def to_lodtensor(data, place):
+    seq_lens = [len(seq) for seq in data]
+    cur_len = 0
+    lod = [cur_len]
+    for l in seq_lens:
+        cur_len += l
+        lod.append(cur_len)
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    res = core.LoDTensor()
+    res.set(flattened_data, place)
+    res.set_lod([lod])
+    return res
+
+
+def main():
+    BATCH_SIZE = 100
+    PASS_NUM = 5
+
+    word_dict = paddle.dataset.imdb.word_dict()
+    dict_dim = len(word_dict)
+    class_dim = 2
+
+    cost, acc = convolution_net(input_dim=dict_dim, class_dim=class_dim)
+
+    train_data = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.imdb.train(word_dict), buf_size=1000),
+        batch_size=BATCH_SIZE)
+    place = core.CPUPlace()
+    exe = Executor(place)
+
+    exe.run(g_init_program)
+
+    for pass_id in xrange(PASS_NUM):
+        for data in train_data():
+            tensor_words = to_lodtensor(map(lambda x: x[0], data), place)
+
+            label = np.array(map(lambda x: x[1], data)).astype("int64")
+            label = label.reshape([BATCH_SIZE, 1])
+
+            tensor_label = core.LoDTensor()
+            tensor_label.set(label, place)
+
+            outs = exe.run(g_program,
+                           feed={"words": tensor_words,
+                                 "label": tensor_label},
+                           fetch_list=[cost, acc])
+            cost_val = np.array(outs[0])
+            acc_val = np.array(outs[1])
+
+            print("cost=" + str(cost_val) + " acc=" + str(acc_val))
+            if cost_val < 1.0 and acc_val > 0.7:
+                exit(0)
+    exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/python/paddle/v2/framework/tests/test_word2vec.py
+++ b/python/paddle/v2/framework/tests/test_word2vec.py
@@ -109,7 +109,7 @@ cost = layers.cross_entropy(
 avg_cost = layers.mean(x=cost, program=program, init_program=init_program)

 sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
-opts = sgd_optimizer.minimize(avg_cost)
+opts = sgd_optimizer.minimize(avg_cost, init_program)

 train_reader = paddle.batch(
    paddle.dataset.imikolov.train(word_dict, N), batch_size)

--- a/python/paddle/v2/plot/plot.py
+++ b/python/paddle/v2/plot/plot.py
@@ -56,7 +56,7 @@ class Ploter(object):
        assert isinstance(data, PlotData)
        data.append(step, value)

-    def plot(self):
+    def plot(self, path=None):
        if self.__plot_is_disabled__():
            return

@@ -68,8 +68,11 @@ class Ploter(object):
                titles.append(title)
                self.plt.plot(data.step, data.value)
        self.plt.legend(titles, loc='upper left')
-        self.display.clear_output(wait=True)
-        self.display.display(self.plt.gcf())
+        if path is None:
+            self.display.clear_output(wait=True)
+            self.display.display(self.plt.gcf())
+        else:
+            self.plt.savefig(path)
        self.plt.gcf().clear()

    def reset(self):

--- a/python/requirements.txt
+++ b/python/requirements.txt
@@ -7,3 +7,4 @@ rarfile
 scipy>=0.19.0
 Pillow
 nltk>=3.2.2
+graphviz