diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 0d4bb973ae87bb45ef4386a63c26ed62602f2cee..a60453ff4e3bba6e6cb3b3de915dd69afd3a1ec3 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1 +1,157 @@
-./doc/howto/dev/contribute_to_paddle_en.md
+# Contribute Code
+
+We sincerely appreciate your contribution.  This document explains our workflow and work style.
+
+## Workflow
+
+PaddlePaddle uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/).  The following steps guide usual contributions.
+
+1. Fork
+
+   Our development community has been growing fastly; it doesn't make sense for everyone to write into the official repo.  So, please file Pull Requests from your fork.  To make a fork,  just head over to the GitHub page and click the ["Fork" button](https://help.github.com/articles/fork-a-repo/).
+
+1. Clone
+
+   To make a copy of your fork to your local computers, please run
+
+   ```bash
+   git clone https://github.com/your-github-account/paddle
+   cd paddle
+   ```
+
+1. Create the local feature branch
+
+   For daily works like adding a new feature or fixing a bug, please open your feature branch before coding:
+
+   ```bash
+   git checkout -b my-cool-stuff
+   ```
+
+1. Commit
+
+   Before issuing your first `git commit` command, please install [`pre-commit`](http://pre-commit.com/) by running the following commands:
+
+   ```bash
+   pip install pre-commit
+   pre-commit install
+   ```
+
+   Our pre-commit configuration requires clang-format 3.8 for auto-formating C/C++ code and yapf for Python.
+
+   Once installed, `pre-commit` checks the style of code and documentation in every commit.  We will see something like the following when you run `git commit`:
+
+   ```
+   ➜  git commit
+   CRLF end-lines remover...............................(no files to check)Skipped
+   yapf.................................................(no files to check)Skipped
+   Check for added large files..............................................Passed
+   Check for merge conflicts................................................Passed
+   Check for broken symlinks................................................Passed
+   Detect Private Key...................................(no files to check)Skipped
+   Fix End of Files.....................................(no files to check)Skipped
+   clang-formater.......................................(no files to check)Skipped
+   [my-cool-stuff c703c041] add test file
+    1 file changed, 0 insertions(+), 0 deletions(-)
+    create mode 100644 233
+   ```
+
+1. Build and test
+
+   Users can build PaddlePaddle natively on Linux and Mac OS X.  But to unify the building environment and to make it easy for debugging, the recommended way is [using Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/build_en.md).
+
+1. Keep pulling
+
+   An experienced Git user pulls from the official repo often -- daily or even hourly, so they notice conflicts with others work early, and it's easier to resolve smaller conflicts.
+
+   ```bash
+   git remote add upstream https://github.com/PaddlePaddle/Paddle
+   git pull upstream develop
+   ```
+
+1. Push and file a pull request
+
+   You can "push" your local work into your forked repo:
+
+   ```bash
+   git push origin my-cool-stuff
+   ```
+
+   The push allows you to create a pull request, requesting owners of this [official repo](https://github.com/PaddlePaddle/Paddle) to pull your change into the official one.
+
+   To create a pull request, please follow [these steps](https://help.github.com/articles/creating-a-pull-request/).
+
+   If your change is for fixing an issue, please write ["Fixes <issue-URL>"](https://help.github.com/articles/closing-issues-using-keywords/) in the description section of your pull request.  Github would close the issue when the owners merge your pull request.
+
+   Please remember to specify some reviewers for your pull request.  If you don't know who are the right ones, please follow Github's recommendation.
+
+
+1. Delete local and remote branches
+
+   To keep your local workspace and your fork clean, you might want to remove merged branches:
+
+   ```bash
+   git push origin :my-cool-stuff
+   git checkout develop
+   git pull upstream develop
+   git branch -d my-cool-stuff
+   ```
+
+### Code Review
+
+-  Please feel free to ping your reviewers by sending them the URL of your pull request via IM or email.  Please do this after your pull request passes the CI.
+
+- Please answer reviewers' every comment.  If you are to follow the comment, please write "Done"; please give a reason otherwise.
+
+- If you don't want your reviewers to get overwhelmed by email notifications, you might reply their comments by [in a batch](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/).
+
+- Reduce the unnecessary commits.  Some developers commit often.  It is recommended to append a sequence of small changes into one commit by running `git commit --amend` instead of `git commit`.
+
+
+## Coding Standard
+
+### Code Style
+
+Our C/C++ code follows the [Google style guide](http://google.github.io/styleguide/cppguide.html).
+
+Our Python code follows the [PEP8 style guide](https://www.python.org/dev/peps/pep-0008/).
+
+Our build process helps to check the code style.  In [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/docker/build.sh#L42), the entry point of our [builder Docker image](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/Dockerfile#L88), the CMake argument `WITH_STYLE_CHECK` is set to `ON` by default.  This flag is on
+
+Please install pre-commit, which automatically reformat the changes to C/C++ and Python code whenever we run `git commit`.  To check the whole codebase, we can run the command `pre-commit run -a`, as in the [`check_style.sh` file](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/travis/check_style.sh#L30), which is invoked by [our Travis CI configuration](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/.travis.yml#L43).
+
+### Unit Tests
+
+Please remember to add related unit tests.
+
+- For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
+
+- For Python code, please use [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/).
+
+
+### Writing Logs
+
+We use [glog](https://github.com/google/glog) for logging in our C/C++ code.
+
+For general information, please use `LOG`.  For debug information, please use [`VLOG`](http://htmlpreview.github.io/?https://github.com/google/glog/blob/master/doc/glog.html#verbose).  The reason is at [here](https://groups.google.com/a/chromium.org/d/msg/chromium-dev/3NDNd1KzXeY/AZKMMx37fdQJ).
+
+`VLOG` requires a *verbose level* parameter.  For example:
+
+```c++
+VLOG(3) << "Operator FC is taking " << num_inputs << "inputs."
+```
+
+When we run a PaddlePaddle application or test, we can specify a verbose threshold.  For example:
+
+```bash
+GLOG_vmodule=buddy_allocator=2 \
+GLOG_v=10 \
+python \
+../python/paddle/v2/framework/tests/test_recurrent_op.py
+```
+
+This will enable VLOG messages generated by `buddy_allocator.{h,cc}` and in the verbose range of 0 to 3, so you will see above example VLOG message, which is in level 3.  This suggests that we output overall messages in lower verbose levels, so they display with higher probability.  When coding C++, please follow the verbose level convention as follows:
+
+- verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework)
+- verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)
+- verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform)
+- verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/math)
diff --git a/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md b/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
index 085b5dda1615a9af918b59870db460fcc5acdcca..026c0c6f3b2a2ca322d063f38e1736a010e1197e 100644
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
@@ -1,39 +1,36 @@
 # 构建Raspberry Pi平台上的PaddlePaddle库
 
-对于Rasspberry Pi系统，用户可通过ssh等方式登录到Raspberry Pi系统上，按照[源码编译PaddlePaddle](http://www.paddlepaddle.org/doc_cn/getstarted/build_and_install/cmake/build_from_source_cn.html)相关文档所述，直接编译Raspberry Pi平台上适用的PaddlePaddle库。
+通常有两个方法来构建基于 Rasspberry Pi 的版本：
 
-用户也可以在自己熟悉的开发平台上，通过交叉编译的方式来编译。这篇文档将以Linux x86-64平台为例，介绍交叉编译Raspberry Pi平台上适用的PaddlePaddle的方法和步骤。
+1. 通过ssh等方式登录到Raspberry Pi系统上来构建。所需的开发工具和第三方库可以参考 [`/Dockerfile`](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile)。
 
-## 准备交叉编译环境
+1. 另一个方法是交叉编译。这篇文档介绍在 Linux/x64 上交叉编译Raspberry Pi平台上适用的PaddlePaddle的方法和步骤。
 
-从源码交叉编译PaddlePaddle，用户需要提前准备好交叉编译环境。用户可自行前往[github](https://github.com/raspberrypi/tools)下载Raspberry Pi平台使用的C/C++交叉编译工具链，也可通过以下命令获取：
+## 安装交叉编译器
+
+克隆下面 Github repo
 
 ```bash
 git clone https://github.com/raspberrypi/tools.git
 ```
 
-该github仓库中包含若干个预编译好的、针对不同平台的编译工具。宿主机是Linux x86-64环境，则需选用`arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64`下的作为编译工具，所使用的编译器为arm-linux-gnueabihf-gcc 4.8.3。
-
-注意，该编译工具链需要系统glibc支持2.14以上。
+即可在 `./tools/tree/master/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64` 目录里找到交叉编译器 arm-linux-gnueabihf-gcc 4.8.3。运行该编译工具链需要一台 Linux x64 机器上以及 2.14版本以上的 glibc。
 
 ## 配置交叉编译参数
 
-CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。为了简化cmake配置，PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/raspberry_pi.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/raspberry_pi.cmake)，以提供一些默认的编译器和编译参数相关配置。
+CMake[支持交叉编译](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。PaddlePaddle for Raspberry Pi的配置信息在[cmake/cross_compiling/raspberry_pi.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/raspberry_pi.cmake)。
 
 交叉编译Raspberry Pi版本PaddlePaddle库时，有一些必须配置的参数：
 
-- `CMAKE_SYSTEM_NAME`，CMake编译的目标平台，必须配置为`RPi`。在设置`CMAKE_SYSTEM_NAME=RPi`后，PaddlePaddle的CMake系统才认为在是在交叉编译Raspberry Pi系统的版本，并自动编译宿主机版protoc可执行文件、目标机版protobuf库、以及目标机版OpenBLAS库。
-
-Raspberry Pi平台可选配置参数：
+- `CMAKE_SYSTEM_NAME`：CMake编译的目标平台，必须配置为`RPi`。在设置`CMAKE_SYSTEM_NAME=RPi`后，PaddlePaddle的CMake系统才认为在是在交叉编译Raspberry Pi系统的版本，并自动编译宿主机版protoc可执行文件、目标机版protobuf库、以及目标机版OpenBLAS库。
 
-- `RPI_TOOLCHAIN`，编译工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动设置需要使用的交叉编译器；否则，用户需要在cmake时手动设置这些值。无默认值。
-- `RPI_ARM_NEON`，是否使用NEON指令。目前必须设置成`ON`，默认值为`ON`。
+- `RPI_TOOLCHAIN`：编译工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动设置需要使用的交叉编译器；否则，用户需要在cmake时手动设置这些值。无默认值。
 
-其他配置参数：
+- `RPI_ARM_NEON`：是否使用NEON指令。目前必须设置成`ON`，默认值为`ON`。
 
 - `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC`的值；若环境变量`CC`没有设置，则设置成`cc`编译器。
 
-cmake参数如下；
+一个常用的CMake配置如下：
 
 ```
 cmake -DCMAKE_SYSTEM_NAME=RPi \
@@ -47,7 +44,9 @@ cmake -DCMAKE_SYSTEM_NAME=RPi \
       ..
 ```
 
-用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。
+其中`WITH_C_API=ON`表示需要构建推理库。
+
+用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。
 
 ## 编译和安装
 
@@ -60,6 +59,4 @@ make install
 
 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
 
-执行完安装命令后，由于上一步cmake配置中`WITH_C_API`设置为`ON`，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
-
-更多的编译配置见[源码编译PaddlePaddle](http://www.paddlepaddle.org/doc_cn/getstarted/build_and_install/cmake/build_from_source_cn.html)相关文档。
+执行完安装命令后，，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
diff --git a/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md b/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..09ac4733ec98c598dfd62f22beaf838320dc7531
--- /dev/null
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
@@ -0,0 +1,62 @@
+# Build PaddlePaddle for Raspberry Pi
+
+You may use any of the following two approaches to build the inference library of PaddlePaddle for Raspberry Pi:
+
+1. Build using SSH: Log in to a Raspberry Pi using SSH and build the library. The required development tools and third-party dependencies are listed in here: [`/Dockerfile`](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile).
+
+1. Cross-compile: We talk about how to cross-compile PaddlePaddle for Raspberry Pi on a Linux/x64 machine, in more detail in this article.
+
+## The Cross-Compiling Toolchain
+
+Step 1. Clone the Github repo by running the following command.
+
+```bash
+git clone https://github.com/raspberrypi/tools.git
+```
+
+Step 2. Use the pre-built cross-compiler found in `./tools/tree/master/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64`.  To run it on a Linux computer, glibc version >= 2.14 is needed.
+
+## CMake Arguments
+
+CMake supports [cross-compiling](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling).  All CMake configuration arguments required for the cross-compilation for Raspberry Pi can be found in [`cmake/cross_compiling/raspberry_pi.cmake`](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/raspberry_pi.cmake).
+
+Some important arguments that need to be set:
+
+- `CMAKE_SYSTEM_NAME`: The target platform.  Must be `RPi`.
+
+- `RPI_TOOLCHAIN`: The absolute path of the cross-compiling toolchain.
+
+- `RPI_ARM_NEON`: Use ARM NEON Intrinsics. This is a required argument and set default to `ON`.
+
+- `HOST_C/CXX_COMPILER`: The C/C++ compiler for the host.  It is used to build building tools running on the host, for example, protoc.
+
+A commonly-used CMake configuration is as follows:
+
+```
+cmake -DCMAKE_SYSTEM_NAME=RPi \
+      -DRPI_TOOLCHAIN=your/path/to/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64 \
+      -DRPI_ARM_NEON=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_GPU=OFF \
+      -DWITH_C_API=ON \
+      -DWITH_PYTHON=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+To build the inference library, please set the argument WITH_API to ON: `WITH_C_API=ON`.
+
+You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`.
+
+## Build and Install
+
+The following commands build the inference library of PaddlePaddle for Raspberry Pi and third-party dependencies.
+
+```bash
+make
+make install
+```
+
+ The intermediate files will be stored in `build`. Third-party libraries will be located in `build/third_party`. If you have already built it for other platforms like Android or iOS, you may want to clear these directories by running the command: `rm -rf build`.
+
+The infernece library will be in `your/path/to/install/lib`, with related header files in `your/path/to/install/include`.
diff --git a/doc/howto/dev/contribute_to_paddle_en.md b/doc/howto/dev/contribute_to_paddle_en.md
deleted file mode 100644
index 40d1eb62d722244139cc84eb170c190d988f5626..0000000000000000000000000000000000000000
--- a/doc/howto/dev/contribute_to_paddle_en.md
+++ /dev/null
@@ -1,219 +0,0 @@
-# Contribute Code
-
-We sincerely appreciate your contributions. You can use fork and pull request
-workflow to merge your code.
-
-## Code Requirements
-- Your code comments must be fully documented by
-  [Doxygen](http://www.stack.nl/~dimitri/doxygen/) style.
-- Make sure the compiler option `WITH_STYLE_CHECK` is on and the compiler
-  passes the code style check.
-- All code must have unit test.
-- Pass all unit tests.
-
-The following tutorial guides you into submitting your contibution.
-
-## [Creating a Fork](https://help.github.com/articles/fork-a-repo/)
-
-Just head over to the GitHub page and click the "Fork" button.
-It's just that simple.
-
-## Clone
-
-Clone remote repository.
-
-```bash
-➜  git clone https://github.com/USERNAME/Paddle
-➜  cd Paddle
-```
-
-## Create a local branch
-
-Paddle is currently using [Git-flow branching model](http://nvie.com/posts/a-successful-git-branching-model/).
-
-All feature and bug fix development work should be done on a new branch, generally create new branch from `develop` branch .
-
-```bash
-➜  git checkout -b my-cool-stuff
-```
-
-Before the checkout, you need to keep the current branch directory clean, otherwise the untracked file will be brought to the new branch, which can be inspected by `git status`.
-
-## Using `pre-commit` hook
-
-Paddle developers use [pre-commit](http://pre-commit.com/) tool to manage git
-pre-commit hooks. It can help us format source codes (cpp, python), check some
-basic thing before commit (only one EOL for each file, do not add a huge file
-in git). `pre-commit` tests is a part of unit tests in Travis-CI now, every
-PR doesn't fit hook can not be merged into Paddle.
-
-To use [pre-commit](http://pre-commit.com/), you should install it by
-`pip install pre-commit`, and currently, Paddle uses `clang-format` to format
-c/cpp sources. Please make sure clang-format 3.8+ installed.
-
-Install and run it as follow:
-
-```bash
-➜  pip install pre-commit
-➜  pre-commit install
-```
-
-When you commit your code, the pre-commit hook will check the local code if there is
-anything not suitable to commit, and so on.
-
-## Start to develop
-
-In this tutorial, I delete a line in README.md and created a new file.
-
-We can use `git status` to inspect the changes of current directory, `git diff` to see difference.
-
-```bash
-➜  git status
-On branch test
-Changes not staged for commit:
-  (use "git add <file>..." to update what will be committed)
-  (use "git checkout -- <file>..." to discard changes in working directory)
-
-	modified:   README.md
-
-Untracked files:
-  (use "git add <file>..." to include in what will be committed)
-
-	test
-
-no changes added to commit (use "git add" and/or "git commit -a")
-```
-## Build and Test
-
-We package PaddlePaddle's compile environment into a Docker image, called the develop image named `paddle:dev`, it contains all compiling tools that PaddlePaddle needs. 
-
-If you want to build the develop image, just run:
-
-```bash
-➜  docker build -t paddle:dev .
-```
-
-Then we can use the develop image to build PaddlePaddle source. For example:
-
-```bash
-➜  docker run -v $(pwd):/paddle -e "WITH_GPU=OFF" -e "WITH_AVX=ON" -e "WITH_TEST=ON" paddle:dev
-```
-
-The above command will compile PaddlePaddle and create a Dockerfile for building production image. All the generated files are in the build directory. "WITH_GPU" controls if the generated production image supports GPU. "WITH_AVX" controls if the generated production image supports AVX. "WITH_TEST" controls if the unit test will be generated.
-
-Then we can generate the production image by copying the compiled PaddlePaddle program into the image by
-
-```bash
-➜  docker build -t paddle:prod -f build/Dockerfile .
-```
-
-Run unit test finally:
-
-```bash
-➜  docker run -it -v $(pwd):/paddle paddle:dev bash -c "cd /paddle/build && ctest"
-```
-
-For more details, you can read [this doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).
-
-## Commit
-
-Next we cancel the changes to the README.md file and then commit our changes by following command lines:
-
-```bash
-➜  git checkout -- README.md
-➜  git status
-On branch test
-Untracked files:
-  (use "git add <file>..." to include in what will be committed)
-
-	test
-
-nothing added to commit but untracked files present (use "git add" to track)
-➜  git add test
-```
-
-We should write a description of each commit by `git commit` to allow others to know
-the changes in these files.
-
-```bash
-➜  git commit
-CRLF end-lines remover...............................(no files to check)Skipped
-yapf.................................................(no files to check)Skipped
-Check for added large files..............................................Passed
-Check for merge conflicts................................................Passed
-Check for broken symlinks................................................Passed
-Detect Private Key...................................(no files to check)Skipped
-Fix End of Files.....................................(no files to check)Skipped
-clang-formater.......................................(no files to check)Skipped
-[my-cool-stuff c703c041] add test file
- 1 file changed, 0 insertions(+), 0 deletions(-)
- create mode 100644 233
-```
-
-## Keeping Fork Up to Date
-
-Before pull your request, you should sync your code from the latest PaddlePaddle.
-To do this, you'll need to add a remote at first:
-
-```bash
-➜  git remote add upstream https://github.com/PaddlePaddle/Paddle
-➜  git remote
-origin
-upstream
-```
-
-Update your fork with the latest upstream changes:
-
-```bash
-➜  git fetch upstream
-➜  git pull upstream develop
-```
-
-Now, your local master branch is up-to-date with everything modified upstream.
-
-## Push to GitHub
-
-```bash
-# push to your repository in Github
-➜  git push origin my-cool-stuff
-```
-
-## Create an issue and a Pull Request
-
-Create an Issue to describe the problem and record its number.
-
-Go to the page for your fork on GitHub, select your development branch,
-and click the `New pull request`.
-
-<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
-
-Then select the target branch:
-
-<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
-
-We can add `resolve #Issue number` in PR description to close the issue automatically after the PR is merge. More details in <https://help.github.com/articles/closing-issues-via-commit-messages/>.
-
-Then wait for review, if there need to modify, refer to the above steps to update the corresponding origin branch.
-
-## Delete origin branch
-
-After the PR is merge into the main repository, we can delete the remote branch on the PR page.
-
-<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
-
-Or just run:
-
-```bash
-➜  git push origin :my-cool-stuff
-```
-
-## Delete local branch
-
-Finally, we delete local branch:
-
-```bash
-➜  git checkout develop 
-
-# delete my-cool-stuff branch
-➜  git branch -D my-cool-stuff
-```
diff --git a/doc/howto/dev/contribute_to_paddle_en.md b/doc/howto/dev/contribute_to_paddle_en.md
new file mode 120000
index 0000000000000000000000000000000000000000..c97564d93a7f0a753a23cd97d2467d595bd154ff
--- /dev/null
+++ b/doc/howto/dev/contribute_to_paddle_en.md
@@ -0,0 +1 @@
+../../../CONTRIBUTING.md
\ No newline at end of file
diff --git a/doc/howto/index_cn.rst b/doc/howto/index_cn.rst
index 0608aa30968b0e8474eef330e4d2cc63c9def97d..76d3e0a0092f89005605a23e14e712530112a5ac 100644
--- a/doc/howto/index_cn.rst
+++ b/doc/howto/index_cn.rst
@@ -21,7 +21,6 @@
 
   dev/build_cn.rst
   dev/write_docs_cn.rst
-  dev/contribute_to_paddle_cn.md
 
 模型配置
 --------
diff --git a/doc/howto/usage/cluster/cluster_train_cn.md b/doc/howto/usage/cluster/cluster_train_cn.md
index 93c5544bcfa911f8bdcdaea39a75b3ab7ef218f8..2e98b3de3fe2284375f87e883ff4bac19255dbeb 100644
--- a/doc/howto/usage/cluster/cluster_train_cn.md
+++ b/doc/howto/usage/cluster/cluster_train_cn.md
@@ -19,7 +19,7 @@
      * [启动集群作业](#启动集群作业-1)
   * [在Kubernetes集群中提交训练作业](#在kubernetes集群中提交训练作业)
 
-# 概述
+## 概述
 本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示：
 
 <img src="https://user-images.githubusercontent.com/13348433/31772175-5f419eca-b511-11e7-9db7-5231fe3d9ccb.png" width="500">
@@ -32,7 +32,7 @@
 
 在使用同步SGD训练神经网络时，PaddlePaddle使用同步屏障（barrier），使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中，则并不会等待所有trainer提交梯度才更新参数，这样极大地提高了计算的并行性：参数服务器之间不相互依赖，并行地接收梯度和更新参数，参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步，计算节点之间也不会相互依赖，并行地执行模型的训练。可以看出，虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新，在任意时间某一台参数服务器上保存的参数可能比另一台要更新，与同步SGD相比，梯度会有噪声。
 
-# 环境准备
+## 环境准备
 
 1. 准备您的计算集群。计算集群通常由一组（几台到几千台规模）的Linux服务器组成。服务器之间可以通过局域网（LAN）联通，每台服务器具有集群中唯一的IP地址（或者可被DNS解析的主机名）。集群中的每台计算机通常被成为一个“节点”。
 1. 我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU，还需要在节点上安装对应的GPU驱动以及CUDA。PaddlePaddle的安装可以参考[build_and_install](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/getstarted/build_and_install)的多种安装方式。我们推荐使用[Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_cn.rst)安装方式来快速安装PaddlePaddle。
@@ -51,8 +51,8 @@ PaddlePaddle 0.10.0, compiled with
 
 下面以`doc/howto/usage/cluster/src/word2vec`中的代码作为实例，介绍使用PaddlePaddle v2 API完成分布式训练。
 
-# 启动参数说明
-## 启动参数服务器
+## 启动参数说明
+### 启动参数服务器
 执行以下的命令启动一个参数服务器并等待和计算节点的数据交互
 ```bash
 $ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1
@@ -70,7 +70,7 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
 | ports_num_for_sparse  | 必选 | 1 | 用于稀疏类型参数通信的端口个数  |
 | num_gradient_servers  | 必选 | 1 | 当前训练任务pserver总数 |
 
-## 启动计算节点
+### 启动计算节点
 执行以下命令启动使用python编写的trainer程序（文件名为任意文件名，如train.py）
 ```bash
 $ python train.py
@@ -117,7 +117,7 @@ paddle.init(
 | pservers  | 必选 | 127.0.0.1 | 当前训练任务启动的pserver的IP列表，多个IP使用“,”隔开 |
 
 
-## 准备数据集
+### 准备数据集
 
 参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py)，准备训练数据和验证数据集，我们使用paddle.dataset.imikolov数据集，并根据分布式训练并发数（trainer节点个数），在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。
 
@@ -149,7 +149,7 @@ test.txt-00002
 
 对于不同的训练任务，训练数据格式和训练程序的`reader()`会大不相同，所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。
 
-## 准备训练程序
+### 准备训练程序
 
 我们会对每个训练任务都会在每个节点上创建一个工作空间（workspace），其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。
 
@@ -184,7 +184,7 @@ test.txt-00002
 - `train_data_dir`：包含训练数据的目录，可以是从分布式存储挂载过来的，也可以是在任务启动前下载到本地的。
 - `test_data_dir`：包含测试数据集的目录。
 
-# 使用分布式计算平台或工具
+## 使用分布式计算平台或工具
 
 PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务，包括：
 - [Kubernetes](http://kubernetes.io) Google开源的容器集群的调度框架，支持大规模集群生产环境的完整集群方案。
@@ -195,12 +195,12 @@ PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务
 
 在使用分布式计算平台进行训练时，任务被调度在集群中时，分布式计算平台通常会通过API或者环境变量提供任务运行需要的参数，比如节点的ID、IP和任务节点个数等。
 
-## 使用Fabric启动集群作业
+### 使用Fabric启动集群作业
 
-### 准备一个Linux集群
+#### 准备一个Linux集群
 可以在`paddle/scripts/cluster_train_v2/fabric/docker_cluster`目录下，执行`kubectl -f ssh_servers.yaml`启动一个测试集群，并使用`kubectl get po -o wide`获得这些节点的IP地址。
 
-### 启动集群作业
+#### 启动集群作业
 
 `paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下，所有命令行选项可以设置为 `paddle.py` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
 
@@ -216,10 +216,10 @@ sh run.sh
 
 集群作业将会在几秒后启动。
 
-### 终止集群作业
+#### 终止集群作业
 `paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。
 
-### 检查集群训练结果
+#### 检查集群训练结果
 详细信息请检查 $workspace/log 里的日志，每一个节点都有相同的日志结构。
 
 `paddle_trainer.INFO`
@@ -234,13 +234,13 @@ sh run.sh
 `train.log`
 提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。
 
-### 检查模型输出
+#### 检查模型输出
 运行完成后，模型文件将被写入节点 0 的 `output` 目录中。
 工作空间中的 `nodefile` 表示当前集群作业的节点 ID。
 
-## 在OpenMPI集群中提交训练作业
+### 在OpenMPI集群中提交训练作业
 
-### 准备OpenMPI集群
+#### 准备OpenMPI集群
 
 执行下面的命令以启动3个节点的OpenMPI集群和一个"head"节点：
 
@@ -252,7 +252,7 @@ kubectl create -f mpi-nodes.yaml
 
 然后可以从head节点ssh无密码登录到OpenMPI的每个节点上。
 
-### 启动集群作业
+#### 启动集群作业
 
 您可以按照下面的步骤在OpenMPI集群中提交paddle训练任务：
 
@@ -280,6 +280,6 @@ scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```
 
-## 在Kubernetes集群中提交训练作业
+### 在Kubernetes集群中提交训练作业
 
 此部分的使用方法可以参考[here](../k8s/k8s_distributed_cn.md)。
diff --git a/doc/howto/usage/cluster/cluster_train_en.md b/doc/howto/usage/cluster/cluster_train_en.md
index 1e8b4d54b9ffa99b3beef35ecaf95bbd0866535f..baa97c0c02ae490fff8587071bd2d4adfb5325e3 100644
--- a/doc/howto/usage/cluster/cluster_train_en.md
+++ b/doc/howto/usage/cluster/cluster_train_en.md
@@ -19,7 +19,7 @@
       * [Launching Cluster Job](#launching-cluster-job-1)
    * [Cluster Training Using Kubernetes](#cluster-training-using-kubernetes)
 
-# Introduction
+## Introduction
 
 In this article, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
 
@@ -33,7 +33,7 @@ PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and
 
 When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.
 
-# Preparations
+## Preparations
 1. Prepare your computer cluster. It's normally a bunch of Linux servers connected by LAN. Each server will be assigned a unique IP address. The computers in the cluster can be called "nodes".
 2. Install PaddlePaddle on every node. If you are going to take advantage of GPU cards, you'll also need to install proper driver and CUDA libraries. To install PaddlePaddle please read [this build and install](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/getstarted/build_and_install) document. We strongly recommend using [Docker installation](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).
 
@@ -52,9 +52,9 @@ PaddlePaddle 0.10.0rc, compiled with
 
 We'll take `doc/howto/usage/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API.
 
-# Command-line arguments
+## Command-line arguments
 
-## Starting parameter server
+### Starting parameter server
 
 Type the below command to start a parameter server which will wait for trainers to connect:
 
@@ -74,7 +74,7 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
 | ports_num_for_sparse  | required | 1 | number of ports which serves sparse parameter update  |
 | num_gradient_servers  | required | 1 | total number of gradient servers |
 
-## Starting trainer
+### Starting trainer
 Type the command below to start the trainer(name the file whatever you want, like "train.py")
 
 ```bash
@@ -122,7 +122,7 @@ paddle.init(
 | trainer_id  | required | 0 | ID for every trainer, start from 0 |
 | pservers  | required | 127.0.0.1 | list of IPs of parameter servers, separated by "," |
 
-## Prepare Training Dataset
+### Prepare Training Dataset
 
 Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
 
@@ -155,7 +155,7 @@ When job started, every trainer needs to get it's own part of data. In some dist
 
 Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
 
-## Prepare Training program
+### Prepare Training program
 
 We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
 
@@ -191,7 +191,7 @@ Your workspace may looks like:
 - `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
 - `test_data_dir`: containing testing data.
 
-# Use cluster platforms or cluster management tools
+## Use cluster platforms or cluster management tools
 
 PaddlePaddle supports running jobs on several platforms including:
 - [Kubernetes](http://kubernetes.io) open-source system for automating deployment, scaling, and management of containerized applications from Google.
@@ -202,13 +202,13 @@ We'll introduce cluster job management on these platforms. The examples can be f
 
 These cluster platforms provide API or environment variables for training processes, when the job is dispatched to different nodes. Like node ID, IP or total number of nodes etc.
 
-## Cluster Training Using Fabric
+### Cluster Training Using Fabric
 
-### Prepare a Linux cluster
+#### Prepare a Linux cluster
 
 Run `kubectl -f ssh_servers.yaml` under the directory:  `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.
 
-### Launching Cluster Job
+#### Launching Cluster Job
 `paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
 
 `paddle.py`provides two distinguished command option for easy job launching.
@@ -224,10 +224,10 @@ sh run.sh
 
 The cluster Job will start in several seconds.
 
-### Kill Cluster Job
+#### Kill Cluster Job
 `paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed.
 
-### Check Cluster Training Result
+#### Check Cluster Training Result
 Check log in $workspace/log for details, each node owns same log structure.
 
 `paddle_trainer.INFO`
@@ -242,13 +242,13 @@ It provides stderr and stdout of parameter server process. Check error log if tr
 `train.log`
 It provides stderr and stdout of trainer process. Check error log if training crashes.
 
-### Check Model Output
+#### Check Model Output
 After one pass finished, model files will be written in `output` directory in node 0.
 `nodefile` in workspace indicates the node id of current cluster job.
 
-## Cluster Training Using OpenMPI
+### Cluster Training Using OpenMPI
 
-### Prepare an OpenMPI cluster
+#### Prepare an OpenMPI cluster
 
 Run the following command to start a 3-node MPI cluster and one "head" node.
 
@@ -260,7 +260,7 @@ kubectl create -f mpi-nodes.yaml
 
 Then you can log in to every OpenMPI node using ssh without input any passwords.
 
-### Launching Cluster Job
+#### Launching Cluster Job
 
 Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\
 
@@ -288,6 +288,6 @@ scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```
 
-## Cluster Training Using Kubernetes
+### Cluster Training Using Kubernetes
 
 The details can be found [here](../k8s/k8s_cn.md)
diff --git a/go/.gitignore b/go/.gitignore
index 000e1fd55b63b8e532308b787c2708a6c3e5ac87..398d70ca375ffceccdbfc82a4851a6830ca31264 100644
--- a/go/.gitignore
+++ b/go/.gitignore
@@ -1,2 +1,3 @@
 vendor/
 .glide/
+proto/*.go
diff --git a/go/glide.lock b/go/glide.lock
index ce654d36364f8078a493651d8d8b141532eea26d..d15fc934dbe511389cc92ce95cededa41ba32b4d 100644
--- a/go/glide.lock
+++ b/go/glide.lock
@@ -1,5 +1,5 @@
-hash: 51d9e2e46d7fd9173ff11ecada40f7b7728756be18d5e2f032535f66465e6e15
-updated: 2017-10-24T15:04:09.987751592-07:00
+hash: 107c058cf5c9163a75d40eef2273a793c36112683c25d72aa8288827fdde3a19
+updated: 2017-10-30T03:46:19.137696069Z
 imports:
 - name: github.com/alecthomas/gometalinter
   version: bae2f1293d092fd8167939d5108d1b025eaef9de
diff --git a/go/glide.yaml b/go/glide.yaml
index ba253f8bebef0ddab810a8303ab1fbe541defbdf..c5d66694acd0f45de5002391a7953b7491eaf2bc 100644
--- a/go/glide.yaml
+++ b/go/glide.yaml
@@ -30,3 +30,4 @@ import:
   version: v2.13
 - package: github.com/go-stack/stack
   version: v1.6.0
+- package: github.com/golang/protobuf
diff --git a/go/proto/.gitignore b/go/proto/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..5e7d2734cfc60289debf74293817c0a8f572ff32
--- /dev/null
+++ b/go/proto/.gitignore
@@ -0,0 +1,4 @@
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore
diff --git a/go/pserver/CMakeLists.txt b/go/pserver/CMakeLists.txt
index 4fe0a8cb021e8dbf443c8f33bfb046e228a2fd8d..9ac05199e7ab76c21275838092c0afbdf2612b77 100644
--- a/go/pserver/CMakeLists.txt
+++ b/go/pserver/CMakeLists.txt
@@ -13,5 +13,5 @@
 # limitations under the License.
 #
 if(WITH_TESTING)
-  go_test(pserver_test DEPS paddle_go_optimizer)
+  go_test(pserver_test DEPS paddle_go_optimizer gen_proto_go)
 endif()
diff --git a/go/pserver/service.go b/go/pserver/service.go
index f703d99a29ae9f5310ef36a7492b729c4c892937..7484ec90b1a3a9e67fa798741a9dfeb580c51f1a 100644
--- a/go/pserver/service.go
+++ b/go/pserver/service.go
@@ -17,6 +17,7 @@ package pserver
 import (
 	"bufio"
 	"bytes"
+	"encoding/binary"
 	"encoding/gob"
 	"encoding/json"
 	"errors"
@@ -26,11 +27,15 @@ import (
 	"os"
 	"path"
 	"strconv"
+	"strings"
 	"sync"
 	"time"
 
+	"github.com/golang/protobuf/proto"
 	uuid "github.com/satori/go.uuid"
 
+	pb "github.com/PaddlePaddle/Paddle/go/proto"
+
 	log "github.com/inconshreveable/log15"
 )
 
@@ -65,6 +70,46 @@ type Parameter struct {
 	Content     []byte
 }
 
+func float32ToString(b []byte) string {
+	f := make([]float32, len(b)/4)
+	buf := bytes.NewReader(b)
+	err := binary.Read(buf, binary.LittleEndian, &f)
+	if err != nil {
+		return ""
+	}
+	return fmt.Sprintf("%v", f)
+}
+
+func float32ByteToString(c []byte) string {
+	var a []byte
+	var b []byte
+	if len(c) <= 80 {
+		a = c
+	} else {
+		a = c[0:40]
+		b = c[len(c)-40:]
+	}
+
+	var s string
+	s = float32ToString(a)
+
+	if b == nil {
+		return s
+	}
+
+	s = strings.Replace(s, "]", "", -1) + "..." + strings.Replace(float32ToString(b), "[", "", -1)
+	return s
+}
+
+func (p Parameter) String() string {
+	if p.ElementType != Float32 {
+		return fmt.Sprintf("name:%v ElementType:%v",
+			p.Name, p.ElementType)
+	}
+
+	return float32ByteToString(p.Content)
+}
+
 // ParameterWithConfig contains the parameter and the configuration.
 type ParameterWithConfig struct {
 	Param  Parameter
@@ -189,7 +234,9 @@ func (s *Service) InitParam(paramWithConfigs ParameterWithConfig, _ *int) error
 	default:
 	}
 
-	// TODO(helin): parse parameter config
+	c := &pb.OptimizerConfig{}
+	proto.Unmarshal(paramWithConfigs.Config, c)
+	log.Debug(fmt.Sprintf("OptimizerConfig:%v", c))
 
 	s.mu.Lock()
 	defer s.mu.Unlock()
@@ -239,7 +286,8 @@ func (s *Service) SendGrad(g Gradient, _ *int) error {
 	select {
 	case <-s.initialized:
 	default:
-		log.Warn("received gradient before initialization.", "name", g.Name, "size", len(g.Content), "type", g.ElementType)
+		log.Warn("received gradient before initialization.",
+			"name", g.Name, "size", len(g.Content), "type", g.ElementType)
 		return errors.New(Uninitialized)
 	}
 
@@ -248,10 +296,14 @@ func (s *Service) SendGrad(g Gradient, _ *int) error {
 
 	o, ok := s.optMap[g.Name]
 	if !ok {
+		log.Warn("received gradient but can't find name.",
+			"name", g.Name, "size", len(g.Content), "type", g.ElementType)
 		return fmt.Errorf("parameter: %s does not exist", g.Name)
 	}
 
-	log.Info("received gradient from trainer, updating gradient.", "name", g.Name, "size", len(g.Content), "type", g.ElementType)
+	log.Debug(Parameter(g).String())
+	log.Info("received gradient from trainer, updating gradient.",
+		"name", g.Name, "size", len(g.Content), "type", g.ElementType)
 	return o.UpdateParameter(g)
 }
 
@@ -277,7 +329,7 @@ func (s *Service) GetParam(name string, parameter *Parameter) error {
 	parameter.Name = name
 	parameter.ElementType = opt.elementType
 	parameter.Content = opt.GetWeights()
-
+	log.Debug(parameter.String())
 	log.Info("sending parameter to the trainer", "name", parameter.Name, "size", len(parameter.Content), "type", parameter.ElementType)
 	return nil
 }
diff --git a/go/pserver/service_test.go b/go/pserver/service_test.go
index b6f4566eb78cf797e3738afa5f86f5c4e8090d85..58a743e1fadff9d629f682d660e661013c33ac8a 100644
--- a/go/pserver/service_test.go
+++ b/go/pserver/service_test.go
@@ -15,6 +15,7 @@
 package pserver_test
 
 import (
+	"fmt"
 	"io/ioutil"
 	"reflect"
 	"sync"
@@ -178,3 +179,33 @@ func TestBlockUntilInitialized(t *testing.T) {
 
 	wg.Wait()
 }
+
+func TestGradientString(t *testing.T) {
+	g := pserver.Parameter{}
+	g.ElementType = pserver.Float32
+	g.Content = []byte{0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40, 0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40}
+	if g.String() != "[3.3702806e+12 2.142699 3.3702806e+12 2.142699]" {
+		t.Fatal("get float data error!")
+	}
+
+	g.Content = []byte{0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40,
+		0x18, 0x2d, 0x44, 0x54, 0xfb, 0x21, 0x09, 0x40}
+	if g.String() != "[3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699...3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699 3.3702806e+12 2.142699]" {
+		t.Fatal("get float data error!", g.String())
+	}
+	fmt.Println(g)
+}
diff --git a/paddle/gserver/tests/CMakeLists.txt b/paddle/gserver/tests/CMakeLists.txt
index 329536afaf6d69676e8c39fdf8b6b8cb87ade5fa..aa94ee406e27c86e6d49b6d2b5327a3f86bcacd6 100644
--- a/paddle/gserver/tests/CMakeLists.txt
+++ b/paddle/gserver/tests/CMakeLists.txt
@@ -1,24 +1,29 @@
 # gserver pacakge unittests
 
-if(NOT MOBILE_INFERENCE)
-################### test_ProtoDataProvider ############
-    add_unittest_without_exec(test_ProtoDataProvider
-        test_ProtoDataProvider.cpp)
-
-    # test_ProtoDataProvider will mkdir as same name,
-    # so if WORKING_DIRECTORY is default directory, then
-    # mkdir will get error.
-    add_test(NAME test_ProtoDataProvider
-        COMMAND ${CMAKE_CURRENT_BINARY_DIR}/test_ProtoDataProvider
-        WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
-endif()
+add_simple_unittest(test_LinearChainCRF)
+add_simple_unittest(test_MultinomialSampler)
+add_simple_unittest(test_RecurrentLayer)
 
-################# test_LayerGrad #######################
-add_unittest_without_exec(test_LayerGrad
-    test_LayerGrad.cpp
-    LayerGradUtil.cpp)
-add_test(NAME test_LayerGrad
-    COMMAND test_LayerGrad)
+function(gserver_test TARGET)
+  add_unittest_without_exec(${TARGET}
+      ${TARGET}.cpp
+      LayerGradUtil.cpp)
+  add_test(NAME ${TARGET}
+      COMMAND ${TARGET})
+endfunction()
+
+gserver_test(test_LayerGrad)
+gserver_test(test_CRFLayerGrad)
+gserver_test(test_CrossEntropyOverBeamGrad)
+gserver_test(test_SeqSliceLayerGrad)
+gserver_test(test_ActivationGrad)
+gserver_test(test_ConvTrans)
+gserver_test(test_PriorBox)
+gserver_test(test_DetectionOutput)
+gserver_test(test_ConvUnify)
+gserver_test(test_BatchNorm)
+gserver_test(test_KmaxSeqScore)
+gserver_test(test_Expand)
 
 ########## test_Mkldnn layers and activations ##########
 if(WITH_MKLDNN)
@@ -32,89 +37,6 @@ if(WITH_MKLDNN)
             WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
 endif()
 
-################ test_CRFLayerGrad ####################
-add_unittest_without_exec(test_CRFLayerGrad
-    test_CRFLayerGrad.cpp
-    LayerGradUtil.cpp)
-add_test(NAME test_CRFLayerGrad
-    COMMAND test_CRFLayerGrad)
-
-################ test_CrossEntropyOverBeam ####################
-add_unittest_without_exec(test_CrossEntropyOverBeam
-    test_CrossEntropyOverBeamGrad.cpp
-    LayerGradUtil.cpp)
-add_test(NAME test_CrossEntropyOverBeam
-    COMMAND test_CrossEntropyOverBeam)
-
-################ test_SeqSliceLayerGrad ####################
-add_unittest_without_exec(test_SeqSliceLayerGrad
-    test_SeqSliceLayerGrad.cpp
-    LayerGradUtil.cpp)
-add_test(NAME test_SeqSliceLayerGrad
-    COMMAND test_SeqSliceLayerGrad)
-
-add_unittest_without_exec(test_ActivationGrad
-    test_ActivationGrad.cpp
-    LayerGradUtil.cpp)
-add_test(NAME test_ActivationGrad
-    COMMAND test_ActivationGrad)
-################# test_ConvTrans #######################
-add_unittest_without_exec(test_ConvTrans
-    test_ConvTrans.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_ConvTrans
-    COMMAND test_ConvTrans)
-################# test_PriorBox #######################
-add_unittest_without_exec(test_PriorBox
-    test_PriorBox.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_PriorBox
-    COMMAND test_PriorBox)
-################# test_DetectionOutput #######################
-add_unittest_without_exec(test_DetectionOutput
-    test_DetectionOutput.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_DetectionOutput
-    COMMAND test_DetectionOutput)
-################# test_ConvUnify #######################
-add_unittest_without_exec(test_ConvUnify
-    test_ConvUnify.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_ConvUnify
-    COMMAND test_ConvUnify)
-################# test_BatchNorm #######################
-add_unittest_without_exec(test_BatchNorm
-    test_BatchNorm.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_BatchNorm
-    COMMAND test_BatchNorm)
-
-
-################# test_KmaxSeqScore #######################
-add_unittest_without_exec(test_KmaxSeqScore
-    test_KmaxSeqScore.cpp
-    LayerGradUtil.cpp)
-
-add_test(NAME test_KmaxSeqScore
-    COMMAND test_KmaxSeqScore)
-
-if(NOT MOBILE_INFERENCE)
-################## test_Evaluator #######################
-    add_unittest(test_Evaluator
-        test_Evaluator.cpp)
-endif()
-
-################ test_LinearChainCRF ####################
-add_simple_unittest(test_LinearChainCRF)
-
-############## test_MultinomialSampler ###################
-add_simple_unittest(test_MultinomialSampler)
-
 ############## test_PyDataProvider ########################
 if(WITH_PYTHON)
     add_unittest_without_exec(test_PyDataProvider
@@ -125,9 +47,6 @@ if(WITH_PYTHON)
         WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
 endif()
 
-############### test_RecurrentLayer #######################
-add_simple_unittest(test_RecurrentLayer)
-
 ############### test_WarpCTCLayer #######################
 if(NOT WITH_DOUBLE)
     add_unittest_without_exec(test_WarpCTCLayer
@@ -139,19 +58,33 @@ if(NOT WITH_DOUBLE)
 endif()
 
 if(NOT MOBILE_INFERENCE)
-############### test_RecurrentGradientMachine ###############
-  # TODO(yuyang18): There is some bug in test_RecurrentGradientMachine
-  # I will fix it.
-  add_unittest_without_exec(test_RecurrentGradientMachine
-      test_RecurrentGradientMachine.cpp)
-  add_test(NAME test_RecurrentGradientMachine
-      COMMAND .set_python_path.sh -d
-              ${PADDLE_SOURCE_DIR}/python:${PADDLE_SOURCE_DIR}/paddle/gserver/tests
-              ${CMAKE_CURRENT_BINARY_DIR}/test_RecurrentGradientMachine
-      WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
-endif()
+################### test_ProtoDataProvider ############
+    add_unittest_without_exec(test_ProtoDataProvider
+        test_ProtoDataProvider.cpp)
 
-if(NOT MOBILE_INFERENCE)
+    # test_ProtoDataProvider will mkdir as same name,
+    # so if WORKING_DIRECTORY is default directory, then
+    # mkdir will get error.
+    add_test(NAME test_ProtoDataProvider
+        COMMAND ${CMAKE_CURRENT_BINARY_DIR}/test_ProtoDataProvider
+        WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
+
+################## test_Evaluator #######################
+    add_unittest(test_Evaluator
+        test_Evaluator.cpp)
+      
+############### test_RecurrentGradientMachine ###############
+    # TODO(yuyang18): There is some bug in test_RecurrentGradientMachine
+    # I will fix it.
+    add_unittest_without_exec(test_RecurrentGradientMachine
+        test_RecurrentGradientMachine.cpp)
+    add_test(NAME test_RecurrentGradientMachine
+        COMMAND .set_python_path.sh -d
+                ${PADDLE_SOURCE_DIR}/python:${PADDLE_SOURCE_DIR}/paddle/gserver/tests
+                ${CMAKE_CURRENT_BINARY_DIR}/test_RecurrentGradientMachine
+        WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle)
+      
+############### test_NetworkCompare ###############
     add_unittest_without_exec(test_NetworkCompare
         test_NetworkCompare.cpp)
     if(WITH_GPU)
diff --git a/paddle/gserver/tests/test_Expand.cpp b/paddle/gserver/tests/test_Expand.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..d32bf0152f77bba098daa508fe448784ac013549
--- /dev/null
+++ b/paddle/gserver/tests/test_Expand.cpp
@@ -0,0 +1,127 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <gtest/gtest.h>
+#include <string>
+#include <vector>
+
+#include "LayerGradUtil.h"
+#include "paddle/testing/TestUtil.h"
+
+using namespace paddle;  // NOLINT
+using namespace std;     // NOLINT
+
+// Do one forward pass of expand layer and check to see if its output
+// matches the given result.(Test onlyCPU currently.)
+void doOneExpandTest(string trans_type,
+                     bool hasSubseq,
+                     bool useGpu,
+                     Argument& input1,
+                     Argument& input2,
+                     Argument& result) {
+  FLAGS_use_gpu = false;
+  // Setting up the expand layer
+  TestConfig config;
+  config.layerConfig.set_type("expand");
+
+  auto inputType1 =
+      trans_type == "non-seq" ? INPUT_DENSE_DIM_DATA : INPUT_SEQUENCE_DATA;
+  config.inputDefs.push_back({inputType1, "layer0", 1, 0});
+  auto inputType2 =
+      hasSubseq ? INPUT_HASSUB_SEQUENCE_DATA : INPUT_SEQUENCE_DATA;
+
+  config.inputDefs.push_back({inputType2, "layer1", 1, 0});
+  config.layerConfig.add_inputs();
+  config.layerConfig.add_inputs();
+  config.layerConfig.set_trans_type(trans_type);
+
+  // data layer initialize
+  std::vector<DataLayerPtr> dataLayers;
+  LayerMap layerMap;
+  vector<Argument> datas;
+  initDataLayer(
+      config, &dataLayers, &datas, &layerMap, "expand", 1, false, useGpu);
+  dataLayers[0]->getOutput() = input1;
+  dataLayers[1]->getOutput() = input2;
+
+  // test layer initialize
+  std::vector<ParameterPtr> parameters;
+  LayerPtr expandLayer;
+  initTestLayer(config, &layerMap, &parameters, &expandLayer);
+  expandLayer->forward(PASS_GC);
+  checkMatrixEqual(expandLayer->getOutputValue(), result.value);
+}
+
+TEST(Layer, ExpandLayerFwd) {
+  bool useGpu = false;
+
+  // Assume batch_size =3 in all cases.
+
+  // CPU case 1. non-seq expand to seq
+  // input1 = 1,2,3
+  // input2 = [4,5],[6],[7,8,9]
+  // result = [1,1],[2],[3,3,3]
+  Argument input1, input2, result;
+  input1.value = Matrix::create(3, 1, false, useGpu);
+  real input1Data[] = {1, 2, 3};
+  input1.value->setData(input1Data);
+
+  input2.value = Matrix::create(6, 1, false, useGpu);
+  real input2Data[] = {4, 5, 6, 7, 8, 9};
+  input2.value->setData(input2Data);
+  input2.sequenceStartPositions = ICpuGpuVector::create(4, useGpu);
+  int input2Seq[] = {0, 2, 3, 6};
+  input2.sequenceStartPositions->copyFrom(input2Seq, 4, useGpu);
+
+  result.value = Matrix::create(6, 1, false, useGpu);
+  real resultData[] = {1, 1, 2, 3, 3, 3};
+  result.value->setData(resultData);
+
+  doOneExpandTest("non-seq", false, useGpu, input1, input2, result);
+
+  // CPU case 2. non-seq expand to sub-seq
+  // NOTE: input1.batch_size == input2.sequencelength in this case.
+  // i.e, input1 expands by input2.sequence
+  // input1 = 1,2,3
+  // input2 = [[4,5]],[[6]],[[7],[8,9]]
+  // result = [[1,1]],[[2]],[[3],[3,3]]
+  input2.subSequenceStartPositions = ICpuGpuVector::create(5, useGpu);
+  int input2SubSeq[] = {0, 2, 3, 4, 6};
+  input2.subSequenceStartPositions->copyFrom(input2SubSeq, 5, useGpu);
+
+  doOneExpandTest("non-seq", true, useGpu, input1, input2, result);
+
+  // CPU case 3. seq expand to sub-seq
+  // input1 = [1,2],[3],[4]
+  // input2 = [[4,5]],[[6]],[[7],[8,9]]
+  // result = [[1,1]],[[2]],[[3],[4,4]]
+  Matrix::resizeOrCreate(input1.value, 4, 1, false, useGpu);
+  real input1Data_case3[] = {1, 2, 3, 4};
+  input1.value->setData(input1Data_case3);
+
+  input1.sequenceStartPositions = ICpuGpuVector::create(4, useGpu);
+  int input1Seq[] = {0, 2, 3, 4};
+  input1.sequenceStartPositions->copyFrom(input1Seq, 4, useGpu);
+
+  real resultData_case3[] = {1, 1, 2, 3, 4, 4};
+  result.value->setData(resultData_case3);
+
+  doOneExpandTest("seq", true, useGpu, input1, input2, result);
+}
+
+int main(int argc, char** argv) {
+  testing::InitGoogleTest(&argc, argv);
+  initMain(argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/paddle/operators/accuracy_op.cc b/paddle/operators/accuracy_op.cc
index eb8bce8da70a128bd1e0d36540dce5e296540629..2a2a1e9cfd680ff983f54e4c12c34fbb5af69ca0 100644
--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -22,22 +22,35 @@ class AccuracyOp : public framework::OperatorWithKernel {
   using framework::OperatorWithKernel::OperatorWithKernel;
 
   void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("Inference"),
-                   "Input(Inference) of AccuracyOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Out"),
+                   "Input (Out) of accuracy op should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Indices"),
+                   "Input (Indices) of accuracy op should not be null.");
     PADDLE_ENFORCE(ctx->HasInput("Label"),
-                   "Input(Label) of AccuracyOp should not be null.");
+                   "Input (Label) of accuracy op should not be null.");
     PADDLE_ENFORCE(ctx->HasOutput("Accuracy"),
-                   "Output(Accuracy) of AccuracyOp should not be null.");
+                   "Output (Accuracy) of AccuracyOp should not be null.");
 
-    auto inference_dim = ctx->GetInputDim("Inference");
+    auto inference_dim = ctx->GetInputDim("Out");
     auto label_dim = ctx->GetInputDim("Label");
+    // Assume indices has same shape with infernece, because
+    // it's the output of topk.
 
-    PADDLE_ENFORCE_EQ(label_dim.size(), 1, "label must be a vector");
+    PADDLE_ENFORCE_EQ(label_dim.size(), 2, "label's rank must be 2.");
+    PADDLE_ENFORCE_EQ(label_dim[1], 1, "label's second dimension must be 1");
     PADDLE_ENFORCE_EQ(inference_dim[0], label_dim[0],
-                      "inference size must be the same as label size");
+                      "the inference tensor's num_rows must be"
+                      " the same as label.");
 
     ctx->SetOutputDim("Accuracy", {1});
-    ctx->ShareLoD("Inference", /*->*/ "Accuracy");
+    ctx->ShareLoD("Out", /*->*/ "Accuracy");
+  }
+
+ protected:
+  // IndicateDataType
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext &ctx) const override {
+    return framework::ToDataType(ctx.Input<Tensor>("Out")->type());
   }
 };
 
@@ -47,7 +60,8 @@ class AccuracyOpMaker : public framework::OpProtoAndCheckerMaker {
                   framework::OpAttrChecker *op_checker)
       : OpProtoAndCheckerMaker(proto, op_checker) {
     // TODO(typhoonzero): support both inference value and indices.
-    AddInput("Inference", "topk(indices) the network output");
+    AddInput("Out", "topk (inferences) the network output");
+    AddInput("Indices", "topk (indices) the network output");
     AddInput("Label", "Label of the training data");
     // TODO(typhoonzero): AddInput("Weight", ...
     AddOutput("Accuracy", "The accuracy of current batch");
@@ -58,7 +72,7 @@ The accuracy is:
 ..  math::
 accuracy = \\frac{NumOfCorrectPredicts}{NumOfAllSamples})
 
-Both the input `Inference` and `Label` can carry the LoD (Level of Details)
+Both the input `Out` and `Label` can carry the LoD (Level of Details)
 information, or not. But the output only shares the LoD with input `Inference`.
 )DOC");
   }
@@ -68,7 +82,10 @@ information, or not. But the output only shares the LoD with input `Inference`.
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-REGISTER_OP_WITHOUT_GRADIENT(accuracy, ops::AccuracyOp, ops::AccuracyOpMaker);
-REGISTER_OP_CPU_KERNEL(
-    accuracy, ops::AccuracyKernel<paddle::platform::CPUPlace, int>,
-    ops::AccuracyKernel<paddle::platform::CPUPlace, int64_t>);
+REGISTER_OPERATOR(accuracy, ops::AccuracyOp, ops::AccuracyOpMaker,
+                  paddle::framework::EmptyGradOpMaker);
+// FIXME(typhoonzero): types of T is for infernece data.
+// label data is always int.
+REGISTER_OP_CPU_KERNEL(accuracy,
+                       ops::AccuracyKernel<paddle::platform::CPUPlace, float>,
+                       ops::AccuracyKernel<paddle::platform::CPUPlace, double>);
diff --git a/paddle/operators/accuracy_op.cu b/paddle/operators/accuracy_op.cu
index be58dfbd0305ba14488c2494f82a41ab6c0e8c19..a0483f367e1e3bcf4ad50ee9155f25461c1b4ed2 100644
--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -21,9 +21,10 @@ namespace paddle {
 namespace operators {
 using platform::PADDLE_CUDA_NUM_THREADS;
 
-template <typename T, int BlockSize>
-__global__ void AccuracyCudaKernel(const int N, const int D, const T* Xdata,
-                                   const T* labeldata, float* accuracy) {
+template <int BlockSize>
+__global__ void AccuracyCudaKernel(const int N, const int D,
+                                   const int64_t* Xdata,
+                                   const int64_t* labeldata, float* accuracy) {
   int count = 0;
   __shared__ int total[BlockSize];
 
@@ -52,13 +53,14 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
   void Compute(const framework::ExecutionContext& ctx) const override {
     PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
                    "It must use GPUPlace.");
-    auto* inference = ctx.Input<Tensor>("Inference");
+    auto* inference = ctx.Input<Tensor>("Out");
+    auto* indices = ctx.Input<Tensor>("Indices");
     auto* label = ctx.Input<Tensor>("Label");
     auto* accuracy = ctx.Output<Tensor>("Accuracy");
     // FIXME(typhoonzero): only support indices currently
     // if add support for output values, how to detect the data type?
-    const T* inference_data = inference->data<T>();
-    const T* label_data = label->data<T>();
+    const int64_t* indices_data = indices->data<int64_t>();
+    const int64_t* label_data = label->data<int64_t>();
     float* accuracy_data = accuracy->mutable_data<float>(ctx.GetPlace());
 
     size_t num_samples = inference->dims()[0];
@@ -69,11 +71,11 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
       return;
     }
 
-    AccuracyCudaKernel<T, PADDLE_CUDA_NUM_THREADS><<<
+    AccuracyCudaKernel<PADDLE_CUDA_NUM_THREADS><<<
         1, PADDLE_CUDA_NUM_THREADS, 0,
         reinterpret_cast<const platform::CUDADeviceContext&>(
             ctx.device_context())
-            .stream()>>>(num_samples, infer_width, inference_data, label_data,
+            .stream()>>>(num_samples, infer_width, indices_data, label_data,
                          accuracy_data);
   }
 };
@@ -81,5 +83,7 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
 }  // namespace operators
 }  // namespace paddle
 
-REGISTER_OP_GPU_KERNEL(accuracy, paddle::operators::AccuracyOpCUDAKernel<int>,
-                       paddle::operators::AccuracyOpCUDAKernel<int64_t>);
+// FIXME(typhoonzero): types of T is for infernece data.
+// label data is always int
+REGISTER_OP_GPU_KERNEL(accuracy, paddle::operators::AccuracyOpCUDAKernel<float>,
+                       paddle::operators::AccuracyOpCUDAKernel<double>);
diff --git a/paddle/operators/accuracy_op.h b/paddle/operators/accuracy_op.h
index 12c6b9aac8819caedbc02017cee81b37322bb72a..1968b53d19acfddaa514eca6e24d98a298d8d311 100644
--- a/paddle/operators/accuracy_op.h
+++ b/paddle/operators/accuracy_op.h
@@ -38,14 +38,15 @@ template <typename Place, typename T>
 class AccuracyKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* inference = ctx.Input<Tensor>("Inference");
+    auto* inference = ctx.Input<Tensor>("Out");
+    auto* indices = ctx.Input<Tensor>("Indices");
     auto* label = ctx.Input<Tensor>("Label");
     auto* accuracy = ctx.Output<Tensor>("Accuracy");
 
     float* accuracy_data = accuracy->mutable_data<float>(ctx.GetPlace());
 
-    const T* inference_data = inference->data<T>();
-    const T* label_data = label->data<T>();
+    const int64_t* indices_data = indices->data<int64_t>();
+    const int64_t* label_data = label->data<int64_t>();
 
     size_t num_samples = inference->dims()[0];
     size_t class_dim = inference->dims()[1];
@@ -60,7 +61,7 @@ class AccuracyKernel : public framework::OpKernel<T> {
     for (size_t i = 0; i < num_samples; ++i) {
       PADDLE_ENFORCE_GE(label_data[i], 0, "label must >= 0");
       for (size_t j = 0; j < class_dim; ++j) {
-        if (inference_data[i * class_dim + j] == label_data[i]) {
+        if (indices_data[i * class_dim + j] == label_data[i]) {
           ++num_correct;
           break;
         }
diff --git a/paddle/operators/activation_op.h b/paddle/operators/activation_op.h
index e4c6b2e09cd71f00a2ef73173205b9066c34fcf5..ddd966e26c9abad0d83f8b5c6e3e7d9ad65158a8 100644
--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -547,6 +547,7 @@ struct ELUGradFunctor : public BaseActivationFunctor<T> {
   }
 };
 
+// FIXME(qijun) https://github.com/PaddlePaddle/Paddle/issues/5198
 template <typename T>
 struct PowFunctor : public BaseActivationFunctor<T> {
   float factor;
diff --git a/paddle/operators/auc_op.cc b/paddle/operators/auc_op.cc
index cf3dbc5d10c66cbb344ca8cf8c46432eabef4a07..f5784922af1cafcf4e6873c0e906bbd66ce8ae69 100644
--- a/paddle/operators/auc_op.cc
+++ b/paddle/operators/auc_op.cc
@@ -23,18 +23,26 @@ class AucOp : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("Inference"),
-                   "Input of Inference must be initialized.");
+    PADDLE_ENFORCE(ctx->HasInput("Out"), "Input of Out must be initialized.");
+    PADDLE_ENFORCE(ctx->HasInput("Indices"),
+                   "Input of Indices must be initialized.");
     PADDLE_ENFORCE(ctx->HasInput("Label"),
                    "Input of Label must be initialized.");
-    auto inference_dim = ctx->GetInputDim("Inference");
-    auto label_dim = ctx->GetInputDim("Label");
+    auto inference_height = ctx->GetInputDim("Out")[0];
+    auto label_height = ctx->GetInputDim("Label")[0];
 
-    PADDLE_ENFORCE_EQ(inference_dim, label_dim,
-                      "inference and label should have same shape");
+    PADDLE_ENFORCE_EQ(inference_height, label_height,
+                      "Out and Label should have same height.");
 
     ctx->SetOutputDim("AUC", {1});
-    ctx->ShareLoD("Inference", /*->*/ "AUC");
+    ctx->ShareLoD("Out", /*->*/ "AUC");
+  }
+
+ protected:
+  // IndicateDataType
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext &ctx) const override {
+    return framework::ToDataType(ctx.Input<Tensor>("Out")->type());
   }
 };
 
@@ -42,12 +50,18 @@ class AucOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   AucOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
       : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("Inference",
-             "A floating point tensor of arbitrary shape and whose values"
-             "are in the range [0, 1].");
+    AddInput("Out",
+             "A floating point 2D tensor, values are in the range [0, 1]."
+             "Each row is descend sorted. This input should be the"
+             "output of topk."
+             "Typically, this tensor indicates the probability of each label");
+    AddInput("Indices",
+             "An int 2D tensor, indicating the indices of original"
+             "tensor before sort. Typically, this tensor indicates which label"
+             "the probability stands for.");
     AddInput("Label",
-             "A tensor whose shape matches "
-             "Inference. Will be cast to bool.");
+             "A 2D int tensor indicating the label of the training data."
+             "The height is batch size and width is always 1.");
     // TODO(typhoonzero): support weight input
     AddOutput("AUC",
               "A scalar representing the "
diff --git a/paddle/operators/auc_op.h b/paddle/operators/auc_op.h
index be6ef29d5f6cff5b9ebdf7d8564b2e2792c3b5cb..e5ac57b038ac32ed35bce35e477ede0cdb5da813 100644
--- a/paddle/operators/auc_op.h
+++ b/paddle/operators/auc_op.h
@@ -29,7 +29,7 @@ template <typename Place, typename T>
 class AucKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* inference = ctx.Input<Tensor>("Inference");
+    auto* inference = ctx.Input<Tensor>("Out");
     auto* label = ctx.Input<Tensor>("Label");
     auto* auc = ctx.Output<Tensor>("AUC");
 
@@ -46,18 +46,11 @@ class AucKernel : public framework::OpKernel<T> {
     thresholds_list[0] = 0.0f - kEpsilon;
     thresholds_list[num_thresholds - 1] = 1.0f + kEpsilon;
 
-    size_t num_samples = inference->numel();
+    size_t batch_size = inference->dims()[0];
+    size_t inference_width = inference->dims()[1];
 
     const T* inference_data = inference->data<T>();
-    Tensor label_casted;
-    label_casted.Resize(label->dims());
-    bool* label_casted_data = label_casted.mutable_data<bool>(ctx.GetPlace());
-
-    const int* label_data = label->data<int>();
-    // cast label_data to bool
-    for (size_t i = 0; i < num_samples; i++) {
-      label_casted_data[i] = static_cast<bool>(label_data[i]);
-    }
+    const int64_t* label_data = label->data<int64_t>();
 
     // Create local tensor for storing the curve: TP, FN, TN, FP
     // TODO(typhoonzero): use eigen op to caculate these values.
@@ -68,23 +61,27 @@ class AucKernel : public framework::OpKernel<T> {
     true_negative.Resize({num_thresholds});
     false_positive.Resize({num_thresholds});
 
-    int* tp_data = true_positive.mutable_data<int>(ctx.GetPlace());
-    int* fn_data = false_negative.mutable_data<int>(ctx.GetPlace());
-    int* tn_data = true_negative.mutable_data<int>(ctx.GetPlace());
-    int* fp_data = false_positive.mutable_data<int>(ctx.GetPlace());
+    int64_t* tp_data = true_positive.mutable_data<int64_t>(ctx.GetPlace());
+    int64_t* fn_data = false_negative.mutable_data<int64_t>(ctx.GetPlace());
+    int64_t* tn_data = true_negative.mutable_data<int64_t>(ctx.GetPlace());
+    int64_t* fp_data = false_positive.mutable_data<int64_t>(ctx.GetPlace());
 
     for (int idx_thresh = 0; idx_thresh < num_thresholds; idx_thresh++) {
       // caculate TP, FN, TN, FP for current thresh
-      int tp = 0, fn = 0, tn = 0, fp = 0;
-      for (size_t i = 0; i < num_samples; i++) {
-        if (label_casted_data[i]) {
-          if (inference_data[i] >= (thresholds_list[idx_thresh])) {
+      int64_t tp = 0, fn = 0, tn = 0, fp = 0;
+      for (size_t i = 0; i < batch_size; i++) {
+        // NOTE: label_data used as bool, labels >0 will be treated as true.
+        if (label_data[i]) {
+          // use first(max) data in each row
+          if (inference_data[i * inference_width] >=
+              (thresholds_list[idx_thresh])) {
             tp++;
           } else {
             fn++;
           }
         } else {
-          if (inference_data[i] >= (thresholds_list[idx_thresh])) {
+          if (inference_data[i * inference_width] >=
+              (thresholds_list[idx_thresh])) {
             fp++;
           } else {
             tn++;
diff --git a/paddle/operators/math/context_project.h b/paddle/operators/math/context_project.h
index e37f3a5bf2bd59e46f66aa3a8284e05d79dbc790..e0283360414fbdfb3dae2e94b45c9c8daeed3c74 100644
--- a/paddle/operators/math/context_project.h
+++ b/paddle/operators/math/context_project.h
@@ -16,36 +16,36 @@ limitations under the License. */
 
 #include "paddle/framework/eigen.h"
 #include "paddle/framework/lod_tensor.h"
-#include "paddle/framework/tensor.h"
 #include "paddle/operators/math/im2col.h"
 
 namespace paddle {
 namespace operators {
 namespace math {
 
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
 template <typename T, int MajorType = Eigen::RowMajor,
           typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+
 /*
- * \brief Context projection concatenate features in adjacent time steps in
+ * \brief Context projection concatenates features in adjacent time-steps in
  * a sequence. The i-th row of the output is the concatenation of
  * context_length rows of the input. The context_length rows are the
  * consecutive rows from the i+shift_start row.
-
+ * ContextProjectGradFunctor is the inverse process of ContextProjectFunctor.
+ *
  * \param in            Input data.
- * \param Shape         The shape of Input data,
- *                      [minibatch, number_of_input_features].
- * \param type          A float LoDTensor.
+ * \param Shape         The shape of Input data:
+ *                        [mini-batch, input_hidden_size].
  *
  * \param padding_data  Padding data.
- * \param Shape         The shape of Padding data,
- *                      [up_pad + down_pad, number_of_input_features].
- * \param type          A float Tensor.
+ * \param Shape         The shape of Padding data:
+ *                        [up_pad + down_pad, input_hidden_size].
  *
  * \param col           Col data.
- * \param Shape         The shape of Col data,
- *                      [minibatch, context_length * number_of_input_features].
- * \param type           A float Tensor.
+ * \param Shape         The shape of Col data:
+ *                        [mini-batch, context_length * input_hidden_size].
  *
  * For a mini-batch of 2 variable lengths sentences, containing 3, and 1
  * time-steps:
@@ -63,72 +63,170 @@ using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
  * representation is 2.
  *
  * - Case1:
- * If context_start is -1 and padding_trainable is false, we use zero to pad
- * instead of learned weight to pad,
- * and the context_lenth is 3, the output (Out) is:
+ *   If context_start is -1 and padding_trainable is false, we use zero to pad
+ *   instead of learned weight to pad,
+ *   and the context_length is 3, the output (Out) is:
  *
- * Out =[[0,  0,  a1, a2, b1, b2;
- *        a1, a2, b1, b2, c1, c2;
- *        b1, b2, c1, c2, 0,  0 ]
- *       [0,  0,  d1, d2, 0,  0 ]]
+ *   Out =[[0,  0,  a1, a2, b1, b2;
+ *          a1, a2, b1, b2, c1, c2;
+ *          b1, b2, c1, c2, 0,  0 ]
+ *          [0,  0, d1, d2, 0,  0 ]]
  *
  * - Case2:
- * If context_start is -1 and padding_trainable is true, we use learned weight
- * to pad,
- * and the context_lenth is 3, the output (Out) is:
+ *   If context_start is -1 and padding_trainable is true, we use learned weight
+ *   to pad,
+ *   and the context_length is 3, the output (Out) is:
  *
- * Out = [[w1, w2, a1, a2, b1, b2;
- *         a1, a2, b1, b2, c1, c2;
- *         b1, b2, c1, c2, w3, w4]
- *        [w1, w2, d1, d2, w3, w4]]
+ *   Out = [[w1, w2, a1, a2, b1, b2;
+ *           a1, a2, b1, b2, c1, c2;
+ *           b1, b2, c1, c2, w3, w4]
+ *          [w1, w2, d1, d2, w3, w4]]
  *
  */
 
 template <typename Place, typename T>
 class ContextProjectFunctor {
  public:
-  void operator()(const platform::DeviceContext& context,
-                  framework::LoDTensor& in, framework::Tensor& padding_data,
-                  framework::Tensor& col, bool padding_trainable,
+  void operator()(const platform::DeviceContext& context, const LoDTensor& in,
+                  const Tensor& padding_data, Tensor& col,
+                  bool padding_trainable, int context_start, int context_length,
+                  int context_stride, int up_pad, int down_pad) {
+    auto lod_level_0 = in.lod()[0];
+
+    math::Im2ColFunctor<math::ColFormat::kOCF, Place, float> im2col_ocf;
+
+    int input_row_begin, input_row_end;
+    int sequence_height, sequence_width;
+    sequence_width = in.dims()[1];
+
+    for (int i = 0; i < static_cast<int>(lod_level_0.size()) - 1; ++i) {
+      input_row_begin = (context_start > 0)
+                            ? static_cast<int>(lod_level_0[i]) + context_start
+                            : static_cast<int>(lod_level_0[i]);
+      input_row_end = static_cast<int>(lod_level_0[i + 1]);
+
+      Tensor out_t = col.Slice(static_cast<int>(lod_level_0[i]),
+                               static_cast<int>(lod_level_0[i + 1]));
+
+      sequence_height = static_cast<int>(out_t.dims()[0]);
+
+      if (input_row_begin < input_row_end) {
+        Tensor in_t = in.Slice(input_row_begin, input_row_end);
+
+        std::vector<int64_t> output_shape(
+            {sequence_height, 1, 1, context_length,
+             sequence_width});  // output_height, output_width,
+        // input_channels, filter_height, filter_width
+        out_t.Resize(framework::make_ddim(output_shape));
+
+        std::vector<int64_t> input_shape(
+            {1, input_row_end - input_row_begin,
+             sequence_width});  // input_channels, input_height, input_width
+        in_t.Resize(framework::make_ddim(input_shape));
+
+        im2col_ocf(context, in_t, out_t,
+                   /*stride_height*/ context_stride, /*stride_width*/ 1, up_pad,
+                   down_pad, 0, 0);
+        out_t.Resize({sequence_height, context_length * sequence_width});
+      }
+    }
+    if (padding_trainable) {
+      for (int i = 0; i < static_cast<int>(lod_level_0.size()) - 1; ++i) {
+        Tensor out_t = col.Slice(static_cast<int>(lod_level_0[i]),
+                                 static_cast<int>(lod_level_0[i + 1]));
+
+        sequence_height = static_cast<int>(out_t.dims()[0]);
+
+        // add up trainable data
+        out_t.Resize({sequence_height * context_length, sequence_width});
+
+        if (up_pad > 0) {  // add up pad
+          int padding_rows = std::min(
+              up_pad, static_cast<int>(lod_level_0[i + 1] - lod_level_0[i]));
+
+          for (int k = 0; k < padding_rows; ++k) {
+            int padding_size =
+                k + context_length < up_pad ? context_length : up_pad - k;
+            Tensor out_t_sub = out_t.Slice(k * context_length,
+                                           k * context_length + padding_size);
+            Tensor w_sub = padding_data.Slice(k, k + padding_size);
+            auto out_t_sub_e = EigenMatrix<T>::From(out_t_sub);
+            auto w_sub_e = EigenMatrix<T>::From(w_sub);
+            out_t_sub_e.device(*context.GetEigenDevice<Place>()) = w_sub_e;
+          }
+        }
+        if (down_pad > 0) {  // add down pad
+          int down_pad_begin_row =
+              std::max(0,
+                       (sequence_height - context_start - context_length) + 1) +
+              1;
+          int padding_begin = std::max(0, context_start - sequence_height);
+          int padding_size =
+              sequence_height - context_start >= context_length
+                  ? 1
+                  : context_length - (sequence_height - context_start);
+          if (context_start >= sequence_height) padding_size = context_length;
+          int padding_idx = padding_begin;
+          for (int t = 0; t + down_pad_begin_row <= sequence_height;
+               ++t, ++padding_size) {
+            if (context_start >= sequence_height) padding_size = context_length;
+            if (padding_size > context_length) {
+              padding_size = context_length;
+              padding_idx++;
+            }
+            if (padding_begin > 0 || sequence_height == context_start)
+              padding_idx = padding_begin + t;
+
+            Tensor out_t_sub = out_t.Slice(
+                (down_pad_begin_row + t) * context_length - padding_size,
+                (down_pad_begin_row + t) * context_length);
+            Tensor w_sub = padding_data.Slice(
+                up_pad + padding_idx, up_pad + padding_idx + padding_size);
+            auto out_t_sub_e = EigenMatrix<T>::From(out_t_sub);
+            auto w_sub_e = EigenMatrix<T>::From(w_sub);
+            out_t_sub_e.device(*context.GetEigenDevice<Place>()) = w_sub_e;
+          }
+        }
+        out_t.Resize({sequence_height, context_length * sequence_width});
+      }
+    }
+  }
+};
+
+template <typename Place, typename T>
+class ContextProjectGradFunctor {
+ public:
+  void operator()(const platform::DeviceContext& context, LoDTensor& in,
+                  Tensor& padding_data, Tensor& col, bool padding_trainable,
                   int context_start, int context_length, int context_stride,
-                  int up_pad, int down_pad, bool gradient, bool input_grad,
-                  bool pad_grad) {
+                  int up_pad, int down_pad, bool input_grad, bool pad_grad) {
     auto lod_level_0 = in.lod()[0];
 
-    paddle::operators::math::Im2ColFunctor<
-        paddle::operators::math::ColFormat::kOCF, Place, float>
-        im2col_ocf;
-    paddle::operators::math::Col2ImFunctor<
-        paddle::operators::math::ColFormat::kOCF, Place, float>
-        col2im_ocf;
+    math::Col2ImFunctor<math::ColFormat::kOCF, Place, float> col2im_ocf;
 
     int input_row_begin, input_row_end;
     int sequence_height, sequence_width;
     sequence_width = in.dims()[1];
-    input_grad = gradient && input_grad;
-    pad_grad = gradient && pad_grad;
 
-    if (!gradient || input_grad) {
+    if (input_grad) {
       for (int i = 0; i < static_cast<int>(lod_level_0.size()) - 1; ++i) {
         input_row_begin = (context_start > 0)
                               ? static_cast<int>(lod_level_0[i]) + context_start
                               : static_cast<int>(lod_level_0[i]);
         input_row_end = static_cast<int>(lod_level_0[i + 1]);
 
-        framework::Tensor out_t =
-            col.Slice(static_cast<int>(lod_level_0[i]),
-                      static_cast<int>(lod_level_0[i + 1]));
+        Tensor out_t = col.Slice(static_cast<int>(lod_level_0[i]),
+                                 static_cast<int>(lod_level_0[i + 1]));
 
         sequence_height = static_cast<int>(out_t.dims()[0]);
 
         if (input_row_begin < input_row_end) {
-          framework::Tensor in_t = in.Slice(input_row_begin, input_row_end);
+          Tensor in_t = in.Slice(input_row_begin, input_row_end);
 
           std::vector<int64_t> output_shape(
               {sequence_height, 1, 1, context_length,
                sequence_width});  // output_height, output_width,
           // input_channels, filter_height, filter_width
-
           out_t.Resize(framework::make_ddim(output_shape));
 
           std::vector<int64_t> input_shape(
@@ -136,53 +234,39 @@ class ContextProjectFunctor {
                sequence_width});  // input_channels, input_height, input_width
           in_t.Resize(framework::make_ddim(input_shape));
 
-          if (gradient) {
-            col2im_ocf(context, in_t, out_t,
-                       /*stride_height*/ context_stride, /*stride_width*/ 1,
-                       up_pad, down_pad, 0, 0);
-          } else {
-            im2col_ocf(context, in_t, out_t,
-                       /*stride_height*/ context_stride, /*stride_width*/ 1,
-                       up_pad, down_pad, 0, 0);
-          }
+          col2im_ocf(context, in_t, out_t,
+                     /*stride_height*/ context_stride, /*stride_width*/ 1,
+                     up_pad, down_pad, 0, 0);
           out_t.Resize({sequence_height, context_length * sequence_width});
         }
       }
     }
-    if (!gradient || pad_grad) {
+    if (pad_grad) {
       if (padding_trainable) {
         for (int i = 0; i < static_cast<int>(lod_level_0.size()) - 1; ++i) {
-          framework::Tensor out_t =
-              col.Slice(static_cast<int>(lod_level_0[i]),
-                        static_cast<int>(lod_level_0[i + 1]));
+          Tensor out_t = col.Slice(static_cast<int>(lod_level_0[i]),
+                                   static_cast<int>(lod_level_0[i + 1]));
 
           sequence_height = static_cast<int>(out_t.dims()[0]);
-
-          // add up trainable data
           out_t.Resize({sequence_height * context_length, sequence_width});
 
-          if (up_pad > 0) {  // add up pad
+          if (up_pad > 0) {
             int padding_rows = std::min(
                 up_pad, static_cast<int>(lod_level_0[i + 1] - lod_level_0[i]));
 
             for (int k = 0; k < padding_rows; ++k) {
               int padding_size =
                   k + context_length < up_pad ? context_length : up_pad - k;
-              framework::Tensor out_t_sub = out_t.Slice(
-                  k * context_length, k * context_length + padding_size);
-              framework::Tensor w_sub = padding_data.Slice(k, k + padding_size);
-              // in this block, using EigenVector<T>::Flatten is ok too.
+              Tensor out_t_sub = out_t.Slice(k * context_length,
+                                             k * context_length + padding_size);
+              Tensor w_sub = padding_data.Slice(k, k + padding_size);
               auto out_t_sub_e = EigenMatrix<T>::From(out_t_sub);
               auto w_sub_e = EigenMatrix<T>::From(w_sub);
-              if (gradient) {
-                w_sub_e.device(*context.GetEigenDevice<Place>()) =
-                    w_sub_e + out_t_sub_e;
-              } else {
-                out_t_sub_e.device(*context.GetEigenDevice<Place>()) = w_sub_e;
-              }
+              w_sub_e.device(*context.GetEigenDevice<Place>()) =
+                  w_sub_e + out_t_sub_e;
             }
           }
-          if (down_pad > 0) {  // add down pad
+          if (down_pad > 0) {
             int down_pad_begin_row =
                 std::max(
                     0, (sequence_height - context_start - context_length) + 1) +
@@ -204,19 +288,16 @@ class ContextProjectFunctor {
               }
               if (padding_begin > 0 || sequence_height == context_start)
                 padding_idx = padding_begin + t;
-              framework::Tensor out_t_sub = out_t.Slice(
+
+              Tensor out_t_sub = out_t.Slice(
                   (down_pad_begin_row + t) * context_length - padding_size,
                   (down_pad_begin_row + t) * context_length);
-              framework::Tensor w_sub = padding_data.Slice(
+              Tensor w_sub = padding_data.Slice(
                   up_pad + padding_idx, up_pad + padding_idx + padding_size);
               auto out_t_sub_e = EigenMatrix<T>::From(out_t_sub);
               auto w_sub_e = EigenMatrix<T>::From(w_sub);
-              if (gradient) {
-                w_sub_e.device(*context.GetEigenDevice<Place>()) =
-                    w_sub_e + out_t_sub_e;
-              } else {
-                out_t_sub_e.device(*context.GetEigenDevice<Place>()) = w_sub_e;
-              }
+              w_sub_e.device(*context.GetEigenDevice<Place>()) =
+                  w_sub_e + out_t_sub_e;
             }
           }
           out_t.Resize({sequence_height, context_length * sequence_width});
diff --git a/paddle/operators/seq_expand_op.h b/paddle/operators/seq_expand_op.h
index aa91e0f9296a7856f4723d413ca0de6876ab6f3b..8703105385183c1a0ee1a1b3831228f942c04dda 100644
--- a/paddle/operators/seq_expand_op.h
+++ b/paddle/operators/seq_expand_op.h
@@ -48,7 +48,7 @@ class SeqExpandKernel : public framework::OpKernel<T> {
           x_t(x_data, 1, element_len);
       Eigen::TensorMap<Eigen::Tensor<T, 2, Eigen::RowMajor, Eigen::DenseIndex>>
           out_t(out_data, scale, element_len);
-      Eigen::array<int, 2> cast({scale, 1});
+      Eigen::array<int, 2> cast({{scale, 1}});
       out_t.device(place) = x_t.broadcast(cast);
       x_data += element_len;
       out_data += element_len * scale;
diff --git a/paddle/operators/sequence_conv_op.cc b/paddle/operators/sequence_conv_op.cc
index 139000c561870c3bc49e01cdcb6cf4b787e64577..bdb52265a529f560b4622ee037dcb3160ac90dec 100644
--- a/paddle/operators/sequence_conv_op.cc
+++ b/paddle/operators/sequence_conv_op.cc
@@ -30,19 +30,20 @@ class SequenceConvOp : public framework::OperatorWithKernel {
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
                    "Output(Out) of SequenceConvOp should not be null.");
 
-    int context_length = ctx->Attrs().Get<int>("context_length");
-    bool padding_trainable = ctx->Attrs().Get<bool>("padding_trainable");
-    int context_start = ctx->Attrs().Get<int>("context_start");
+    int context_length = ctx->Attrs().Get<int>("contextLength");
+    int context_start = ctx->Attrs().Get<int>("contextStart");
 
     auto in_dims = ctx->GetInputDim("X");
     auto filter_dims = ctx->GetInputDim("Filter");
+    PADDLE_ENFORCE(ctx->Attrs().Get<int>("contextStride") == 1,
+                   "Currently, SequenceConvOp only supports contextStride=1.");
     PADDLE_ENFORCE(in_dims.size() == 2 && filter_dims.size() == 2,
                    "Input(X, Filter) should be 2-D tensor.");
     PADDLE_ENFORCE(filter_dims[0] == context_length * in_dims[1],
                    "Filter's height should be context_length * "
-                   "number_of_input_features .");
+                   "input_hidden_size .");
 
-    if (padding_trainable) {
+    if (ctx->Attrs().Get<bool>("paddingTrainable")) {
       PADDLE_ENFORCE(
           ctx->HasInput("PaddingData"),
           "Input(PaddingData) of SequenceConvOp should not be null.");
@@ -54,7 +55,7 @@ class SequenceConvOp : public framework::OperatorWithKernel {
 
       if (context_start == 0 && context_length == 1) {
         PADDLE_THROW(
-            "If context_start is 0 and context_length is 1, padding_trainable "
+            "If context_start is 0 and context_length is 1, paddingTrainable "
             "should be false.");
       }
       PADDLE_ENFORCE(padding_dim.size() == 2,
@@ -81,13 +82,14 @@ class SequenceConvGradOp : public framework::OperatorWithKernel {
                    "Gradient of output(Out) should not be null.");
     PADDLE_ENFORCE(ctx->HasInput("X"), "The input(X) should not be null.");
 
-    if (ctx->Attrs().Get<bool>("padding_trainable") &&
+    if (ctx->Attrs().Get<bool>("paddingTrainable") &&
         ctx->HasOutput(framework::GradVarName("PaddingData"))) {
       ctx->SetOutputDim(framework::GradVarName("PaddingData"),
                         ctx->GetInputDim("PaddingData"));
     }
     if (ctx->HasOutput(framework::GradVarName("X"))) {
       ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
+      ctx->ShareLoD(framework::GradVarName("X"), "X");
     }
     if (ctx->HasOutput(framework::GradVarName("Filter"))) {
       ctx->SetOutputDim(framework::GradVarName("Filter"),
@@ -105,54 +107,58 @@ class SequenceConvOpMaker : public framework::OpProtoAndCheckerMaker {
         "X",
         "(LoDTensor) the input(X) is a LodTensor, which support "
         "variable-time length input sequence. The underlying tensor in "
-        "this LoDTensor is a matrix with shape (T, D), where, T is the "
-        "total time steps in this mini-batch, D is the input feature size.");
+        "this LoDTensor is a matrix with shape (T, N), where, T is the "
+        "total time steps in this mini-batch, N is the input_hidden_size.");
     AddInput("PaddingData",
              "(Tensor, optional) the input(PaddingData) is an optional "
              "parameter, and it is learnable. "
-             "This is a tensor with shape (N, D), where N is the "
-             "top_pad + bottom_pad, D is the input feature size. In order to "
+             "This is a tensor with shape (P, N), where P is the "
+             "top_pad + bottom_pad, N is the input_hidden_size. In order to "
              "ensure the equal length of sequence before and after "
              "convolution, it is necessary to fill the top and bottom of each "
              "sequence according to context_length, context_stride and "
              "context_start")
         .AsDispensable();
-    AddInput("Filter",
-             "(Tensor) the input(Filter) is an learnable parameter."
-             "This is a tensor with shape (N, D), where N is the "
-             "context_length, D is the output feature size.");
+    AddInput(
+        "Filter",
+        "(Tensor) the input(Filter) is an learnable parameter."
+        "This is a tensor with shape (K, M), where K is the "
+        "context_length * input_hidden_size, M is the output feature size.");
     AddOutput(
         "Out",
         "(LoDTensor) the output(Out) is a LodTensor, which support "
         "variable-time length output sequence. The underlying tensor in "
-        "this LoDTensor is a matrix with shape (T, D), where, T is the "
-        "total time steps in this mini-batch, D is the output feature size.");
+        "this LoDTensor is a matrix with shape (T, M), where, T is the "
+        "total time steps in this mini-batch, M is the output feature size.");
 
-    AddAttr<bool>("padding_trainable",
-                  "(bool, default false) the padding data of SequenceConvOp "
+    AddAttr<bool>("paddingTrainable",
+                  "(bool, default:false) the padding data of SequenceConvOp "
                   "is trainable or not.")
         .SetDefault(false);
-    AddAttr<int>("context_length",
-                 "(int, default 3) the context_length of SequenceConvOp is the "
+    AddAttr<int>("contextLength",
+                 "(int) the contextLength of SequenceConvOp is the "
                  "height of the convolution kernel.")
-        .SetDefault(3)
         .GreaterThan(0);
-    AddAttr<int>("context_start",
-                 "(int, default 0) the context_start of SequenceConvOp "
+    AddAttr<int>("contextStart",
+                 "(int, default:0) the contextStart of SequenceConvOp "
                  "represents the beginning of the convolution of the number of "
-                 "rows of sequence, which can be negative.")
+                 "rows of sequence, which can be negative. The negative number "
+                 "means to pad contextStart time-steps of zeros or learnable "
+                 "parameters at the beginning of each instance. The positive "
+                 "number means to skip contextStart time-steps of each "
+                 "instance.")
         .SetDefault(0);
-    AddAttr<int>("context_stride",
-                 "(int, default 1) the context_stride of SequenceConvOp "
-                 "represents the step length of convolution. "
+    AddAttr<int>("contextStride",
+                 "(int, default:1) the contextStride of SequenceConvOp "
+                 "represents the stride length of convolution kernel. "
                  "Currently, SequenceConvOp only supports"
-                 "context_stride=1.")
+                 "contextStride=1.")
         .SetDefault(1)
         .GreaterThan(0);
 
     AddComment(R"DOC(
     SequenceConvOp performs convolution operation on features of
-    context_length time-steps of each instance.
+    contextLength time-steps of each instance.
     The convolution operation calculates the output based on the input, filter
     and strides, paddings parameters. The size of each dimension of the
     parameters is checked in the infer-shape. In order to ensure the equal
diff --git a/paddle/operators/sequence_conv_op.h b/paddle/operators/sequence_conv_op.h
index cd8a8d4cea39161029602530cc75532b5f977d01..a57e1752bb8ed4844423f752bf0ad9f8e114486a 100644
--- a/paddle/operators/sequence_conv_op.h
+++ b/paddle/operators/sequence_conv_op.h
@@ -35,12 +35,11 @@ class SequenceConvKernel : public framework::OpKernel<T> {
     out->mutable_data<T>(context.GetPlace());
     context.ShareLoD("X", "Out");
 
-    int context_start = context.Attr<int>("context_start");
-    int context_length = context.Attr<int>("context_length");
-    int context_stride = context.Attr<int>("context_stride");
-    bool padding_trainable = context.Attr<bool>("padding_trainable");
+    int context_start = context.Attr<int>("contextStart");
+    int context_length = context.Attr<int>("contextLength");
+    int context_stride = context.Attr<int>("contextStride");
+    bool padding_trainable = context.Attr<bool>("paddingTrainable");
 
-    // InferShape by in_lod
     PADDLE_ENFORCE_EQ(in->lod().size(), 1UL,
                       "Only support one level sequence now.");
 
@@ -51,26 +50,21 @@ class SequenceConvKernel : public framework::OpKernel<T> {
 
     int up_pad = std::max(0, -context_start);
     int down_pad = std::max(0, context_start + context_length - 1);
-    int sequence_width;
-    sequence_width = static_cast<int>(in->dims()[1]);
+    int sequence_width = static_cast<int>(in->dims()[1]);
 
-    // Use col_shape in the im2col calculation.
     framework::DDim col_shape = {in->dims()[0],
-                                 sequence_width * context_length};
+                                 context_length * sequence_width};
     Tensor col;
     col.mutable_data<T>(col_shape, context.GetPlace());
-    math::SetConstant<Place, T> set_zero;
     // Because if padding_trainable is false, padding data should be zeros.
+    math::SetConstant<Place, T> set_zero;
     set_zero(context.device_context(), &col, static_cast<T>(0));
 
-    paddle::operators::math::ContextProjectFunctor<Place, T>
-        seq_project_functor;
-    LoDTensor* input = const_cast<LoDTensor*>(in);
-    Tensor* pad_data = const_cast<Tensor*>(padding_data);
+    math::ContextProjectFunctor<Place, T> seq_project_functor;
 
-    seq_project_functor(context.device_context(), *input, *pad_data, col,
+    seq_project_functor(context.device_context(), *in, *padding_data, col,
                         padding_trainable, context_start, context_length,
-                        context_stride, up_pad, down_pad, false, false, false);
+                        context_stride, up_pad, down_pad);
 
     math::matmul<Place, T>(context.device_context(), col, false, filter, false,
                            static_cast<T>(1.0), out, static_cast<T>(0.0));
@@ -81,18 +75,18 @@ template <typename Place, typename T>
 class SequenceConvGradKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& context) const override {
-    auto* out_g = context.Input<LoDTensor>(framework::GradVarName("Out"));
     auto* in_g = context.Output<LoDTensor>(framework::GradVarName("X"));
+    auto* out_g = context.Input<LoDTensor>(framework::GradVarName("Out"));
     auto* filter_g = context.Output<Tensor>(framework::GradVarName("Filter"));
     auto* padding_data_g =
         context.Output<Tensor>(framework::GradVarName("PaddingData"));
     auto* in = context.Input<LoDTensor>("X");
     auto* filter = context.Input<Tensor>("Filter");
 
-    int context_start = context.Attr<int>("context_start");
-    int context_length = context.Attr<int>("context_length");
-    int context_stride = context.Attr<int>("context_stride");
-    bool padding_trainable = context.Attr<bool>("padding_trainable");
+    int context_start = context.Attr<int>("contextStart");
+    int context_length = context.Attr<int>("contextLength");
+    int context_stride = context.Attr<int>("contextStride");
+    bool padding_trainable = context.Attr<bool>("paddingTrainable");
 
     PADDLE_ENFORCE_EQ(in->lod().size(), 1UL,
                       "Only support one level sequence now.");
@@ -115,17 +109,18 @@ class SequenceConvGradKernel : public framework::OpKernel<T> {
       math::matmul<Place, T>(context.device_context(), *out_g, false, *filter,
                              true, T(1.0), &col, T(1.0));
     }
-    paddle::operators::math::ContextProjectFunctor<Place, T>
-        seq_project_functor;
+    math::ContextProjectFunctor<Place, T> seq_project_functor;
+    math::ContextProjectGradFunctor<Place, T> seq_project_grad_functor;
 
     if (in_g) {
       in_g->mutable_data<T>(context.GetPlace());
       in_g->set_lod(in->lod());
       set_zero(context.device_context(), in_g, static_cast<T>(0));
 
-      seq_project_functor(context.device_context(), *in_g, *padding_data_g, col,
-                          padding_trainable, context_start, context_length,
-                          context_stride, up_pad, down_pad, true, true, false);
+      seq_project_grad_functor(context.device_context(), *in_g, *padding_data_g,
+                               col, padding_trainable, context_start,
+                               context_length, context_stride, up_pad, down_pad,
+                               true, false);
     }
 
     if (padding_trainable && padding_data_g) {
@@ -133,9 +128,10 @@ class SequenceConvGradKernel : public framework::OpKernel<T> {
       set_zero(context.device_context(), padding_data_g, static_cast<T>(0));
 
       LoDTensor* input = const_cast<LoDTensor*>(in);
-      seq_project_functor(context.device_context(), *input, *padding_data_g,
-                          col, padding_trainable, context_start, context_length,
-                          context_stride, up_pad, down_pad, true, false, true);
+      seq_project_grad_functor(context.device_context(), *input,
+                               *padding_data_g, col, padding_trainable,
+                               context_start, context_length, context_stride,
+                               up_pad, down_pad, false, true);
     }
 
     if (filter_g) {
@@ -150,15 +146,9 @@ class SequenceConvGradKernel : public framework::OpKernel<T> {
         padding_data = context.Input<Tensor>("PaddingData");
       }
 
-      sequence_width = static_cast<int>(in->dims()[1]);
-
-      LoDTensor* input = const_cast<LoDTensor*>(in);
-      Tensor* pad_data = const_cast<Tensor*>(padding_data);
-
-      seq_project_functor(context.device_context(), *input, *pad_data, col,
+      seq_project_functor(context.device_context(), *in, *padding_data, col,
                           padding_trainable, context_start, context_length,
-                          context_stride, up_pad, down_pad, false, false,
-                          false);
+                          context_stride, up_pad, down_pad);
 
       math::matmul<Place, T>(context.device_context(), col, true, out_grad,
                              false, T(1.0), &filter_grad, T(1.0));
diff --git a/paddle/operators/top_k_op.cc b/paddle/operators/top_k_op.cc
index d5c2c91a5fb0f639ea84d13e27de8271218da54f..ac9257259548dc4223efac239d8362f69366850c 100644
--- a/paddle/operators/top_k_op.cc
+++ b/paddle/operators/top_k_op.cc
@@ -52,7 +52,11 @@ class TopkOpMaker : public framework::OpProtoAndCheckerMaker {
     AddOutput("Out", "The output tensor of Topk op");
     AddOutput("Indices", "The indices of Topk elements of input");
     AddComment(
-        R"DOC(If the input is a vector (1d tensor), finds the k largest entries in the vector and outputs their values and indices as vectors. Thus values[j] is the j-th largest entry in input, and its index is indices[j].
+        R"DOC(If the input is a vector (1d tensor), 
+        finds the k largest entries in the vector 
+        and outputs their values and indices as vectors. 
+        Thus values[j] is the j-th largest entry in input, 
+        and its index is indices[j].
 
     For matrices, computes the top k entries in each row. )DOC");
     AddAttr<int>("k",
@@ -66,6 +70,7 @@ class TopkOpMaker : public framework::OpProtoAndCheckerMaker {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-REGISTER_OP_WITHOUT_GRADIENT(top_k, ops::TopkOp, ops::TopkOpMaker);
+REGISTER_OPERATOR(top_k, ops::TopkOp, ops::TopkOpMaker,
+                  paddle::framework::EmptyGradOpMaker);
 REGISTER_OP_CPU_KERNEL(top_k,
                        ops::TopkKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/top_k_op.cu b/paddle/operators/top_k_op.cu
index 7be6932f1e301d06e0e232367a38bfa673ff45be..7851c71bbe9fe73402968ce14f6db0df523cd6d3 100644
--- a/paddle/operators/top_k_op.cu
+++ b/paddle/operators/top_k_op.cu
@@ -23,9 +23,9 @@ using Tensor = framework::Tensor;
 template <typename T>
 struct Pair {
   __device__ __forceinline__ Pair() {}
-  __device__ __forceinline__ Pair(T value, int id) : v(value), id(id) {}
+  __device__ __forceinline__ Pair(T value, int64_t id) : v(value), id(id) {}
 
-  __device__ __forceinline__ void set(T value, int id) {
+  __device__ __forceinline__ void set(T value, int64_t id) {
     v = value;
     id = id;
   }
@@ -48,7 +48,7 @@ struct Pair {
   }
 
   T v;
-  int id;
+  int64_t id;
 };
 
 template <typename T>
@@ -197,7 +197,7 @@ __device__ __forceinline__ void ThreadGetTopK(Pair<T> topk[], int& beam,
 template <typename T, int MaxLength, int BlockSize>
 __device__ __forceinline__ void BlockReduce(Pair<T>* sh_topk, int* maxid,
                                             Pair<T> topk[], T** topVal,
-                                            int** topIds, int& beam, int& k,
+                                            int64_t** topIds, int& beam, int& k,
                                             const int tid, const int warp) {
   while (true) {
     __syncthreads();
@@ -249,7 +249,7 @@ __device__ __forceinline__ void BlockReduce(Pair<T>* sh_topk, int* maxid,
  * 4. go to the first setp, until get the topk value.
  */
 template <typename T, int MaxLength, int BlockSize>
-__global__ void KeMatrixTopK(T* output, int output_stride, int* indices,
+__global__ void KeMatrixTopK(T* output, int output_stride, int64_t* indices,
                              const T* src, int lds, int dim, int k) {
   __shared__ Pair<T> sh_topk[BlockSize];
   __shared__ int maxid[BlockSize / 2];
@@ -293,7 +293,7 @@ class TopkOpCUDAKernel : public framework::OpKernel<T> {
 
     T* output_data = output->mutable_data<T>(ctx.GetPlace());
     // FIXME(typhoonzero): data is always converted to type T?
-    int* indices_data = indices->mutable_data<int>(ctx.GetPlace());
+    int64_t* indices_data = indices->mutable_data<int64_t>(ctx.GetPlace());
 
     size_t input_height = input->dims()[0];
     size_t input_width = input->dims()[1];
diff --git a/paddle/operators/top_k_op.h b/paddle/operators/top_k_op.h
index 4b248faa120bcfd20e70d288cce2d485d3e6371e..bc8563717a21bd5b3d8fc87f689657990066957b 100644
--- a/paddle/operators/top_k_op.h
+++ b/paddle/operators/top_k_op.h
@@ -40,7 +40,7 @@ class TopkKernel : public framework::OpKernel<T> {
     const size_t k = static_cast<int>(ctx.Attr<int>("k"));
 
     T* output_data = output->mutable_data<T>(ctx.GetPlace());
-    T* indices_data = indices->mutable_data<T>(ctx.GetPlace());
+    int64_t* indices_data = indices->mutable_data<int64_t>(ctx.GetPlace());
 
     auto eg_input = EigenMatrix<T>::From(*input);
 
@@ -66,7 +66,7 @@ class TopkKernel : public framework::OpKernel<T> {
           });
       for (size_t j = 0; j < k; j++) {
         output_data[i * k + j] = vec[j].first;
-        indices_data[i * k + j] = vec[j].second;
+        indices_data[i * k + j] = int64_t(vec[j].second);
       }
     }
   }
diff --git a/paddle/platform/CMakeLists.txt b/paddle/platform/CMakeLists.txt
index eb850b658583f2256629d63fdb64248dbf249937..bd86a9fe268c277065cd450f91b544def6c4d32f 100644
--- a/paddle/platform/CMakeLists.txt
+++ b/paddle/platform/CMakeLists.txt
@@ -9,7 +9,6 @@ cc_test(place_test SRCS place_test.cc DEPS place glog gflags)
 add_subdirectory(dynload)
 
 cc_test(enforce_test SRCS enforce_test.cc DEPS stringpiece)
-cc_test(environment_test SRCS environment_test.cc DEPS stringpiece)
 
 IF(WITH_GPU)
     set(GPU_CTX_DEPS dynload_cuda dynamic_loader)
diff --git a/paddle/platform/environment.h b/paddle/platform/environment.h
deleted file mode 100644
index 4edcce932edc61453cef74f2c4ee0f72496b3677..0000000000000000000000000000000000000000
--- a/paddle/platform/environment.h
+++ /dev/null
@@ -1,60 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#pragma once
-
-#include <stdlib.h>
-#include <unistd.h>
-#include <vector>
-
-#include "paddle/platform/enforce.h"
-#include "paddle/string/piece.h"
-
-extern char** environ;  // for environment variables
-
-namespace paddle {
-namespace platform {
-
-inline void SetEnvVariable(const std::string& name, const std::string& value) {
-  PADDLE_ENFORCE_NE(setenv(name.c_str(), value.c_str(), 1), -1,
-                    "Failed to set environment variable %s=%s", name, value);
-}
-
-inline void UnsetEnvVariable(const std::string& name) {
-  PADDLE_ENFORCE_NE(unsetenv(name.c_str()), -1,
-                    "Failed to unset environment variable %s", name);
-}
-
-inline bool IsEnvVarDefined(const std::string& name) {
-  return std::getenv(name.c_str()) != nullptr;
-}
-
-inline std::string GetEnvValue(const std::string& name) {
-  PADDLE_ENFORCE(IsEnvVarDefined(name),
-                 "Tried to access undefined environment variable %s", name);
-  return std::getenv(name.c_str());
-}
-
-inline std::vector<std::string> GetAllEnvVariables() {
-  std::vector<std::string> vars;
-  for (auto var = environ; *var != nullptr; ++var) {
-    auto tail = string::Index(*var, "=");
-    auto name = string::SubStr(*var, 0, tail).ToString();
-    vars.push_back(name);
-  }
-  return vars;
-}
-
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/platform/environment_test.cc b/paddle/platform/environment_test.cc
deleted file mode 100644
index 5f136527215d6a676cfa1a3b08f09dfd3ab24a90..0000000000000000000000000000000000000000
--- a/paddle/platform/environment_test.cc
+++ /dev/null
@@ -1,54 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/platform/environment.h"
-
-#include "glog/logging.h"
-#include "gtest/gtest.h"
-
-TEST(ENVIRONMENT, ACCESS) {
-  namespace platform = paddle::platform;
-  namespace string = paddle::string;
-
-  platform::SetEnvVariable("PADDLE_USE_ENV", "TRUE");
-
-  EXPECT_TRUE(platform::IsEnvVarDefined("PADDLE_USE_ENV"));
-  EXPECT_EQ(platform::GetEnvValue("PADDLE_USE_ENV"), "TRUE");
-
-  platform::UnsetEnvVariable("PADDLE_USE_ENV");
-  EXPECT_FALSE(platform::IsEnvVarDefined("PADDLE_USE_ENV"));
-
-  platform::SetEnvVariable("PADDLE_USE_ENV1", "Hello ");
-  platform::SetEnvVariable("PADDLE_USE_ENV2", "World, ");
-  platform::SetEnvVariable("PADDLE_USE_ENV3", "PaddlePaddle!");
-
-  std::string env_info;
-  auto vars = platform::GetAllEnvVariables();
-  for_each(vars.begin(), vars.end(), [&](const std::string& var) {
-    env_info += platform::GetEnvValue(var);
-  });
-
-  EXPECT_TRUE(string::Contains(env_info, "Hello World, PaddlePaddle!"));
-  platform::UnsetEnvVariable("PADDLE_USE_ENV1");
-  platform::UnsetEnvVariable("PADDLE_USE_ENV2");
-  platform::UnsetEnvVariable("PADDLE_USE_ENV3");
-
-  env_info.clear();
-  vars = platform::GetAllEnvVariables();
-  for_each(vars.begin(), vars.end(), [&](const std::string& var) {
-    env_info += platform::GetEnvValue(var);
-  });
-
-  EXPECT_FALSE(string::Contains(env_info, "Hello World, PaddlePaddle!"));
-  EXPECT_FALSE(platform::IsEnvVarDefined("PADDLE_USE_ENV1"));
-  EXPECT_FALSE(platform::IsEnvVarDefined("PADDLE_USE_ENV2"));
-  EXPECT_FALSE(platform::IsEnvVarDefined("PADDLE_USE_ENV3"));
-}
diff --git a/paddle/platform/gpu_info.cc b/paddle/platform/gpu_info.cc
index 0cab5ffc5609bbd6fd08c74329d8370fb95f8102..f3455a8733862c91eaece629b6684d446672336c 100644
--- a/paddle/platform/gpu_info.cc
+++ b/paddle/platform/gpu_info.cc
@@ -17,7 +17,6 @@ limitations under the License. */
 #include "gflags/gflags.h"
 
 #include "paddle/platform/enforce.h"
-#include "paddle/platform/environment.h"
 
 DEFINE_double(fraction_of_gpu_memory_to_use, 0.95,
               "Default use 95% of GPU memory for PaddlePaddle,"
@@ -75,13 +74,6 @@ size_t GpuMaxChunkSize() {
 
   GpuMemoryUsage(available, total);
 
-  if (IsEnvVarDefined(kEnvFractionGpuMemoryToUse)) {
-    auto val = std::stod(GetEnvValue(kEnvFractionGpuMemoryToUse));
-    PADDLE_ENFORCE_GT(val, 0.0);
-    PADDLE_ENFORCE_LE(val, 1.0);
-    FLAGS_fraction_of_gpu_memory_to_use = val;
-  }
-
   // Reserving the rest memory for page tables, etc.
   size_t reserving = (1 - FLAGS_fraction_of_gpu_memory_to_use) * total;
 
diff --git a/paddle/pybind/pybind.cc b/paddle/pybind/pybind.cc
index bf6e12264269c7603484e0acf502adab25645856..2a0075356ed1e0f0b3501ac681c5e3a1bf37e2ca 100644
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -14,6 +14,9 @@ limitations under the License. */
 
 #include "paddle/pybind/protobuf.h"
 
+#include <mutex>  // for call_once
+#include <unordered_map>
+#include "gflags/gflags.h"
 #include "paddle/framework/backward.h"
 #include "paddle/framework/executor.h"
 #include "paddle/framework/feed_fetch_method.h"
@@ -40,9 +43,27 @@ limitations under the License. */
 
 namespace paddle {
 namespace pybind {
-static size_t UniqueIntegerGenerator() {
-  static std::atomic<size_t> generator;
-  return generator.fetch_add(1);
+static size_t UniqueIntegerGenerator(const std::string &prefix) {
+  static std::unordered_map<std::string, std::atomic<size_t>> generators;
+  return generators[prefix].fetch_add(1);
+}
+
+std::once_flag gflags_init_flag;
+
+// TODO(qijun) move init gflags to init.cc
+void InitGflags(std::vector<std::string> &argv) {
+  std::call_once(gflags_init_flag, [&]() {
+    int argc = argv.size();
+    char **arr = new char *[argv.size()];
+    std::string line;
+    for (size_t i = 0; i < argv.size(); i++) {
+      arr[i] = &argv[i][0];
+      line += argv[i];
+      line += ' ';
+    }
+    google::ParseCommandLineFlags(&argc, &arr, true);
+    VLOG(1) << "Init commandline: " << line;
+  });
 }
 
 bool IsCompileGPU() {
@@ -483,6 +504,7 @@ All parameter, weight, gradient are variables in Paddle.
       });
 
   m.def("unique_integer", UniqueIntegerGenerator);
+  m.def("init_gflags", InitGflags);
 
   m.def("is_compile_gpu", IsCompileGPU);
   m.def("set_feed_variable", framework::SetFeedVariable);
diff --git a/proto/CMakeLists.txt b/proto/CMakeLists.txt
index 5d898d860cfc6dc26eaf5a81d8aed6d757ed5831..556bcd1d7e60c27fece43de666e9531ab4203414 100644
--- a/proto/CMakeLists.txt
+++ b/proto/CMakeLists.txt
@@ -27,3 +27,30 @@ foreach(filename ${proto_filenames})
 endforeach()
 
 add_custom_target(gen_proto_py ALL DEPENDS ${PROTO_GEN_PY})
+
+
+if (WITH_GOLANG)
+    add_custom_target(protoc-gen-go)
+    add_custom_command(TARGET protoc-gen-go
+            COMMAND go 
+            ARGS "get" "-u" "github.com/golang/protobuf/protoc-gen-go")
+
+    set(PROTO_GEN_GO)
+    file(GLOB proto_filenames . OptimizerConfig.proto)
+    foreach(filename ${proto_filenames})
+        message(STATUS ${filename})
+        get_filename_component(ABS_FIL ${filename} ABSOLUTE)
+        get_filename_component(FIL_WE ${filename} NAME_WE)
+        set(CUR_PROTO_GEN_GO
+                ${PADDLE_SOURCE_DIR}/paddle/go/proto/${FIL_WE}.pb.go)
+        set(PROTO_GEN_GO
+                ${CUR_PROTO_GEN_GO}
+                ${PROTO_GEN_GO})
+        add_custom_command(OUTPUT ${CUR_PROTO_GEN_GO}
+                COMMAND ${PROTOBUF_PROTOC_EXECUTABLE}
+                ARGS "--go_out=${PADDLE_SOURCE_DIR}/go/proto"
+                "-I" ${CMAKE_CURRENT_SOURCE_DIR} ${ABS_FIL}
+                DEPENDS ${ABS_FIL} protoc protoc-gen-go)
+    endforeach()
+    add_custom_target(gen_proto_go ALL DEPENDS ${PROTO_GEN_GO})
+endif()
diff --git a/python/paddle/v2/framework/__init__.py b/python/paddle/v2/framework/__init__.py
index c942373c667733f8aabe63026998a8915618130a..5df612bf3530c843c16b337f2b8f83445fcf39b5 100644
--- a/python/paddle/v2/framework/__init__.py
+++ b/python/paddle/v2/framework/__init__.py
@@ -1 +1,11 @@
+import sys
+import core
 __all__ = ['proto']
+argv = []
+if core.is_compile_gpu():
+    argv = list(sys.argv) + [
+        "--tryfromenv=fraction_of_gpu_memory_to_use,use_pinned_memory"
+    ]
+else:
+    argv = list(sys.argv) + ["--tryfromenv=use_pinned_memory"]
+core.init_gflags(argv)
diff --git a/python/paddle/v2/framework/framework.py b/python/paddle/v2/framework/framework.py
index 43101c9ddad76b7c1c322130dc0362a5c8ea4336..f8d2f67410a6c06a1642180d2d62c881ec6bda3d 100644
--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -119,8 +119,9 @@ class Variable(object):
 
     @staticmethod
     def _unique_var_name_():
-        uid = core.unique_integer()  # unique during whole process.
-        return "_generated_var_%d" % uid
+        prefix = "_generated_var"
+        uid = core.unique_integer(prefix)  # unique during whole process.
+        return "_".join([prefix, str(uid)])
 
     @staticmethod
     def _convert_np_dtype_to_dtype_(np_dtype):
diff --git a/python/paddle/v2/framework/layer_helper.py b/python/paddle/v2/framework/layer_helper.py
index 1f72c9bc7b0ceda1dd954703fcc10c77a3e5ed25..d96dbe172c22617182e7ebf4aab175c6142352b7 100644
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
@@ -8,7 +8,7 @@ from paddle.v2.framework.framework import Variable, g_program, \
 
 
 def unique_name(prefix):
-    uid = core.unique_integer()  # unique during whole process.
+    uid = core.unique_integer(prefix)  # unique during whole process.
     return "_".join([prefix, str(uid)])
 
 
diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py
index 041a3b2c0b03c8171c2af9d856b33f461bb486c1..6451d11e2b68692527addb424b0cd716f23bd77a 100644
--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
@@ -5,7 +5,7 @@ import re
 
 __all__ = [
     'fc', 'data', 'cross_entropy', 'conv2d', 'pool2d', 'embedding', 'concat',
-    'StaticRNN', 'cast'
+    'StaticRNN', 'cast', 'sequence_conv', 'sequence_pool', 'accuracy'
 ]
 
 
@@ -150,7 +150,7 @@ def _create_op_func_(op_type):
             outputs[name] = [helper.create_tmp_variable(dtype=dtype)]
         helper.append_op(
             type=op_type, inputs=inputs, outputs=outputs, attrs=kwargs)
-        return out
+        return helper.append_activation(out)
 
     func.__name__ = op_type
     globals()[op_type] = func
@@ -160,10 +160,23 @@ def _create_op_func_(op_type):
 
 _create_op_func_('mean')
 _create_op_func_('mul')
+_create_op_func_('elementwise_add')
 _create_op_func_('dropout')
 _create_op_func_('reshape')
 
 
+def cast(x, data_type, program=None):
+    helper = LayerHelper('cast', **locals())
+    out = helper.create_tmp_variable(dtype=data_type)
+    helper.append_op(
+        type='cast',
+        inputs={'X': [x]},
+        outputs={'Out': [out]},
+        attrs={'in_data_type': x.data_type,
+               'out_data_type': out.data_type})
+    return out
+
+
 def cast(x, data_type, program=None):
     helper = LayerHelper('cast', **locals())
     out = helper.create_tmp_variable(dtype=data_type)
@@ -212,13 +225,73 @@ def square_error_cost(input, label, **kwargs):
 
     square_out = helper.create_tmp_variable(dtype=input.data_type)
     helper.append_op(
-        type='pow',
-        inputs={'X': [minus_out]},
-        outputs={'Y': [square_out]},
-        attrs={'factor': 2.0})
+        type='square', inputs={'X': [minus_out]}, outputs={'Y': [square_out]})
     return square_out
 
 
+def accuracy(input, label, k=1, **kwargs):
+    helper = LayerHelper("accuracy", **kwargs)
+    topk_out = helper.create_tmp_variable(dtype=input.data_type)
+    topk_indices = helper.create_tmp_variable(dtype="int64")
+    helper.append_op(
+        type="top_k",
+        inputs={"X": [input]},
+        outputs={"Out": [topk_out],
+                 "Indices": [topk_indices]},
+        attrs={"k": k})
+    acc_out_dtype = kwargs.get("out_dtype", "float32")
+    acc_out = helper.create_tmp_variable(dtype=acc_out_dtype)
+    helper.append_op(
+        type="accuracy",
+        inputs={
+            "Out": [topk_out],
+            "Indices": [topk_indices],
+            "Label": [label]
+        },
+        outputs={"Accuracy": [acc_out]})
+    return acc_out
+
+
+def sequence_conv(input,
+                  num_filters,
+                  name=None,
+                  filter_size=3,
+                  act=None,
+                  stride=1,
+                  padding=None,
+                  bias_attr=None,
+                  param_attr=None,
+                  program=None,
+                  init_program=None):
+    # FIXME(dzh) : want to unify the argument of python layer
+    # function. So we ignore some unecessary attributes.
+    # such as, padding_trainable, context_start.
+
+    helper = LayerHelper('sequence_conv', **locals())
+    dtype = helper.input_dtype()
+
+    filter_shape = [num_filters, filter_size]
+    filter = helper.create_parameter(
+        attr=helper.param_attr, shape=filter_shape, dtype=dtype)
+    pre_bias = helper.create_tmp_variable(dtype)
+
+    helper.append_op(
+        type='sequence_conv',
+        inputs={
+            'X': [input],
+            'Filter': filter,
+        },
+        outputs={"Out": pre_bias},
+        attrs={
+            'context_stride': stride,
+            'context_start': 0,
+            'context_length': filter_size
+        })
+
+    pre_act = helper.append_bias_op(pre_bias)
+    return helper.append_activation(pre_act)
+
+
 def conv2d(input,
            num_filters,
            name=None,
@@ -271,6 +344,35 @@ def conv2d(input,
     return helper.append_activation(pre_act)
 
 
+def sequence_pool(input,
+                  pool_size,
+                  pool_type,
+                  pool_stride=1,
+                  pool_padding=0,
+                  global_pooling=False,
+                  program=None,
+                  init_program=None):
+    # FIXME(dzh) : want to unify the argument of python layer
+    # function. So we ignore some unecessary attributes
+
+    ENUM_POOL_TYPE = set(["max", "avg", "sqrt", "last", "first"])
+    if pool_type not in ENUM_POOL_TYPE:
+        raise ValueError("Unknown pool_type: '%s'. It can only be %s.",
+                         str(pool_type), " ".join(ENUM_POOL_TYPE))
+
+    helper = LayerHelper('sequence_pool', **locals())
+    dtype = helper.input_dtype()
+    pool_out = helper.create_tmp_variable(dtype)
+
+    helper.append_op(
+        type="sequence_pool",
+        inputs={"X": [input]},
+        outputs={"Out": pool_out},
+        attrs={"strategy": pool_type})
+
+    return pool_out
+
+
 def pool2d(input,
            pool_size,
            pool_type,
@@ -290,7 +392,7 @@ def pool2d(input,
     if isinstance(pool_padding, int):
         pool_padding = [pool_padding, pool_padding]
 
-    helper = LayerHelper('conv2d', **locals())
+    helper = LayerHelper('pool2d', **locals())
     dtype = helper.input_dtype()
     pool_out = helper.create_tmp_variable(dtype)
 
diff --git a/python/paddle/v2/framework/nets.py b/python/paddle/v2/framework/nets.py
index 803534fa391c49d646c5d98a442d35d06b98603e..a9998073e164a223e5d99fc26146ba48027d7a3e 100644
--- a/python/paddle/v2/framework/nets.py
+++ b/python/paddle/v2/framework/nets.py
@@ -1,9 +1,11 @@
 import paddle.v2.framework.layers as layers
 
+__all__ = ["simple_img_conv_pool", "sequence_conv_pool"]
+
 
 def simple_img_conv_pool(input,
-                         filter_size,
                          num_filters,
+                         filter_size,
                          pool_size,
                          pool_stride,
                          act,
@@ -94,3 +96,29 @@ def img_conv_group(input,
         program=program,
         init_program=init_program)
     return pool_out
+
+
+def sequence_conv_pool(input,
+                       num_filters,
+                       filter_size,
+                       pool_size,
+                       pool_stride,
+                       act,
+                       program=None,
+                       init_program=None):
+    conv_out = layers.sequence_conv(
+        input=input,
+        num_filters=num_filters,
+        filter_size=filter_size,
+        act=act,
+        program=program,
+        init_program=init_program)
+
+    pool_out = layers.sequence_pool(
+        input=conv_out,
+        pool_size=pool_size,
+        pool_type='max',
+        pool_stride=pool_stride,
+        program=program,
+        init_program=init_program)
+    return pool_out
diff --git a/python/paddle/v2/framework/tests/op_test.py b/python/paddle/v2/framework/tests/op_test.py
index 50360e6e729df2957a5c7fe871100b5a53bd9305..2e6710b5fcfe5a531067498e38a4cb93d3165602 100644
--- a/python/paddle/v2/framework/tests/op_test.py
+++ b/python/paddle/v2/framework/tests/op_test.py
@@ -281,7 +281,8 @@ class OpTest(unittest.TestCase):
                                          type(sub_out))
                 for sub_out_name, expect in sub_out:
                     idx = find_actual(sub_out_name, fetch_list)
-                    actual_t = np.array(outs[idx])
+                    actual = outs[idx]
+                    actual_t = np.array(actual)
                     expect_t = expect[0] \
                         if isinstance(expect, tuple) else expect
                     self.assertTrue(
@@ -291,11 +292,12 @@ class OpTest(unittest.TestCase):
                         str(place))
                     if isinstance(expect, tuple):
                         self.assertListEqual(
-                            actual_t.lod(), expect[1], "Output (" + sub_out_name
-                            + ") has different lod at " + str(place))
+                            actual.lod(), expect[1], "Output (" + sub_out_name +
+                            ") has different lod at " + str(place))
             else:
                 idx = find_actual(out_name, fetch_list)
-                actual_t = outs[idx]
+                actual = outs[idx]
+                actual_t = np.array(actual)
                 expect = self.outputs[out_name]
                 expect_t = expect[0] if isinstance(expect, tuple) else expect
                 self.assertTrue(
@@ -303,7 +305,7 @@ class OpTest(unittest.TestCase):
                         actual_t, expect_t, atol=atol),
                     "Output (" + out_name + ") has diff at " + str(place))
                 if isinstance(expect, tuple):
-                    self.assertListEqual(actual_t.lod(), expect[1],
+                    self.assertListEqual(actual.lod(), expect[1],
                                          "Output (" + out_name +
                                          ") has different lod at " + str(place))
 
diff --git a/python/paddle/v2/framework/tests/test_accuracy_op.py b/python/paddle/v2/framework/tests/test_accuracy_op.py
index 02be9a02910bee3eae63e12cceaa51cf53591539..6536c297e8e559bf04fe6ef3b0e2dadd1914eb87 100644
--- a/python/paddle/v2/framework/tests/test_accuracy_op.py
+++ b/python/paddle/v2/framework/tests/test_accuracy_op.py
@@ -7,12 +7,13 @@ class TestAccuracyOp(OpTest):
     def setUp(self):
         self.op_type = "accuracy"
         n = 8192
-        infer = np.random.randint(0, 2, (n, 1)).astype("int")
-        label = np.random.randint(0, 2, (n, )).astype("int")
-        self.inputs = {'Inference': infer, "Label": label}
+        infer = np.random.random((n, 1)).astype("float32")
+        indices = np.random.randint(0, 2, (n, 1))
+        label = np.random.randint(0, 2, (n, 1))
+        self.inputs = {'Out': infer, 'Indices': indices, "Label": label}
         num_correct = 0
         for rowid in xrange(n):
-            for ele in infer[rowid]:
+            for ele in indices[rowid]:
                 if ele == label[rowid]:
                     num_correct += 1
                     break
diff --git a/python/paddle/v2/framework/tests/test_auc_op.py b/python/paddle/v2/framework/tests/test_auc_op.py
index 65f679cfccccae41b8924bc68833c1703dd3671d..26ea905d88093605dff820b178996a5724becf82 100644
--- a/python/paddle/v2/framework/tests/test_auc_op.py
+++ b/python/paddle/v2/framework/tests/test_auc_op.py
@@ -6,10 +6,11 @@ from op_test import OpTest
 class TestAucOp(OpTest):
     def setUp(self):
         self.op_type = "auc"
-        pred = np.random.random((128)).astype("float32")
-        labels = np.random.randint(0, 2, (128, ))
+        pred = np.random.random((128, 2)).astype("float32")
+        indices = np.random.randint(0, 2, (128, 2))
+        labels = np.random.randint(0, 2, (128, 1))
         num_thresholds = 200
-        self.inputs = {'Inference': pred, 'Label': labels}
+        self.inputs = {'Out': pred, 'Indices': indices, 'Label': labels}
         self.attrs = {'curve': 'ROC', 'num_thresholds': num_thresholds}
         # NOTE: sklearn use a different way to generate thresholds
         #       which will cause the result differs slightly:
@@ -31,12 +32,12 @@ class TestAucOp(OpTest):
             tp, fn, tn, fp = 0, 0, 0, 0
             for i, lbl in enumerate(labels):
                 if lbl:
-                    if pred[i] >= thresh:
+                    if pred[i, 0] >= thresh:
                         tp += 1
                     else:
                         fn += 1
                 else:
-                    if pred[i] >= thresh:
+                    if pred[i, 0] >= thresh:
                         fp += 1
                     else:
                         tn += 1
@@ -62,6 +63,5 @@ class TestAucOp(OpTest):
         self.check_output()
 
 
-# TODO(typhoonzero): add this back till we fix it
-#if __name__ == "__main__":
-#    unittest.main()
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_image_classification_layer.py b/python/paddle/v2/framework/tests/test_image_classification_layer.py
index 908cf44b88a5de88690f5e17a1da1b5f8b1d8079..b4eda13552e60f009ec910e3d21e9d77107133a1 100644
--- a/python/paddle/v2/framework/tests/test_image_classification_layer.py
+++ b/python/paddle/v2/framework/tests/test_image_classification_layer.py
@@ -37,7 +37,7 @@ class TestLayer(unittest.TestCase):
         layers.batch_norm(
             input=images, program=program, init_program=init_program)
 
-        #print str(program)
+        # print str(program)
 
     def test_dropout_layer(self):
         program = Program()
@@ -53,7 +53,7 @@ class TestLayer(unittest.TestCase):
             program=program,
             init_program=init_program)
 
-        #print str(program)
+        # print str(program)
 
     def test_img_conv_group(self):
         program = Program()
@@ -70,6 +70,29 @@ class TestLayer(unittest.TestCase):
 
         # print str(program)
 
+    def test_elementwise_add_with_act(self):
+        program = Program()
+        init_program = Program()
+        image1 = layers.data(
+            name='pixel1',
+            shape=[3, 48, 48],
+            data_type='float32',
+            program=program,
+            init_program=init_program)
+        image2 = layers.data(
+            name='pixel2',
+            shape=[3, 48, 48],
+            data_type='float32',
+            program=program,
+            init_program=init_program)
+        out = layers.elementwise_add(
+            x=image1,
+            y=image2,
+            act='relu',
+            program=program,
+            init_program=init_program)
+        # print(program)
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_image_classification_train.py b/python/paddle/v2/framework/tests/test_image_classification_train.py
index 4eb9051261ee6786ba78f62ea3bfd89ae90e1d74..6b6dec4976d510fae7e987ad0276b049bbcb96fa 100644
--- a/python/paddle/v2/framework/tests/test_image_classification_train.py
+++ b/python/paddle/v2/framework/tests/test_image_classification_train.py
@@ -10,6 +10,120 @@ from paddle.v2.framework.executor import Executor
 import numpy as np
 
 
+def resnet_cifar10(input, depth=32, program=None, init_program=None):
+    def conv_bn_layer(input,
+                      ch_out,
+                      filter_size,
+                      stride,
+                      padding,
+                      act='relu',
+                      program=None,
+                      init_program=None):
+        tmp = layers.conv2d(
+            input=input,
+            filter_size=filter_size,
+            num_filters=ch_out,
+            stride=stride,
+            padding=padding,
+            act=None,
+            bias_attr=False,
+            program=program,
+            init_program=init_program)
+        return layers.batch_norm(
+            input=tmp, act=act, program=program, init_program=init_program)
+
+    def shortcut(input, ch_in, ch_out, stride, program, init_program):
+        if ch_in != ch_out:
+            return conv_bn_layer(input, ch_out, 1, stride, 0, None, program,
+                                 init_program)
+        else:
+            return input
+
+    def basicblock(input,
+                   ch_in,
+                   ch_out,
+                   stride,
+                   program=program,
+                   init_program=init_program):
+        tmp = conv_bn_layer(
+            input,
+            ch_out,
+            3,
+            stride,
+            1,
+            program=program,
+            init_program=init_program)
+        tmp = conv_bn_layer(
+            tmp,
+            ch_out,
+            3,
+            1,
+            1,
+            act=None,
+            program=program,
+            init_program=init_program)
+        short = shortcut(input, ch_in, ch_out, stride, program, init_program)
+        return layers.elementwise_add(
+            x=tmp,
+            y=short,
+            act='relu',
+            program=program,
+            init_program=init_program)
+
+    def layer_warp(block_func, input, ch_in, ch_out, count, stride, program,
+                   init_program):
+        tmp = block_func(input, ch_in, ch_out, stride, program, init_program)
+        for i in range(1, count):
+            tmp = block_func(tmp, ch_out, ch_out, 1, program, init_program)
+        return tmp
+
+    assert (depth - 2) % 6 == 0
+    n = (depth - 2) / 6
+    conv1 = conv_bn_layer(
+        input=input,
+        ch_out=16,
+        filter_size=3,
+        stride=1,
+        padding=1,
+        program=program,
+        init_program=init_program)
+    res1 = layer_warp(
+        basicblock,
+        conv1,
+        16,
+        16,
+        n,
+        1,
+        program=program,
+        init_program=init_program)
+    res2 = layer_warp(
+        basicblock,
+        res1,
+        16,
+        32,
+        n,
+        2,
+        program=program,
+        init_program=init_program)
+    res3 = layer_warp(
+        basicblock,
+        res2,
+        32,
+        64,
+        n,
+        2,
+        program=program,
+        init_program=init_program)
+    pool = layers.pool2d(
+        input=res3,
+        pool_size=8,
+        pool_type='avg',
+        pool_stride=1,
+        program=program,
+        init_program=init_program)
+    return pool
+
+
 def vgg16_bn_drop(input, program, init_program):
     def conv_block(input,
                    num_filter,
@@ -75,8 +189,16 @@ label = layers.data(
     data_type='int64',
     program=program,
     init_program=init_program)
-vgg_net = vgg16_bn_drop(images, program, init_program)
-predict = layers.fc(input=vgg_net,
+
+# Add neural network config
+# option 1. resnet
+net = resnet_cifar10(images, 32, program, init_program)
+# option 2. vgg
+# net = vgg16_bn_drop(images, program, init_program)
+
+# print(program)
+
+predict = layers.fc(input=net,
                     size=classdim,
                     act='softmax',
                     program=program,
@@ -123,8 +245,8 @@ for pass_id in range(PASS_NUM):
                        fetch_list=[avg_cost])
 
         loss = np.array(outs[0])
-        # print("pass_id:" + str(pass_id) + " batch_id:" + str(batch_id) +
-        #       " loss:" + str(loss))
+        print("pass_id:" + str(pass_id) + " batch_id:" + str(batch_id) +
+              " loss:" + str(loss))
         batch_id = batch_id + 1
 
         if batch_id > 1:
diff --git a/python/paddle/v2/framework/tests/test_recognize_digits_conv.py b/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
index a9b6c8410e2af36e6928b2fac919398473611728..92b1d0542619b765cc32c98f59604cfc73d7d6d4 100644
--- a/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
@@ -51,12 +51,14 @@ predict = layers.fc(input=conv_pool_2,
 cost = layers.cross_entropy(
     input=predict, label=label, program=program, init_program=init_program)
 avg_cost = layers.mean(x=cost, program=program)
+accuracy = layers.accuracy(
+    input=predict, label=label, program=program, init_program=init_program)
 
 sgd_optimizer = optimizer.SGDOptimizer(learning_rate=0.001)
 opts = sgd_optimizer.minimize(avg_cost)
 
 BATCH_SIZE = 50
-PASS_NUM = 1
+PASS_NUM = 3
 train_reader = paddle.batch(
     paddle.reader.shuffle(
         paddle.dataset.mnist.train(), buf_size=500),
@@ -83,10 +85,11 @@ for pass_id in range(PASS_NUM):
         outs = exe.run(program,
                        feed={"pixel": tensor_img,
                              "label": tensor_y},
-                       fetch_list=[avg_cost])
-
+                       fetch_list=[avg_cost, accuracy])
         loss = np.array(outs[0])
+        acc = np.array(outs[1])
 
-        if loss < 10.0:
-            exit(0)  # if avg cost less than 10.0, we think our code is good.
+        if loss < 10.0 and acc > 0.9:
+            # if avg cost less than 10.0 and accuracy is larger than 0.9, we think our code is good.
+            exit(0)
 exit(1)
diff --git a/python/paddle/v2/framework/tests/test_seq_conv.py b/python/paddle/v2/framework/tests/test_seq_conv.py
index f0337c20a9e87fab971f9d9e2a113346feb20957..14edc5f953022ca05f5620c28bd7276d961dd4d0 100644
--- a/python/paddle/v2/framework/tests/test_seq_conv.py
+++ b/python/paddle/v2/framework/tests/test_seq_conv.py
@@ -45,10 +45,10 @@ class TestSeqProject(OpTest):
             self.inputs_val_no_f = ['PaddingData', 'X']
 
         self.attrs = {
-            'context_start': self.context_start,
-            'context_length': self.context_length,
-            'padding_trainable': self.padding_trainable,
-            'context_stride': self.context_stride
+            'contextStart': self.context_start,
+            'contextLength': self.context_length,
+            'paddingTrainable': self.padding_trainable,
+            'contextStride': self.context_stride
         }
         out = np.zeros(
             (self.input_size[0], self.output_represention)).astype('float32')
diff --git a/python/paddle/v2/framework/tests/test_top_k_op.py b/python/paddle/v2/framework/tests/test_top_k_op.py
index 694f37d612d4c46e673dc894b05a0a446190732c..6e8fbefa6eafa391cdb5e17c882ee74b5bdc6507 100644
--- a/python/paddle/v2/framework/tests/test_top_k_op.py
+++ b/python/paddle/v2/framework/tests/test_top_k_op.py
@@ -9,7 +9,7 @@ class TestTopkOp(OpTest):
         k = 1
         input = np.random.random((32, 84)).astype("float32")
         output = np.ndarray((32, k))
-        indices = np.ndarray((32, k))
+        indices = np.ndarray((32, k)).astype("int64")
 
         self.inputs = {'X': input}
         self.attrs = {'k': k}
@@ -32,7 +32,7 @@ class TestTopkOp3d(OpTest):
         input = np.random.random((32, 2, 84)).astype("float32")
         input_flat_2d = input.reshape(64, 84)
         output = np.ndarray((64, k))
-        indices = np.ndarray((64, k)).astype("int")
+        indices = np.ndarray((64, k)).astype("int64")
 
         # FIXME: should use 'X': input for a 3d input
         self.inputs = {'X': input_flat_2d}
diff --git a/python/paddle/v2/trainer.py b/python/paddle/v2/trainer.py
index b68fd0d5a97a7993ddd0a1d947304fa5428c01b8..db01ab7374eca18b6063dc634da5ef83c4bc9adc 100644
--- a/python/paddle/v2/trainer.py
+++ b/python/paddle/v2/trainer.py
@@ -205,7 +205,8 @@ class SGD(object):
         """
         Testing method. Will test input data.
 
-        :param reader: A reader that reads and yeilds data items.
+        :param reader: A batch reader that reads and yeilds data items,
+                       it should be a paddle.v2.batch.
         :type reader: collections.Iterable
         :param feeding: Feeding is a map of neural network input name and array
                         index that reader returns.