Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into HEAD

37792e54 · Yu Yang · 23bbaada · 8a0dd240 · 37792e54 · 23bbaada
73 changed file
--- a/AUTHORS.md
+++ b/AUTHORS.md
@@ -2,7 +2,7 @@
 |---|---|
 | backyes | Yan-Fei Wang |
 | beckett1124 | Bin Qi |
-| Canpio | Jia-Yi Feng |
+| JiayiFeng | Jia-Yi Feng |
 | chengxiaohua1105 | Xiao-Hua Cheng |
 | cxwangyi, yiwangbaidu, wangkuiyi | Yi Wang |
 | cxysteven | Xing-Yi Cheng |

--- a/doc/build_and_install/build_cn.md
+++ b/doc/build_and_install/build_cn.md
-# 用Docker编译和测试PaddlePaddle
-
-## 需要的软硬件
-
-为了开发PaddlePaddle，我们需要
-
-1. 一台电脑，可以装的是 Linux, BSD, Windows 或者 MacOS 操作系统，以及
-1. Docker。
-
-不需要依赖其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker image 里。
-
-## 总体流程
-
-1. 获取源码
-
-   ```bash
-   git clone https://github.com/paddlepaddle/paddle
-   ```
-
-2. 安装开发工具到 Docker image 里
-
-   ```bash
-   cd paddle; docker build -t paddle:dev .
-   ```
-
-   请注意这个命令结尾处的 `.`；它表示 `docker build` 应该读取当前目录下的 [`Dockerfile`文件](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile)，按照其内容创建一个名为 `paddle:dev` 的 Docker image，并且把各种开发工具安装进去。
-
-3. 编译
-
-   以下命令启动一个 Docker container 来执行 `paddle:dev` 这个 Docker image，同时把当前目录（源码树根目录）映射为 container 里的 `/paddle` 目录，并且运行 `Dockerfile` 描述的默认入口程序 [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `cmake` 和 `make` 来编译 `/paddle` 里的源码，结果输出到 `/paddle/build`，也就是本地的源码树根目录里的 `build` 子目录。
-
-   ```bash
-   docker run --rm -v $PWD:/paddle paddle:dev
-   ```
-
-   上述命令编译出一个 CUDA-enabled 版本。如果我们只需要编译一个只支持 CPU 的版本，可以用
-
-   ```bash
-   docker run --rm -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
-   ```
-
-4. 运行单元测试
-
-   用本机的第一个 GPU 来运行包括 GPU 单元测试在内的所有单元测试：
-
-   ```bash
-   NV_GPU=0 nvidia-docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
-   ```
-
-   如果编译的时候我们用了 `WITH_GPU=OFF` 选项，那么编译过程只会产生 CPU-based 单元测试，那么我们也就不需要 nvidia-docker 来运行单元测试了。我们只需要：
-
-   ```bash
-   docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
-   ```
-
-   有时候我们只想运行一个特定的单元测试，比如 `memory_test`，我们可以
-
-   ```bash
-   nvidia-docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
-   ```
-
-5. 清理
-
-   有时候我们会希望清理掉已经下载的第三方依赖以及已经编译的二进制文件。此时只需要：
-
-   ```bash
-   rm -rf build
-   ```
-
-## 为什么要 Docker 呀？
-
- 什么是 Docker?
-
-  如果您没有听说 Docker，可以把它想象为一个类似 virtualenv 的系统，但是虚拟的不仅仅是 Python 的运行环境。
-
- Docker 还是虚拟机？
-
-  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行一样。
-
- 为什么用 Docker?
-
-  把工具和配置都安装在一个 Docker image 里可以标准化编译环境。这样如果遇到问题，其他人可以复现问题以便帮助。
-
-  另外，对于习惯使用Windows和MacOS的开发者来说，使用Docker就不用配置交叉编译环境了。
-
- 我可以选择不用Docker吗？
-
-  当然可以。大家可以用把开发工具安装进入 Docker image 一样的方式，把这些工具安装到本机。这篇文档介绍基于 Docker 的开发流程，是因为这个流程比其他方法都更简便。
-
- 学习 Docker 有多难？
-
-  理解 Docker 并不难，大概花十分钟看一下[这篇文章](https://zhuanlan.zhihu.com/p/19902938)。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
-
- 我可以用 IDE 吗？
-
-  当然可以，因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码，我们只需要配置 IDE 来调用 Docker 命令编译源码即可。
-
-  很多 PaddlePaddle 开发者使用 Emacs。他们在自己的 `~/.emacs` 配置文件里加两行
-
-  ```emacs
-  (global-set-key "\C-cc" 'compile)
-  (setq compile-command
-   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
-  ```
-
-  就可以按 `Ctrl-C` 和 `c` 键来启动编译了。
-
- 可以并行编译吗？
-
-  是的。我们的 Docker image 运行一个 [Bash 脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
-
-## 可能碰到的问题
-
- Docker 需要 sudo
-
-  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
-
- 在 Windows/MacOS 上编译很慢
-
-  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
-
- 磁盘不够
-
-  本文中的例子里，`docker run` 命令里都用了 `--rm` 参数，这样保证运行结束之后的 containers 不会保留在磁盘上。可以用 `docker ps -a` 命令看到停止后但是没有删除的 containers。`docker build` 命令有时候会产生一些中间结果，是没有名字的 images，也会占用磁盘。可以参考[这篇文章](https://zaiste.net/posts/removing_docker_containers/)来清理这些内容。
--- a/doc/build_and_install/build_en.md
+++ b/doc/build_and_install/build_en.md
-# Build using Docker
-
-## What Developers Need
-
-To contribute to PaddlePaddle, you need
-
-1. A computer -- Linux, BSD, Windows, MacOS, and
-1. Docker.
-
-Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image.  We run all the tools by running this image.
-
-## General Process
-
-1. Retrieve source code.
-
-   ```bash
-   git clone https://github.com/paddlepaddle/paddle
-   ```
-
-2. Install build tools into a Docker image.
-
-   ```bash
-   cd paddle; docker build -t paddle:dev .
-   ```
-
-   Please be aware of the `.` at the end of the command, which refers to the [`./Dockerfile` file](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile).  `docker build` follows instructions in this file to create a Docker image named `paddle:dev`, and installs building tools into it.
-
-3. Build from source.
-
-   This following command starts a Docker container that executes the Docker image `paddle:dev`, mapping the current directory to `/paddle/` in the container, and runs the default entry-point [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh) as specified in the Dockefile.  `build.sh` invokes `cmake` and `make` to build PaddlePaddle source code, which had been mapped to `/paddle`, and writes outputs to `/paddle/build`, which maps to `build` in the current source directory on the computer.
-
-   ```bash
-   docker run -v $PWD:/paddle paddle:dev
-   ```
-
-   Above command builds a CUDA-enabled version.  If we want to build a CPU-only version, we can type
-
-   ```bash
-   docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
-   ```
-
-4. Run unit tests.
-
-   To run all unit tests using the first GPU of a node:
-
-   ```bash
-   NV_GPU=0 nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
-   ```
-
-   If we used `WITH_GPU=OFF` at build time, it generates only CPU-based unit tests, and we don't need nvidia-docker to run them.  We can just run
-
-   ```bash
-   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
-   ```
-
-   Sometimes we want to run a specific unit test, say `memory_test`, we can run
-
-   ```bash
-   nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
-   ```
-
-5. Clean Build.
-
-   Sometimes, we might want to clean all thirt-party dependents and built binaries.  To do so, just
-
-   ```bash
-   rm -rf build
-   ```
-
-## Docker, Or Not?
-
- What is Docker?
-
-  If you haven't heard of it, consider it something like Python's virtualenv.
-
- Docker or virtual machine?
-
-  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware nor running a guest OS, which means there is no compromise on the performance.
-
- Why Docker?
-
-  Using a Docker image of build tools standardizes the building environment, which makes it easier for others to reproduce your problems and to help.
-
-  Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want.
-
- Can I choose not to use Docker?
-
-  Sure, you don't have to install build tools into a Docker image; instead, you can install them in your local computer.  This document exists because Docker would make the development way easier.
-
- How difficult is it to learn Docker?
-
-    It takes you ten minutes to read [an introductory article](https://docs.docker.com/get-started) and saves you more than one hour to install all required build tools, configure them, especially when new versions of PaddlePaddle require some new tools.  Not even to mention the time saved when other people trying to reproduce the issue you have.
-
- Can I use my favorite IDE?
-
-  Yes, of course.  The source code resides on your local computer, and you can edit it using whatever editor you like.
-
-  Many PaddlePaddle developers are using Emacs.  They add the following few lines into their `~/.emacs` configure file:
-
-  ```emacs
-  (global-set-key "\C-cc" 'compile)
-  (setq compile-command
-   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
-  ```
-
-  so they could type `Ctrl-C` and `c` to build PaddlePaddle from source.
-
- Does Docker do parallel building?
-
-  Our building Docker image runs a [Bash script](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh), which calls `make -j$(nproc)` to starts as many processes as the number of your CPU cores.
-
-## Some Gotchas
-
- Docker requires sudo
-
-  An owner of a computer has the administrative privilege, a.k.a., sudo, and Docker requires this privilege to work properly.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
-
- Docker on Windows/MacOS builds slowly
-
-  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to [this issue](https://github.com/PaddlePaddle/Paddle/issues/627) for details.
-
- Not enough disk space
-
-  Examples in this article uses option `--rm` with the `docker run` command.  This option ensures that stopped containers do not exist on hard disks.  We can use `docker ps -a` to list all containers, including stopped.  Sometimes `docker build` generates some intermediate dangling images, which also take disk space.  To clean them, please refer to [this article](https://zaiste.net/posts/removing_docker_containers/).
--- a/doc/build_and_install/build_from_source_cn.rst
+++ b/doc/build_and_install/build_from_source_cn.rst
 从源码编译
 ======================

+.. _requirements:
+
+需要的软硬件
+----------------
+
+为了编译PaddlePaddle，我们需要
+
+1. 一台电脑，可以装的是 Linux, Windows 或者 MacOS 操作系统
+1. Docker
+
+不需要依赖其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker 镜像里。
+
 .. _build_step:

 编译方法
 ----------------

-PaddlePaddle主要使用 `CMake <https://cmake.org>`_ 以及GCC, G++作为编译工具。
-我们推荐您使用PaddlePaddle Docker编译环境镜像完成编译，这样可以免去单独安装编译依赖的步骤，可选的不同编译环境Docker镜像
-可以在 `这里 <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`_ 找到。
+PaddlePaddle需要使用Docker环境完成编译，这样可以免去单独安装编译依赖的步骤，可选的不同编译环境Docker镜像
+可以在 `这里 <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`_ 找到。或者
+参考下述可选步骤，从源码中构建用于编译PaddlePaddle的Docker镜像。

 如果您选择不使用Docker镜像，则需要在本机安装下面章节列出的 `编译依赖`_ 之后才能开始编译的步骤。

@@ -16,15 +28,19 @@ PaddlePaddle主要使用 `CMake <https://cmake.org>`_ 以及GCC, G++作为编译

 .. code-block:: bash

+   # 1. 获取源码
   git clone https://github.com/PaddlePaddle/Paddle.git
   cd Paddle
-   # 如果使用Docker编译环境，执行下面的命令编译CPU-Only的二进制
+   # 2. 可选步骤：源码中构建用于编译PaddlePaddle的Docker镜像
+   docker build -t paddle:dev .
+   # 3. 执行下面的命令编译CPU-Only的二进制
   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 bash -x /paddle/paddle/scripts/docker/build.sh
-   # 如果不使用Docker编译环境，执行下面的命令
-   mkdir build
-   cd build
-   cmake -DWITH_GPU=OFF -DWITH_TESTING=OFF ..
-   make
+   # 4. 或者也可以使用为上述可选步骤构建的镜像（必须先执行第2步）
+   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev
+
+注：上述命令把当前目录（源码树根目录）映射为 container 里的 :code:`/paddle` 目录。如果使用自行
+构建的镜像（上述第4步）会执行 :code:`Dockerfile` 描述的默认入口程序 :code:`build.sh` 可以省略步骤3中
+最后的执行脚本的命令。

 编译完成后会在build/python/dist目录下生成输出的whl包，可以选在在当前机器安装也可以拷贝到目标机器安装：

@@ -50,28 +66,83 @@ PaddlePaddle主要使用 `CMake <https://cmake.org>`_ 以及GCC, G++作为编译

 如果您期望在编译完成后立即执行所有的单元测试，可以按照下面的方法：

-使用Docker的情况下，设置 :code:`RUN_TEST=ON` 和 :code:`WITH_TESTING=ON` 就会在完成编译之后，立即执行单元测试。
+设置 :code:`RUN_TEST=ON` 和 :code:`WITH_TESTING=ON` 就会在完成编译之后，立即执行单元测试。
 开启 :code:`WITH_GPU=ON` 可以指定同时执行GPU上的单元测试。

 .. code-block:: bash

   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 bash -x /paddle/paddle/scripts/docker/build.sh

-如果不使用Docker，可以执行ctest命令即可：
+如果期望执行其中一个单元测试，（比如 :code:`test_sum_op` ）：

 .. code-block:: bash

-   mkdir build
-   cd build
-   cmake -DWITH_GPU=OFF -DWITH_TESTING=OFF ..
-   make
-   ctest
-   # 指定执行其中一个单元测试 test_mul_op
-   ctest -R test_mul_op
+   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash
+   bash /paddle/paddle/scripts/docker/build.sh
+   cd /paddle/build
+   ctest -R test_sum_op -V
+
+.. _faq_docker:
+
+常见问题
+----------------
+
+- 什么是 Docker?
+
+  如果您没有听说 Docker，可以把它想象为一个类似 virtualenv 的系统，但是虚拟的不仅仅是 Python 的运行环境。
+
+- Docker 还是虚拟机？
+
+  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行一样。
+
+- 为什么用 Docker?
+
+  把工具和配置都安装在一个 Docker image 里可以标准化编译环境。这样如果遇到问题，其他人可以复现问题以便帮助。
+
+  另外，对于习惯使用Windows和MacOS的开发者来说，使用Docker就不用配置交叉编译环境了。
+
+- 我可以选择不用Docker吗？
+
+  当然可以。大家可以用把开发工具安装进入 Docker image 一样的方式，把这些工具安装到本机。这篇文档介绍基于 Docker 的开发流程，是因为这个流程比其他方法都更简便。
+
+- 学习 Docker 有多难？
+
+  理解 Docker 并不难，大概花十分钟看一下[这篇文章](https://zhuanlan.zhihu.com/p/19902938)。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
+
+- 我可以用 IDE 吗？
+
+  当然可以，因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码，我们只需要配置 IDE 来调用 Docker 命令编译源码即可。
+
+  很多 PaddlePaddle 开发者使用 Emacs。他们在自己的 `~/.emacs` 配置文件里加两行
+
+  ```emacs
+  (global-set-key "\C-cc" 'compile)
+  (setq compile-command
+   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
+  ```
+
+  就可以按 `Ctrl-C` 和 `c` 键来启动编译了。
+
+- 可以并行编译吗？
+
+  是的。我们的 Docker image 运行一个 [Bash 脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
+
+- Docker 需要 sudo
+
+  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
+
+- 在 Windows/MacOS 上编译很慢
+
+  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
+
+- 磁盘不够
+
+  本文中的例子里，`docker run` 命令里都用了 `--rm` 参数，这样保证运行结束之后的 containers 不会保留在磁盘上。可以用 `docker ps -a` 命令看到停止后但是没有删除的 containers。`docker build` 命令有时候会产生一些中间结果，是没有名字的 images，也会占用磁盘。可以参考[这篇文章](https://zaiste.net/posts/removing_docker_containers/)来清理这些内容。
+

 .. _compile_deps:

-编译依赖
+附录：编译依赖
 ----------------

 PaddlePaddle编译需要使用到下面的依赖（包含但不限于），其他的依赖软件，会自动在编译时下载。
@@ -91,7 +162,7 @@ PaddlePaddle编译需要使用到下面的依赖（包含但不限于），其

 .. _build_options:

-编译选项
+附录：编译选项
 ----------------

 PaddlePaddle的编译选项，包括生成CPU/GPU二进制文件、链接何种BLAS库等。

--- a/doc/build_and_install/build_from_source_en.rst
+++ b/doc/build_and_install/build_from_source_en.rst
 Build from Sources
 ==========================

-.. _build_step:
+.. _requirements:

-How To Build
+Requirements
 ----------------

-PaddlePaddle mainly uses `CMake <https://cmake.org>`_ and GCC, G++ as compile
-tools. We recommend you to use our pre-built Docker image to run the build
-to avoid installing dependencies by yourself. We have several build environment
-Docker images `here <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`_ .
+To build PaddlePaddle, you need
+
+1. A computer -- Linux, Windows, MacOS.
+1. Docker.
+
+Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image. 
+We run all the tools by running this image.
+
+.. _build_step:

-If you choose not to use Docker image for your build, you need to install the
-below `Compile Dependencies`_ before run the build.
+How To Build
+----------------

-Then run:
+You need to use Docker to build PaddlePaddle
+to avoid installing dependencies by yourself. We have several pre-built
+Docker images `here <https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/>`_ ,
+Or you can build your own image from source as the optional step below:

 .. code-block:: bash

+   # 1. clone the source code
   git clone https://github.com/PaddlePaddle/Paddle.git
   cd Paddle
-   # run the following command to build a CPU-Only binaries if you are using docker
+   # 2. Optional: build development docker image from source
+   docker build -t paddle:dev .
+   # 3. Run the following command to build a CPU-Only binaries
   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 bash -x /paddle/paddle/scripts/docker/build.sh
-   # else run these commands
-   mkdir build
-   cd build
-   cmake -DWITH_GPU=OFF -DWITH_TESTING=OFF ..
-   make
+   # 4. Or, use your built Docker image to build PaddlePaddle (must run step 2)
+   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" paddle:dev
+
+NOTE: The above command try to mount the current working directory (root directory of source code)
+into :code:`/paddle` directory inside docker container. If you are using your own image
+(Step 4) it will run default entry-point :code:`build.sh` , so you could omit the last
+command in step 3.

 When the compile finishes, you can get the output whl package under
 build/python/dist, then you can choose to install the whl on local
@@ -61,22 +74,75 @@ Set :code:`WITH_GPU=ON` Can also run tests on GPU.

   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=ON" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 bash -x paddle/paddle/scripts/docker/build.sh

-If you don't use Docker, just run ctest will start the tests:
+If you wish to run only one unit test, like :code:`test_sum_op`:

 .. code-block:: bash

-   mkdir build
-   cd build
-   cmake -DWITH_GPU=OFF -DWITH_TESTING=ON ..
-   make
-   ctest
-   # run a single test like test_mul_op
-   ctest -R test_mul_op
+   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=ON" -e "RUN_TEST=OFF" paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5 /bin/bash
+   bash /paddle/paddle/scripts/docker/build.sh
+   cd /paddle/build
+   ctest -R test_sum_op -V
+
+.. _faq_docker:
+
+Frequently Asked Questions
+----------------
+
+- What is Docker?
+
+  If you haven't heard of it, consider it something like Python's virtualenv.
+
+- Docker or virtual machine?
+
+  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware nor running a guest OS, which means there is no compromise on the performance.
+
+- Why Docker?
+
+  Using a Docker image of build tools standardizes the building environment, which makes it easier for others to reproduce your problems and to help.
+
+  Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want.

+- Can I choose not to use Docker?
+
+  Sure, you don't have to install build tools into a Docker image; instead, you can install them on your local computer.  This document exists because Docker would make the development way easier.
+
+- How difficult is it to learn Docker?
+
+    It takes you ten minutes to read [an introductory article](https://docs.docker.com/get-started) and saves you more than one hour to install all required build tools, configure them, especially when new versions of PaddlePaddle require some new tools.  Not even to mention the time saved when other people trying to reproduce the issue you have.
+
+- Can I use my favorite IDE?
+
+  Yes, of course.  The source code resides on your local computer, and you can edit it using whatever editor you like.
+
+  Many PaddlePaddle developers are using Emacs.  They add the following few lines into their `~/.emacs` configure file:
+
+  ```emacs
+  (global-set-key "\C-cc" 'compile)
+  (setq compile-command
+   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
+  ```
+
+  so they could type `Ctrl-C` and `c` to build PaddlePaddle from source.
+
+- Does Docker do parallel building?
+
+  Our building Docker image runs a [Bash script](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh), which calls `make -j$(nproc)` to starts as many processes as the number of your CPU cores.
+
+- Docker requires sudo
+
+  An owner of a computer has the administrative privilege, a.k.a., sudo, and Docker requires this privilege to work properly.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
+
+- Docker on Windows/MacOS builds slowly
+
+  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to [this issue](https://github.com/PaddlePaddle/Paddle/issues/627) for details.
+
+- Not enough disk space
+
+  Examples in this article use option `--rm` with the `docker run` command.  This option ensures that stopped containers do not exist on hard disks.  We can use `docker ps -a` to list all containers, including stopped.  Sometimes `docker build` generates some intermediate dangling images, which also take disk space.  To clean them, please refer to [this article](https://zaiste.net/posts/removing_docker_containers/).

 .. _compile_deps:

-Compile Dependencies
+Appendix: Compile Dependencies
 ----------------

 PaddlePaddle need the following dependencies when compiling, other dependencies
@@ -97,17 +163,13 @@ will be downloaded automatically.

 .. _build_options:

-Build Options
+Appendix: Build Options
 ----------------

 Build options include whether build binaries for CPU or GPU, which BLAS
 library to use etc. You may pass these settings when running cmake.
 For detailed cmake tutorial please refer to `here <https://cmake.org/cmake-tutorial>`_ 。

-.. _build_options_bool:
-
-Bool Type Options
----------------

 You can add :code:`-D` argument to pass such options, like:


--- a/doc/build_and_install/index_cn.rst
+++ b/doc/build_and_install/index_cn.rst
@@ -13,7 +13,6 @@ PaddlePaddle提供pip和Docker的安装方式：

   pip_install_cn.rst
   docker_install_cn.rst
-   build_cn.md

 编译流程
 ++++++++

--- a/doc/build_and_install/index_en.rst
+++ b/doc/build_and_install/index_en.rst
@@ -13,8 +13,6 @@ You can choose either pip or Docker to complete your install:

   pip_install_en.rst
   docker_install_en.rst
-   build_en.md
-

 Build from Source
 -----------------

--- a/doc/design/switch_kernel.md
+++ b/doc/design/switch_kernel.md
--- a/doc/templates/conf.py.cn.in
+++ b/doc/templates/conf.py.cn.in
@@ -82,7 +82,7 @@ language = 'zh_CN'

 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
-exclude_patterns = ['_build', '**/*_en*', '*_en*']
+exclude_patterns = ['_build', '**/*_en*', '*_en*', 'api/*']

 # The reST default role (used for this markup: `text`) to use for all
 # documents.

--- a/doc/templates/conf.py.en.in
+++ b/doc/templates/conf.py.en.in
@@ -82,7 +82,7 @@ language = None

 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
-exclude_patterns = ['_build', '**/*_cn*', '*_cn*']
+exclude_patterns = ['_build', '**/*_cn*', '*_cn*', 'api/*']

 # The reST default role (used for this markup: `text`) to use for all
 # documents.

--- a/paddle/CMakeLists.txt
+++ b/paddle/CMakeLists.txt
@@ -11,7 +11,6 @@ if(MOBILE_INFERENCE)
 else()
  add_subdirectory(pserver)
  add_subdirectory(trainer)
-  add_subdirectory(string)
  add_subdirectory(scripts)

  if(WITH_C_API)

--- a/paddle/fluid/CMakeLists.txt
+++ b/paddle/fluid/CMakeLists.txt
@@ -4,3 +4,4 @@ add_subdirectory(framework)
 add_subdirectory(operators)
 add_subdirectory(pybind)
 add_subdirectory(inference)
+add_subdirectory(string)
--- a/paddle/fluid/framework/ddim.cc
+++ b/paddle/fluid/framework/ddim.cc
@@ -314,5 +314,15 @@ DDim stride(const DDim& ddim) {
  }
  return framework::make_ddim(strides);
 }
+
+DDim stride_numel(const framework::DDim& ddim) {
+  std::vector<int64_t> strides(ddim.size());
+  strides[ddim.size() - 1] = ddim[ddim.size() - 1];
+  for (int i = ddim.size() - 2; i >= 0; --i) {
+    strides[i] = strides[i + 1] * ddim[i];
+  }
+  return framework::make_ddim(strides);
+}
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/ddim.h
+++ b/paddle/fluid/framework/ddim.h
@@ -125,6 +125,8 @@ DDim flatten_to_2d(const DDim& src, int num_col_dims);
 DDim flatten_to_1d(const DDim& src);

 DDim stride(const DDim& ddim);
+
+DDim stride_numel(const DDim& ddim);
 }  // namespace framework
 }  // namespace paddle


--- a/paddle/fluid/framework/init.cc
+++ b/paddle/fluid/framework/init.cc
@@ -20,7 +20,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/operator.h"
 #include "paddle/fluid/platform/device_context.h"
 #include "paddle/fluid/platform/place.h"
-#include "paddle/string/piece.h"
+#include "paddle/fluid/string/piece.h"

 namespace paddle {
 namespace framework {

--- a/paddle/fluid/framework/mixed_vector.h
+++ b/paddle/fluid/framework/mixed_vector.h
@@ -37,9 +37,8 @@ class Vector {

  // Fill vector with value. The vector size is `count`.
  explicit Vector(size_t count, const T& value = T()) {
-    if (count == 0) {
-      InitEmpty();
-    } else {
+    InitEmpty();
+    if (count != 0) {
      resize(count);
      T* ptr = begin();
      for (size_t i = 0; i < count; ++i) {
@@ -107,9 +106,11 @@ class Vector {
  // std::vector iterator methods. Based on CPU data access method
  size_t size() const { return size_; }

-  T* begin() { return &this->operator[](0); }
+  T* begin() { return capacity() == 0 ? &EmptyDummy() : &this->operator[](0); }

-  T* end() { return &this->operator[](size()); }
+  T* end() {
+    return capacity() == 0 ? &EmptyDummy() : &this->operator[](size());
+  }

  T& front() { return *begin(); }

@@ -119,8 +120,17 @@ class Vector {
    return *it;
  }

-  const T* begin() const { return &this->operator[](0); }
-  const T* end() const { return &this->operator[](size()); }
+  const T* begin() const {
+    return capacity() == 0 ? &EmptyDummy() : &this->operator[](0);
+  }
+
+  const T* end() const {
+    return capacity() == 0 ? &EmptyDummy() : &this->operator[](size());
+  }
+
+  const T* cbegin() const { return begin(); }
+
+  const T* cend() const { return end(); }

  const T& back() const {
    auto it = end();
@@ -244,7 +254,9 @@ class Vector {

  bool operator==(const Vector<T>& other) const {
    if (size() != other.size()) return false;
-    for (auto it1 = begin(), it2 = other.begin(); it1 < end(); ++it1, ++it2) {
+    auto it1 = cbegin();
+    auto it2 = other.cbegin();
+    for (; it1 < cend(); ++it1, ++it2) {
      if (*it1 != *it2) {
        return false;
      }
@@ -353,6 +365,11 @@ class Vector {
    }
  }

+  static T& EmptyDummy() {
+    static T dummy = T();
+    return dummy;
+  }
+
  mutable int flag_;
  mutable Tensor cpu_vec_;
  mutable Tensor cuda_vec_;

--- a/paddle/fluid/framework/mixed_vector_test.cu
+++ b/paddle/fluid/framework/mixed_vector_test.cu
@@ -26,10 +26,10 @@ TEST(mixed_vector, CPU_VECTOR) {
  for (int i = 0; i < 10; ++i) {
    tmp.push_back(i);
  }
-  ASSERT_EQ(tmp.size(), 10);
+  ASSERT_EQ(tmp.size(), 10UL);
  vec<int> tmp2;
  tmp2 = tmp;
-  ASSERT_EQ(tmp2.size(), 10);
+  ASSERT_EQ(tmp2.size(), 10UL);
  for (int i = 0; i < 10; ++i) {
    ASSERT_EQ(tmp2[i], i);
    ASSERT_EQ(tmp2[i], tmp[i]);
@@ -58,7 +58,7 @@ TEST(mixed_vector, GPU_VECTOR) {
  for (int i = 0; i < 10; ++i) {
    tmp.push_back(i);
  }
-  ASSERT_EQ(tmp.size(), 10);
+  ASSERT_EQ(tmp.size(), 10UL);
  paddle::platform::CUDAPlace gpu(0);

  multiply_10<<<1, 1, 0, GetCUDAStream(gpu)>>>(tmp.MutableData(gpu));
@@ -79,7 +79,7 @@ TEST(mixed_vector, MultiGPU) {
  for (int i = 0; i < 10; ++i) {
    tmp.push_back(i);
  }
-  ASSERT_EQ(tmp.size(), 10);
+  ASSERT_EQ(tmp.size(), 10UL);
  paddle::platform::CUDAPlace gpu0(0);
  paddle::platform::SetDeviceId(0);
  multiply_10<<<1, 1, 0, GetCUDAStream(gpu0)>>>(tmp.MutableData(gpu0));
@@ -91,3 +91,16 @@ TEST(mixed_vector, MultiGPU) {
    ASSERT_EQ(tmp[i], i * 100);
  }
 }
+
+TEST(mixed_vector, InitWithCount) {
+  paddle::framework::Vector<int> vec(10, 10);
+  for (int i = 0; i < 10; ++i) {
+    ASSERT_EQ(vec[i], 10);
+  }
+}
+
+TEST(mixed_vector, ForEach) {
+  vec<int> tmp;
+  for (auto& v : tmp) {
+  }
+}
--- a/paddle/fluid/framework/scope.cc
+++ b/paddle/fluid/framework/scope.cc
@@ -18,7 +18,7 @@ limitations under the License. */
 #include <mutex>   // for call_once
 #include "glog/logging.h"
 #include "paddle/fluid/framework/threadpool.h"
-#include "paddle/string/printf.h"
+#include "paddle/fluid/string/printf.h"

 DEFINE_bool(benchmark, false,
            "Doing memory benchmark. It will make deleting scope synchronized, "

--- a/paddle/fluid/inference/tests/book/CMakeLists.txt
+++ b/paddle/fluid/inference/tests/book/CMakeLists.txt
@@ -29,6 +29,6 @@ inference_test(image_classification ARGS vgg resnet)
 inference_test(label_semantic_roles)
 inference_test(recognize_digits ARGS mlp)
 inference_test(recommender_system)
-inference_test(rnn_encoder_decoder)
+#inference_test(rnn_encoder_decoder)
 inference_test(understand_sentiment)
 inference_test(word2vec)
--- a/paddle/fluid/operators/compare_op.cc
+++ b/paddle/fluid/operators/compare_op.cc
@@ -102,3 +102,5 @@ REGISTER_LOGICAL_OP(less_equal, "Out = X <= Y");
 REGISTER_LOGICAL_KERNEL(less_equal, CPU, paddle::operators::LessEqualFunctor);
 REGISTER_LOGICAL_OP(equal, "Out = X == Y");
 REGISTER_LOGICAL_KERNEL(equal, CPU, paddle::operators::EqualFunctor);
+REGISTER_LOGICAL_OP(not_equal, "Out = X != Y");
+REGISTER_LOGICAL_KERNEL(not_equal, CPU, paddle::operators::NotEqualFunctor);
--- a/paddle/fluid/operators/compare_op.cu
+++ b/paddle/fluid/operators/compare_op.cu
@@ -17,3 +17,4 @@ limitations under the License. */
 REGISTER_LOGICAL_KERNEL(less_than, CUDA, paddle::operators::LessThanFunctor);
 REGISTER_LOGICAL_KERNEL(less_equal, CUDA, paddle::operators::LessEqualFunctor);
 REGISTER_LOGICAL_KERNEL(equal, CUDA, paddle::operators::EqualFunctor);
+REGISTER_LOGICAL_KERNEL(not_equal, CUDA, paddle::operators::NotEqualFunctor);
--- a/paddle/fluid/operators/compare_op.h
+++ b/paddle/fluid/operators/compare_op.h
@@ -48,6 +48,14 @@ struct EqualFunctor {
  }
 };

+template <typename T>
+struct NotEqualFunctor {
+  using ELEM_TYPE = T;
+  HOSTDEVICE bool operator()(const T& a, const T& b) const {
+    return !EqualFunctor<T>()(a, b);
+  }
+};
+
 template <typename DeviceContext, typename Functor>
 class CompareOpKernel
    : public framework::OpKernel<typename Functor::ELEM_TYPE> {

--- a/paddle/fluid/operators/concat_op.h
+++ b/paddle/fluid/operators/concat_op.h
@@ -28,17 +28,18 @@ class ConcatKernel : public framework::OpKernel<T> {
    auto ins = ctx.MultiInput<framework::Tensor>("X");
    auto* out = ctx.Output<framework::Tensor>("Out");
    int64_t axis = static_cast<int64_t>(ctx.Attr<int>("axis"));
-    const size_t n = ins.size();
+    auto place = ctx.GetPlace();
+    out->mutable_data<T>(place);
+
+    auto out_stride = framework::stride_numel(out->dims());
+
    size_t output_offset = 0;
-    out->mutable_data<T>(ctx.GetPlace());
-    auto out_stride = framework::stride(out->dims());
-    for (size_t i = 0; i < n; i++) {
-      auto& in = ins[i];
-      auto axis_dim = in->dims()[axis];
-      auto in_stride = framework::stride(in->dims());
-      StridedMemcpy<T>(ctx.device_context(), in->data<T>(), in_stride,
-                       in->dims(), out_stride, out->data<T>() + output_offset);
-      output_offset += axis_dim * in_stride[axis];
+    for (auto* in : ins) {
+      auto in_stride = framework::stride_numel(in->dims());
+      StridedNumelCopyWithAxis<T>(ctx.device_context(), axis,
+                                  out->data<T>() + output_offset, out_stride,
+                                  in->data<T>(), in_stride);
+      output_offset += in_stride[axis];
    }
  }
 };
@@ -50,17 +51,16 @@ class ConcatGradKernel : public framework::OpKernel<T> {
    auto* in = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
    auto outs = ctx.MultiOutput<framework::Tensor>(framework::GradVarName("X"));
    int64_t axis = static_cast<int64_t>(ctx.Attr<int>("axis"));
-    const size_t n = outs.size();
    size_t input_offset = 0;
-    auto in_stride = framework::stride(in->dims());
-    for (size_t i = 0; i < n; i++) {
-      auto& out = outs[i];
+    auto in_stride = framework::stride_numel(in->dims());
+
+    for (auto& out : outs) {
      out->mutable_data<T>(ctx.GetPlace());
-      size_t axis_dim = out->dims()[axis];
-      auto out_stride = framework::stride(out->dims());
-      StridedMemcpy<T>(ctx.device_context(), in->data<T>() + input_offset,
-                       in_stride, out->dims(), out_stride, out->data<T>());
-      input_offset += axis_dim * in_stride[axis];
+      auto out_stride = framework::stride_numel(out->dims());
+      StridedNumelCopyWithAxis<T>(ctx.device_context(), axis, out->data<T>(),
+                                  out_stride, in->data<T>() + input_offset,
+                                  in_stride);
+      input_offset += out_stride[axis];
    }
  }
 };

--- a/paddle/fluid/operators/detection_map_op.cc
+++ b/paddle/fluid/operators/detection_map_op.cc
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/detection_map_op.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+class DetectionMAPOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("DetectRes"),
+                   "Input(DetectRes) of DetectionMAPOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Label"),
+                   "Input(Label) of DetectionMAPOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AccumPosCount"),
+        "Output(AccumPosCount) of DetectionMAPOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AccumTruePos"),
+        "Output(AccumTruePos) of DetectionMAPOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AccumFalsePos"),
+        "Output(AccumFalsePos) of DetectionMAPOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("MAP"),
+                   "Output(MAP) of DetectionMAPOp should not be null.");
+
+    auto det_dims = ctx->GetInputDim("DetectRes");
+    PADDLE_ENFORCE_EQ(det_dims.size(), 2UL,
+                      "The rank of Input(DetectRes) must be 2, "
+                      "the shape is [N, 6].");
+    PADDLE_ENFORCE_EQ(det_dims[1], 6UL,
+                      "The shape is of Input(DetectRes) [N, 6].");
+    auto label_dims = ctx->GetInputDim("Label");
+    PADDLE_ENFORCE_EQ(label_dims.size(), 2UL,
+                      "The rank of Input(Label) must be 2, "
+                      "the shape is [N, 6].");
+    PADDLE_ENFORCE_EQ(label_dims[1], 6UL,
+                      "The shape is of Input(Label) [N, 6].");
+
+    if (ctx->HasInput("PosCount")) {
+      PADDLE_ENFORCE(ctx->HasInput("TruePos"),
+                     "Input(TruePos) of DetectionMAPOp should not be null when "
+                     "Input(TruePos) is not null.");
+      PADDLE_ENFORCE(
+          ctx->HasInput("FalsePos"),
+          "Input(FalsePos) of DetectionMAPOp should not be null when "
+          "Input(FalsePos) is not null.");
+    }
+
+    ctx->SetOutputDim("MAP", framework::make_ddim({1}));
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        framework::ToDataType(
+            ctx.Input<framework::Tensor>("DetectRes")->type()),
+        ctx.device_context());
+  }
+};
+
+class DetectionMAPOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  DetectionMAPOpMaker(OpProto* proto, OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("DetectRes",
+             "(LoDTensor) A 2-D LoDTensor with shape [M, 6] represents the "
+             "detections. Each row has 6 values: "
+             "[label, confidence, xmin, ymin, xmax, ymax], M is the total "
+             "number of detect results in this mini-batch. For each instance, "
+             "the offsets in first dimension are called LoD, the number of "
+             "offset is N + 1, if LoD[i + 1] - LoD[i] == 0, means there is "
+             "no detected data.");
+    AddInput("Label",
+             "(LoDTensor) A 2-D LoDTensor with shape[N, 6] represents the"
+             "Labeled ground-truth data. Each row has 6 values: "
+             "[label, is_difficult, xmin, ymin, xmax, ymax], N is the total "
+             "number of ground-truth data in this mini-batch. For each "
+             "instance, the offsets in first dimension are called LoD, "
+             "the number of offset is N + 1, if LoD[i + 1] - LoD[i] == 0, "
+             "means there is no ground-truth data.");
+    AddInput("PosCount",
+             "(Tensor) A tensor with shape [Ncls, 1], store the "
+             "input positive example count of each class, Ncls is the count of "
+             "input classification. "
+             "This input is used to pass the AccumPosCount generated by the "
+             "previous mini-batch when the multi mini-batches cumulative "
+             "calculation carried out. "
+             "When the input(PosCount) is empty, the cumulative "
+             "calculation is not carried out, and only the results of the "
+             "current mini-batch are calculated.")
+        .AsDispensable();
+    AddInput("TruePos",
+             "(LoDTensor) A 2-D LoDTensor with shape [Ntp, 2], store the "
+             "input true positive example of each class."
+             "This input is used to pass the AccumTruePos generated by the "
+             "previous mini-batch when the multi mini-batches cumulative "
+             "calculation carried out. ")
+        .AsDispensable();
+    AddInput("FalsePos",
+             "(LoDTensor) A 2-D LoDTensor with shape [Nfp, 2], store the "
+             "input false positive example of each class."
+             "This input is used to pass the AccumFalsePos generated by the "
+             "previous mini-batch when the multi mini-batches cumulative "
+             "calculation carried out. ")
+        .AsDispensable();
+    AddOutput("AccumPosCount",
+              "(Tensor) A tensor with shape [Ncls, 1], store the "
+              "positive example count of each class. It combines the input "
+              "input(PosCount) and the positive example count computed from "
+              "input(Detection) and input(Label).");
+    AddOutput("AccumTruePos",
+              "(LoDTensor) A LoDTensor with shape [Ntp', 2], store the "
+              "true positive example of each class. It combines the "
+              "input(TruePos) and the true positive examples computed from "
+              "input(Detection) and input(Label).");
+    AddOutput("AccumFalsePos",
+              "(LoDTensor) A LoDTensor with shape [Nfp', 2], store the "
+              "false positive example of each class. It combines the "
+              "input(FalsePos) and the false positive examples computed from "
+              "input(Detection) and input(Label).");
+    AddOutput("MAP",
+              "(Tensor) A tensor with shape [1], store the mAP evaluate "
+              "result of the detection.");
+
+    AddAttr<float>(
+        "overlap_threshold",
+        "(float) "
+        "The lower bound jaccard overlap threshold of detection output and "
+        "ground-truth data.")
+        .SetDefault(.3f);
+    AddAttr<bool>("evaluate_difficult",
+                  "(bool, default true) "
+                  "Switch to control whether the difficult data is evaluated.")
+        .SetDefault(true);
+    AddAttr<std::string>("ap_type",
+                         "(string, default 'integral') "
+                         "The AP algorithm type, 'integral' or '11point'.")
+        .SetDefault("integral")
+        .InEnum({"integral", "11point"})
+        .AddCustomChecker([](const std::string& ap_type) {
+          PADDLE_ENFORCE_NE(GetAPType(ap_type), APType::kNone,
+                            "The ap_type should be 'integral' or '11point.");
+        });
+    AddComment(R"DOC(
+Detection mAP evaluate operator.
+The general steps are as follows. First, calculate the true positive and
+ false positive according to the input of detection and labels, then
+ calculate the mAP evaluate value.
+ Supporting '11 point' and 'integral' mAP algorithm. Please get more information
+ from the following articles:
+ https://sanchom.wordpress.com/tag/average-precision/
+ https://arxiv.org/abs/1512.02325
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(detection_map, ops::DetectionMAPOp,
+                             ops::DetectionMAPOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    detection_map, ops::DetectionMAPOpKernel<paddle::platform::CPUPlace, float>,
+    ops::DetectionMAPOpKernel<paddle::platform::CPUPlace, double>);
--- a/paddle/fluid/operators/detection_map_op.h
+++ b/paddle/fluid/operators/detection_map_op.h
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+enum APType { kNone = 0, kIntegral, k11point };
+
+APType GetAPType(std::string str) {
+  if (str == "integral") {
+    return APType::kIntegral;
+  } else if (str == "11point") {
+    return APType::k11point;
+  } else {
+    return APType::kNone;
+  }
+}
+
+template <typename T>
+inline bool SortScorePairDescend(const std::pair<float, T>& pair1,
+                                 const std::pair<float, T>& pair2) {
+  return pair1.first > pair2.first;
+}
+
+template <typename T>
+inline void GetAccumulation(std::vector<std::pair<T, int>> in_pairs,
+                            std::vector<int>* accu_vec) {
+  std::stable_sort(in_pairs.begin(), in_pairs.end(), SortScorePairDescend<int>);
+  accu_vec->clear();
+  size_t sum = 0;
+  for (size_t i = 0; i < in_pairs.size(); ++i) {
+    auto count = in_pairs[i].second;
+    sum += count;
+    accu_vec->push_back(sum);
+  }
+}
+
+template <typename Place, typename T>
+class DetectionMAPOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* in_detect = ctx.Input<framework::LoDTensor>("DetectRes");
+    auto* in_label = ctx.Input<framework::LoDTensor>("Label");
+    auto* out_map = ctx.Output<framework::Tensor>("MAP");
+
+    auto* in_pos_count = ctx.Input<framework::Tensor>("PosCount");
+    auto* in_true_pos = ctx.Input<framework::LoDTensor>("TruePos");
+    auto* in_false_pos = ctx.Input<framework::LoDTensor>("FalsePos");
+
+    auto* out_pos_count = ctx.Output<framework::Tensor>("AccumPosCount");
+    auto* out_true_pos = ctx.Output<framework::LoDTensor>("AccumTruePos");
+    auto* out_false_pos = ctx.Output<framework::LoDTensor>("AccumFalsePos");
+
+    float overlap_threshold = ctx.Attr<float>("overlap_threshold");
+    float evaluate_difficult = ctx.Attr<bool>("evaluate_difficult");
+    auto ap_type = GetAPType(ctx.Attr<std::string>("ap_type"));
+
+    auto label_lod = in_label->lod();
+    auto detect_lod = in_detect->lod();
+    PADDLE_ENFORCE_EQ(label_lod.size(), 1UL,
+                      "Only support one level sequence now.");
+    PADDLE_ENFORCE_EQ(label_lod[0].size(), detect_lod[0].size(),
+                      "The batch_size of input(Label) and input(Detection) "
+                      "must be the same.");
+
+    std::vector<std::map<int, std::vector<Box>>> gt_boxes;
+    std::vector<std::map<int, std::vector<std::pair<T, Box>>>> detect_boxes;
+
+    GetBoxes(*in_label, *in_detect, gt_boxes, detect_boxes);
+
+    std::map<int, int> label_pos_count;
+    std::map<int, std::vector<std::pair<T, int>>> true_pos;
+    std::map<int, std::vector<std::pair<T, int>>> false_pos;
+
+    if (in_pos_count != nullptr) {
+      GetInputPos(*in_pos_count, *in_true_pos, *in_false_pos, label_pos_count,
+                  true_pos, false_pos);
+    }
+
+    CalcTrueAndFalsePositive(gt_boxes, detect_boxes, evaluate_difficult,
+                             overlap_threshold, label_pos_count, true_pos,
+                             false_pos);
+
+    T map = CalcMAP(ap_type, label_pos_count, true_pos, false_pos);
+
+    GetOutputPos(ctx, label_pos_count, true_pos, false_pos, *out_pos_count,
+                 *out_true_pos, *out_false_pos);
+
+    T* map_data = out_map->mutable_data<T>(ctx.GetPlace());
+    map_data[0] = map;
+  }
+
+ protected:
+  struct Box {
+    Box(T xmin, T ymin, T xmax, T ymax)
+        : xmin(xmin), ymin(ymin), xmax(xmax), ymax(ymax), is_difficult(false) {}
+
+    T xmin, ymin, xmax, ymax;
+    bool is_difficult;
+  };
+
+  inline T JaccardOverlap(const Box& box1, const Box& box2) const {
+    if (box2.xmin > box1.xmax || box2.xmax < box1.xmin ||
+        box2.ymin > box1.ymax || box2.ymax < box1.ymin) {
+      return 0.0;
+    } else {
+      T inter_xmin = std::max(box1.xmin, box2.xmin);
+      T inter_ymin = std::max(box1.ymin, box2.ymin);
+      T inter_xmax = std::min(box1.xmax, box2.xmax);
+      T inter_ymax = std::min(box1.ymax, box2.ymax);
+
+      T inter_width = inter_xmax - inter_xmin;
+      T inter_height = inter_ymax - inter_ymin;
+      T inter_area = inter_width * inter_height;
+
+      T bbox_area1 = (box1.xmax - box1.xmin) * (box1.ymax - box1.ymin);
+      T bbox_area2 = (box2.xmax - box2.xmin) * (box2.ymax - box2.ymin);
+
+      return inter_area / (bbox_area1 + bbox_area2 - inter_area);
+    }
+  }
+
+  void GetBoxes(const framework::LoDTensor& input_label,
+                const framework::LoDTensor& input_detect,
+                std::vector<std::map<int, std::vector<Box>>>& gt_boxes,
+                std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
+                    detect_boxes) const {
+    auto labels = framework::EigenTensor<T, 2>::From(input_label);
+    auto detect = framework::EigenTensor<T, 2>::From(input_detect);
+
+    auto label_lod = input_label.lod();
+    auto detect_lod = input_detect.lod();
+
+    int batch_size = label_lod[0].size() - 1;
+    auto label_index = label_lod[0];
+
+    for (int n = 0; n < batch_size; ++n) {
+      std::map<int, std::vector<Box>> boxes;
+      for (int i = label_index[n]; i < label_index[n + 1]; ++i) {
+        Box box(labels(i, 2), labels(i, 3), labels(i, 4), labels(i, 5));
+        int label = labels(i, 0);
+        auto is_difficult = labels(i, 1);
+        if (std::abs(is_difficult - 0.0) < 1e-6)
+          box.is_difficult = false;
+        else
+          box.is_difficult = true;
+        boxes[label].push_back(box);
+      }
+      gt_boxes.push_back(boxes);
+    }
+
+    auto detect_index = detect_lod[0];
+    for (int n = 0; n < batch_size; ++n) {
+      std::map<int, std::vector<std::pair<T, Box>>> boxes;
+      for (int i = detect_index[n]; i < detect_index[n + 1]; ++i) {
+        Box box(detect(i, 2), detect(i, 3), detect(i, 4), detect(i, 5));
+        int label = detect(i, 0);
+        auto score = detect(i, 1);
+        boxes[label].push_back(std::make_pair(score, box));
+      }
+      detect_boxes.push_back(boxes);
+    }
+  }
+
+  void GetOutputPos(
+      const framework::ExecutionContext& ctx,
+      const std::map<int, int>& label_pos_count,
+      const std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      const std::map<int, std::vector<std::pair<T, int>>>& false_pos,
+      framework::Tensor& output_pos_count,
+      framework::LoDTensor& output_true_pos,
+      framework::LoDTensor& output_false_pos) const {
+    int max_class_id = 0;
+    int true_pos_count = 0;
+    int false_pos_count = 0;
+    for (auto it = label_pos_count.begin(); it != label_pos_count.end(); ++it) {
+      int label = it->first;
+      if (label > max_class_id) max_class_id = label;
+      int label_num_pos = it->second;
+      if (label_num_pos == 0 || true_pos.find(label) == true_pos.end())
+        continue;
+      auto label_true_pos = true_pos.find(label)->second;
+      auto label_false_pos = false_pos.find(label)->second;
+      true_pos_count += label_true_pos.size();
+      false_pos_count += label_false_pos.size();
+    }
+
+    int* pos_count_data = output_pos_count.mutable_data<int>(
+        framework::make_ddim({max_class_id + 1, 1}), ctx.GetPlace());
+    T* true_pos_data = output_true_pos.mutable_data<T>(
+        framework::make_ddim({true_pos_count, 2}), ctx.GetPlace());
+    T* false_pos_data = output_false_pos.mutable_data<T>(
+        framework::make_ddim({false_pos_count, 2}), ctx.GetPlace());
+    true_pos_count = 0;
+    false_pos_count = 0;
+    std::vector<size_t> true_pos_starts = {0};
+    std::vector<size_t> false_pos_starts = {0};
+    for (int i = 0; i <= max_class_id; ++i) {
+      auto it_count = label_pos_count.find(i);
+      pos_count_data[i] = 0;
+      if (it_count != label_pos_count.end()) {
+        pos_count_data[i] = it_count->second;
+      }
+      auto it_true_pos = true_pos.find(i);
+      if (it_true_pos != true_pos.end()) {
+        const std::vector<std::pair<T, int>>& true_pos_vec =
+            it_true_pos->second;
+        for (const std::pair<T, int>& tp : true_pos_vec) {
+          true_pos_data[true_pos_count * 2] = tp.first;
+          true_pos_data[true_pos_count * 2 + 1] = static_cast<T>(tp.second);
+          true_pos_count++;
+        }
+      }
+      true_pos_starts.push_back(true_pos_count);
+
+      auto it_false_pos = false_pos.find(i);
+      if (it_false_pos != false_pos.end()) {
+        const std::vector<std::pair<T, int>>& false_pos_vec =
+            it_false_pos->second;
+        for (const std::pair<T, int>& fp : false_pos_vec) {
+          false_pos_data[false_pos_count * 2] = fp.first;
+          false_pos_data[false_pos_count * 2 + 1] = static_cast<T>(fp.second);
+          false_pos_count++;
+        }
+      }
+      false_pos_starts.push_back(false_pos_count);
+    }
+
+    framework::LoD true_pos_lod;
+    true_pos_lod.emplace_back(true_pos_starts);
+    framework::LoD false_pos_lod;
+    false_pos_lod.emplace_back(false_pos_starts);
+
+    output_true_pos.set_lod(true_pos_lod);
+    output_false_pos.set_lod(false_pos_lod);
+    return;
+  }
+
+  void GetInputPos(
+      const framework::Tensor& input_pos_count,
+      const framework::LoDTensor& input_true_pos,
+      const framework::LoDTensor& input_false_pos,
+      std::map<int, int>& label_pos_count,
+      std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      std::map<int, std::vector<std::pair<T, int>>>& false_pos) const {
+    constexpr T kEPS = static_cast<T>(1e-6);
+    int class_number = input_pos_count.dims()[0];
+    const int* pos_count_data = input_pos_count.data<int>();
+    for (int i = 0; i < class_number; ++i) {
+      label_pos_count[i] = pos_count_data[i];
+    }
+
+    auto SetData = [](const framework::LoDTensor& pos_tensor,
+                      std::map<int, std::vector<std::pair<T, int>>>& pos) {
+      const T* pos_data = pos_tensor.data<T>();
+      auto pos_data_lod = pos_tensor.lod();
+      for (int i = 0; i < pos_data_lod.size(); ++i) {
+        for (int j = pos_data_lod[0][i]; j < pos_data_lod[0][i + 1]; ++j) {
+          T score = pos_data[j * 2];
+          int flag = 1;
+          if (pos_data[j * 2 + 1] < kEPS) flag = 0;
+          pos[i].push_back(std::make_pair(score, flag));
+        }
+      }
+    };
+
+    SetData(input_true_pos, true_pos);
+    SetData(input_false_pos, false_pos);
+    return;
+  }
+
+  void CalcTrueAndFalsePositive(
+      const std::vector<std::map<int, std::vector<Box>>>& gt_boxes,
+      const std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
+          detect_boxes,
+      bool evaluate_difficult, float overlap_threshold,
+      std::map<int, int>& label_pos_count,
+      std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      std::map<int, std::vector<std::pair<T, int>>>& false_pos) const {
+    int batch_size = gt_boxes.size();
+    for (int n = 0; n < batch_size; ++n) {
+      auto image_gt_boxes = gt_boxes[n];
+      for (auto it = image_gt_boxes.begin(); it != image_gt_boxes.end(); ++it) {
+        size_t count = 0;
+        auto labeled_bboxes = it->second;
+        if (evaluate_difficult) {
+          count = labeled_bboxes.size();
+        } else {
+          for (size_t i = 0; i < labeled_bboxes.size(); ++i)
+            if (!(labeled_bboxes[i].is_difficult)) ++count;
+        }
+        if (count == 0) {
+          continue;
+        }
+        int label = it->first;
+        if (label_pos_count.find(label) == label_pos_count.end()) {
+          label_pos_count[label] = count;
+        } else {
+          label_pos_count[label] += count;
+        }
+      }
+    }
+
+    for (size_t n = 0; n < detect_boxes.size(); ++n) {
+      auto image_gt_boxes = gt_boxes[n];
+      auto detections = detect_boxes[n];
+
+      if (image_gt_boxes.size() == 0) {
+        for (auto it = detections.begin(); it != detections.end(); ++it) {
+          auto pred_boxes = it->second;
+          int label = it->first;
+          for (size_t i = 0; i < pred_boxes.size(); ++i) {
+            auto score = pred_boxes[i].first;
+            true_pos[label].push_back(std::make_pair(score, 0));
+            false_pos[label].push_back(std::make_pair(score, 1));
+          }
+        }
+        continue;
+      }
+
+      for (auto it = detections.begin(); it != detections.end(); ++it) {
+        int label = it->first;
+        auto pred_boxes = it->second;
+        if (image_gt_boxes.find(label) == image_gt_boxes.end()) {
+          for (size_t i = 0; i < pred_boxes.size(); ++i) {
+            auto score = pred_boxes[i].first;
+            true_pos[label].push_back(std::make_pair(score, 0));
+            false_pos[label].push_back(std::make_pair(score, 1));
+          }
+          continue;
+        }
+
+        auto matched_bboxes = image_gt_boxes.find(label)->second;
+        std::vector<bool> visited(matched_bboxes.size(), false);
+        // Sort detections in descend order based on scores
+        std::sort(pred_boxes.begin(), pred_boxes.end(),
+                  SortScorePairDescend<Box>);
+        for (size_t i = 0; i < pred_boxes.size(); ++i) {
+          T max_overlap = -1.0;
+          size_t max_idx = 0;
+          auto score = pred_boxes[i].first;
+          for (size_t j = 0; j < matched_bboxes.size(); ++j) {
+            T overlap = JaccardOverlap(pred_boxes[i].second, matched_bboxes[j]);
+            if (overlap > max_overlap) {
+              max_overlap = overlap;
+              max_idx = j;
+            }
+          }
+          if (max_overlap > overlap_threshold) {
+            bool match_evaluate_difficult =
+                evaluate_difficult ||
+                (!evaluate_difficult && !matched_bboxes[max_idx].is_difficult);
+            if (match_evaluate_difficult) {
+              if (!visited[max_idx]) {
+                true_pos[label].push_back(std::make_pair(score, 1));
+                false_pos[label].push_back(std::make_pair(score, 0));
+                visited[max_idx] = true;
+              } else {
+                true_pos[label].push_back(std::make_pair(score, 0));
+                false_pos[label].push_back(std::make_pair(score, 1));
+              }
+            }
+          } else {
+            true_pos[label].push_back(std::make_pair(score, 0));
+            false_pos[label].push_back(std::make_pair(score, 1));
+          }
+        }
+      }
+    }
+  }
+
+  T CalcMAP(
+      APType ap_type, const std::map<int, int>& label_pos_count,
+      const std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      const std::map<int, std::vector<std::pair<T, int>>>& false_pos) const {
+    T mAP = 0.0;
+    int count = 0;
+    for (auto it = label_pos_count.begin(); it != label_pos_count.end(); ++it) {
+      int label = it->first;
+      int label_num_pos = it->second;
+      if (label_num_pos == 0 || true_pos.find(label) == true_pos.end())
+        continue;
+      auto label_true_pos = true_pos.find(label)->second;
+      auto label_false_pos = false_pos.find(label)->second;
+      // Compute average precision.
+      std::vector<int> tp_sum;
+      GetAccumulation<T>(label_true_pos, &tp_sum);
+      std::vector<int> fp_sum;
+      GetAccumulation<T>(label_false_pos, &fp_sum);
+      std::vector<T> precision, recall;
+      size_t num = tp_sum.size();
+      // Compute Precision.
+      for (size_t i = 0; i < num; ++i) {
+        precision.push_back(static_cast<T>(tp_sum[i]) /
+                            static_cast<T>(tp_sum[i] + fp_sum[i]));
+        recall.push_back(static_cast<T>(tp_sum[i]) / label_num_pos);
+      }
+      // VOC2007 style
+      if (ap_type == APType::k11point) {
+        std::vector<T> max_precisions(11, 0.0);
+        int start_idx = num - 1;
+        for (int j = 10; j >= 0; --j)
+          for (int i = start_idx; i >= 0; --i) {
+            if (recall[i] < j / 10.) {
+              start_idx = i;
+              if (j > 0) max_precisions[j - 1] = max_precisions[j];
+              break;
+            } else {
+              if (max_precisions[j] < precision[i])
+                max_precisions[j] = precision[i];
+            }
+          }
+        for (int j = 10; j >= 0; --j) mAP += max_precisions[j] / 11;
+        ++count;
+      } else if (ap_type == APType::kIntegral) {
+        // Nature integral
+        float average_precisions = 0.;
+        float prev_recall = 0.;
+        for (size_t i = 0; i < num; ++i) {
+          if (fabs(recall[i] - prev_recall) > 1e-6)
+            average_precisions += precision[i] * fabs(recall[i] - prev_recall);
+          prev_recall = recall[i];
+        }
+        mAP += average_precisions;
+        ++count;
+      } else {
+        LOG(FATAL) << "Unkown ap version: " << ap_type;
+      }
+    }
+    if (count != 0) mAP /= count;
+    return mAP * 100;
+  }
+};  // namespace operators
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/fluid/operators/listen_and_serv_op.cc
+++ b/paddle/fluid/operators/listen_and_serv_op.cc
@@ -27,7 +27,7 @@ limitations under the License. */
 #include "paddle/fluid/operators/detail/grpc_server.h"
 #include "paddle/fluid/operators/detail/sendrecvop_utils.h"
 #include "paddle/fluid/operators/detail/simple_block_queue.h"
-#include "paddle/string/printf.h"
+#include "paddle/fluid/string/printf.h"

 namespace paddle {
 namespace operators {
@@ -101,11 +101,15 @@ class ListenAndServOp : public framework::OperatorBase {

    // TODO(typhoonzero): change this to a while_op for every cluster-batch.
    bool exit_flag = false;
+    // Record received sparse variables, so that
+    // we could reset those after execute optimize program
+    std::vector<framework::Variable *> sparse_vars;
    while (!exit_flag) {
      // Get from multiple trainers, we don't care about the order in which
      // the gradients arrives, just add suffix 0~n and merge the gradient.
      rpc_service_->SetCond(0);
      size_t recv_var_cnt = 0;
+      size_t update_param_cnt = 0;
      int batch_barrier = 0;
      while (batch_barrier != fan_in) {
        const detail::MessageWithName &v = rpc_service_->Get();
@@ -126,13 +130,14 @@ class ListenAndServOp : public framework::OperatorBase {
          std::string param_var_name;
          if (it != grad_list.end()) {
            param_var_name = param_list[it - grad_list.begin()];
+            update_param_cnt++;
+            VLOG(3) << "received grad: " << grad_var_name
+                    << " updating param: " << param_var_name;
          } else {
-            LOG(ERROR) << "grad has no paired param:" << grad_var_name;
+            VLOG(3) << "received variable: " << grad_var_name
+                    << " no need to update param";
          }
-          VLOG(3) << "received grad: " << grad_var_name
-                  << " updating param: " << param_var_name;
-
-          if (fan_in > 1) {
+          if (fan_in > 1 && !param_var_name.empty()) {
            grad_var_name = this->GetGradVarNameForTrainer(grad_var_name);
          }
          auto *var = recv_scope.FindVar(grad_var_name);
@@ -141,23 +146,35 @@ class ListenAndServOp : public framework::OperatorBase {
            PADDLE_THROW("Can not find server side var");
          }
          detail::DeserializeFromMessage(v.second, dev_ctx, var);
+          if (var->IsType<framework::SelectedRows>()) {
+            sparse_vars.push_back(var);
+          }
        }
      }
      VLOG(3) << "recv " << recv_var_cnt << " parmeters for one barrier.";
-      // TODO(Yancey1989): merge SelectedRows variables here
      if (exit_flag) {
        rpc_service_->ShutDown();
      }
-
+      VLOG(3) << "run optimize graph...";
      try {
        executor.Run(*program, &recv_scope, block->ID(), /*global_block*/
                     false /*create_local_scope*/, false /*create_vars*/);
      } catch (std::exception &e) {
        LOG(ERROR) << "run sub program error " << e.what();
      }
+
+      // Reset the received sparse variables, the sum operator would not
+      // sum the input sparse variables which rows is empty at the next
+      // mini-batch.
+      // TOOD(Yancey1989): move the reset action into an operator, we couldn't
+      // have any hide logic in the operator.
+      for (auto &var : sparse_vars) {
+        var->GetMutable<framework::SelectedRows>()->mutable_rows()->clear();
+      }
      rpc_service_->SetCond(1);
-      rpc_service_->WaitClientGet(recv_var_cnt);
+      rpc_service_->WaitClientGet(update_param_cnt);
      grads_counter_.clear();
+      sparse_vars.clear();
    }  // while(true)
  }


--- a/paddle/fluid/operators/multiclass_nms_op.cc
+++ b/paddle/fluid/operators/multiclass_nms_op.cc
@@ -38,22 +38,22 @@ class MultiClassNMSOp : public framework::OperatorWithKernel {
    auto box_dims = ctx->GetInputDim("BBoxes");
    auto score_dims = ctx->GetInputDim("Scores");

-    PADDLE_ENFORCE_EQ(box_dims.size(), 2,
-                      "The rank of Input(BBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(box_dims.size(), 3,
+                      "The rank of Input(BBoxes) must be 3.");
    PADDLE_ENFORCE_EQ(score_dims.size(), 3,
                      "The rank of Input(Scores) must be 3.");
-    PADDLE_ENFORCE_EQ(box_dims[1], 4,
+    PADDLE_ENFORCE_EQ(box_dims[2], 4,
                      "The 2nd dimension of Input(BBoxes) must be 4, "
                      "represents the layout of coordinate "
                      "[xmin, ymin, xmax, ymax]");
-    PADDLE_ENFORCE_EQ(box_dims[0], score_dims[2],
+    PADDLE_ENFORCE_EQ(box_dims[1], score_dims[2],
                      "The 1st dimensiong of Input(BBoxes) must be equal to "
                      "3rd dimension of Input(Scores), which represents the "
                      "predicted bboxes.");

    // Here the box_dims[0] is not the real dimension of output.
    // It will be rewritten in the computing kernel.
-    ctx->SetOutputDim("Out", {box_dims[0], 6});
+    ctx->SetOutputDim("Out", {box_dims[1], 6});
  }

 protected:
@@ -260,15 +260,20 @@ class MultiClassNMSKernel : public framework::OpKernel<T> {
    int64_t batch_size = score_dims[0];
    int64_t class_num = score_dims[1];
    int64_t predict_dim = score_dims[2];
+    int64_t box_dim = boxes->dims()[2];

    std::vector<std::map<int, std::vector<int>>> all_indices;
    std::vector<size_t> batch_starts = {0};
    for (int64_t i = 0; i < batch_size; ++i) {
      Tensor ins_score = scores->Slice(i, i + 1);
      ins_score.Resize({class_num, predict_dim});
+
+      Tensor ins_boxes = boxes->Slice(i, i + 1);
+      ins_boxes.Resize({predict_dim, box_dim});
+
      std::map<int, std::vector<int>> indices;
      int num_nmsed_out = 0;
-      MultiClassNMS(ctx, ins_score, *boxes, indices, num_nmsed_out);
+      MultiClassNMS(ctx, ins_score, ins_boxes, indices, num_nmsed_out);
      all_indices.push_back(indices);
      batch_starts.push_back(batch_starts.back() + num_nmsed_out);
    }
@@ -282,11 +287,15 @@ class MultiClassNMSKernel : public framework::OpKernel<T> {
      for (int64_t i = 0; i < batch_size; ++i) {
        Tensor ins_score = scores->Slice(i, i + 1);
        ins_score.Resize({class_num, predict_dim});
+
+        Tensor ins_boxes = boxes->Slice(i, i + 1);
+        ins_boxes.Resize({predict_dim, box_dim});
+
        int64_t s = batch_starts[i];
        int64_t e = batch_starts[i + 1];
        if (e > s) {
          Tensor out = outs->Slice(s, e);
-          MultiClassOutput(ins_score, *boxes, all_indices[i], &out);
+          MultiClassOutput(ins_score, ins_boxes, all_indices[i], &out);
        }
      }
    }
@@ -303,9 +312,9 @@ class MultiClassNMSOpMaker : public framework::OpProtoAndCheckerMaker {
  MultiClassNMSOpMaker(OpProto* proto, OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("BBoxes",
-             "(Tensor) A 2-D Tensor with shape [M, 4] represents the "
-             "predicted locations of M bounding bboxes. Each bounding box "
-             "has four coordinate values and the layout is "
+             "(Tensor) A 3-D Tensor with shape [N, M, 4] represents the "
+             "predicted locations of M bounding bboxes, N is the batch size. "
+             "Each bounding box has four coordinate values and the layout is "
             "[xmin, ymin, xmax, ymax].");
    AddInput("Scores",
             "(Tensor) A 3-D Tensor with shape [N, C, M] represents the "

--- a/paddle/fluid/operators/send_op.cc
+++ b/paddle/fluid/operators/send_op.cc
@@ -24,6 +24,22 @@ limitations under the License. */

 namespace paddle {
 namespace operators {
+static bool IsVariableInitialized(const framework::Scope& scope,
+                                  const std::string& varname) {
+  auto* var = scope.FindVar(varname);
+  PADDLE_ENFORCE_NOT_NULL(var, "Can not find variable '%s' in the send side.",
+                          varname);
+  if (var->IsType<framework::LoDTensor>()) {
+    return var->Get<framework::LoDTensor>().IsInitialized();
+  } else if (var->IsType<framework::SelectedRows>()) {
+    return var->Get<framework::SelectedRows>().value().IsInitialized();
+  } else {
+    PADDLE_THROW(
+        "Variable type in send side should be in "
+        "[LodTensor, SelectedRows]");
+  }
+  return false;
+}

 class SendOp : public framework::OperatorBase {
 public:
@@ -51,8 +67,12 @@ class SendOp : public framework::OperatorBase {
    detail::RPCClient* rpc_client = client_var->GetMutable<detail::RPCClient>();

    for (size_t i = 0; i < ins.size(); i++) {
-      VLOG(3) << "sending " << ins[i] << " to " << epmap[i];
-      rpc_client->AsyncSendVariable(epmap[i], ctx, scope, ins[i]);
+      if (IsVariableInitialized(scope, ins[i])) {
+        VLOG(3) << "sending " << ins[i] << " to " << epmap[i];
+        rpc_client->AsyncSendVariable(epmap[i], ctx, scope, ins[i]);
+      } else {
+        VLOG(3) << "don't send no-initialied variable: " << ins[i];
+      }
    }
    PADDLE_ENFORCE(rpc_client->Wait());


--- a/paddle/fluid/operators/send_recv_op_test.cc
+++ b/paddle/fluid/operators/send_recv_op_test.cc
@@ -22,7 +22,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/program_desc.h"
 #include "paddle/fluid/operators/math/math_function.h"
 #include "paddle/fluid/operators/math/selected_rows_functor.h"
-#include "paddle/string/printf.h"
+#include "paddle/fluid/string/printf.h"

 USE_NO_KERNEL_OP(send);
 USE_NO_KERNEL_OP(listen_and_serv);

--- a/paddle/fluid/operators/sequence_expand_op.cc
+++ b/paddle/fluid/operators/sequence_expand_op.cc
@@ -29,7 +29,9 @@ class SequenceExpandOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(ctx->HasOutput("Out"));
    PADDLE_ENFORCE(ctx->HasInput("Y"));
    framework::DDim out_dim;
-    out_dim = ctx->GetInputDim("Y");
+    auto y_dim = ctx->GetInputDim("Y");
+    out_dim = ctx->GetInputDim("X");
+    out_dim[0] = y_dim[0];
    ctx->ShareLoD("Y", "Out");
    ctx->SetOutputDim("Out", out_dim);
  }

--- a/paddle/fluid/operators/smooth_l1_loss_op.cc
+++ b/paddle/fluid/operators/smooth_l1_loss_op.cc
@@ -44,7 +44,6 @@ class SmoothL1LossOp : public framework::OperatorWithKernel {
  }
 };

-template <typename AttrType>
 class SmoothL1LossOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  SmoothL1LossOpMaker(OpProto* proto, OpAttrChecker* op_checker)
@@ -73,10 +72,10 @@ class SmoothL1LossOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("Out",
              "(Tensor, default Tensor<float>) A tensor with rank be 2. "
              "The output smooth l1 loss with shape [batch_size, 1].");
-    AddAttr<AttrType>("sigma",
-                      "Hyper parameter of smooth l1 loss op."
-                      "A float scalar with default value 3.0.")
-        .SetDefault(3.0);
+    AddAttr<float>("sigma",
+                   "Hyper parameter of smooth l1 loss op."
+                   "A float scalar with default value 3.0.")
+        .SetDefault(1.0);
    AddComment(R"DOC(
 Smooth L1 Loss Operator.

@@ -133,9 +132,8 @@ class SmoothL1LossGradOp : public framework::OperatorWithKernel {
 }  // namespace paddle

 namespace ops = paddle::operators;
-REGISTER_OP(smooth_l1_loss, ops::SmoothL1LossOp,
-            ops::SmoothL1LossOpMaker<float>, smooth_l1_loss_grad,
-            ops::SmoothL1LossGradOp);
+REGISTER_OP(smooth_l1_loss, ops::SmoothL1LossOp, ops::SmoothL1LossOpMaker,
+            smooth_l1_loss_grad, ops::SmoothL1LossGradOp);
 REGISTER_OP_CPU_KERNEL(
    smooth_l1_loss,
    ops::SmoothL1LossKernel<paddle::platform::CPUDeviceContext, float>);

--- a/paddle/fluid/operators/split_op.h
+++ b/paddle/fluid/operators/split_op.h
@@ -14,6 +14,7 @@ limitations under the License. */

 #pragma once

+#include <chrono>
 #include <vector>
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/strided_memcpy.h"
@@ -27,18 +28,18 @@ class SplitOpKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& ctx) const override {
    auto* in = ctx.Input<framework::Tensor>("X");
    auto outs = ctx.MultiOutput<framework::Tensor>("Out");
-    auto in_stride = framework::stride(in->dims());
+    auto in_stride = framework::stride_numel(in->dims());
    int64_t axis = static_cast<int64_t>(ctx.Attr<int>("axis"));
-    const size_t n = outs.size();
+    auto place = ctx.GetPlace();
+
    size_t input_offset = 0;
-    for (size_t i = 0; i < n; i++) {
-      auto& out = outs[i];
+    for (auto& out : outs) {
      out->mutable_data<T>(ctx.GetPlace());
-      size_t axis_dim = out->dims()[axis];
-      auto out_stride = framework::stride(out->dims());
-      StridedMemcpy<T>(ctx.device_context(), in->data<T>() + input_offset,
-                       in_stride, out->dims(), out_stride, out->data<T>());
-      input_offset += axis_dim * in_stride[axis];
+      auto out_stride = framework::stride_numel(out->dims());
+      StridedNumelCopyWithAxis<T>(ctx.device_context(), axis, out->data<T>(),
+                                  out_stride, in->data<T>() + input_offset,
+                                  in_stride);
+      input_offset += out_stride[axis];
    }
  }
 };

--- a/paddle/fluid/operators/split_selected_rows_op.cc
+++ b/paddle/fluid/operators/split_selected_rows_op.cc
@@ -22,7 +22,7 @@ class SplitSelectedRowsOpMaker : public framework::OpProtoAndCheckerMaker {
  SplitSelectedRowsOpMaker(OpProto *proto, OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "The input SelectedRows.");
-    AddOutput("Out", "The outputs of input SelectedRows.").AsDuplicable();
+    AddOutput("Out", "The outputs of the input SelectedRows.").AsDuplicable();
    AddAttr<std::vector<int>>("height_sections",
                              "Height for each output SelectedRows.")
        .SetDefault(std::vector<int>({}));
@@ -56,27 +56,6 @@ class SplitSelectedRowsOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(ctx->HasInput("X"), "SplitSelectedRowsOp must has input X.");
    PADDLE_ENFORCE(ctx->HasOutputs("Out"),
                   "SplitSelectedRowsOp must has output Out.");
-
-    std::vector<int> height_sections =
-        ctx->Attrs().Get<std::vector<int>>("height_sections");
-    int64_t n = ctx->Outputs("Out").size();
-
-    std::vector<framework::DDim> outs_dims;
-    outs_dims.reserve(n);
-
-    // make output dims
-    for (int64_t i = 0; i < n; ++i) {
-      auto dims = ctx->GetInputDim("X");
-      if (height_sections.size()) {
-        PADDLE_ENFORCE_EQ(
-            height_sections.size(), static_cast<size_t>(n),
-            "The size of height section should be the same with height"
-            " section size.");
-        dims[0] = height_sections[i];
-      }
-      outs_dims.push_back(dims);
-    }
-    ctx->SetOutputsDim("Out", outs_dims);
  }
 };


--- a/paddle/fluid/operators/split_selected_rows_op.h
+++ b/paddle/fluid/operators/split_selected_rows_op.h
@@ -55,6 +55,7 @@ class SplitSelectedRowsOpKernel : public framework::OpKernel<T> {

    for (size_t i = 0; i < outs_rows_idx.size(); ++i) {
      auto rows_idx = outs_rows_idx[i];
+      outs[i]->set_height(height_sections[i]);
      if (rows_idx.size() > 0) {
        auto dims = x->GetCompleteDims();
        dims[0] = rows_idx.size();

--- a/paddle/fluid/operators/strided_memcpy.h
+++ b/paddle/fluid/operators/strided_memcpy.h
@@ -41,5 +41,62 @@ inline void StridedMemcpy(const platform::DeviceContext& dev_ctx, const T* src,
  StridedCopyDimVisitor<T> func(dev_ctx, src, src_stride, dst_stride, dst);
  boost::apply_visitor(func, dst_dim);
 }
+
+// Strided numel memory copy from src to dst by the specified axis
+//
+// For example, for a tensor dims [4, 20, 100], the strieded numel is
+// [8000, 2000, 100]
+//
+// NOTE: The src and dst tensor should have the same elements
+// except the specified axis.
+template <typename T>
+inline void StridedNumelCopyWithAxis(const platform::DeviceContext& ctx,
+                                     int64_t axis, T* dst,
+                                     const framework::DDim& dst_stride_numel,
+                                     const T* src,
+                                     const framework::DDim& src_stride_numel) {
+  int64_t before = dst_stride_numel[0] / dst_stride_numel[axis];
+  int64_t src_after = src_stride_numel[axis];
+  int64_t dst_after = dst_stride_numel[axis];
+  auto place = ctx.GetPlace();
+
+  PADDLE_ENFORCE_EQ(src_stride_numel.size(), dst_stride_numel.size(),
+                    "src and dst tensor should have the same dims size.");
+
+  for (int64_t i = 0; i < axis; ++i) {
+    if (i < axis) {
+      PADDLE_ENFORCE_EQ(src_stride_numel[i] / src_stride_numel[axis],
+                        dst_stride_numel[i] / dst_stride_numel[axis],
+                        "src and dst should have the same elements "
+                        "except the specified axis.");
+    } else if (i == axis) {
+      continue;
+    } else {
+      PADDLE_ENFORCE_EQ(src_stride_numel[i], dst_stride_numel[i],
+                        "src and dst should have the same elements "
+                        "except the specified axis.");
+    }
+  }
+
+  for (int64_t i = 0; i < before; ++i) {
+    if (platform::is_cpu_place(place)) {
+      auto& cpu_place = boost::get<platform::CPUPlace>(place);
+      memory::Copy(cpu_place, dst + i * dst_after, cpu_place,
+                   src + i * src_after, sizeof(T) * src_after);
+    } else {
+#ifdef PADDLE_WITH_CUDA
+      auto& gpu_place = boost::get<platform::CUDAPlace>(place);
+      auto& cuda_ctx =
+          reinterpret_cast<const platform::CUDADeviceContext&>(ctx);
+      memory::Copy(gpu_place, dst + i * dst_after, gpu_place,
+                   src + i * src_after, sizeof(T) * src_after,
+                   cuda_ctx.stream());
+#else
+      PADDLE_THROW("Paddle is not compiled with GPU");
+#endif
+    }
+  }
+}
+
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/fluid/operators/sum_op.h
+++ b/paddle/fluid/operators/sum_op.h
@@ -116,7 +116,9 @@ class SumKernel : public framework::OpKernel<T> {
      int64_t offset = 0;
      for (int i = 0; i < N; i++) {
        auto &sel_row = get_selected_row(i);
-
+        if (!sel_row.value().IsInitialized() || sel_row.rows().size() == 0) {
+          continue;
+        }
        PADDLE_ENFORCE_EQ(out->height(), sel_row.height());
        functor(context.template device_context<DeviceContext>(), sel_row,
                offset, out);

--- a/paddle/fluid/operators/target_assign_op.cc
+++ b/paddle/fluid/operators/target_assign_op.cc
@@ -22,69 +22,43 @@ class TargetAssignOp : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;

  void InferShape(framework::InferShapeContext* ctx) const override {
-    // checkout inputs
-    PADDLE_ENFORCE(ctx->HasInput("EncodedGTBBox"),
-                   "Input(EncodedGTBBox) of TargetAssignOp should not be null");
-    PADDLE_ENFORCE(ctx->HasInput("GTScoreLabel"),
-                   "Input(GTScoreLabel) of TargetAssignOp should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of TargetAssignOp should not be null");
    PADDLE_ENFORCE(ctx->HasInput("MatchIndices"),
                   "Input(MatchIndices) of TargetAssignOp should not be null");
-    PADDLE_ENFORCE(ctx->HasInput("NegIndices"),
-                   "Input(NegIndices) of TargetAssignOp should not be null");
-
-    // checkout outputs
-    PADDLE_ENFORCE(
-        ctx->HasOutput("PredBBoxLabel"),
-        "Output(PredBBoxLabel) of TargetAssignOp should not be null.");
-    PADDLE_ENFORCE(
-        ctx->HasOutput("PredBBoxWeight"),
-        "Output(PredBBoxWeight) of TargetAssignOp should not be null.");
-    PADDLE_ENFORCE(
-        ctx->HasOutput("PredScoreLabel"),
-        "Output(PredScoreLabel) of TargetAssignOp should not be null.");
-    PADDLE_ENFORCE(
-        ctx->HasOutput("PredScoreWeight"),
-        "Output(PredScoreWeight) of TargetAssignOp should not be null.");
-
-    auto blabel_dims = ctx->GetInputDim("EncodedGTBBox");
-    auto slabel_dims = ctx->GetInputDim("GTScoreLabel");
+
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of TargetAssignOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("OutWeight"),
+                   "Output(OutWeight) of TargetAssignOp should not be null.");
+
+    auto in_dims = ctx->GetInputDim("X");
    auto mi_dims = ctx->GetInputDim("MatchIndices");
-    auto neg_dims = ctx->GetInputDim("NegIndices");

-    PADDLE_ENFORCE_EQ(blabel_dims.size(), 3UL,
-                      "The rank of Input(EncodedGTBBox) must be 3.");
-    PADDLE_ENFORCE_EQ(slabel_dims.size(), 2UL,
-                      "The rank of Input(GTScoreLabel) must be 2.");
-    PADDLE_ENFORCE_EQ(mi_dims.size(), 2UL,
+    PADDLE_ENFORCE_EQ(in_dims.size(), 3, "The rank of Input(X) must be 3.");
+    PADDLE_ENFORCE_EQ(mi_dims.size(), 2,
                      "The rank of Input(MatchIndices) must be 2.");
-    PADDLE_ENFORCE_EQ(neg_dims.size(), 2UL,
-                      "The rank of Input(NegIndices) must be 2.");
-
-    PADDLE_ENFORCE_EQ(blabel_dims[0], slabel_dims[0],
-                      "The 1st dimension (means the total number of "
-                      "ground-truth bounding boxes) of Input(EncodedGTBBox) "
-                      "and Input(GTScoreLabel) must be the same.");
-    PADDLE_ENFORCE_EQ(blabel_dims[1], mi_dims[1],
-                      "The 2nd dimension (means the number of priod boxes) "
-                      "of Input(EncodedGTBBox) and "
-                      "Input(MatchIndices) must be the same.");
-    PADDLE_ENFORCE_EQ(blabel_dims[2], 4,
-                      "The 3rd dimension of Input(EncodedGTBBox) must be 4.");
+
+    if (ctx->HasInput("NegIndices")) {
+      auto neg_dims = ctx->GetInputDim("NegIndices");
+      PADDLE_ENFORCE_EQ(neg_dims.size(), 2,
+                        "The rank of Input(NegIndices) must be 2.");
+      PADDLE_ENFORCE_EQ(neg_dims[1], 1,
+                        "The last dimenstion of Out(NegIndices) must be 1.");
+    }

    auto n = mi_dims[0];
-    auto np = mi_dims[1];
-    ctx->SetOutputDim("PredBBoxLabel", {n, np, 4});
-    ctx->SetOutputDim("PredBBoxWeight", {n, np, 1});
-    ctx->SetOutputDim("PredScoreLabel", {n, np, 1});
-    ctx->SetOutputDim("PredScoreWeight", {n, np, 1});
+    auto m = mi_dims[1];
+    auto k = in_dims[in_dims.size() - 1];
+    ctx->SetOutputDim("Out", {n, m, k});
+    ctx->SetOutputDim("OutWeight", {n, m, 1});
  }

 protected:
  framework::OpKernelType GetExpectedKernelType(
      const framework::ExecutionContext& ctx) const override {
    return framework::OpKernelType(
-        framework::ToDataType(
-            ctx.Input<framework::LoDTensor>("EncodedGTBBox")->type()),
+        framework::ToDataType(ctx.Input<framework::LoDTensor>("X")->type()),
        ctx.device_context());
  }
 };
@@ -93,102 +67,87 @@ class TargetAssignOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  TargetAssignOpMaker(OpProto* proto, OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("EncodedGTBBox",
-             "(LoDTensor), The encoded ground-truth bounding boxes with shape "
-             "[Ng, Np, 4], where Ng is the total number of ground-truth boxes "
-             "in this mini-batch, Np the number of predictions, 4 is the "
-             "number of coordinate in [xmin, ymin, xmax, ymax] layout.");
-    AddInput("GTScoreLabel",
-             "(LoDTensor, default LoDTensor<int>),  The input ground-truth "
-             "labels with shape [Ng, 1], where the Ng is the same as it in "
-             "the input of EncodedGTBBox.");
+    AddInput("X",
+             "(LoDTensor), This input is a 3D LoDTensor with shape [M, P, K]. "
+             "Some elements in X will be assigned to Out based on the "
+             "MatchIndices and NegIndices.");
    AddInput("MatchIndices",
             "(Tensor, default Tensor<int>), The input matched indices "
-             "with shape [N, Np], where N is the batch size, Np is the same "
-             "as it in the input of EncodedGTBBox. If MatchIndices[i][j] "
-             "is -1, the j-th prior box is not matched to any ground-truh "
-             "box in i-th instance.");
+             "with shape [N, P], If MatchIndices[i][j] is -1, the j-th entity "
+             "of column is not matched to any entity of row in i-th instance.");
    AddInput("NegIndices",
             "(LoDTensor, default LoDTensor<int>), The input negative example "
-             "indices with shape [Neg, 1], where is the total number of "
-             "negative example indices.");
-    AddAttr<int>("background_label",
-                 "(int, default 0), Label index of background class.")
+             "indices are an optional input with shape [Neg, 1], where Neg is "
+             "the total number of negative example indices.")
+        .AsDispensable();
+    AddAttr<int>("mismatch_value",
+                 "(int, default 0), Fill this value to the "
+                 "mismatched location.")
        .SetDefault(0);
-    AddOutput("PredBBoxLabel",
-              "(Tensor), The output encoded ground-truth labels "
-              "with shape [N, Np, 4], N is the batch size and Np, 4 is the "
-              "same as they in input of EncodedGTBBox. If MatchIndices[i][j] "
-              "is -1, the PredBBoxLabel[i][j][:] is the encoded ground-truth "
-              "box for background_label in i-th instance.");
-    AddOutput("PredBBoxWeight",
-              "(Tensor), The weight for PredBBoxLabel with the shape "
-              "of [N, Np, 1]");
-    AddOutput("PredScoreLabel",
-              "(Tensor, default Tensor<int>), The output score labels for "
-              "each predictions with shape [N, Np, 1]. If MatchIndices[i][j] "
-              "is -1, PredScoreLabel[i][j] = background_label.");
-    AddOutput("PredScoreWeight",
-              "(Tensor), The weight for PredScoreLabel with the shape "
-              "of [N, Np, 1]");
+    AddOutput("Out",
+              "(Tensor), The output is a 3D Tensor with shape [N, P, K], "
+              "N and P is the same as they are in NegIndices, K is the "
+              "same as it in input of X. If MatchIndices[i][j] "
+              "is -1, the Out[i][j][0 : K] is the mismatch_value.");
+    AddOutput("OutWeight",
+              "(Tensor), The weight for output with the shape of [N, P, 1]");
    AddComment(R"DOC(
-This operator is, for given the encoded boxes between prior boxes and
-ground-truth boxes and ground-truth class labels, to assign classification
-and regression targets to each prior box as well as weights to each
-prior box. The weights is used to specify which prior box would not contribute
-to training loss.
-
-For each instance, the output `PredBBoxLabel`, `PredBBoxWeight`,
-`PredScoreLabel` and `PredScoreWeight` are assigned based on `MatchIndices`.
-Assumed that the row offset for each instance in `EncodedGTBBox` is called lod,
-this operato assigns classification/regression targets by performing the
+This operator can be, for given the target bounding boxes or labels,
+to assign classification and regression targets to each prediction as well as
+weights to prediction. The weights is used to specify which prediction would
+not contribute to training loss.
+
+For each instance, the output `Out` and`OutWeight` are assigned based on
+`MatchIndices` and `NegIndices`.
+Assumed that the row offset for each instance in `X` is called lod,
+this operator assigns classification/regression targets by performing the
 following steps:

 1. Assigning all outpts based on `MatchIndices`:

 If id = MatchIndices[i][j] > 0,

-    PredBBoxLabel[i][j] = EncodedGTBBox[lod[i] + id][j]
-    PredBBoxWeight[i][j] = 1.
-    PredScoreLabel[i][j] = GTScoreLabel[lod[i] + id]
-    PredScoreWeight[i][j] = 1.
+    Out[i][j][0 : K] = X[lod[i] + id][j % P][0 : K]
+    OutWeight[i][j] = 1.

 Otherwise, 

-    PredBBoxLabel[j][j] = [0., 0., 0., 0.]
-    PredBBoxWeight[i][j] = 0.
-    PredScoreLabel[i][j] = background_label
-    PredScoreWeight[i][j] = 0.
+    Out[j][j][0 : K] = {mismatch_value, mismatch_value, ...}
+    OutWeight[i][j] = 0.

-2. Assigning PredScoreWeight based on `NegIndices`:
+2. Assigning OutWeight based on `NegIndices` if `NegIndices` is provided:

-Assumed that the row offset for each instance in `NegIndices` is caleed neg_lod,
-for i-th instance and all ids of NegIndices in this instance:
+Assumed that the row offset for each instance in `NegIndices` is called neg_lod,
+for i-th instance and each `id` of NegIndices in this instance:

-    PredScoreLabel[i][id] = background_label
-    PredScoreWeight[i][id] = 1.0
+    Out[i][id][0 : K] = {mismatch_value, mismatch_value, ...}
+    OutWeight[i][id] = 1.0

    )DOC");
  }
 };

-template <typename T>
-struct NegTargetAssignFunctor<platform::CPUDeviceContext, T> {
+template <typename T, typename WT>
+struct NegTargetAssignFunctor<platform::CPUDeviceContext, T, WT> {
  void operator()(const platform::CPUDeviceContext& ctx, const int* neg_indices,
-                  const size_t* lod, const int num, const int num_prior_box,
-                  const int background_label, int* out_label, T* out_label_wt) {
-    for (int i = 0; i < num; ++i) {
+                  const size_t* lod, const int N, const int M, const int K,
+                  const int mismatch_value, T* out, WT* out_wt) {
+    for (int i = 0; i < N; ++i) {
      for (size_t j = lod[i]; j < lod[i + 1]; ++j) {
        int id = neg_indices[j];
-        out_label[i * num_prior_box + id] = background_label;
-        out_label_wt[i * num_prior_box + id] = static_cast<T>(1.0);
+        int off = (i * M + id) * K;
+        for (int k = 0; k < K; ++k) {
+          out[off + k] = mismatch_value;
+          out_wt[off + k] = static_cast<WT>(1.0);
+        }
      }
    }
  }
 };

-template struct NegTargetAssignFunctor<platform::CPUDeviceContext, float>;
-template struct NegTargetAssignFunctor<platform::CPUDeviceContext, double>;
+template struct NegTargetAssignFunctor<platform::CPUDeviceContext, int, float>;
+template struct NegTargetAssignFunctor<platform::CPUDeviceContext, float,
+                                       float>;

 }  // namespace operators
 }  // namespace paddle
@@ -198,5 +157,5 @@ REGISTER_OP_WITHOUT_GRADIENT(target_assign, ops::TargetAssignOp,
                             ops::TargetAssignOpMaker);
 REGISTER_OP_CPU_KERNEL(
    target_assign,
-    ops::TargetAssignKernel<paddle::platform::CPUDeviceContext, float>,
-    ops::TargetAssignKernel<paddle::platform::CPUDeviceContext, double>);
+    ops::TargetAssignKernel<paddle::platform::CPUDeviceContext, int, float>,
+    ops::TargetAssignKernel<paddle::platform::CPUDeviceContext, float, float>);
--- a/paddle/fluid/operators/target_assign_op.cu
+++ b/paddle/fluid/operators/target_assign_op.cu
@@ -17,39 +17,41 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

-template <typename T>
+template <typename T, typename WT>
 __global__ void NegTargetAssignKernel(const int* neg_indices, const size_t* lod,
-                                      const int num, const int num_prior_box,
-                                      const int background_label,
-                                      int* out_label, T* out_label_wt) {
+                                      const int N, const int M, const int K,
+                                      const int mismatch_value, T* out,
+                                      WT* out_wt) {
  int bidx = blockIdx.x;
  int st = lod[bidx];
  int ed = lod[bidx + 1];

-  int row_start = bidx * num_prior_box;
+  int row_start = bidx * M;
  for (int i = st + threadIdx.x; i < ed; i += blockDim.x) {
    int id = row_start + neg_indices[i];
-    out_label[id] = background_label;
-    out_label_wt[id] = 1.;
+    for (int k = 0; k < K; ++k) {
+      out[id * K + k] = T(mismatch_value);
+      out_wt[id * K + k] = WT(1.);
+    }
  }
 }

-template <typename T>
-struct NegTargetAssignFunctor<platform::CUDADeviceContext, T> {
+template <typename T, typename WT>
+struct NegTargetAssignFunctor<platform::CUDADeviceContext, T, WT> {
  void operator()(const platform::CUDADeviceContext& ctx,
-                  const int* neg_indices, const size_t* lod, const int num,
-                  const int num_prior_box, const int background_label,
-                  int* out_label, T* out_label_wt) {
+                  const int* neg_indices, const size_t* lod, const int N,
+                  const int M, const int K, const int mismatch_value, T* out,
+                  WT* out_wt) {
    const int block_size = 256;
-    const int grid_size = num;
-    NegTargetAssignKernel<T><<<grid_size, block_size, 0, ctx.stream()>>>(
-        neg_indices, lod, num, num_prior_box, background_label, out_label,
-        out_label_wt);
+    const int grid_size = N;
+    NegTargetAssignKernel<T, WT><<<grid_size, block_size, 0, ctx.stream()>>>(
+        neg_indices, lod, N, M, K, mismatch_value, out, out_wt);
  }
 };

-template struct NegTargetAssignFunctor<platform::CUDADeviceContext, float>;
-template struct NegTargetAssignFunctor<platform::CUDADeviceContext, double>;
+template struct NegTargetAssignFunctor<platform::CUDADeviceContext, int, float>;
+template struct NegTargetAssignFunctor<platform::CUDADeviceContext, float,
+                                       float>;

 }  // namespace operators
 }  // namespace paddle
@@ -57,5 +59,5 @@ template struct NegTargetAssignFunctor<platform::CUDADeviceContext, double>;
 namespace ops = paddle::operators;
 REGISTER_OP_CUDA_KERNEL(
    target_assign,
-    ops::TargetAssignKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::TargetAssignKernel<paddle::platform::CUDADeviceContext, double>);
+    ops::TargetAssignKernel<paddle::platform::CUDADeviceContext, int, float>,
+    ops::TargetAssignKernel<paddle::platform::CUDADeviceContext, float, float>);
--- a/paddle/fluid/operators/target_assign_op.h
+++ b/paddle/fluid/operators/target_assign_op.h
@@ -19,140 +19,113 @@ limitations under the License. */

 namespace paddle {
 namespace operators {
-
-template <typename T>
+template <typename T, typename WT>
 struct TargetAssignFunctor {
-  const T* gt_box_;
-  const int* gt_label_;
+  const T* in_;
  const int* match_indices_;
  const size_t* lod_;
-  const int background_label_;
-  const int64_t num_;
-  const int64_t num_prior_box_;
-
-  T* out_box_;
-  T* out_box_wt_;
-  int* out_label_;
-  T* out_label_wt_;
-
-  TargetAssignFunctor(const T* gt_box, const int* gt_label,
-                      const int* match_indices, const size_t* lod,
-                      const int background_label, const int64_t num,
-                      const int64_t np, T* out_box, T* out_box_wt,
-                      int* out_label, T* out_label_wt)
-      : gt_box_(gt_box),
-        gt_label_(gt_label),
+  const int mismatch_value_;
+  const int64_t N_;
+  const int64_t M_;
+  const int64_t P_;
+  const int64_t K_;
+
+  T* out_;
+  WT* out_wt_;
+
+  TargetAssignFunctor(const T* input, const int* match_indices,
+                      const size_t* lod, const int mismatch_value,
+                      const int64_t N, const int64_t M, const int64_t P,
+                      const int64_t K, T* out, WT* out_wt)
+      : in_(input),
        match_indices_(match_indices),
        lod_(lod),
-        background_label_(background_label),
-        num_(num),
-        num_prior_box_(np),
-        out_box_(out_box),
-        out_box_wt_(out_box_wt),
-        out_label_(out_label),
-        out_label_wt_(out_label_wt) {}
+        mismatch_value_(mismatch_value),
+        N_(N),
+        M_(M),
+        P_(P),
+        K_(K),
+        out_(out),
+        out_wt_(out_wt) {}

  HOSTDEVICE void operator()(size_t i) const {
-    int row = i / num_prior_box_;
-    int col = i - row * num_prior_box_;
+    int h = i / M_;
+    int w = i - h * M_;

-    size_t row_off = lod_[row];
-    int offset = row * num_prior_box_ + col;
+    size_t off = lod_[h];
+    int id = match_indices_[i];

-    int id = match_indices_[offset];
-    T* obox = out_box_ + offset * 4;
-    int* olabel = out_label_ + offset;
-    T* obox_wt = out_box_wt_ + offset;
-    T* olabel_wt = out_label_wt_ + offset;
+    T* out = out_ + i * K_;
+    WT* out_wt = out_wt_ + i;

    if (id > -1) {
-      const T* gtbox = gt_box_ + ((row_off + id) * num_prior_box_ + col) * 4;
-
-      obox[0] = gtbox[0];
-      obox[1] = gtbox[1];
-      obox[2] = gtbox[2];
-      obox[3] = gtbox[3];
-
-      olabel[0] = gt_label_[row_off + id];
-      obox_wt[0] = static_cast<T>(1.);
-      olabel_wt[0] = static_cast<T>(1.);
+      int w_off = w % P_;
+      const T* in = in_ + ((off + id) * P_ + w_off) * K_;
+      for (int64_t k = 0; k < K_; ++k) {
+        out[k] = in[k];
+      }
+      out_wt[0] = static_cast<WT>(1.);
    } else {
-      obox[0] = static_cast<T>(0.);
-      obox[1] = static_cast<T>(0.);
-      obox[2] = static_cast<T>(0.);
-      obox[3] = static_cast<T>(0.);
-
-      olabel[0] = background_label_;
-      obox_wt[0] = static_cast<T>(0.);
-      olabel_wt[0] = static_cast<T>(0.);
+      for (int64_t k = 0; k < K_; ++k) {
+        out[k] = static_cast<T>(mismatch_value_);
+      }
+      out_wt[0] = static_cast<WT>(0.);
    }
  }
 };

-template <typename DeviceContext, typename T>
+template <typename DeviceContext, typename T, typename WT>
 struct NegTargetAssignFunctor {
  void operator()(const platform::DeviceContext& ctx, const int* neg_indices,
-                  const size_t* lod, const int num, const int num_prior_box,
-                  const int background_label, int* out_label,
-                  T* out_label_wt) const;
+                  const size_t* lod, const int N, const int M, const int K,
+                  const int mismatch_value, T* out, WT* out_wt) const;
 };

-template <typename DeviceContext, typename T>
+template <typename DeviceContext, typename T, typename WT>
 class TargetAssignKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* enc_gt_box = ctx.Input<framework::LoDTensor>("EncodedGTBBox");
-    auto* gt_label = ctx.Input<framework::LoDTensor>("GTScoreLabel");
+    auto* x = ctx.Input<framework::LoDTensor>("X");
    auto* match_indices = ctx.Input<framework::Tensor>("MatchIndices");
-    auto* neg_indices = ctx.Input<framework::LoDTensor>("NegIndices");
-
-    auto* out_box = ctx.Output<framework::Tensor>("PredBBoxLabel");
-    auto* out_box_wt = ctx.Output<framework::Tensor>("PredBBoxWeight");
-    auto* out_label = ctx.Output<framework::Tensor>("PredScoreLabel");
-    auto* out_label_wt = ctx.Output<framework::Tensor>("PredScoreWeight");

-    PADDLE_ENFORCE_EQ(enc_gt_box->lod().size(), 1UL);
-    PADDLE_ENFORCE_EQ(gt_label->lod().size(), 1UL);
-    PADDLE_ENFORCE_EQ(neg_indices->lod().size(), 1UL);
+    auto* out = ctx.Output<framework::Tensor>("Out");
+    auto* out_wt = ctx.Output<framework::Tensor>("OutWeight");

-    int background_label = ctx.Attr<int>("background_label");
+    PADDLE_ENFORCE_EQ(x->lod().size(), 1UL);
+    int mismatch_value = ctx.Attr<int>("mismatch_value");

-    const T* box_data = enc_gt_box->data<T>();
-    const int* label_data = gt_label->data<int>();
+    const T* x_data = x->data<T>();
    const int* match_idx_data = match_indices->data<int>();
-    const int* neg_idx_data = neg_indices->data<int>();

-    T* obox_data = out_box->mutable_data<T>(ctx.GetPlace());
-    T* obox_wt_data = out_box_wt->mutable_data<T>(ctx.GetPlace());
-    int* olabel_data = out_label->mutable_data<int>(ctx.GetPlace());
-    T* olabel_wt_data = out_label_wt->mutable_data<T>(ctx.GetPlace());
+    T* out_data = out->mutable_data<T>(ctx.GetPlace());
+    WT* out_wt_data = out_wt->mutable_data<WT>(ctx.GetPlace());

-    int64_t num = match_indices->dims()[0];
-    int64_t num_prior_box = match_indices->dims()[1];
+    int64_t n = match_indices->dims()[0];
+    int64_t m = match_indices->dims()[1];
+    int64_t p = x->dims()[1];
+    int64_t k = x->dims()[2];

-    auto gt_lod = enc_gt_box->lod().back();
-    auto gt_label_lod = gt_label->lod().back();
-    auto neg_lod = neg_indices->lod().back();
-    for (size_t i = 0; i < gt_lod.size(); ++i) {
-      PADDLE_ENFORCE_EQ(gt_lod.data()[i], gt_label_lod.data()[i]);
-    }
-
-    size_t* gt_lod_data = gt_lod.MutableData(ctx.GetPlace());
-    size_t* neg_lod_data = neg_lod.MutableData(ctx.GetPlace());
+    auto x_lod = x->lod().back();
+    size_t* x_lod_data = x_lod.MutableData(ctx.GetPlace());

-    TargetAssignFunctor<T> functor(box_data, label_data, match_idx_data,
-                                   gt_lod_data, background_label, num,
-                                   num_prior_box, obox_data, obox_wt_data,
-                                   olabel_data, olabel_wt_data);
+    TargetAssignFunctor<T, WT> functor(x_data, match_idx_data, x_lod_data,
+                                       mismatch_value, n, m, p, k, out_data,
+                                       out_wt_data);

    auto& device_ctx = ctx.template device_context<DeviceContext>();
-    platform::ForRange<DeviceContext> for_range(device_ctx,
-                                                num * num_prior_box);
+    platform::ForRange<DeviceContext> for_range(device_ctx, n * m);
    for_range(functor);

-    NegTargetAssignFunctor<DeviceContext, T> neg_trg_functor;
-    neg_trg_functor(device_ctx, neg_idx_data, neg_lod_data, num, num_prior_box,
-                    background_label, olabel_data, olabel_wt_data);
+    auto* neg_indices = ctx.Input<framework::LoDTensor>("NegIndices");
+    if (neg_indices) {
+      PADDLE_ENFORCE_EQ(neg_indices->lod().size(), 1UL);
+      const int* neg_idx_data = neg_indices->data<int>();
+      auto neg_lod = neg_indices->lod().back();
+      size_t* neg_lod_data = neg_lod.MutableData(ctx.GetPlace());
+      NegTargetAssignFunctor<DeviceContext, T, WT> neg_trg_functor;
+      neg_trg_functor(device_ctx, neg_idx_data, neg_lod_data, n, m, k,
+                      mismatch_value, out_data, out_wt_data);
+    }
  }
 };


--- a/paddle/fluid/platform/cpu_info_test.cc
+++ b/paddle/fluid/platform/cpu_info_test.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 #include "paddle/fluid/platform/cpu_info.h"
-#include "paddle/string/printf.h"
+#include "paddle/fluid/string/printf.h"

 #include <ostream>
 #include <sstream>

--- a/paddle/fluid/platform/enforce.h
+++ b/paddle/fluid/platform/enforce.h
@@ -23,8 +23,8 @@ limitations under the License. */
 #include <string>

 #include "paddle/fluid/platform/macros.h"
-#include "paddle/string/printf.h"
-#include "paddle/string/to_string.h"
+#include "paddle/fluid/string/printf.h"
+#include "paddle/fluid/string/to_string.h"

 #ifdef __GNUC__
 #include <cxxabi.h>  // for __cxa_demangle

--- a/paddle/fluid/platform/enforce_test.cc
+++ b/paddle/fluid/platform/enforce_test.cc
@@ -15,7 +15,7 @@ limitations under the License. */

 #include "gtest/gtest.h"
 #include "paddle/fluid/platform/enforce.h"
-#include "paddle/string/piece.h"
+#include "paddle/fluid/string/piece.h"

 using StringPiece = paddle::string::Piece;
 using paddle::string::HasPrefix;

--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
@@ -35,7 +35,7 @@ limitations under the License. */
 #include "paddle/fluid/pybind/exception.h"
 #include "paddle/fluid/pybind/pybind.h"
 #include "paddle/fluid/pybind/tensor_py.h"
-#include "paddle/string/to_string.h"
+#include "paddle/fluid/string/to_string.h"

 #ifdef PADDLE_WITH_CUDA
 #include "paddle/fluid/operators/nccl/nccl_gpu_common.h"

--- a/paddle/string/.clang-format
+++ b/paddle/string/.clang-format
--- a/paddle/string/CMakeLists.txt
+++ b/paddle/string/CMakeLists.txt
--- a/paddle/string/piece.cc
+++ b/paddle/string/piece.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-#include "paddle/string/piece.h"
+#include "piece.h"

 #include <string.h>


--- a/paddle/string/piece.h
+++ b/paddle/string/piece.h
@@ -28,7 +28,7 @@ namespace string {
 // its syntax is simple as it doesn't own/manage the string, it is
 // cheap to construct Pieces and pass them around.
 class Piece {
-public:
+ public:
  static const size_t npos = static_cast<size_t>(-1);

  // We provide non-explicit singleton constructors so users can
@@ -55,7 +55,7 @@ public:
  // Return a string that contains the copy of the referenced data.
  std::string ToString() const { return std::string(data_, size_); }

-private:
+ private:
  const char* data_;
  size_t size_;


--- a/paddle/string/piece_test.cc
+++ b/paddle/string/piece_test.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-#include "paddle/string/piece.h"
+#include "paddle/fluid/string/piece.h"

 #include <sstream>


--- a/paddle/string/printf.h
+++ b/paddle/string/printf.h
@@ -71,7 +71,7 @@

 #include <iostream>
 #include <sstream>
-#include "paddle/string/tinyformat/tinyformat.h"  // https://github.com/c42f/tinyformat
+#include "tinyformat/tinyformat.h"  // https://github.com/c42f/tinyformat

 namespace paddle {
 namespace string {

--- a/paddle/string/printf_test.cc
+++ b/paddle/string/printf_test.cc
@@ -11,7 +11,7 @@
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/string/printf.h"
+#include "printf.h"

 #include <string>

@@ -24,6 +24,6 @@ TEST(StringPrintf, StringPrintf) {
  long hour = 14;
  int min = 44;
  EXPECT_EQ(std::string("Wednesday, July 27, 14:44"),
-            paddle::string::Sprintf(
-                "%s, %s %d, %.2d:%.2d", weekday, month, day, hour, min));
+            paddle::string::Sprintf("%s, %s %d, %.2d:%.2d", weekday, month, day,
+                                    hour, min));
 }
--- a/paddle/string/tinyformat/tinyformat.h
+++ b/paddle/string/tinyformat/tinyformat.h
@@ -147,7 +147,7 @@ namespace detail {
 // Test whether type T1 is convertible to type T2
 template <typename T1, typename T2>
 struct is_convertible {
-private:
+ private:
  // two types of different size
  struct fail {
    char dummy[2];
@@ -160,7 +160,7 @@ private:
  static succeed tryConvert(const T2 &);
  static const T1 &makeT1();

-public:
+ public:
  // Standard trick: the (...) version of tryConvert will be chosen from
  // the overload set only if the version taking a T2 doesn't match.
  // Then we compare the sizes of the return types to check which
@@ -170,8 +170,7 @@ public:

 // Format the value by casting to type fmtT.  This default implementation
 // should never be called.
-template <typename T,
-          typename fmtT,
+template <typename T, typename fmtT,
          bool convertible = is_convertible<T, fmtT>::value>
 struct formatValueAsType {
  static void invoke(std::ostream & /*out*/, const T & /*value*/) { assert(0); }
@@ -241,11 +240,8 @@ TINYFORMAT_DEFINE_FORMAT_TRUNCATED_CSTR(char)
 /// operator<< to format the type T, with special cases for the %c and %p
 /// conversions.
 template <typename T>
-inline void formatValue(std::ostream &out,
-                        const char * /*fmtBegin*/,
-                        const char *fmtEnd,
-                        int ntrunc,
-                        const T &value) {
+inline void formatValue(std::ostream &out, const char * /*fmtBegin*/,
+                        const char *fmtEnd, int ntrunc, const T &value) {
  // The mess here is to support the %c and %p conversions: if these
  // conversions are active we try to convert the type to a char or const
  // void* respectively and format that instead of the value itself.  For the
@@ -267,25 +263,22 @@ inline void formatValue(std::ostream &out,
 }

 // Overloaded version for char types to support printing as an integer
-#define TINYFORMAT_DEFINE_FORMATVALUE_CHAR(charType) \
-  inline void formatValue(std::ostream &out,         \
-                          const char * /*fmtBegin*/, \
-                          const char *fmtEnd,        \
-                          int /**/,                  \
-                          charType value) {          \
-    switch (*(fmtEnd - 1)) {                         \
-      case 'u':                                      \
-      case 'd':                                      \
-      case 'i':                                      \
-      case 'o':                                      \
-      case 'X':                                      \
-      case 'x':                                      \
-        out << static_cast<int>(value);              \
-        break;                                       \
-      default:                                       \
-        out << value;                                \
-        break;                                       \
-    }                                                \
+#define TINYFORMAT_DEFINE_FORMATVALUE_CHAR(charType)                      \
+  inline void formatValue(std::ostream &out, const char * /*fmtBegin*/,   \
+                          const char *fmtEnd, int /**/, charType value) { \
+    switch (*(fmtEnd - 1)) {                                              \
+      case 'u':                                                           \
+      case 'd':                                                           \
+      case 'i':                                                           \
+      case 'o':                                                           \
+      case 'X':                                                           \
+      case 'x':                                                           \
+        out << static_cast<int>(value);                                   \
+        break;                                                            \
+      default:                                                            \
+        out << value;                                                     \
+        break;                                                            \
+    }                                                                     \
  }
 // per 3.9.1: char, signed char and unsigned char are all distinct types
 TINYFORMAT_DEFINE_FORMATVALUE_CHAR(char)
@@ -482,7 +475,7 @@ namespace detail {
 // each argument to be allocated as a homogenous array inside FormatList
 // whereas a naive implementation based on inheritance does not.
 class FormatArg {
-public:
+ public:
  FormatArg() {}

  template <typename T>
@@ -491,22 +484,17 @@ public:
        m_formatImpl(&formatImpl<T>),
        m_toIntImpl(&toIntImpl<T>) {}

-  void format(std::ostream &out,
-              const char *fmtBegin,
-              const char *fmtEnd,
+  void format(std::ostream &out, const char *fmtBegin, const char *fmtEnd,
              int ntrunc) const {
    m_formatImpl(out, fmtBegin, fmtEnd, ntrunc, m_value);
  }

  int toInt() const { return m_toIntImpl(m_value); }

-private:
+ private:
  template <typename T>
-  static void formatImpl(std::ostream &out,
-                         const char *fmtBegin,
-                         const char *fmtEnd,
-                         int ntrunc,
-                         const void *value) {
+  static void formatImpl(std::ostream &out, const char *fmtBegin,
+                         const char *fmtEnd, int ntrunc, const void *value) {
    formatValue(out, fmtBegin, fmtEnd, ntrunc, *static_cast<const T *>(value));
  }

@@ -516,11 +504,8 @@ private:
  }

  const void *m_value;
-  void (*m_formatImpl)(std::ostream &out,
-                       const char *fmtBegin,
-                       const char *fmtEnd,
-                       int ntrunc,
-                       const void *value);
+  void (*m_formatImpl)(std::ostream &out, const char *fmtBegin,
+                       const char *fmtEnd, int ntrunc, const void *value);
  int (*m_toIntImpl)(const void *value);
 };

@@ -569,12 +554,10 @@ inline const char *printFormatStringLiteral(std::ostream &out,
 // necessary to pull out variable width and precision .  The function returns a
 // pointer to the character after the end of the current format spec.
 inline const char *streamStateFromFormat(std::ostream &out,
-                                         bool &spacePadPositive,
-                                         int &ntrunc,
+                                         bool &spacePadPositive, int &ntrunc,
                                         const char *fmtStart,
                                         const detail::FormatArg *formatters,
-                                         int &argIndex,
-                                         int numFormatters) {
+                                         int &argIndex, int numFormatters) {
  if (*fmtStart != '%') {
    TINYFORMAT_ERROR(
        "tinyformat: Not enough conversion specifiers in format string");
@@ -750,10 +733,8 @@ inline const char *streamStateFromFormat(std::ostream &out,
 }

 //------------------------------------------------------------------------------
-inline void formatImpl(std::ostream &out,
-                       const char *fmt,
-                       const detail::FormatArg *formatters,
-                       int numFormatters) {
+inline void formatImpl(std::ostream &out, const char *fmt,
+                       const detail::FormatArg *formatters, int numFormatters) {
  // Saved stream state
  std::streamsize origWidth = out.width();
  std::streamsize origPrecision = out.precision();
@@ -765,13 +746,9 @@ inline void formatImpl(std::ostream &out,
    fmt = printFormatStringLiteral(out, fmt);
    bool spacePadPositive = false;
    int ntrunc = -1;
-    const char *fmtEnd = streamStateFromFormat(out,
-                                               spacePadPositive,
-                                               ntrunc,
-                                               fmt,
-                                               formatters,
-                                               argIndex,
-                                               numFormatters);
+    const char *fmtEnd =
+        streamStateFromFormat(out, spacePadPositive, ntrunc, fmt, formatters,
+                              argIndex, numFormatters);
    if (argIndex >= numFormatters) {
      // Check args remain after reading any variable width/precision
      TINYFORMAT_ERROR("tinyformat: Not enough format arguments");
@@ -820,15 +797,14 @@ inline void formatImpl(std::ostream &out,
 /// information has been stripped from the arguments, leaving just enough of a
 /// common interface to perform formatting as required.
 class FormatList {
-public:
+ public:
  FormatList(detail::FormatArg *formatters, int N)
      : m_formatters(formatters), m_N(N) {}

-  friend void vformat(std::ostream &out,
-                      const char *fmt,
+  friend void vformat(std::ostream &out, const char *fmt,
                      const FormatList &list);

-private:
+ private:
  const detail::FormatArg *m_formatters;
  int m_N;
 };
@@ -841,7 +817,7 @@ namespace detail {
 // Format list subclass with fixed storage to avoid dynamic allocation
 template <int N>
 class FormatListN : public FormatList {
-public:
+ public:
  template <typename... Args>
  FormatListN(const Args &... args)
      : FormatList(&m_formatterStore[0], N),
@@ -849,14 +825,14 @@ public:
    static_assert(sizeof...(args) == N, "Number of args must be N");
  }

-private:
+ private:
  FormatArg m_formatterStore[N];
 };

 // Special 0-arg version - MSVC says zero-sized C array in struct is nonstandard
 template <>
 class FormatListN<0> : public FormatList {
-public:
+ public:
  FormatListN() : FormatList(0, 0) {}
 };


--- a/paddle/string/to_string.h
+++ b/paddle/string/to_string.h
--- a/paddle/string/to_string_test.cc
+++ b/paddle/string/to_string_test.cc
@@ -12,12 +12,12 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#include "paddle/string/to_string.h"
+#include "to_string.h"
 #include <gtest/gtest.h>

 constexpr char kOutputString[] = "User Defined Output";
 class UserDefinedClass {
-public:
+ public:
 };

 std::ostream& operator<<(std::ostream& s, const UserDefinedClass& ins) {

--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
@@ -117,8 +117,8 @@ EOF
            -DWITH_AVX=${WITH_AVX:-ON} \
            -DWITH_SWIG_PY=ON \
            -DWITH_STYLE_CHECK=OFF
-        make -j `nproc` gen_proto_py
-        make -j `nproc` paddle_python
+        make -j `nproc` gen_proto_py framework_py_proto
+        make -j `nproc` copy_paddle_pybind
        make -j `nproc` paddle_docs paddle_docs_cn paddle_api_docs
        popd
    fi

--- a/paddle/scripts/travis/build_doc.sh
+++ b/paddle/scripts/travis/build_doc.sh
@@ -6,9 +6,9 @@ mkdir -p $TRAVIS_BUILD_DIR/build
 cd $TRAVIS_BUILD_DIR/build

 # Compile Documentation only.
-cmake .. -DCMAKE_BUILD_TYPE=Debug -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON
-make -j `nproc` gen_proto_py
-make -j `nproc` paddle_python
+cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON -DWITH_STYLE_CHECK=OFF
+make -j `nproc` gen_proto_py framework_py_proto
+make -j `nproc` copy_paddle_pybind
 make -j `nproc` paddle_docs paddle_docs_cn paddle_api_docs

 # check websites for broken links

--- a/python/paddle/v2/fluid/distribute_transpiler.py
+++ b/python/paddle/v2/fluid/distribute_transpiler.py
@@ -33,6 +33,57 @@ class VarBlock:
        return "%s:%d:%d" % (self.varname, self.offset, self.size)


+class UnionFind(object):
+    """ Union-find data struct.
+    
+    Union-find is a data struct that keeps track of a set of elements partitioned
+    into a number of disjoint (non-overlapping) subsets.
+
+    Reference:
+    https://en.wikipedia.org/wiki/Disjoint-set_data_structure
+
+    Args:
+      elements(list): The initialize element list.
+    """
+
+    def __init__(self, elementes=None):
+        self._parents = []  # index -> parent index
+        self._index = {}  # element -> index
+        self._curr_idx = 0
+        if not elementes:
+            elementes = []
+        for ele in elementes:
+            self._parents.append(self._curr_idx)
+            self._index.update({ele: self._curr_idx})
+            self._curr_idx += 1
+
+    def find(self, x):
+        # Find the root index of given element x,
+        # execute the path compress while findind the root index
+        if not x in self._index:
+            return -1
+        idx = self._index[x]
+        while idx != self._parents[idx]:
+            t = self._parents[idx]
+            self._parents[idx] = self._parents[t]
+            idx = t
+        return idx
+
+    def union(self, x, y):
+        # Union two given element
+        x_root = self.find(x)
+        y_root = self.find(y)
+
+        if x_root == y_root:
+            return
+        self._parents[x_root] = y_root
+
+    def is_connected(self, x, y):
+        # If two given elements have the same root index,
+        # then they are connected.
+        return self.find(x) == self.find(y)
+
+
 def same_or_split_var(p_name, var_name):
    return p_name == var_name or p_name.startswith(var_name + ".block")

@@ -140,6 +191,7 @@ class DistributeTranspiler:
        for b in param_blocks:
            varname, block_id, _ = b.split(":")
            send_outputs.append(param_var_mapping[varname][int(block_id)])
+
        # let send_op know which endpoint to send which var to, eplist has the same
        # order as send_inputs.
        eplist = split_method(send_inputs, pserver_endpoints)
@@ -178,6 +230,21 @@ class DistributeTranspiler:
                outputs={"Out": [orig_param]},
                attrs={"axis": 0})

+        self.lr_param_mapping = self._create_lr_param_mapping()
+
+    def _create_lr_param_mapping(self):
+        lr_mapping = dict()
+        for _, opt_op in enumerate(self.optimize_ops):
+            if not opt_op.inputs or not opt_op.inputs.has_key("LearningRate") \
+              or not opt_op.inputs.has_key("Param"):
+                continue
+            lr = opt_op.inputs["LearningRate"].name
+            param = opt_op.inputs["Param"].name
+            if not lr_mapping.has_key(lr):
+                lr_mapping.update({lr: list()})
+            lr_mapping[lr].append(param)
+        return lr_mapping
+
    def _create_vars_from_blocklist(self, program, block_list):
        # Create respective variables using the block_list
        block_map = dict()
@@ -208,6 +275,7 @@ class DistributeTranspiler:
                    name="%s.block%d" % (varname, i),
                    psersistable=False,
                    dtype=orig_var.dtype,
+                    type=orig_var.type,
                    shape=splited_shape)  # flattend splited var
                var_mapping[varname].append(var)
        return var_mapping
@@ -269,6 +337,7 @@ class DistributeTranspiler:
                name="%s.trainer_%d" % (var.name, i),
                psersistable=var.persistable,
                dtype=var.dtype,
+                type=var.type,
                shape=var.shape)
            var_list.append(var_each)
        return var_list
@@ -300,52 +369,15 @@ class DistributeTranspiler:
            pass
        return orig_shape

-    def _op_input_var(self, op, varname):
-        pass
-
-    def _is_op_on_pserver(self, endpoint, all_ops, idx):
-        """
-        Recursively check if the op need to run on current server.
-        Assume that ops are in the execution order.
-        """
-        param_names = [
-            p.name for p in self.param_grad_ep_mapping[endpoint]["params"]
-        ]
-        op = all_ops[idx]
-        input_names = set(op.input_names)
-        # TODO(typhoonzero): using Param and Grad input name to identify
-        # that the operator is an optimization operator, need a better way.
-        if "Param" in input_names:
-            if op.input("Param")[0] in param_names:
-                return True
-            else:
-                for n in param_names:
-                    if same_or_split_var(n, op.input("Param")[0]) \
-                            and n != op.input("Param")[0]:
-                        return True
-                return False
-        else:
-            j = idx - 1
-            while j >= 0:
-                prev_op = all_ops[j]
-                # prev_output_names = [o.name for o in prev_op.outputs.values()]
-                # prev_input_names = [o.name for o in prev_op.inputs.values()]
-                # NOTE(typhoonzero): consider list input/output
-                prev_output_names = prev_op.desc.output_arg_names()
-                prev_input_names = prev_op.desc.input_arg_names()
-                found1 = False
-                found2 = False
-                for varname in op.desc.input_arg_names():
-                    if varname in prev_output_names:
-                        found1 = self._is_op_on_pserver(endpoint, all_ops, j)
-                # later ops may produce output for prev op's next batch use.
-                for varname in op.desc.output_arg_names():
-                    if varname in prev_input_names:
-                        found2 = self._is_op_on_pserver(endpoint, all_ops, j)
-                if found1 or found2:
-                    return True
-                j -= 1
-            return False
+    def _fetch_var_names(self, param_dict):
+        res = []
+        if not param_dict:
+            return res
+        for _, values in param_dict.iteritems():
+            if not isinstance(values, list):
+                values = [values]
+            res += [v.name for v in values]
+        return res

    def _append_pserver_ops(self, optimize_block, opt_op, endpoint):
        program = optimize_block.program
@@ -363,11 +395,7 @@ class DistributeTranspiler:
                    # do not append this op if current endpoint
                    # is not dealing with this grad block
                    return
-                merged_var = program.global_block().create_var(
-                    name=grad_block.name,
-                    persistable=grad_block.persistable,
-                    dtype=grad_block.dtype,
-                    shape=grad_block.shape)
+                merged_var = program.global_block().vars[grad_block.name]
                # append merging ops if trainers > 1
                if self.trainers > 1:
                    vars2merge = self._create_var_for_trainers(
@@ -398,13 +426,19 @@ class DistributeTranspiler:
                    shape=param_block.shape)

                new_inputs[key] = tmpvar
+            elif key == "LearningRate":
+                # leraning rate variable has already be created by non-optimize op,
+                # don't create it once again.
+                new_inputs[key] = program.global_block().vars[opt_op.input(key)[
+                    0]]

        for key in opt_op.input_names:
-            if key in ["Param", "Grad"]:
+            new_shape = None
+            if key in ["Param", "Grad", "LearningRate"]:
                continue
+            var = program.global_block().vars[opt_op.input(key)[0]]
            # update accumulator variable shape
            param_shape = new_inputs["Param"].shape
-            var = program.global_block().vars[opt_op.input(key)[0]]
            new_shape = self._get_optimizer_input_shape(opt_op.type, key,
                                                        var.shape, param_shape)
            tmpvar = program.global_block().create_var(
@@ -415,12 +449,11 @@ class DistributeTranspiler:
            new_inputs[key] = tmpvar

        # change output's ParamOut variable
-        outputs = self._get_output_map_from_op(program.global_block(), opt_op)
-        outputs["ParamOut"] = new_inputs["Param"]
+        opt_op.outputs["ParamOut"] = new_inputs["Param"]
        optimize_block.append_op(
            type=opt_op.type,
            inputs=new_inputs,
-            outputs=outputs,
+            outputs=opt_op.outputs,
            attrs=opt_op.attrs)

    def _append_pserver_non_opt_ops(self, optimize_block, opt_op):
@@ -428,11 +461,10 @@ class DistributeTranspiler:
        # Append the ops for parameters that do not need to be optimized/updated
        inputs = self._get_input_map_from_op(self.program.global_block().vars,
                                             opt_op)
-        for var in inputs.itervalues():
-            if type(var) == list:
-                varlist = var
-            else:
-                varlist = [var]
+        for varlist in inputs.itervalues():
+            if not isinstance(varlist, list):
+                varlist = [varlist]
+
            for var in varlist:
                if not program.global_block().vars.has_key(var.name):
                    program.global_block().create_var(
@@ -444,12 +476,70 @@ class DistributeTranspiler:
        outputs = self._get_output_map_from_op(self.program.global_block().vars,
                                               opt_op)

+        for varlist in outputs.itervalues():
+            if not isinstance(varlist, list):
+                varlist = [varlist]
+
+            for var in varlist:
+                program.global_block().create_var(
+                    name=var.name,
+                    persistable=var.persistable,
+                    dtype=var.dtype,
+                    shape=var.shape)
+
        optimize_block.append_op(
            type=opt_op.type,
            inputs=inputs,
            outputs=outputs,
            attrs=opt_op.attrs)

+    def _is_op_connected(self, op1, op2):
+        # If one op's input is another op's output or
+        # one op's output is another op's input, we say
+        # the two operator is connected.
+        op1_input_names = self._fetch_var_names(op1.inputs)
+        op1_output_names = self._fetch_var_names(op1.outputs)
+
+        op2_input_names = self._fetch_var_names(op2.inputs)
+        op2_output_names = self._fetch_var_names(op2.outputs)
+        if set(op1_output_names) & set(op2_input_names) or \
+           set(op1_input_names) & set(op2_output_names):
+            return True
+        return False
+
+    def _create_ufind(self, optimize_ops):
+        # Create a unit find data struct by optimize ops
+        ufind = UnionFind(optimize_ops)
+        for i in xrange(len(optimize_ops)):
+            for j in xrange(i, len(optimize_ops)):
+                op1 = optimize_ops[i]
+                op2 = optimize_ops[j]
+                if self._is_op_connected(op1, op2):
+                    ufind.union(op1, op2)
+        return ufind
+
+    def _is_opt_op(self, op):
+        # NOTE: It's a HACK implement.
+        # optimize op: SGDOptimize, MomentumOptimizer, AdamOptimizer and etc... 
+        if op.inputs and op.inputs.has_key("Param") \
+          and op.inputs.has_key("LearningRate"):
+            return True
+        return False
+
+    def _is_opt_op_on_pserver(self, endpoint, op):
+        param_names = [
+            p.name for p in self.param_grad_ep_mapping[endpoint]["params"]
+        ]
+        if op.inputs["Param"].name in param_names:
+            return True
+        else:
+            for n in param_names:
+                param = op.inputs["Param"].name
+                if same_or_split_var(n, param) and n != op.inputs["Param"].name:
+                    return True
+            return False
+        return False
+
    def get_pserver_program(self, endpoint):
        """
        Get pserver side program using the endpoint
@@ -469,26 +559,38 @@ class DistributeTranspiler:
            pserver_program.global_block().create_var(
                name=v.name, persistable=True, dtype=v.dtype, shape=v.shape)
            for trainer_id in xrange(self.trainers):
-                print("create variable for program: %s.trainer_%d" %
-                      (v.name, trainer_id))
                pserver_program.global_block().create_var(
                    name="%s.trainer_%d" % (v.name, trainer_id),
                    persistable=True,
                    dtype=v.dtype,
                    shape=v.shape)
+
        # step6
        optimize_block = pserver_program.create_block(0)
-        # Iterate through the ops and append ops as needed
-        for idx, opt_op in enumerate(self.optimize_ops):
-            is_op_on_pserver = self._is_op_on_pserver(endpoint,
-                                                      self.optimize_ops, idx)
-            if not is_op_on_pserver:
-                continue
-            if "Grad" in opt_op.desc.input_arg_names():
-                self._append_pserver_ops(optimize_block, opt_op, endpoint)
-            else:
-                self._append_pserver_non_opt_ops(optimize_block, opt_op)
-
+        # step 6.1
+        # Create a union-find data struct by optimize ops,
+        # If two ops are connected, we could add these two ops
+        # into one set.
+        ufind = self._create_ufind(self.optimize_ops)
+        # step 6.2 
+        # Iterate through the ops and append optimize op which
+        # located on current pserver
+        opt_op_on_pserver = []
+        for _, op in enumerate(self.optimize_ops):
+            if self._is_opt_op(op) and self._is_opt_op_on_pserver(endpoint, op):
+                opt_op_on_pserver.append(op)
+        # step 6.3
+        # Iterate through the ops, and if an op and the optimize ops
+        # which located on current pserver are in one set, then 
+        # append it into the sub program.
+        for _, op in enumerate(self.optimize_ops):
+            for _, opt_op in enumerate(opt_op_on_pserver):
+                if ufind.is_connected(op, opt_op):
+                    if self._is_opt_op(op):
+                        self._append_pserver_ops(optimize_block, op, endpoint)
+                    else:
+                        self._append_pserver_non_opt_ops(optimize_block, op)
+                    break
        # Append the listen_and_serv op
        pserver_program.global_block().append_op(
            type="listen_and_serv",

--- a/python/paddle/v2/fluid/layers/__init__.py
+++ b/python/paddle/v2/fluid/layers/__init__.py
@@ -16,6 +16,8 @@ import ops
 from ops import *
 import nn
 from nn import *
+import detection
+from detection import *
 import io
 from io import *
 import tensor
@@ -28,6 +30,7 @@ import math_op_patch
 from math_op_patch import *

 __all__ = []
+__all__ += detection.__all__
 __all__ += nn.__all__
 __all__ += io.__all__
 __all__ += tensor.__all__

--- a/python/paddle/v2/fluid/layers/detection.py
+++ b/python/paddle/v2/fluid/layers/detection.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+All layers just related to the detection neural network.
+"""
+
+from ..layer_helper import LayerHelper
+
+__all__ = ['detection_output', ]
+
+
+def detection_output(scores,
+                     loc,
+                     prior_box,
+                     prior_box_var,
+                     background_label=0,
+                     nms_threshold=0.3,
+                     nms_top_k=400,
+                     keep_top_k=200,
+                     score_threshold=0.01,
+                     nms_eta=1.0):
+    """
+    **Detection Output Layer**
+
+    This layer applies the NMS to the output of network and computes the 
+    predict bounding box location. The output's shape of this layer could
+    be zero if there is no valid bounding box.
+
+    Args:
+        scores(Variable): A 3-D Tensor with shape [N, C, M] represents the
+            predicted confidence predictions. N is the batch size, C is the
+            class number, M is number of bounding boxes. For each category
+            there are total M scores which corresponding M bounding boxes.
+        loc(Variable): A 3-D Tensor with shape [N, M, 4] represents the
+            predicted locations of M bounding bboxes. N is the batch size,
+            and each bounding box has four coordinate values and the layout
+            is [xmin, ymin, xmax, ymax].
+        prior_box(Variable): A 2-D Tensor with shape [M, 4] holds M boxes,
+            each box is represented as [xmin, ymin, xmax, ymax],
+            [xmin, ymin] is the left top coordinate of the anchor box,
+            if the input is image feature map, they are close to the origin
+            of the coordinate system. [xmax, ymax] is the right bottom
+            coordinate of the anchor box.
+        prior_box_var(Variable): A 2-D Tensor with shape [M, 4] holds M group
+            of variance.
+        background_label(float): The index of background label,
+            the background label will be ignored. If set to -1, then all
+            categories will be considered.
+        nms_threshold(float): The threshold to be used in NMS.
+        nms_top_k(int): Maximum number of detections to be kept according
+            to the confidences aftern the filtering detections based on
+            score_threshold.
+        keep_top_k(int): Number of total bboxes to be kept per image after
+            NMS step. -1 means keeping all bboxes after NMS step.
+        score_threshold(float): Threshold to filter out bounding boxes with
+            low confidence score. If not provided, consider all boxes.
+        nms_eta(float): The parameter for adaptive NMS.
+
+    Returns:
+        The detected bounding boxes which are a Tensor.
+
+    Examples:
+        .. code-block:: python
+
+        pb = layers.data(name='prior_box', shape=[10, 4],
+                         append_batch_size=False, dtype='float32')
+        pbv = layers.data(name='prior_box_var', shape=[10, 4],
+                          append_batch_size=False, dtype='float32')
+        loc = layers.data(name='target_box', shape=[21, 4],
+                          append_batch_size=False, dtype='float32')
+        scores = layers.data(name='scores', shape=[2, 21, 10],
+                          append_batch_size=False, dtype='float32')
+        nmsed_outs = fluid.layers.detection_output(scores=scores,
+                                       loc=loc,
+                                       prior_box=pb,
+                                       prior_box_var=pbv)
+    """
+
+    helper = LayerHelper("detection_output", **locals())
+    decoded_box = helper.create_tmp_variable(dtype=loc.dtype)
+    helper.append_op(
+        type="box_coder",
+        inputs={
+            'PriorBox': prior_box,
+            'PriorBoxVar': prior_box_var,
+            'TargetBox': loc
+        },
+        outputs={'OutputBox': decoded_box},
+        attrs={'code_type': 'decode_center_size'})
+    nmsed_outs = helper.create_tmp_variable(dtype=decoded_box.dtype)
+
+    helper.append_op(
+        type="multiclass_nms",
+        inputs={'Scores': scores,
+                'BBoxes': decoded_box},
+        outputs={'Out': nmsed_outs},
+        attrs={
+            'background_label': 0,
+            'nms_threshold': nms_threshold,
+            'nms_top_k': nms_top_k,
+            'keep_top_k': keep_top_k,
+            'score_threshold': score_threshold,
+            'nms_eta': 1.0
+        })
+    return nmsed_outs
--- a/python/paddle/v2/fluid/layers/math_op_patch.py
+++ b/python/paddle/v2/fluid/layers/math_op_patch.py
@@ -117,6 +117,7 @@ def monkey_patch_variable():

            tmp_name = unique_tmp_name()
            out = self.block.create_var(name=tmp_name, dtype=lhs_dtype)
+
            self.block.append_op(
                type=op_type,
                inputs={'X': [self],
@@ -151,7 +152,12 @@ def monkey_patch_variable():
        ("__div__", "elementwise_div", False),
        ("__rdiv__", "elementwise_div", True),
        ("__pow__", "elementwise_pow", False),
-        ("__rpow__", "elementwise_pow", True)):
+        ("__rpow__", "elementwise_pow", True),
+            # for logical compare
+        ("__eq__", "equal", False),
+        ("__ne__", "not_equal", False),
+        ("__lt__", "less_than", False),
+        ("__le__", "less_equal", False)):
        setattr(Variable, method_name,
                _elemwise_method_creator_(method_name, op_type, reverse))


--- a/python/paddle/v2/fluid/layers/nn.py
+++ b/python/paddle/v2/fluid/layers/nn.py
@@ -66,6 +66,8 @@ __all__ = [
    'row_conv',
    'multiplex',
    'layer_norm',
+    'softmax_with_cross_entropy',
+    'smooth_l1',
 ]


@@ -3091,3 +3093,122 @@ def multiplex(inputs, index):
                'Ids': index},
        outputs={'Out': [out]})
    return out
+
+
+def softmax_with_cross_entropy(logits, label, soft_label=False):
+    """
+    **Softmax With Cross Entropy Operator.**
+    
+    Cross entropy loss with softmax is used as the output layer extensively. This
+    operator computes the softmax normalized values for each row of the input
+    tensor, after which cross-entropy loss is computed. This provides a more
+    numerically stable gradient.
+    
+    Because this operator performs a softmax on logits internally, it expects
+    unscaled logits. This operator should not be used with the output of
+    softmax operator since that would produce incorrect results.
+    
+    When the attribute soft_label is set false, this operators expects mutually
+    exclusive hard labels, each sample in a batch is in exactly one class with a
+    probability of 1.0. Each sample in the batch will have a single label.
+    
+    The equation is as follows:
+    
+    1) Hard label (one-hot label, so every sample has exactly one class)
+    
+    .. math::
+
+        loss_j =  -\\text{logit}_{label_j} +
+        \\log\\left(\\sum_{i=0}^{K}\\exp(\\text{logit}_i)\\right), j = 1,..., K
+    
+    2) Soft label (each sample can have a distribution over all classes)
+
+    .. math::
+    
+        loss_j =  -\\sum_{i=0}^{K}\\text{label}_i
+        \\left(\\text{logit}_i - \\log\\left(\\sum_{i=0}^{K}
+        \\exp(\\text{logit}_i)\\right)\\right), j = 1,...,K
+
+    Args:
+        logits (Variable): The unscaled log probabilities, which is a 2-D tensor
+            with shape [N x K]. N is the batch_size, and K is the class number.
+        label (Variable): The ground truth which is a 2-D tensor. If soft_label
+            is set to false, Label is a Tensor<int64> with shape [N x 1]. If
+            soft_label is set to true, Label is a Tensor<float/double> with
+        soft_label (bool): A flag to indicate whether to interpretate the given
+            labels as soft labels. By default, `soft_label` is set to False.
+    Returns:
+        Variable: The cross entropy loss is a 2-D tensor with shape [N x 1].
+
+    Examples:
+        .. code-block:: python
+
+            data = fluid.layers.data(name='data', shape=[128], dtype='float32')
+            label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+            fc = fluid.layers.fc(input=data, size=100)
+            out = fluid.layers.softmax_with_cross_entropy(logits=fc, label=label)
+    """
+    helper = LayerHelper('softmax_with_cross_entropy', **locals())
+    softmax = helper.create_tmp_variable(dtype=logits.dtype)
+    loss = helper.create_tmp_variable(dtype=logits.dtype)
+    helper.append_op(
+        type='softmax_with_cross_entropy',
+        inputs={'Logits': logits,
+                'Label': label},
+        outputs={'Softmax': softmax,
+                 'Loss': loss},
+        attrs={'soft_label': soft_label})
+    return loss
+
+
+def smooth_l1(x, y, inside_weight=None, outside_weight=None, sigma=None):
+    """
+    **Smooth L1 Loss Operator. **
+
+    This operator computes the smooth l1 loss for X and Y.
+    The operator takes the first dimension of X and Y as batch size.
+    For each instance, it computes the smooth l1 loss element by element first
+    and then sums all the losses. So the shape of Out is [batch_size, 1].
+    
+    Args:
+        x (Variable): A tensor with rank at least 2. The input value of smooth
+            l1 loss op with shape [batch_size, dim1, ..., dimN].
+        y (Variable): A tensor with rank at least 2. The target value of smooth
+            l1 loss op with same shape as x.
+        inside_weight (Variable|None):  A tensor with rank at least 2. This
+            input is optional and should have same shape with x. If provided,
+            the result of (x - y) will be multiplied by this tensor element by
+            element.
+        outside_weight (Variable|None): A tensor with rank at least 2. This
+            input is optional and should have same shape with x. If provided,
+            the out smooth l1 loss will be multiplied by this tensor element
+            by element.
+        sigma (float|None): Hyper parameter of smooth l1 loss op. A float scalar
+            with default value 1.0.
+    Returns:
+        Variable: A tensor with rank be 2. The output smooth l1 loss with
+            shape [batch_size, 1].
+
+    Examples:
+        .. code-block:: python
+
+            data = fluid.layers.data(name='data', shape=[128], dtype='float32')
+            label = fluid.layers.data(name='label', shape=[100], dtype='int64')
+            fc = fluid.layers.fc(input=data, size=100)
+            out = fluid.layers.smooth_l1(logits=fc, label=label)
+    """
+    helper = LayerHelper('smooth_l1_loss', **locals())
+    diff = helper.create_tmp_variable(dtype=x.dtype)
+    loss = helper.create_tmp_variable(dtype=x.dtype)
+    helper.append_op(
+        type='smooth_l1_loss',
+        inputs={
+            'X': x,
+            'Y': y,
+            'InsideWeight': inside_weight,
+            'OutsideWeight': outside_weight
+        },
+        outputs={'Diff': diff,
+                 'Out': loss},
+        attrs={'sigma': sigma})
+    return loss
--- a/python/paddle/v2/fluid/learning_rate_decay.py
+++ b/python/paddle/v2/fluid/learning_rate_decay.py
@@ -179,7 +179,7 @@ def polynomial_decay(learning_rate,
                shape=[1], dtype='float32', value=1.0)

            with layers.Switch() as switch:
-                with switch.case(layers.equal(x=global_step, y=zero_var)):
+                with switch.case(global_step == zero_var):
                    layers.assign(input=one_var, output=div_res)
            decay_steps = decay_steps * div_res
        else:
@@ -229,7 +229,7 @@ def piecewise_decay(global_step, boundaries, values):
                    shape=[1], dtype='float32', value=float(boundaries[i]))
                value_var = layers.fill_constant(
                    shape=[1], dtype='float32', value=float(values[i]))
-                with switch.case(layers.less_than(global_step, boundary_val)):
+                with switch.case(global_step < boundary_val):
                    layers.assign(value_var, lr)
            last_value_var = layers.fill_constant(
                shape=[1],

--- a/python/paddle/v2/fluid/tests/book/test_rnn_encoder_decoder.py
+++ b/python/paddle/v2/fluid/tests/book/test_rnn_encoder_decoder.py
--- a/python/paddle/v2/fluid/tests/book_distribute/notest_dist_word2vec.py
+++ b/python/paddle/v2/fluid/tests/book_distribute/notest_dist_word2vec.py
@@ -99,7 +99,7 @@ elif training_role == "TRAINER":
    exe.run(fluid.default_startup_program())
    for pass_id in range(PASS_NUM):
        for data in train_reader():
-            avg_cost_np = exe.run(fluid.default_main_program(),
+            avg_cost_np = exe.run(t.get_trainer_program(),
                                  feed=feeder.feed(data),
                                  fetch_list=[avg_cost])
            print("avg_cost_np", avg_cost_np)

--- a/python/paddle/v2/fluid/tests/test_cpp_reader.py
+++ b/python/paddle/v2/fluid/tests/test_cpp_reader.py
@@ -64,9 +64,7 @@ exe = fluid.Executor(place)

 [res1, res2] = exe.run(prog, fetch_list=[out1, out2])

-test_pass = res1.shape == (10, 2) and res2.shape == (10, 1)
-
-if not test_pass:
+if not (res1.shape == (10, 2) and res2.shape == (10, 1)):
    exit(1)

 exit(0)
--- a/python/paddle/v2/fluid/tests/test_detection.py
+++ b/python/paddle/v2/fluid/tests/test_detection.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import unittest
+
+import paddle.v2.fluid.layers as layers
+from paddle.v2.fluid.framework import Program, program_guard
+
+
+class TestBook(unittest.TestCase):
+    def test_detection_output(self):
+        program = Program()
+        with program_guard(program):
+            pb = layers.data(
+                name='prior_box',
+                shape=[10, 4],
+                append_batch_size=False,
+                dtype='float32')
+            pbv = layers.data(
+                name='prior_box_var',
+                shape=[10, 4],
+                append_batch_size=False,
+                dtype='float32')
+            loc = layers.data(
+                name='target_box',
+                shape=[20, 4],
+                append_batch_size=False,
+                dtype='float32')
+            scores = layers.data(
+                name='scores',
+                shape=[2, 20, 10],
+                append_batch_size=False,
+                dtype='float32')
+            out = layers.detection_output(
+                scores=scores, loc=loc, prior_box=pb, prior_box_var=pbv)
+            self.assertIsNotNone(out)
+        print(str(program))
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_detection_map_op.py
+++ b/python/paddle/v2/fluid/tests/test_detection_map_op.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+import sys
+import collections
+import math
+from op_test import OpTest
+
+
+class TestDetectionMAPOp(OpTest):
+    def set_data(self):
+        self.init_test_case()
+
+        self.mAP = [self.calc_map(self.tf_pos, self.tf_pos_lod)]
+        self.label = np.array(self.label).astype('float32')
+        self.detect = np.array(self.detect).astype('float32')
+        self.mAP = np.array(self.mAP).astype('float32')
+
+        if (len(self.class_pos_count) > 0):
+            self.class_pos_count = np.array(self.class_pos_count).astype(
+                'int32')
+            self.true_pos = np.array(self.true_pos).astype('float32')
+            self.false_pos = np.array(self.false_pos).astype('float32')
+
+            self.inputs = {
+                'Label': (self.label, self.label_lod),
+                'DetectRes': (self.detect, self.detect_lod),
+                'PosCount': self.class_pos_count,
+                'TruePos': (self.true_pos, self.true_pos_lod),
+                'FalsePos': (self.false_pos, self.false_pos_lod)
+            }
+        else:
+            self.inputs = {
+                'Label': (self.label, self.label_lod),
+                'DetectRes': (self.detect, self.detect_lod),
+            }
+
+        self.attrs = {
+            'overlap_threshold': self.overlap_threshold,
+            'evaluate_difficult': self.evaluate_difficult,
+            'ap_type': self.ap_type
+        }
+
+        self.out_class_pos_count = np.array(self.out_class_pos_count).astype(
+            'int')
+        self.out_true_pos = np.array(self.out_true_pos).astype('float32')
+        self.out_false_pos = np.array(self.out_false_pos).astype('float32')
+
+        self.outputs = {
+            'MAP': self.mAP,
+            'AccumPosCount': self.out_class_pos_count,
+            'AccumTruePos': (self.out_true_pos, self.out_true_pos_lod),
+            'AccumFalsePos': (self.out_false_pos, self.out_false_pos_lod)
+        }
+
+    def init_test_case(self):
+        self.overlap_threshold = 0.3
+        self.evaluate_difficult = True
+        self.ap_type = "integral"
+
+        self.label_lod = [[0, 2, 4]]
+        # label difficult xmin ymin xmax ymax
+        self.label = [[1, 0, 0.1, 0.1, 0.3, 0.3], [1, 1, 0.6, 0.6, 0.8, 0.8],
+                      [2, 0, 0.3, 0.3, 0.6, 0.5], [1, 0, 0.7, 0.1, 0.9, 0.3]]
+
+        # label score xmin ymin xmax ymax difficult
+        self.detect_lod = [[0, 3, 7]]
+        self.detect = [
+            [1, 0.3, 0.1, 0.0, 0.4, 0.3], [1, 0.7, 0.0, 0.1, 0.2, 0.3],
+            [1, 0.9, 0.7, 0.6, 0.8, 0.8], [2, 0.8, 0.2, 0.1, 0.4, 0.4],
+            [2, 0.1, 0.4, 0.3, 0.7, 0.5], [1, 0.2, 0.8, 0.1, 1.0, 0.3],
+            [3, 0.2, 0.8, 0.1, 1.0, 0.3]
+        ]
+
+        # label score true_pos false_pos
+        self.tf_pos_lod = [[0, 3, 7]]
+        self.tf_pos = [[1, 0.9, 1, 0], [1, 0.7, 1, 0], [1, 0.3, 0, 1],
+                       [1, 0.2, 1, 0], [2, 0.8, 0, 1], [2, 0.1, 1, 0],
+                       [3, 0.2, 0, 1]]
+
+        self.class_pos_count = []
+        self.true_pos_lod = [[]]
+        self.true_pos = [[]]
+        self.false_pos_lod = [[]]
+        self.false_pos = [[]]
+
+    def calc_map(self, tf_pos, tf_pos_lod):
+        mAP = 0.0
+        count = 0
+
+        def get_input_pos(class_pos_count, true_pos, true_pos_lod, false_pos,
+                          false_pos_lod):
+            class_pos_count_dict = collections.Counter()
+            true_pos_dict = collections.defaultdict(list)
+            false_pos_dict = collections.defaultdict(list)
+            for i, count in enumerate(class_pos_count):
+                class_pos_count_dict[i] = count
+
+            for i in range(len(true_pos_lod[0]) - 1):
+                start = true_pos_lod[0][i]
+                end = true_pos_lod[0][i + 1]
+                for j in range(start, end):
+                    true_pos_dict[i].append(true_pos[j])
+
+            for i in range(len(false_pos_lod[0]) - 1):
+                start = false_pos_lod[0][i]
+                end = false_pos_lod[0][i + 1]
+                for j in range(start, end):
+                    false_pos_dict[i].append(false_pos[j])
+
+            return class_pos_count_dict, true_pos_dict, false_pos_dict
+
+        def get_output_pos(label_count, true_pos, false_pos):
+            max_label = 0
+            for (label, label_pos_num) in label_count.items():
+                if max_label < label:
+                    max_label = label
+
+            label_number = max_label + 1
+
+            out_class_pos_count = []
+            out_true_pos_lod = [0]
+            out_true_pos = []
+            out_false_pos_lod = [0]
+            out_false_pos = []
+
+            for i in range(label_number):
+                out_class_pos_count.append([label_count[i]])
+                true_pos_list = true_pos[i]
+                out_true_pos += true_pos_list
+                out_true_pos_lod.append(len(out_true_pos))
+                false_pos_list = false_pos[i]
+                out_false_pos += false_pos_list
+                out_false_pos_lod.append(len(out_false_pos))
+
+            return out_class_pos_count, out_true_pos, [
+                out_true_pos_lod
+            ], out_false_pos, [out_false_pos_lod]
+
+        def get_accumulation(pos_list):
+            sorted_list = sorted(pos_list, key=lambda pos: pos[0], reverse=True)
+            sum = 0
+            accu_list = []
+            for (score, count) in sorted_list:
+                sum += count
+                accu_list.append(sum)
+            return accu_list
+
+        label_count, true_pos, false_pos = get_input_pos(
+            self.class_pos_count, self.true_pos, self.true_pos_lod,
+            self.false_pos, self.false_pos_lod)
+        for (label, difficult, xmin, ymin, xmax, ymax) in self.label:
+            if self.evaluate_difficult:
+                label_count[label] += 1
+            elif not difficult:
+                label_count[label] += 1
+
+        true_pos = collections.defaultdict(list)
+        false_pos = collections.defaultdict(list)
+        for (label, score, tp, fp) in tf_pos:
+            true_pos[label].append([score, tp])
+            false_pos[label].append([score, fp])
+
+        for (label, label_pos_num) in label_count.items():
+            if label_pos_num == 0 or label not in true_pos: continue
+            label_true_pos = true_pos[label]
+            label_false_pos = false_pos[label]
+
+            accu_tp_sum = get_accumulation(label_true_pos)
+            accu_fp_sum = get_accumulation(label_false_pos)
+
+            precision = []
+            recall = []
+
+            for i in range(len(accu_tp_sum)):
+                precision.append(
+                    float(accu_tp_sum[i]) /
+                    float(accu_tp_sum[i] + accu_fp_sum[i]))
+                recall.append(float(accu_tp_sum[i]) / label_pos_num)
+
+            if self.ap_type == "11point":
+                max_precisions = [0.0] * 11
+                start_idx = len(accu_tp_sum) - 1
+                for j in range(10, -1, -1):
+                    for i in range(start_idx, -1, -1):
+                        if recall[i] < float(j) / 10.0:
+                            start_idx = i
+                            if j > 0:
+                                max_precisions[j - 1] = max_precisions[j]
+                                break
+                        else:
+                            if max_precisions[j] < precision[i]:
+                                max_precisions[j] = precision[i]
+                for j in range(10, -1, -1):
+                    mAP += max_precisions[j] / 11
+                count += 1
+            elif self.ap_type == "integral":
+                average_precisions = 0.0
+                prev_recall = 0.0
+                for i in range(len(accu_tp_sum)):
+                    if math.fabs(recall[i] - prev_recall) > 1e-6:
+                        average_precisions += precision[i] * \
+                            math.fabs(recall[i] - prev_recall)
+                        prev_recall = recall[i]
+
+                mAP += average_precisions
+                count += 1
+        self.out_class_pos_count, self.out_true_pos, self.out_true_pos_lod, self.out_false_pos, self.out_false_pos_lod = get_output_pos(
+            label_count, true_pos, false_pos)
+        if count != 0:
+            mAP /= count
+        return mAP * 100.0
+
+    def setUp(self):
+        self.op_type = "detection_map"
+        self.set_data()
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestDetectionMAPOpSkipDiff(TestDetectionMAPOp):
+    def init_test_case(self):
+        super(TestDetectionMAPOpSkipDiff, self).init_test_case()
+
+        self.evaluate_difficult = False
+
+        self.tf_pos_lod = [[0, 2, 6]]
+        # label score true_pos false_pos
+        self.tf_pos = [[1, 0.7, 1, 0], [1, 0.3, 0, 1], [1, 0.2, 1, 0],
+                       [2, 0.8, 0, 1], [2, 0.1, 1, 0], [3, 0.2, 0, 1]]
+
+
+class TestDetectionMAPOp11Point(TestDetectionMAPOp):
+    def init_test_case(self):
+        super(TestDetectionMAPOp11Point, self).init_test_case()
+
+        self.ap_type = "11point"
+
+
+class TestDetectionMAPOpMultiBatch(TestDetectionMAPOp):
+    def init_test_case(self):
+        super(TestDetectionMAPOpMultiBatch, self).init_test_case()
+        self.class_pos_count = [0, 2, 1]
+        self.true_pos_lod = [[0, 0, 3, 5]]
+        self.true_pos = [[0.7, 1.], [0.3, 0.], [0.2, 1.], [0.8, 0.], [0.1, 1.]]
+        self.false_pos_lod = [[0, 0, 3, 5]]
+        self.false_pos = [[0.7, 0.], [0.3, 1.], [0.2, 0.], [0.8, 1.], [0.1, 0.]]
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_layers.py
+++ b/python/paddle/v2/fluid/tests/test_layers.py
@@ -161,8 +161,8 @@ class TestBook(unittest.TestCase):
                label=label,
                chunk_scheme="IOB",
                num_chunk_types=(label_dict_len - 1) / 2)
-            self.assertNotEqual(crf, None)
-            self.assertNotEqual(crf_decode, None)
+            self.assertFalse(crf is None)
+            self.assertFalse(crf_decode is None)

        print(str(program))

@@ -309,6 +309,24 @@ class TestBook(unittest.TestCase):
            self.assertIsNotNone(out)
        print(str(program))

+    def test_softmax_with_cross_entropy(self):
+        program = Program()
+        with program_guard(program):
+            x = layers.data(name='x', shape=[16], dtype='float32')
+            y = layers.data(name='label', shape=[1], dtype='int64')
+            loss = layers.softmax_with_cross_entropy(x, y)
+            self.assertIsNotNone(loss)
+        print(str(program))
+
+    def test_smooth_l1(self):
+        program = Program()
+        with program_guard(program):
+            x = layers.data(name='x', shape=[4], dtype='float32')
+            y = layers.data(name='label', shape=[4], dtype='float32')
+            loss = layers.smooth_l1(x, y)
+            self.assertIsNotNone(loss)
+        print(str(program))
+

 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_multiclass_nms_op.py
+++ b/python/paddle/v2/fluid/tests/test_multiclass_nms_op.py
@@ -137,7 +137,7 @@ def batched_multiclass_nms(boxes, scores, background, score_threshold,
    det_outs = []
    lod = [0]
    for n in range(batch_size):
-        nmsed_outs, nmsed_num = multiclass_nms(boxes, scores[n], background,
+        nmsed_outs, nmsed_num = multiclass_nms(boxes[n], scores[n], background,
                                               score_threshold, nms_threshold,
                                               nms_top_k, keep_top_k)
        lod.append(lod[-1] + nmsed_num)
@@ -145,7 +145,7 @@ def batched_multiclass_nms(boxes, scores, background, score_threshold,

        for c, indices in nmsed_outs.iteritems():
            for idx in indices:
-                xmin, ymin, xmax, ymax = boxes[idx][:]
+                xmin, ymin, xmax, ymax = boxes[n][idx][:]
                det_outs.append([c, scores[n][c][idx], xmin, ymin, xmax, ymax])

    return det_outs, lod
@@ -179,9 +179,9 @@ class TestMulticlassNMSOp(OpTest):
        scores = np.reshape(scores, (N, M, C))
        scores = np.transpose(scores, (0, 2, 1))

-        boxes = np.random.random((M, BOX_SIZE)).astype('float32')
-        boxes[:, 0:2] = boxes[:, 0:2] * 0.5
-        boxes[:, 2:4] = boxes[:, 2:4] * 0.5 + 0.5
+        boxes = np.random.random((N, M, BOX_SIZE)).astype('float32')
+        boxes[:, :, 0:2] = boxes[:, :, 0:2] * 0.5
+        boxes[:, :, 2:4] = boxes[:, :, 2:4] * 0.5 + 0.5

        nmsed_outs, lod = batched_multiclass_nms(boxes, scores, background,
                                                 score_threshold, nms_threshold,

--- a/python/paddle/v2/fluid/tests/test_python_operator_overriding.py
+++ b/python/paddle/v2/fluid/tests/test_python_operator_overriding.py
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+import paddle.v2.fluid.layers as layers
+import paddle.v2.fluid.framework as framework
+import paddle.v2.fluid as fluid
+
+
+class TestPythonOperatorOverride(unittest.TestCase):
+    def check_result(self, fn, place, dtype):
+        shape = [9, 10]
+
+        x_data = np.random.random(size=shape).astype(dtype)
+        y_data = np.random.random(size=shape).astype(dtype)
+        python_out = fn(x_data, y_data)
+
+        x_var = layers.create_global_var(
+            name='x', shape=shape, value=0.0, dtype=dtype, persistable=True)
+        y_var = layers.create_global_var(
+            name='y', shape=shape, value=0.0, dtype=dtype, persistable=True)
+        out = fn(x_var, y_var)
+
+        exe = fluid.Executor(place)
+
+        exe.run(fluid.default_startup_program())
+        fluid_out = exe.run(fluid.default_main_program(),
+                            feed={'x': x_data,
+                                  'y': y_data},
+                            fetch_list=[out])
+
+        np.testing.assert_array_equal(python_out, fluid_out[0])
+
+    def test_override(self):
+        # compare func to check
+        compare_fns = [
+            lambda _a, _b: _a == _b,
+            lambda _a, _b: _a != _b,
+            lambda _a, _b: _a < _b,
+            lambda _a, _b: _a <= _b,
+            lambda _a, _b: _a > _b,
+            lambda _a, _b: _a >= _b,
+        ]
+
+        # places to check
+        places = [fluid.CPUPlace()]
+        if fluid.core.is_compiled_with_cuda():
+            places.append(fluid.CUDAPlace(0))
+
+        # dtypes to check
+        dtypes = ['int32', 'float32']
+
+        for place in places:
+            for dtype in dtypes:
+                for compare_fn in compare_fns:
+                    with framework.program_guard(framework.Program(),
+                                                 framework.Program()):
+                        self.check_result(compare_fn, place, dtype)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_sequence_expand.py
+++ b/python/paddle/v2/fluid/tests/test_sequence_expand.py
@@ -73,5 +73,20 @@ class TestSequenceExpandCase3(TestSequenceExpand):
        self.inputs = {'X': (x_data, x_lod), 'Y': (y_data, y_lod)}


+class TestSequenceExpandCase4(TestSequenceExpand):
+    def set_data(self):
+        x_data = np.array(
+            [0.1, 0.3, 0.2, 0.15, 0.25, 0.2, 0.15, 0.25, 0.1, 0.3]).reshape(
+                [2, 5]).astype('float32')
+        x_lod = [[
+            0,
+            1,
+            2,
+        ]]
+        y_data = np.random.uniform(0.1, 1, [2, 1]).astype('float32')
+        y_lod = [[0, 1, 2], [0, 1, 2]]
+        self.inputs = {'X': (x_data, x_lod), 'Y': (y_data, y_lod)}
+
+
 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_split_op.py
+++ b/python/paddle/v2/fluid/tests/test_split_op.py
@@ -20,11 +20,11 @@ from op_test import OpTest
 class TestSplitOp(OpTest):
    def setUp(self):
        self.op_type = "split"
-        axis = 0
-        x = np.random.random((4, 2, 5)).astype('float32')
-        out = np.split(x, [1, 3], axis)
+        axis = 1
+        x = np.random.random((4, 5, 6)).astype('float32')
+        out = np.split(x, [2, 3], axis)
        self.inputs = {'X': x}
-        self.attrs = {'axis': axis, 'sections': [1, 2, 1]}
+        self.attrs = {'axis': axis, 'sections': [2, 1, 2]}
        self.outputs = {'Out': [('out%d' % i, out[i]) \
            for i in xrange(len(out))]}


--- a/python/paddle/v2/fluid/tests/test_target_assign_op.py
+++ b/python/paddle/v2/fluid/tests/test_target_assign_op.py
@@ -43,7 +43,7 @@ def gen_match_and_neg_indices(num_prior, gt_lod, neg_lod):


 def target_assign(encoded_box, gt_label, match_indices, neg_indices, gt_lod,
-                  neg_lod, background_label):
+                  neg_lod, mismatch_value):
    batch_size, num_prior = match_indices.shape

    # init target bbox
@@ -52,7 +52,7 @@ def target_assign(encoded_box, gt_label, match_indices, neg_indices, gt_lod,
    trg_box_wt = np.zeros((batch_size, num_prior, 1)).astype('float32')
    # init target label
    trg_label = np.ones((batch_size, num_prior, 1)).astype('int32')
-    trg_label = trg_label * background_label
+    trg_label = trg_label * mismatch_value
    # init weight for target label
    trg_label_wt = np.zeros((batch_size, num_prior, 1)).astype('float32')

@@ -65,53 +65,90 @@ def target_assign(encoded_box, gt_label, match_indices, neg_indices, gt_lod,
        # target bbox
        for v, c in zip(col_val + gt_start, col_ids[0].tolist()):
            trg_box[i][c][:] = encoded_box[v][c][:]
-
        # weight for target bbox
        trg_box_wt[i][col_ids] = 1.0

        trg_label[i][col_ids] = gt_label[col_val + gt_start]
-
        trg_label_wt[i][col_ids] = 1.0
        # set target label weight to 1.0 for the negative samples
-        neg_ids = neg_indices[neg_lod[i]:neg_lod[i + 1]]
-        trg_label_wt[i][neg_ids] = 1.0
+        if neg_indices is not None:
+            neg_ids = neg_indices[neg_lod[i]:neg_lod[i + 1]]
+            trg_label_wt[i][neg_ids] = 1.0

    return trg_box, trg_box_wt, trg_label, trg_label_wt


-class TestTargetAssginOp(OpTest):
+class TestTargetAssginFloatType(OpTest):
    def setUp(self):
        self.op_type = "target_assign"
+        num_prior = 120
+        num_class = 21
+        gt_lod = [0, 5, 11, 23]
+        neg_lod = [0, 4, 7, 13]
+        mismatch_value = 0
+        batch_size = len(gt_lod) - 1
+        num_gt = gt_lod[-1]
+
+        encoded_box = np.random.random((num_gt, num_prior, 4)).astype('float32')
+        gt_label = np.random.randint(
+            num_class, size=(num_gt, 1)).astype('int32')
+
+        match_indices, neg_indices = gen_match_and_neg_indices(num_prior,
+                                                               gt_lod, neg_lod)

+        out, out_wt, _, _ = target_assign(encoded_box, gt_label, match_indices,
+                                          neg_indices, gt_lod, neg_lod,
+                                          mismatch_value)
+
+        # assign regression targets
+        x = encoded_box
+        self.inputs = {
+            'X': (x, [gt_lod]),
+            'MatchIndices': match_indices,
+        }
+        self.attrs = {'mismatch_value': mismatch_value}
+        self.outputs = {
+            'Out': out,
+            'OutWeight': out_wt,
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestTargetAssginIntType(OpTest):
+    def setUp(self):
+        self.op_type = "target_assign"
        num_prior = 120
        num_class = 21
        gt_lod = [0, 5, 11, 23]
        neg_lod = [0, 4, 7, 13]
+        mismatch_value = 0
        batch_size = len(gt_lod) - 1
        num_gt = gt_lod[-1]
-        background_label = 0

        encoded_box = np.random.random((num_gt, num_prior, 4)).astype('float32')
        gt_label = np.random.randint(
            num_class, size=(num_gt, 1)).astype('int32')
+
        match_indices, neg_indices = gen_match_and_neg_indices(num_prior,
                                                               gt_lod, neg_lod)
-        trg_box, trg_box_wt, trg_label, trg_label_wt = target_assign(
-            encoded_box, gt_label, match_indices, neg_indices, gt_lod, neg_lod,
-            background_label)

+        _, _, out, out_wt, = target_assign(encoded_box, gt_label, match_indices,
+                                           neg_indices, gt_lod, neg_lod,
+                                           mismatch_value)
+
+        # assign cassification argets
+        x = np.reshape(gt_label, (num_gt, 1, 1))
        self.inputs = {
-            'EncodedGTBBox': (encoded_box, [gt_lod]),
-            'GTScoreLabel': (gt_label, [gt_lod]),
-            'MatchIndices': (match_indices),
+            'X': (x, [gt_lod]),
+            'MatchIndices': match_indices,
            'NegIndices': (neg_indices, [neg_lod]),
        }
-        self.attrs = {'background_label': background_label}
+        self.attrs = {'mismatch_value': mismatch_value}
        self.outputs = {
-            'PredBBoxLabel': (trg_box),
-            'PredBBoxWeight': (trg_box_wt),
-            'PredScoreLabel': (trg_label),
-            'PredScoreWeight': (trg_label_wt),
+            'Out': out,
+            'OutWeight': out_wt,
        }

    def test_check_output(self):

--- a/tools/manylinux1/Dockerfile.x64
+++ b/tools/manylinux1/Dockerfile.x64
@@ -52,3 +52,5 @@ RUN wget -O /opt/swig-2.0.12.tar.gz https://sourceforge.net/projects/swig/files/

 RUN mkdir -p /src && cd /src && git clone https://github.com/NVIDIA/nccl.git nccl && cd nccl &&\
    make -j `nproc` install <NCCL_MAKE_OPTS>  && cd .. && rm -rf nccl
+
+CMD ["bash", "/paddle/paddle/scripts/docker/build.sh"]