Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-communicator

381f3839 · Qiao Longfei · 657a4f94 · bec68fa0 · 381f3839 · 381f3839
271 changed file
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,12 +25,18 @@ message(STATUS "CXX compiler: ${CMAKE_CXX_COMPILER}, version: "
 message(STATUS "C compiler: ${CMAKE_C_COMPILER}, version: "
        "${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}")
 if(WIN32)
+    set(CMAKE_SUPPRESS_REGENERATION ON)
    set(CMAKE_STATIC_LIBRARY_PREFIX lib)
    add_definitions("/DGOOGLE_GLOG_DLL_DECL=")
    set(CMAKE_C_FLAGS_DEBUG   "${CMAKE_C_FLAGS_DEBUG} /bigobj /MTd")
    set(CMAKE_C_FLAGS_RELEASE  "${CMAKE_C_FLAGS_RELEASE} /bigobj /MT")
    set(CMAKE_CXX_FLAGS_DEBUG  "${CMAKE_CXX_FLAGS_DEBUG} /bigobj /MTd")
    set(CMAKE_CXX_FLAGS_RELEASE   "${CMAKE_CXX_FLAGS_RELEASE} /bigobj /MT")
+    add_compile_options(/wd4068 /wd4129 /wd4244 /wd4267 /wd4297 /wd4530 /wd4577 /wd4819 /wd4838)
+    set(PADDLE_LINK_FLAGS "/IGNORE:4006 /IGNORE:4098 /IGNORE:4217 /IGNORE:4221")
+    set(CMAKE_STATIC_LINKER_FLAGS  "${CMAKE_STATIC_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}")
+    set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}")
+    set(CMAKE_EXE_LINKER_FLAGS  "${CMAKE_EXE_LINKER_FLAGS} ${PADDLE_LINK_FLAGS}")
 endif(WIN32)
 find_package(CUDA QUIET)
@@ -212,7 +218,7 @@ endif()
 if (WITH_JEMALLOC)
    find_package(JeMalloc REQUIRED)
    include_directories(${JEMALLOC_INCLUDE_DIR})
-    add_definitions(-DWITH_JEMALLOC)
+    add_definitions(-DPADDLE_WITH_JEMALLOC)
 endif()
 include(generic)            # simplify cmake module

--- a/README.md
+++ b/README.md
 # PaddlePaddle
+English | [简体中文](./README_cn.md)
 [![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
 [![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://paddlepaddle.org/documentation/docs/en/1.2/getstarted/index_en.html)
@@ -7,7 +8,6 @@
 [![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
 [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
 Welcome to the PaddlePaddle GitHub.
 PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use,
@@ -18,16 +18,6 @@ learning to many products at Baidu.
 Our vision is to enable deep learning for everyone via PaddlePaddle.
 Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest feature of PaddlePaddle.
-欢迎来到 PaddlePaddle GitHub
-PaddlePaddle (PArallel Distributed Deep LEarning) 是一个简单易用、高效灵活、可扩展的深度学习平台，最初由百度科学家和工程师共同开发，目的是将深度学习技术应用到百度的众多产品中。
-我们的愿景是让每个人都能通过PaddlePaddle接触深度学习
-跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
 ### Latest PaddlePaddle Release: [Fluid 1.2.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.2)
 ### Install Latest Stable Release:
 ```
@@ -43,23 +33,6 @@ pip install paddlepaddle-gpu==1.2.0.post85
 # For installation on other platform, refer to http://paddlepaddle.org/
 ```
-### PaddlePaddle最新版本: [Fluid 1.2.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.2)
-### 安装最新稳定版本:
-```
-# Linux CPU
-pip install paddlepaddle
-# Linux GPU cuda9cudnn7
-pip install paddlepaddle-gpu
-# Linux GPU cuda8cudnn7
-pip install paddlepaddle-gpu==1.2.0.post87
-# Linux GPU cuda8cudnn5
-pip install paddlepaddle-gpu==1.2.0.post85
-# 其他平台上的安装指引请参考 http://paddlepaddle.org/
-```
 ## Features
 - **Flexibility**
@@ -100,38 +73,10 @@ pip install paddlepaddle-gpu==1.2.0.post85
    Baidu and it has achieved a significant impact. We hope you can also explore
    the capability of PaddlePaddle to make an impact on your product.
-## 特点
- **灵活性**
-    PaddlePaddle支持丰富的神经网络架构和优化算法。易于配置复杂模型，例如带有注意力机制或复杂记忆连接的神经网络机器翻译模型。
-  **高效性**
-    为了高效使用异步计算资源，PaddlePaddle对框架的不同层进行优化，包括计算、存储、架构和通信。下面是一些样例：
-    - 通过SSE/AVX 内置函数、BLAS库(例如MKL、OpenBLAS、cuBLAS)或定制的CPU/GPU内核优化数学操作。
-    - 通过MKL-DNN库优化CNN网络
-    - 高度优化循环网络，无需执行 `padding` 操作即可处理 **变长** 序列
-    - 针对高维稀疏数据模型，优化了局部和分布式训练。
- **稳定性**
-    有了 PaddlePaddle，使得利用各种CPU/GPU和机器来加速训练变得简单。PaddlePaddle 通过优化通信可以实现巨大吞吐量和快速执行。
- **连接产品**
-    另外，PaddlePaddle 的设计也易于部署。在百度，PaddlePaddle 已经部署到含有巨大用户量的产品和服务上，包括广告点击率（CTR）预测、大规模图像分类、光学字符识别（OCR）、搜索排序，计算机病毒检测、推荐系统等等。PaddlePaddle广泛应用于百度产品中，产生了非常重要的影响。我们希望您也能探索 PaddlePaddle 的能力，为您的产品创造新的影响力和效果。
 ## Installation
 It is recommended to read [this doc](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/install/index_cn.html) on our website.
-## 安装
-推荐阅读官网上的[安装说明](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/install/index_cn.html) 
 ## Documentation
 We provide [English](http://paddlepaddle.org/documentation/docs/en/1.2/getstarted/index_en.html) and
@@ -153,37 +98,9 @@ We provide [English](http://paddlepaddle.org/documentation/docs/en/1.2/getstarte
   We appreciate your contributions!
-## 文档
-我们提供[英文](http://paddlepaddle.org/documentation/docs/en/1.2/getstarted/index_en.html)和
-[中文](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/index.html) 文档
- [深度学习101](https://github.com/PaddlePaddle/book)
-  或许您想从这个在线交互式书籍开始，可以在Jupyter Notebook中运行
- [分布式训练](http://paddlepaddle.org/documentation/docs/zh/1.2/user_guides/howto/training/cluster_howto.html)
-  可以在MPI集群上运行分布式训练任务
- [Python API](http://paddlepaddle.org/documentation/docs/zh/1.2/api_cn/index_cn.html)
-   新的API支持代码更少更简洁的程序
- [贡献方式](http://paddlepaddle.org/documentation/docs/zh/1.2/advanced_usage/development/contribute_to_paddle/index_cn.html)
-   欢迎您的贡献!
 ## Ask Questions
 You are welcome to submit questions and bug reports as [Github Issues](https://github.com/PaddlePaddle/Paddle/issues).
-## 答疑
-欢迎您将问题和bug报告以[Github Issues](https://github.com/PaddlePaddle/Paddle/issues)的形式提交
 ## Copyright and License
 PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
-## 版权和许可证
-PaddlePaddle由[Apache-2.0 license](LICENSE)提供
--- a/README_cn.md
+++ b/README_cn.md
+# PaddlePaddle
+[English](./README.md) | 简体中文
+[![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://paddlepaddle.org/documentation/docs/en/1.2/getstarted/index_en.html)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/index.html)
+[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
+欢迎来到 PaddlePaddle GitHub
+PaddlePaddle (PArallel Distributed Deep LEarning) 是一个简单易用、高效灵活、可扩展的深度学习平台，最初由百度科学家和工程师共同开发，目的是将深度学习技术应用到百度的众多产品中。
+我们的愿景是让每个人都能通过PaddlePaddle接触深度学习
+跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
+### PaddlePaddle最新版本: [Fluid 1.2.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.2)
+### 安装最新稳定版本:
+```
+# Linux CPU
+pip install paddlepaddle
+# Linux GPU cuda9cudnn7
+pip install paddlepaddle-gpu
+# Linux GPU cuda8cudnn7
+pip install paddlepaddle-gpu==1.2.0.post87
+# Linux GPU cuda8cudnn5
+pip install paddlepaddle-gpu==1.2.0.post85
+# 其他平台上的安装指引请参考 http://paddlepaddle.org/
+```
+## 特性
+- **灵活性**
+    PaddlePaddle支持丰富的神经网络架构和优化算法。易于配置复杂模型，例如带有注意力机制或复杂记忆连接的神经网络机器翻译模型。
+-  **高效性**
+    为了高效使用异步计算资源，PaddlePaddle对框架的不同层进行优化，包括计算、存储、架构和通信。下面是一些样例：
+    - 通过SSE/AVX 内置函数、BLAS库(例如MKL、OpenBLAS、cuBLAS)或定制的CPU/GPU内核优化数学操作。
+    - 通过MKL-DNN库优化CNN网络
+    - 高度优化循环网络，无需执行 `padding` 操作即可处理 **变长** 序列
+    - 针对高维稀疏数据模型，优化了局部和分布式训练。
+- **稳定性**
+    有了 PaddlePaddle，使得利用各种CPU/GPU和机器来加速训练变得简单。PaddlePaddle 通过优化通信可以实现巨大吞吐量和快速执行。
+- **与产品相连**
+    另外，PaddlePaddle 的设计也易于部署。在百度，PaddlePaddle 已经部署到含有巨大用户量的产品和服务上，包括广告点击率（CTR）预测、大规模图像分类、光学字符识别（OCR）、搜索排序，计算机病毒检测、推荐系统等等。PaddlePaddle广泛应用于百度产品中，产生了非常重要的影响。我们希望您也能探索 PaddlePaddle 的能力，为您的产品创造新的影响力和效果。
+## 安装
+推荐阅读官网上的[安装说明](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/install/index_cn.html)
+## 文档
+我们提供[英文](http://paddlepaddle.org/documentation/docs/en/1.2/getstarted/index_en.html)和
+[中文](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/index.html) 文档
+- [深度学习101](https://github.com/PaddlePaddle/book)
+  或许您想从这个在线交互式书籍开始，可以在Jupyter Notebook中运行
+- [分布式训练](http://paddlepaddle.org/documentation/docs/zh/1.2/user_guides/howto/training/cluster_howto.html)
+  可以在MPI集群上运行分布式训练任务
+- [Python API](http://paddlepaddle.org/documentation/docs/zh/1.2/api_cn/index_cn.html)
+   新的API支持代码更少更简洁的程序
+- [贡献方式](http://paddlepaddle.org/documentation/docs/zh/1.2/advanced_usage/development/contribute_to_paddle/index_cn.html)
+   欢迎您的贡献!
+## 答疑
+欢迎您将问题和bug报告以[Github Issues](https://github.com/PaddlePaddle/Paddle/issues)的形式提交
+## 版权和许可证
+PaddlePaddle由[Apache-2.0 license](LICENSE)提供
--- a/cmake/configure.cmake
+++ b/cmake/configure.cmake
@@ -152,7 +152,12 @@ endif()
 if (WITH_MKLML AND MKLML_IOMP_LIB)
    message(STATUS "Enable Intel OpenMP with ${MKLML_IOMP_LIB}")
-    set(OPENMP_FLAGS "-fopenmp")
+    if(WIN32)
+        # openmp not support well for now on windows
+        set(OPENMP_FLAGS "")
+    else(WIN32)
+        set(OPENMP_FLAGS "-fopenmp")
+    endif(WIN32)
    set(CMAKE_C_CREATE_SHARED_LIBRARY_FORBIDDEN_FLAGS ${OPENMP_FLAGS})
    set(CMAKE_CXX_CREATE_SHARED_LIBRARY_FORBIDDEN_FLAGS ${OPENMP_FLAGS})
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OPENMP_FLAGS}")

--- a/cmake/cuda.cmake
+++ b/cmake/cuda.cmake
@@ -203,25 +203,26 @@ list(APPEND CUDA_NVCC_FLAGS "-w")
 list(APPEND CUDA_NVCC_FLAGS "--expt-relaxed-constexpr")
 if (NOT WIN32)
-if(CMAKE_BUILD_TYPE  STREQUAL "Debug")
+  if(CMAKE_BUILD_TYPE  STREQUAL "Debug")
-    list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_DEBUG})
+      list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_DEBUG})
-elseif(CMAKE_BUILD_TYPE  STREQUAL "Release")
+  elseif(CMAKE_BUILD_TYPE  STREQUAL "Release")
-    list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELEASE})
+      list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELEASE})
-elseif(CMAKE_BUILD_TYPE  STREQUAL "RelWithDebInfo")
+  elseif(CMAKE_BUILD_TYPE  STREQUAL "RelWithDebInfo")
-    list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELWITHDEBINFO})
+      list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELWITHDEBINFO})
-elseif(CMAKE_BUILD_TYPE  STREQUAL "MinSizeRel")
+  elseif(CMAKE_BUILD_TYPE  STREQUAL "MinSizeRel")
-    # nvcc 9 does not support -Os. Use Release flags instead
+      # nvcc 9 does not support -Os. Use Release flags instead
-    list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELEASE})
+      list(APPEND CUDA_NVCC_FLAGS  ${CMAKE_CXX_FLAGS_RELEASE})
-endif()
+  endif()
 else(NOT WIN32)
-list(APPEND CUDA_NVCC_FLAGS  "--compiler-options;/bigobj")
+  list(APPEND CUDA_NVCC_FLAGS  "-Xcompiler \"/wd 4244 /wd 4267 /wd 4819\"")
-if(CMAKE_BUILD_TYPE  STREQUAL "Debug")
+  list(APPEND CUDA_NVCC_FLAGS  "--compiler-options;/bigobj")
-  list(APPEND CUDA_NVCC_FLAGS  "-g -G")
+  if(CMAKE_BUILD_TYPE  STREQUAL "Debug")
-  # match the cl's _ITERATOR_DEBUG_LEVEL
+    list(APPEND CUDA_NVCC_FLAGS  "-g -G")
-  list(APPEND CUDA_NVCC_FLAGS  "-D_DEBUG")
+    # match the cl's _ITERATOR_DEBUG_LEVEL
-elseif(CMAKE_BUILD_TYPE STREQUAL "Release")
+    list(APPEND CUDA_NVCC_FLAGS  "-D_DEBUG")
-  list(APPEND CUDA_NVCC_FLAGS "-O3 -DNDEBUG")
+  elseif(CMAKE_BUILD_TYPE STREQUAL "Release")
-else()
+    list(APPEND CUDA_NVCC_FLAGS "-O3 -DNDEBUG")
+  else()
  message(FATAL "Windows only support Release or Debug build now. Please set visual studio build type to Release/Debug, x64 build.")
 endif()
 endif(NOT WIN32)

--- a/cmake/external/glog.cmake
+++ b/cmake/external/glog.cmake
@@ -20,8 +20,10 @@ SET(GLOG_INCLUDE_DIR "${GLOG_INSTALL_DIR}/include" CACHE PATH "glog include dire
 IF(WIN32)
  SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.lib" CACHE FILEPATH "glog library." FORCE)
+  SET(GLOG_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4267 /wd4530")
 ELSE(WIN32)
  SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.a" CACHE FILEPATH "glog library." FORCE)
+  SET(GLOG_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
 ENDIF(WIN32)
 INCLUDE_DIRECTORIES(${GLOG_INCLUDE_DIR})
@@ -39,7 +41,7 @@ ExternalProject_Add(
    UPDATE_COMMAND  ""
    CMAKE_ARGS      -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
-                    -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS}
                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}

--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@@ -49,6 +49,8 @@ IF(NOT WIN32)
    SET(MKLDNN_FLAG "${MKLDNN_FLAG} -Wno-unused-result -Wno-unused-value")
    SET(MKLDNN_CFLAG "${CMAKE_C_FLAGS} ${MKLDNN_FLAG}")
    SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} ${MKLDNN_FLAG}")
+ELSE()
+    SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} /EHsc")
 ENDIF(NOT WIN32)
 ExternalProject_Add(
@@ -61,7 +63,6 @@ ExternalProject_Add(
    UPDATE_COMMAND      ""
    CMAKE_ARGS          -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
    CMAKE_ARGS          -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
-    CMAKE_ARGS          -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
    CMAKE_ARGS          -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
    CMAKE_ARGS          -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
    CMAKE_ARGS          -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}

--- a/cmake/external/snappy.cmake
+++ b/cmake/external/snappy.cmake
@@ -20,6 +20,12 @@ set(SNAPPY_SOURCES_DIR ${THIRD_PARTY_PATH}/snappy)
 set(SNAPPY_INSTALL_DIR ${THIRD_PARTY_PATH}/install/snappy)
 set(SNAPPY_INCLUDE_DIR "${SNAPPY_INSTALL_DIR}/include" CACHE PATH "snappy include directory." FORCE)
+if(WIN32)
+    SET(SNAPPY_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4244 /wd4267")
+else()
+    SET(SNAPPY_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
+endif()
 ExternalProject_Add(
    extern_snappy
    GIT_REPOSITORY "https://github.com/google/snappy"
@@ -31,7 +37,7 @@ ExternalProject_Add(
                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
-                    -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS=${SNAPPY_CMAKE_CXX_FLAGS}
                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
                    -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR}

--- a/cmake/flags.cmake
+++ b/cmake/flags.cmake
@@ -147,12 +147,6 @@ set(GPU_COMMON_FLAGS
    -Wno-error=unused-function  # Warnings in Numpy Header.
    -Wno-error=array-bounds # Warnings in Eigen::array
 )
-else(NOT WIN32)
-set(COMMON_FLAGS
-    "/w") #disable all warnings.
-set(GPU_COMMON_FLAGS
-    "/w") #disable all warnings
 endif(NOT WIN32)
 if (APPLE)
@@ -193,8 +187,7 @@ safe_set_static_flag()
        CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO
        CMAKE_C_FLAGS CMAKE_C_FLAGS_DEBUG CMAKE_C_FLAGS_RELEASE
        CMAKE_C_FLAGS_MINSIZEREL CMAKE_C_FLAGS_RELWITHDEBINFO)
-      if(${flag_var} MATCHES "/W3")
+        string(REGEX REPLACE "(^| )/W[0-9]( |$)" " " ${flag_var} "${${flag_var}}")
-        string(REGEX REPLACE "/W3" "/w" ${flag_var} "${${flag_var}}")
+        set(flag_var "${flag_var} /w")
-      endif(${flag_var} MATCHES "/W3")
    endforeach(flag_var)
 endif(WIN32)
--- a/cmake/operators.cmake
+++ b/cmake/operators.cmake
@@ -52,8 +52,8 @@ function(op_library TARGET)
        endif()
        if(WITH_MKLDNN)
            string(REPLACE "_op" "_mkldnn_op" MKLDNN_FILE "${TARGET}")
-            if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MKLDNN_FILE}.cc)
+            if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/mkldnn/${MKLDNN_FILE}.cc)
-                list(APPEND mkldnn_cc_srcs ${MKLDNN_FILE}.cc)
+                list(APPEND mkldnn_cc_srcs mkldnn/${MKLDNN_FILE}.cc)
            endif()
        endif()
    else()

--- a/cmake/version.cmake
+++ b/cmake/version.cmake
@@ -31,8 +31,23 @@ while ("${PADDLE_VERSION}" STREQUAL "")
          set(tmp_version "${GIT_TAG_NAME}~1")
        endif()
      else()
-        # otherwise, we always set PADDLE_VERSION to 0.0.0 to represent latest
+        execute_process(
-        set(PADDLE_VERSION "0.0.0")
+          COMMAND ${GIT_EXECUTABLE} describe --exact-match --tags ${tmp_version}
+          WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}
+          OUTPUT_VARIABLE GIT_EXACT_TAG_NAME
+          RESULT_VARIABLE GIT_EXACT_TAG_RESULT
+          ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
+        if (NOT ${GIT_EXACT_TAG_NAME})
+          # Check if current branch is tag branch
+          if (${GIT_EXACT_TAG_NAME} MATCHES "v${TAG_VERSION_REGEX}")
+            string(REPLACE "v" "" PADDLE_VERSION ${GIT_EXACT_TAG_NAME})
+          else()
+            set(PADDLE_VERSION "0.0.0")
+          endif()
+        else()
+          # otherwise, we always set PADDLE_VERSION to 0.0.0 to represent latest
+          set(PADDLE_VERSION "0.0.0")
+        endif()
      endif()
    else()
      set(PADDLE_VERSION "0.0.0")

--- a/paddle/fluid/API.spec
+++ b/paddle/fluid/API.spec
@@ -122,7 +122,7 @@ paddle.fluid.layers.transpose ArgSpec(args=['x', 'perm', 'name'], varargs=None,
 paddle.fluid.layers.im2sequence ArgSpec(args=['input', 'filter_size', 'stride', 'padding', 'input_image_size', 'out_stride', 'name'], varargs=None, keywords=None, defaults=(1, 1, 0, None, 1, None))
 paddle.fluid.layers.nce ArgSpec(args=['input', 'label', 'num_total_classes', 'sample_weight', 'param_attr', 'bias_attr', 'num_neg_samples', 'name', 'sampler', 'custom_dist', 'seed', 'is_sparse'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 'uniform', None, 0, False))
 paddle.fluid.layers.hsigmoid ArgSpec(args=['input', 'label', 'num_classes', 'param_attr', 'bias_attr', 'name', 'path_table', 'path_code', 'is_custom', 'is_sparse'], varargs=None, keywords=None, defaults=(None, None, None, None, None, False, False))
-paddle.fluid.layers.beam_search ArgSpec(args=['pre_ids', 'pre_scores', 'ids', 'scores', 'beam_size', 'end_id', 'level', 'is_accumulated', 'name'], varargs=None, keywords=None, defaults=(0, True, None))
+paddle.fluid.layers.beam_search ArgSpec(args=['pre_ids', 'pre_scores', 'ids', 'scores', 'beam_size', 'end_id', 'level', 'is_accumulated', 'name', 'return_parent_idx'], varargs=None, keywords=None, defaults=(0, True, None, False))
 paddle.fluid.layers.row_conv ArgSpec(args=['input', 'future_context_size', 'param_attr', 'act'], varargs=None, keywords=None, defaults=(None, None))
 paddle.fluid.layers.multiplex ArgSpec(args=['inputs', 'index'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.layers.layer_norm ArgSpec(args=['input', 'scale', 'shift', 'begin_norm_axis', 'epsilon', 'param_attr', 'bias_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(True, True, 1, 1e-05, None, None, None, None))
@@ -142,10 +142,10 @@ paddle.fluid.layers.label_smooth ArgSpec(args=['label', 'prior_dist', 'epsilon',
 paddle.fluid.layers.roi_pool ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1, 1, 1.0))
 paddle.fluid.layers.roi_align ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale', 'sampling_ratio', 'name'], varargs=None, keywords=None, defaults=(1, 1, 1.0, -1, None))
 paddle.fluid.layers.dice_loss ArgSpec(args=['input', 'label', 'epsilon'], varargs=None, keywords=None, defaults=(1e-05,))
-paddle.fluid.layers.image_resize ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR', None))
+paddle.fluid.layers.image_resize ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample', 'actual_shape', 'align_corners', 'align_mode'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR', None, True, 1))
 paddle.fluid.layers.image_resize_short ArgSpec(args=['input', 'out_short_len', 'resample'], varargs=None, keywords=None, defaults=('BILINEAR',))
-paddle.fluid.layers.resize_bilinear ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, None))
+paddle.fluid.layers.resize_bilinear ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape', 'align_corners', 'align_mode'], varargs=None, keywords=None, defaults=(None, None, None, None, True, 1))
-paddle.fluid.layers.resize_nearest ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, None))
+paddle.fluid.layers.resize_nearest ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape', 'align_corners'], varargs=None, keywords=None, defaults=(None, None, None, None, True))
 paddle.fluid.layers.gather ArgSpec(args=['input', 'index'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.layers.scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.layers.sequence_scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,))
@@ -322,9 +322,11 @@ paddle.fluid.layers.generate_proposal_labels ArgSpec(args=['rpn_rois', 'gt_class
 paddle.fluid.layers.generate_proposals ArgSpec(args=['scores', 'bbox_deltas', 'im_info', 'anchors', 'variances', 'pre_nms_top_n', 'post_nms_top_n', 'nms_thresh', 'min_size', 'eta', 'name'], varargs=None, keywords=None, defaults=(6000, 1000, 0.5, 0.1, 1.0, None))
 paddle.fluid.layers.generate_mask_labels ArgSpec(args=['im_info', 'gt_classes', 'is_crowd', 'gt_segms', 'rois', 'labels_int32', 'num_classes', 'resolution'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.layers.iou_similarity ArgSpec(args=['x', 'y', 'name'], varargs=None, keywords=None, defaults=(None,))
-paddle.fluid.layers.box_coder ArgSpec(args=['prior_box', 'prior_box_var', 'target_box', 'code_type', 'box_normalized', 'name'], varargs=None, keywords=None, defaults=('encode_center_size', True, None))
+paddle.fluid.layers.box_coder ArgSpec(args=['prior_box', 'prior_box_var', 'target_box', 'code_type', 'box_normalized', 'name', 'axis'], varargs=None, keywords=None, defaults=('encode_center_size', True, None, 0))
 paddle.fluid.layers.polygon_box_transform ArgSpec(args=['input', 'name'], varargs=None, keywords=None, defaults=(None,))
-paddle.fluid.layers.yolov3_loss ArgSpec(args=['x', 'gtbox', 'gtlabel', 'anchors', 'class_num', 'ignore_thresh', 'loss_weight_xy', 'loss_weight_wh', 'loss_weight_conf_target', 'loss_weight_conf_notarget', 'loss_weight_class', 'name'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None))
+paddle.fluid.layers.yolov3_loss ArgSpec(args=['x', 'gtbox', 'gtlabel', 'anchors', 'anchor_mask', 'class_num', 'ignore_thresh', 'downsample_ratio', 'name'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.layers.box_clip ArgSpec(args=['input', 'im_info', 'name'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.layers.multiclass_nms ArgSpec(args=['bboxes', 'scores', 'score_threshold', 'nms_top_k', 'keep_top_k', 'nms_threshold', 'normalized', 'nms_eta', 'background_label', 'name'], varargs=None, keywords=None, defaults=(0.3, True, 1.0, 0, None))
 paddle.fluid.layers.accuracy ArgSpec(args=['input', 'label', 'k', 'correct', 'total'], varargs=None, keywords=None, defaults=(1, None, None))
 paddle.fluid.layers.auc ArgSpec(args=['input', 'label', 'curve', 'num_thresholds', 'topk', 'slide_steps'], varargs=None, keywords=None, defaults=('ROC', 4095, 1, 1))
 paddle.fluid.layers.exponential_decay ArgSpec(args=['learning_rate', 'decay_steps', 'decay_rate', 'staircase'], varargs=None, keywords=None, defaults=(False,))
@@ -360,6 +362,9 @@ paddle.fluid.contrib.QuantizeTranspiler.__init__ ArgSpec(args=['self', 'weight_b
 paddle.fluid.contrib.QuantizeTranspiler.convert_to_int8 ArgSpec(args=['self', 'program', 'place', 'scope'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.contrib.QuantizeTranspiler.freeze_program ArgSpec(args=['self', 'program', 'place', 'fuse_bn', 'scope'], varargs=None, keywords=None, defaults=(False, None))
 paddle.fluid.contrib.QuantizeTranspiler.training_transpile ArgSpec(args=['self', 'program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None))
+paddle.fluid.contrib.Calibrator.__init__ ArgSpec(args=['self'], varargs='args', keywords='kwargs', defaults=None)
+paddle.fluid.contrib.Calibrator.sample_data ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
+paddle.fluid.contrib.Calibrator.save_int8_model ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.contrib.reader.ctr_reader.ctr_reader ArgSpec(args=['feed_dict', 'file_type', 'file_format', 'dense_slot_index', 'sparse_slot_index', 'capacity', 'thread_num', 'batch_size', 'file_list', 'slots', 'name'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.contrib.build_compressor ArgSpec(args=['place', 'data_reader', 'data_feeder', 'scope', 'metrics', 'epoch', 'config'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None))
 paddle.fluid.contrib.CompressPass.__init__ ArgSpec(args=['self', 'place', 'data_reader', 'data_feeder', 'scope', 'metrics', 'epoch', 'program_exe'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None))

--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@@ -128,7 +128,7 @@ cc_test(version_test SRCS version_test.cc DEPS version)
 cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS shape_inference op_info operator glog version)
-cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc)
+cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc memory_optimize_helper)
 nv_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry)
 py_proto_compile(framework_py_proto SRCS framework.proto data_feed.proto)
@@ -192,6 +192,7 @@ cc_library(prune SRCS prune.cc DEPS framework_proto)
 cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context)
 cc_test(var_type_inference_test SRCS var_type_inference_test.cc DEPS op_registry
        proto_desc)
+cc_test(inplace_op_inference_test SRCS inplace_op_inference_test.cc DEPS op_registry proto_desc op_info memory_optimize_helper)
 cc_library(selected_rows SRCS selected_rows.cc DEPS tensor)
 cc_test(selected_rows_test SRCS selected_rows_test.cc DEPS selected_rows)

--- a/paddle/fluid/framework/details/CMakeLists.txt
+++ b/paddle/fluid/framework/details/CMakeLists.txt
@@ -50,7 +50,9 @@ cc_library(data_balance_op_handle SRCS data_balance_op_handle.cc DEPS op_handle_
 cc_library(gather_op_handle SRCS gather_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor)
 cc_library(fuse_vars_op_handle SRCS fuse_vars_op_handle.cc DEPS op_handle_base scope)
-cc_library(memory_optimize_pass SRCS analysis_var_pass.cc memory_reuse_types.cc DEPS graph graph_helper pass)
+cc_library(memory_optimize_helper SRCS memory_optimize_helper.cc DEPS graph graph_helper)
+cc_library(memory_optimize_pass SRCS memory_optimize_pass.cc DEPS memory_optimize_helper pass)
+cc_library(inplace_op_pass SRCS inplace_op_pass.cc DEPS memory_optimize_pass op_info)
 cc_library(modify_op_lock_and_record_event_pass SRCS modify_op_lock_and_record_event_pass.cc DEPS computation_op_handle op_graph_view multi_devices_helper)
 cc_library(memory_early_delete_pass SRCS memory_early_delete_pass.cc DEPS memory_optimize_pass computation_op_handle scale_loss_grad_op_handle rpc_op_handle
        all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle graph graph_helper pass)
@@ -65,12 +67,12 @@ cc_library(all_reduce_deps_pass SRCS all_reduce_deps_pass.cc DEPS graph graph_he
 cc_library(multi_devices_graph_pass SRCS multi_devices_graph_pass.cc DEPS multi_devices_helper computation_op_handle
        scale_loss_grad_op_handle rpc_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle fused_broadcast_op_handle)
-set(SSA_GRAPH_EXECUTOR_DEPS graph framework_proto sequential_execution_pass modify_op_lock_and_record_event_pass all_reduce_deps_pass reference_count_pass eager_deletion_pass memory_optimize_pass memory_early_delete_pass)
+set(SSA_GRAPH_EXECUTOR_DEPS graph framework_proto sequential_execution_pass modify_op_lock_and_record_event_pass all_reduce_deps_pass reference_count_pass eager_deletion_pass memory_optimize_pass memory_early_delete_pass inplace_op_pass)
 if (WITH_GPU)
  list(APPEND SSA_GRAPH_EXECUTOR_DEPS reference_count_pass)
 endif()
-cc_test(memory_reuse_types_test SRCS memory_reuse_types_test.cc memory_reuse_types.cc DEPS framework_proto graph)
+cc_test(memory_optimize_helper_test SRCS memory_optimize_helper_test.cc memory_optimize_helper.cc DEPS framework_proto graph)
-cc_test(analysis_var_pass_test SRCS analysis_var_pass_test.cc analysis_var_pass.cc memory_reuse_types.cc DEPS framework_proto graph graph_helper op_registry pass)
+cc_test(memory_optimize_pass_test SRCS memory_optimize_pass_test.cc memory_optimize_pass.cc memory_optimize_helper.cc DEPS framework_proto graph graph_helper op_registry pass)
 cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS ${SSA_GRAPH_EXECUTOR_DEPS})

--- a/paddle/fluid/framework/details/build_strategy.cc
+++ b/paddle/fluid/framework/details/build_strategy.cc
@@ -17,7 +17,7 @@ limitations under the License. */
 #include <glog/logging.h>
 #include <memory>
-#include "paddle/fluid/framework/details/memory_reuse_types.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
 #include "paddle/fluid/framework/details/multi_devices_graph_pass.h"
 #include "paddle/fluid/framework/details/multi_devices_graph_print_pass.h"
 #include "paddle/fluid/framework/details/reduce_op_handle.h"
@@ -47,6 +47,22 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder {
      AppendPass("sequential_execution_pass");
    }
+    // Add op fusion.
+    if (strategy.fuse_relu_depthwise_conv_) {
+      AppendPass("fuse_relu_depthwise_conv_pass");
+    }
+    // NOTE(dzhwinter): A note for automatical inplace.
+    // 1. modify program desc passes should put
+    // before inplace pass.
+    // 2. manually configured inplace should put
+    // before inplace_pass
+    // Add automatically inplace.
+    if (strategy_.enable_inplace_) {
+      AppendPass("inplace_pass");
+    }
    // Add a graph viz pass to record a graph.
    if (!strategy_.debug_graphviz_path_.empty()) {
      auto viz_pass = AppendPass("graph_viz_pass");
@@ -55,10 +71,6 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder {
      viz_pass->Set<std::string>("graph_viz_path", new std::string(graph_path));
    }
-    // Add op fusion.
-    if (strategy.fuse_relu_depthwise_conv_) {
-      AppendPass("fuse_relu_depthwise_conv_pass");
-    }
    if (strategy.fuse_elewise_add_act_ops_) {
      auto fuse_elewise_add_act_pass = AppendPass("fuse_elewise_add_act_pass");
      // Add a graph viz pass to record a graph.
@@ -88,7 +100,7 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder {
    // A side-effect of that, memory optimize cannot forsee the fetched vars
    // , so fetchlist should be set persistable before call the Run interface.
    if (strategy.memory_optimize_) {
-      auto analysis_var_pass = AppendPass("analysis_var_pass");
+      auto memory_optimize_pass = AppendPass("memory_optimize_pass");
    }
    AppendMultiDevPass(strategy);
@@ -186,8 +198,10 @@ std::unique_ptr<ir::Graph> BuildStrategy::Apply(
      pass->Erase("nccl_ctxs");
      pass->SetNotOwned<platform::NCCLContextMap>("nccl_ctxs", nctx);
 #endif
+    } else if (pass->Type() == "memory_optimize_pass") {
-    } else if (pass->Type() == "analysis_var_pass") {
+      if (graph->Has(kAllOpDescs)) {
+        graph->Erase(kAllOpDescs);
+      }
      const std::vector<OpDesc *> *all_op_descs =
          new std::vector<OpDesc *>(main_program.Block(0).AllOps());
      graph->Set<const std::vector<OpDesc *>>(kAllOpDescs,
@@ -214,6 +228,13 @@ std::unique_ptr<ir::Graph> BuildStrategy::Apply(
      pass->Set<const std::vector<OpDesc *>>(
          kAllOpDescs,
          new std::vector<OpDesc *>(main_program.Block(0).AllOps()));
+    } else if (pass->Type() == "inplace_pass") {
+      if (graph->Has(kAllOpDescs)) {
+        graph->Erase(kAllOpDescs);
+      }
+      graph->Set<const std::vector<OpDesc *>>(
+          kAllOpDescs,
+          new std::vector<OpDesc *>(main_program.Block(0).AllOps()));
    } else if (pass->Type() == "fuse_relu_depthwise_conv_pass") {
      if (!use_cuda) {
        LOG(WARNING) << "fuse_relu_depthwise_conv_pass is only supported on "
@@ -239,9 +260,10 @@ USE_PASS(allreduce_mode_multi_devices_pass);
 USE_PASS(dist_multi_devices_pass);
 USE_PASS(multi_devices_check_pass);
 USE_PASS(multi_devices_print_pass);
-USE_PASS(analysis_var_pass);
+USE_PASS(memory_optimize_pass);
 USE_PASS(sequential_execution_pass);
 USE_PASS(all_reduce_deps_pass);
 USE_PASS(modify_op_lock_and_record_event_pass);
+USE_PASS(inplace_pass);
 USE_PASS(lock_free_optimize_pass);
 USE_PASS(graph_to_program_pass);
--- a/paddle/fluid/framework/details/build_strategy.h
+++ b/paddle/fluid/framework/details/build_strategy.h
@@ -80,6 +80,11 @@ struct BuildStrategy {
  bool memory_early_delete_{false};
+  // TODO(dzhwinter):
+  // make enable_inplace, memory_optimize_
+  // memory_early_delete_ true by default
+  bool enable_inplace_{false};
  bool enable_sequential_execution_{false};
  bool fuse_broadcast_op_{false};

--- a/paddle/fluid/framework/details/graph_test_base.h
+++ b/paddle/fluid/framework/details/graph_test_base.h
+// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+#include <algorithm>
+#include <iostream>
+#include <iterator>
+#include <string>
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/ir/graph.h"
+#include "paddle/fluid/framework/ir/graph_helper.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/program_desc.h"
+namespace paddle {
+namespace framework {
+class DummyOp : public OperatorBase {
+ public:
+  DummyOp(const std::string& type, const VariableNameMap& inputs,
+          const VariableNameMap& outputs, const AttributeMap& attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+ private:
+  void RunImpl(const Scope& scope,
+               const platform::Place& place) const override {}
+};
+class SumOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  void Make() {
+    AddInput("X", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddComment("");
+  }
+};
+class AssignOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  void Make() {
+    AddInput("X", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddComment("");
+  }
+};
+class SplitOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  void Make() {
+    AddInput("X", "");
+    AddOutput("Out", "").AsDuplicable();
+    AddComment("");
+  }
+};
+class DummyVarTypeInference : public VarTypeInference {
+ public:
+  void operator()(const OpDesc& op_desc, BlockDesc* block) const override {
+    auto& inputs = op_desc.Input("X");
+    auto type = block->Var(inputs.front())->GetType();
+    auto out_var_name = op_desc.Output("Out").front();
+    block->Var(out_var_name)->SetType(type);
+  }
+};
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/details/inplace_op_pass.cc
+++ b/paddle/fluid/framework/details/inplace_op_pass.cc
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "paddle/fluid/framework/details/inplace_op_pass.h"
+#include <algorithm>
+#include <deque>
+#include <iterator>
+#include <stack>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+#include "paddle/fluid/framework/details/memory_optimize_pass.h"
+#include "paddle/fluid/framework/ir/graph_helper.h"
+#include "paddle/fluid/framework/op_info.h"
+// NOTE(dzhwinter): inplace means one op output variable reuse the input space.
+// By our design, one operator only can read its input(const Variable),
+// write its output(non-const Variable). If one operator is inplaced, means
+// user have chance to write the space before reading happens.
+// Especially when some optimize code writing style is applied.
+//
+//
+// /* wrong case in operator */
+// /*In this case, a larger allocation is allocated, input content is lost*/
+// const Tensor* in = ctx.Input<Tensor>("In")
+// Tensor* out = ctx.Output<Tensor>("Out");
+// auto* out_ptr = out->mutable_data<T>(ctx.GetPlace());
+// out_ptr[0] = 0;  // input contect is overwrited.
+// NOTE(dzhwinter):
+// Only for backward compacity and stable. if enable_inplace_whitelist is turn
+// on.
+// only the ops in whitelist will be use inplace strategy.
+// if not, all the op will be inplaced if it registered with InplaceClass
+DEFINE_bool(
+    enable_inplace_whitelist, false,
+    "If this option turns on, only these op in whitelist can be inplaced."
+    "If it turns off, all of the running op can be candidate of inplaced op."
+    "Such as scale, elementwise_add"
+    "By default, it's turned on");
+DECLARE_string(memory_optimize_debug);
+// clang-format off
+const std::string kInplacedOpWhiteList[] = { // NOLINT
+    "sigmoid",
+    "exp",
+    "relu",
+    "tanh",
+    "sqrt",
+    "ceil",
+    "floor",
+    "reciprocal",
+    "relu6",
+    "soft_relu",
+    "hard_sigmoid",
+    "batch_norm",
+    "batch_norm_grad",
+    "sum",
+    "sum_grad",
+    "scale",
+    "reshape",
+    "elementwise_add",
+    "elementwise_add_grad",
+};
+// clang-format on
+namespace paddle {
+namespace framework {
+namespace details {
+static inline ir::Node* GetNextCascadeInplacedVar(ir::Node* var) {
+  // if next op is inplaced, then return the output var
+  // otherwise return nullptr
+  PADDLE_ENFORCE(var && var->IsVar() && !var->IsCtrlVar());
+  ir::Node* inplaced_var = nullptr;
+  for (auto* next_op : var->outputs) {
+    for (auto* output : next_op->outputs) {
+      if (output->IsVar() && !output->IsCtrlVar() &&
+          output->Name() == var->Name()) {
+        inplaced_var = output;
+      }
+    }
+  }
+  return inplaced_var;
+}
+static inline ir::Node* GetPrevCascadeInplacedVar(ir::Node* var) {
+  PADDLE_ENFORCE(var && var->IsVar() && !var->IsCtrlVar());
+  if (var->inputs.empty()) return nullptr;
+  auto* prev_op = var->inputs.at(0);
+  auto input_it = std::find_if(prev_op->inputs.begin(), prev_op->inputs.end(),
+                               [&](ir::Node* node) {
+                                 if (node->IsVar() && !node->IsCtrlVar() &&
+                                     node->Name() == var->Name()) {
+                                   return true;
+                                 } else {
+                                   return false;
+                                 }
+                               });
+  return input_it == prev_op->inputs.end() ? nullptr : *input_it;
+}
+InplacePass::InplacePass() : Pass() {
+  if (FLAGS_enable_inplace_whitelist) {
+    for (auto& s : kInplacedOpWhiteList) {
+      whitelist_.emplace(s);
+    }
+  }
+}
+void InplacePass::InitSSAGraphNodes() const {
+  std::unordered_map<std::string, std::unordered_set<ir::Node*>> all_vars;
+  for (auto* op : view_.AllOps()) {
+    for (auto* node : op->inputs) {
+      if (!node->IsVar() || node->IsCtrlVar()) continue;
+      if (all_vars[node->Name()].count(node) == 0) {
+        all_vars[node->Name()].emplace(node);
+        var_nodes_[node->Name()].emplace_back(node);
+      }
+    }
+    for (auto* node : op->outputs) {
+      if (!node->IsVar() || node->IsCtrlVar()) continue;
+      if (all_vars[node->Name()].count(node) == 0) {
+        all_vars[node->Name()].emplace(node);
+        var_nodes_[node->Name()].emplace_back(node);
+      }
+    }
+  }
+}
+std::unique_ptr<ir::Graph> InplacePass::ApplyImpl(
+    std::unique_ptr<ir::Graph> graph) const {
+  var_nodes_.clear();
+  view_.Build(graph.get());
+  InitSSAGraphNodes();
+  for (auto* op : view_.AllOps()) {
+    if (FLAGS_enable_inplace_whitelist && !whitelist_.count(op->Name()))
+      continue;
+    TryInplaceOpInputOutput(op, graph.get());
+  }
+  graph->ResolveHazard(var_nodes_);
+  return graph;
+}
+void InplacePass::InplaceModifyDesc(const std::string& var,
+                                    const std::string& cache_var,
+                                    const size_t& idx) const {
+  for (size_t i = idx; i < view_.AllOps().size(); ++i) {
+    ir::Node* op = view_.AllOps()[i];
+    PADDLE_ENFORCE(op->IsOp() && op->Op());
+    auto* op_desc = op->Op();
+    op_desc->RenameInput(var, cache_var);
+    op_desc->RenameOutput(var, cache_var);
+    if (op_desc->Block()->HasVar(var)) op_desc->Block()->RemoveVar(var);
+    op_desc->Flush();
+  }
+}
+const SSANodePair InplacePass::TryInplaceModifyVar(const std::string& var,
+                                                   const std::string& cache_var,
+                                                   const size_t& idx,
+                                                   ir::Graph* graph) const {
+  PADDLE_ENFORCE(var_nodes_[var].size() >= 1 &&
+                 var_nodes_[var].at(0)->Var() != nullptr);
+  std::unique_ptr<VarDesc> var_desc(new VarDesc(*var_nodes_[var].at(0)->Var()));
+  var_desc->SetName(cache_var);
+  SSANodePair swap_nodes;
+  for (size_t i = idx; i < view_.AllOps().size(); ++i) {
+    auto* op = view_.AllOps()[i];
+    // redirect the input to the latest version of cache_var
+    for (auto* node : op->inputs) {
+      if (node->Name() == var) {
+        ir::Node* cache_node = graph->CreateVarNode(var_desc.get());
+        // swap node to cache_node
+        cache_node->outputs.insert(cache_node->outputs.end(),
+                                   node->outputs.begin(), node->outputs.end());
+        PADDLE_ENFORCE(node->inputs.size() == 1 && node->inputs[0]->IsOp());
+        auto* prev_op = node->inputs[0];
+        std::replace(prev_op->outputs.begin(), prev_op->outputs.end(), node,
+                     cache_node);
+        cache_node->inputs.emplace_back(prev_op);
+        for (auto* next_op : node->outputs) {
+          std::replace(next_op->inputs.begin(), next_op->inputs.end(), node,
+                       cache_node);
+        }
+        swap_nodes.emplace_back(std::make_pair(node, cache_node));
+      }
+    }
+    // if we need to rename the output,
+    // always create a newer version of cache_var
+    for (auto* node : op->outputs) {
+      if (node->Name() == var) {
+        ir::Node* cache_node = graph->CreateVarNode(var_desc.get());
+        // swap node to cache node
+        cache_node->outputs.insert(cache_node->outputs.end(),
+                                   node->outputs.begin(), node->outputs.end());
+        cache_node->inputs.emplace_back(op);
+        std::replace(op->outputs.begin(), op->outputs.end(), node, cache_node);
+        for (auto* next_op : node->outputs) {
+          std::replace(next_op->inputs.begin(), next_op->inputs.end(), node,
+                       cache_node);
+        }
+        swap_nodes.emplace_back(std::make_pair(node, cache_node));
+      }
+    }
+  }
+  return swap_nodes;
+}
+void InplacePass::CommitModify(const SSANodePair& swap_nodes,
+                               ir::Graph* graph) const {
+  for (auto& pair : swap_nodes) {
+    auto *node = pair.first, *cache_node = pair.second;
+    const std::string var = node->Name(), cache_var = cache_node->Name();
+    var_nodes_[cache_var].emplace_back(cache_node);
+    graph->RemoveNode(node);
+    auto& nodes = var_nodes_.at(var);
+    // release unused var in graph. Because python side memory optimize
+    // may reused the var in same name, so we only clear the var node
+    // after current inplaced index.
+    nodes.erase(std::remove(nodes.begin(), nodes.end(), node), nodes.end());
+  }
+}
+void InplacePass::WithdrawModify(const SSANodePair& nodes,
+                                 ir::Graph* graph) const {
+  for (auto& pair : nodes) {
+    auto *node = pair.first, *cache_node = pair.second;
+    const std::string var = node->Name(), cache_var = cache_node->Name();
+    auto* prev_op = node->inputs[0];
+    std::replace(prev_op->outputs.begin(), prev_op->outputs.end(), cache_node,
+                 node);
+    for (auto* next_op : node->outputs) {
+      std::replace(next_op->inputs.begin(), next_op->inputs.end(), cache_node,
+                   node);
+    }
+    graph->RemoveNode(cache_node);
+  }
+}
+void InplacePass::TryInplaceOpInputOutput(ir::Node* op,
+                                          ir::Graph* graph) const {
+  VLOG(4) << "Try to inplace op " << op->Name();
+  PADDLE_ENFORCE(op->Op() != nullptr && op->Op()->Block() != nullptr,
+                 "op_desc is nullptr");
+  // some pre-requirments need to meet if the op want to inplaced.
+  auto* op_desc = op->Op();
+  auto& infer_inplace =
+      OpInfoMap::Instance().Get(op_desc->Type()).infer_inplace_;
+  // 1. infer_inplace_ is registered.
+  if (!static_cast<bool>(infer_inplace)) return;
+  PADDLE_ENFORCE(static_cast<bool>(infer_inplace),
+                 "%s's infer_inplace has not been registered", op_desc->Type());
+  auto* block = op_desc->Block();
+  auto in_to_outs = infer_inplace(*op_desc, block);
+  auto& all_ops = view_.AllOps();
+  auto cursor = std::find(all_ops.begin(), all_ops.end(), op);
+  size_t idx = std::distance(all_ops.begin(), cursor);
+  for (auto& pair : in_to_outs) {
+    auto& in_var_name = pair.first;
+    auto& out_var_name = pair.second;
+    auto* in_node = view_.GetNodeByName(in_var_name, op->inputs);
+    auto* out_node = view_.GetNodeByName(out_var_name, op->outputs);
+    // 2. there is no external pending op on the input node
+    if (view_.PendingOpsOnVar(in_node).size() > 1) {
+      VLOG(4) << string::Sprintf(
+          "Skiped pair %s => %s. %s input has external dependency."
+          "inplace such pair will overwrite the memory.",
+          out_var_name, in_var_name, op->Name());
+      continue;
+    }
+    // 3. if output has been memory optimize by python(fluid.memory_optmize()).
+    // this candidate  can not be inplaced. Will be deprecated in the future.
+    if (view_.InSkipSet(out_node->Name())) {
+      VLOG(4) << string::Sprintf(
+          "Skiped %s => %s reused previous memory block in python memory "
+          "optmize,"
+          "it inplace may generate a circle",
+          out_var_name, in_var_name, op->Name());
+      continue;
+    }
+    // Debug Interface. Which would be skipped by the pass.
+    if (out_node->Name() == FLAGS_memory_optimize_debug) {
+      VLOG(3) << "Skiped var by force. FLAGS_memory_optimize_debug="
+              << out_node->Name();
+      continue;
+    }
+    // NOTE(dzhwinter):
+    // two stage commit of inplaced process. if after inplace happens generate a
+    // circle,
+    // then withdraw the changes. Otherwise, safely add the node.
+    auto swap_nodes =
+        TryInplaceModifyVar(out_var_name, in_var_name, idx, graph);
+    if (!ir::HasCircle(*graph)) {
+      VLOG(3) << string::Sprintf("!!! %s,  %s => %s inplaced", op->Name(),
+                                 out_var_name, in_var_name);
+      InplaceModifyDesc(out_var_name, in_var_name, idx);
+      CommitModify(swap_nodes, graph);
+    } else {
+      VLOG(3) << string::Sprintf(
+          "Skiped pair %s => %s, inplace will generate a circle. withdraw %s",
+          out_var_name, in_var_name, op->Name());
+      WithdrawModify(swap_nodes, graph);
+    }
+  }
+}
+ir::Node* GraphView::GetNodeByName(const std::string& name,
+                                   const std::vector<ir::Node*>& nodes) const {
+  // nodes should be op->inputs/outputs
+  // node in same node do have different name.
+  std::unordered_set<std::string> nodes_in_op;
+  bool has_dup_node =
+      std::all_of(nodes.begin(), nodes.end(), [&nodes_in_op](ir::Node* node) {
+        if (!node->IsVar() || node->IsCtrlVar() || node->Var() == nullptr) {
+          if (nodes_in_op.count(node->Name())) return true;
+          nodes_in_op.emplace(node->Name());
+        }
+        return false;
+      });
+  PADDLE_ENFORCE(has_dup_node == false, "nodes has same name!");
+  ir::Node* node = nullptr;
+  for (auto* it : nodes) {
+    if (!it->IsVar() || it->IsCtrlVar() || it->Var() == nullptr) continue;
+    if (it->Name() == name) {
+      node = it;
+      break;
+    }
+  }
+  PADDLE_ENFORCE(node != nullptr,
+                 string::Sprintf("Not found var %s in nodes!", name));
+  return node;
+}
+std::vector<ir::Node*> GraphView::PendingOpsOnVar(ir::Node* node) {
+  // get the pending ops depends on same var node.
+  // because node also maybe a inplaced variable, so need to backtrack all the
+  // previous inplaced vars.
+  std::vector<ir::Node*> pending_ops;
+  ir::Node* p = node;
+  while (p != nullptr) {
+    pending_ops.insert(pending_ops.end(), p->outputs.begin(), p->outputs.end());
+    p = GetPrevCascadeInplacedVar(p);
+  }
+  return pending_ops;
+}
+void GraphView::Build(ir::Graph* g) {
+  // track the var nodes in correct order.
+  // Because we insert some new created node. Which may have data race between
+  // nodes.
+  // resolve data harzards depends on the var nodes in right order.
+  ops_ = SortOpLikeDescOrder(*g);
+  // 1. track the nodes which reused previous node in Python memory optimize.
+  // these node can not be inplaced, otherwise may generate a circle in graph.
+  std::unordered_set<std::string> all_vars;
+  for (auto& node : g->Nodes()) {
+    if (node->IsVar()) continue;
+    for (auto& out : node->outputs) {
+      if (out->IsCtrlVar() || out->Var() == nullptr) continue;
+      if (all_vars.count(out->Name())) {
+        dup_nodes_.emplace(out->Name());
+      } else {
+        all_vars.emplace(out->Name());
+      }
+    }
+  }
+  // 2. track the nodes which used by parameter server.
+  // these node can not be inplaced, otherwise trainer
+  // pserver can not find each other name.
+  auto update_skip_set = [&](ir::Node* node) {
+    for (auto& in : node->inputs) {
+      if (in->IsVar() && in->Var() != nullptr) dup_nodes_.emplace(in->Name());
+    }
+    for (auto& out : node->outputs) {
+      if (out->IsVar() && out->Var() != nullptr)
+        dup_nodes_.emplace(out->Name());
+    }
+  };
+  for (auto& node : g->Nodes()) {
+    if (!node->IsOp()) continue;
+    if (node->Name() == "send") update_skip_set(node);
+    if (node->Name() == "recv") update_skip_set(node);
+    if (node->Name() == "prefetch") update_skip_set(node);
+  }
+}
+const std::vector<ir::Node*>& GraphView::AllOps() { return ops_; }
+bool GraphView::InSkipSet(const std::string& var) const {
+  return dup_nodes_.count(var);
+}
+}  // namespace details
+}  // namespace framework
+}  // namespace paddle
+REGISTER_PASS(inplace_pass, paddle::framework::details::InplacePass);
--- a/paddle/fluid/framework/details/inplace_op_pass.h
+++ b/paddle/fluid/framework/details/inplace_op_pass.h
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may abtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+#include <map>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
+#include <vector>
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
+#include "paddle/fluid/framework/ir/graph.h"
+#include "paddle/fluid/framework/ir/pass.h"
+namespace paddle {
+namespace framework {
+namespace details {
+class GraphView {
+ public:
+  GraphView() = default;
+  void Build(ir::Graph* g);
+  const std::vector<ir::Node*>& AllOps();
+  ir::Node* GetNodeByName(const std::string& name,
+                          const std::vector<ir::Node*>& nodes) const;
+  std::vector<ir::Node*> PendingOpsOnVar(ir::Node* var);
+  // Will Deperated in the future.
+  // NOTE(dzhwinter) :
+  // 1. Python memory optimize will reuse
+  // memory based var name, so different op output may
+  // have the same variable name. enable inplace on such node
+  // will generate a circle in ssa graph.
+  // 2. DistributeTranspiler will use unique name to
+  // map the parameter and gradient, must be skipped.
+  bool InSkipSet(const std::string& var) const;
+ private:
+  std::vector<ir::Node*> ops_;
+  std::unordered_set<std::string> dup_nodes_;  // mem opt affect nodes
+  std::map<ir::Node*, std::unordered_set<ir::Node*>> adj_list_;
+};
+typedef std::vector<std::pair<ir::Node*, ir::Node*>> SSANodePair;
+class InplacePass : public ir::Pass {
+ public:
+  InplacePass();
+ protected:
+  std::unique_ptr<ir::Graph> ApplyImpl(
+      std::unique_ptr<ir::Graph> graph) const override;
+  void InitSSAGraphNodes() const;
+ private:
+  const SSANodePair TryInplaceModifyVar(const std::string& var,
+                                        const std::string& cache_var,
+                                        const size_t& idx,
+                                        ir::Graph* graph) const;
+  void CommitModify(const SSANodePair&, ir::Graph* graph) const;
+  void WithdrawModify(const SSANodePair& nodes, ir::Graph* graph) const;
+  void InplaceModifyDesc(const std::string& in_var, const std::string& out_var,
+                         const size_t& idx) const;
+  void TryInplaceOpInputOutput(ir::Node* op, ir::Graph* graph) const;
+  mutable std::map<std::string, std::vector<ir::Node*>> var_nodes_;
+  mutable std::unordered_set<std::string> whitelist_;
+  mutable GraphView view_;
+};
+}  // namespace details
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/details/memory_early_delete_pass.cc
+++ b/paddle/fluid/framework/details/memory_early_delete_pass.cc
@@ -16,7 +16,7 @@
 #include <queue>
 #include <string>
 #include <vector>
-#include "paddle/fluid/framework/details/memory_reuse_types.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
 #include "paddle/fluid/framework/details/multi_devices_helper.h"
 #include "paddle/fluid/framework/details/reference_count_pass_helper.h"
 #include "paddle/fluid/framework/ir/graph_helper.h"

--- a/paddle/fluid/framework/details/memory_reuse_types.cc
+++ b/paddle/fluid/framework/details/memory_reuse_types.cc
@@ -12,8 +12,10 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/details/memory_reuse_types.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
+#include <functional>
 #include <iostream>
+#include <numeric>
 #include <sstream>
 #include <string>
@@ -21,15 +23,17 @@ namespace paddle {
 namespace framework {
 namespace details {
+size_t NodeSizeInBytes(const VarDesc& node) {
+  auto shape = node.GetShape();
+  int size =
+      std::accumulate(shape.begin(), shape.end(), 1, std::multiplies<int>());
+  size_t type_size = SizeOfType(node.GetDataType());
+  return type_size * std::abs(size);
+}
 size_t NodeSizeInBytes(ir::Node* n) {
  auto* desc = FindVarDescInBlock(n);
-  auto shape = desc->GetShape();
+  return NodeSizeInBytes(*desc);
-  size_t type_size = SizeOfType(desc->GetDataType());
-  int size = 1;
-  for (auto& s : shape) {
-    size *= s;
-  }
-  return type_size * std::abs(size);
 }
 std::string DebugStringImpl(VarDesc* var) {
@@ -83,7 +87,7 @@ struct NodeComparator {
  }
 };
-void OrderedNodePairPool::Insert(ir::Node* var, ir::Node* op) {
+void OrderedNodeList::Insert(ir::Node* var, ir::Node* op) {
  PADDLE_ENFORCE(var->IsVar() && !var->IsCtrlVar());
  PADDLE_ENFORCE(op->IsOp());
  if (mark_table_.count(var->Name()) != 0) {
@@ -119,11 +123,11 @@ void OrderedNodePairPool::Insert(ir::Node* var, ir::Node* op) {
  mark_table_[var->Name()] = it;
 }
-int OrderedNodePairPool::GetIndex(ir::Node* var) {
+int OrderedNodeList::GetIndex(ir::Node* var) {
  return std::distance(nodes_.begin(), mark_table_[var->Name()]);
 }
-ir::Node* OrderedNodePairPool::NodeMatch(ir::Node* var) const {
+ir::Node* OrderedNodeList::NodeMatch(ir::Node* var) const {
  ir::Node* found_node = nullptr;
  NodeComparator compare_node;
@@ -136,13 +140,15 @@ ir::Node* OrderedNodePairPool::NodeMatch(ir::Node* var) const {
  return found_node;
 }
-void OrderedNodePairPool::Erase(ir::Node* var) {
+void OrderedNodeList::Erase(ir::Node* var) { Erase(var->Name()); }
-  PADDLE_ENFORCE(mark_table_.count(var->Name()));
-  nodes_.erase(mark_table_[var->Name()]);
+void OrderedNodeList::Erase(const std::string& var) {
-  mark_table_.erase(var->Name());
+  PADDLE_ENFORCE(mark_table_.count(var));
+  nodes_.erase(mark_table_[var]);
+  mark_table_.erase(var);
 }
-std::string OrderedNodePairPool::ToString() const {
+std::string OrderedNodeList::ToString() const {
  std::stringstream ss;
  for (auto it = nodes_.begin(); it != nodes_.end(); ++it) {
    ss << DebugString(it->first) << " ";
@@ -150,6 +156,43 @@ std::string OrderedNodePairPool::ToString() const {
  return ss.str();
 }
+bool NodeCanReused(ir::Node* node) {
+  if (node == nullptr || !node->IsVar() || node->IsCtrlVar()) return false;
+  // auto* desc = node->Var();
+  bool flag = NodeCanReused(*node->Var());
+  for (auto* op : node->inputs) {
+    if (op->Op()->HasAttr("force_cpu")) {
+      // op output force generated in cpu, can not be reused.
+      flag &= framework::AttrReader(op->Op()->GetAttrMap())
+                  .Get<bool>("force_cpu") == 0;
+    }
+  }
+  return flag;
+}
+bool NodeCanReused(const VarDesc& node) {
+  auto type = node.GetType();
+  if (node.Persistable() || type != proto::VarType::LOD_TENSOR ||
+      node.GetShape().empty()) {
+    return false;
+  }
+  // vars can be @EMPTY@, @LR_DECAY_REUSE_ID@. For example, while_grad
+  std::string name = node.Name();
+  if (!name.empty() && name[0] == '@' && name[name.size() - 1] == '@')
+    return false;
+  return true;
+}
+bool OpHasSubBlock(OpDesc* desc) {
+  const AttributeMap& attrs = desc->GetAttrMap();
+  for (auto& attr : attrs) {
+    if (attr.second.type() == typeid(BlockDesc*) ||             // NOLINT
+        attr.second.type() == typeid(std::vector<BlockDesc*>))  // NOLINT
+      return true;
+  }
+  return false;
+}
 }  // namespace details
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/details/memory_reuse_types.h
+++ b/paddle/fluid/framework/details/memory_reuse_types.h
@@ -43,7 +43,7 @@ using GraphNodePool = std::vector<
 // For example,
 // node0[-1, 1] node1[-1, 1, 1], node2[1,1], node3[1,1024], ..
 // O(1) insert, delete
-class OrderedNodePairPool {
+class OrderedNodeList {
 public:
  using NodePair = std::pair<ir::Node*, std::unordered_set<ir::Node*>>;
  using Iter = typename std::list<NodePair>::iterator;
@@ -53,8 +53,12 @@ class OrderedNodePairPool {
  void Erase(ir::Node* var);
+  void Erase(const std::string& var);
  bool Has(ir::Node* var) { return mark_table_.count(var->Name()); }
+  bool Has(const std::string& var) { return mark_table_.count(var); }
  ir::Node* NodeMatch(ir::Node* var) const;
  // map store non-const iterator, can not promise const
  int GetIndex(ir::Node* var);
@@ -67,6 +71,11 @@ class OrderedNodePairPool {
  ConstIter end() const { return nodes_.end(); }
  size_t size() const { return nodes_.size(); }
+  void Clear() {
+    mark_table_.clear();
+    nodes_.clear();
+  }
 private:
  // for searching.
  std::unordered_map<std::string, Iter> mark_table_;
@@ -74,14 +83,53 @@ class OrderedNodePairPool {
  std::list<NodePair> nodes_;
 };
+// valid a tensor can be reuse or not
+bool NodeCanReused(ir::Node* node);
+// valid a tensor can be reuse or not.
+bool NodeCanReused(const VarDesc& node);
+// check op has subblock or not
+bool OpHasSubBlock(OpDesc* desc);
 // node memory size in bytes
 size_t NodeSizeInBytes(ir::Node* n);
+// node memory size in bytes
+size_t NodeSizeInBytes(const VarDesc&);
 std::string DebugString(ir::Node* var);
-// std::string DebugString(VarDesc* var);
 VarDesc* FindVarDescInBlock(ir::Node* n);
+template <typename Container, typename Callback>
+class FilterVariableImpl {
+ public:
+  void operator()(const Container& nodes, Callback callback) {
+    for (auto* node : nodes) {
+      callback(node);
+    }
+  }
+};
+// filter var node for op->inputs/outputs
+template <typename Callback>
+class FilterVariableImpl<std::vector<ir::Node*>, Callback> {
+ public:
+  void operator()(const std::vector<ir::Node*>& nodes, Callback callback) {
+    for (auto* var : nodes) {
+      if (var->IsVar() && !var->IsCtrlVar()) {
+        callback(var);
+      }
+    }
+  }
+};
+template <typename Container, typename Callback>
+void FilterVariables(const Container& nodes, Callback callback) {
+  FilterVariableImpl<Container, Callback>()(nodes, callback);
+}
 }  // namespace details
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/details/memory_reuse_types_test.cc
+++ b/paddle/fluid/framework/details/memory_reuse_types_test.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/details/memory_reuse_types.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
 #include <algorithm>
 #include <iostream>
 #include <memory>
@@ -27,8 +27,8 @@ namespace paddle {
 namespace framework {
 namespace details {
-TEST(OrderedNodePairPool, Normal) {
+TEST(OrderedNodeList, Normal) {
-  OrderedNodePairPool pool;
+  OrderedNodeList pool;
  std::vector<std::unique_ptr<ir::Node>> nodes;
  // clang-format off

--- a/paddle/fluid/framework/details/analysis_var_pass.cc
+++ b/paddle/fluid/framework/details/analysis_var_pass.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/details/analysis_var_pass.h"
+#include "paddle/fluid/framework/details/memory_optimize_pass.h"
 #include <algorithm>
 #include <atomic>
 #include <deque>
@@ -48,39 +48,10 @@ static inline bool IsSameDesc(OpDesc* op1, OpDesc* op2) {
         op1->Outputs() == op2->Outputs();
 }
-template <typename Container, typename Callback>
+std::unique_ptr<ir::Graph> MemoryOptimizePass::ApplyImpl(
-class FilterVariableImpl {
- public:
-  void operator()(const Container& nodes, Callback callback) {
-    for (auto* node : nodes) {
-      callback(node);
-    }
-  }
-};
-// filter var node for op->inputs/outputs
-template <typename Callback>
-class FilterVariableImpl<std::vector<ir::Node*>, Callback> {
- public:
-  void operator()(const std::vector<ir::Node*>& nodes, Callback callback) {
-    for (auto* var : nodes) {
-      if (var->IsVar() && !var->IsCtrlVar()) {
-        callback(var);
-      }
-    }
-  }
-};
-template <typename Container, typename Callback>
-void FilterVariables(const Container& nodes, Callback callback) {
-  FilterVariableImpl<Container, Callback>()(nodes, callback);
-}
-std::unique_ptr<ir::Graph> AnalysisVarPass::ApplyImpl(
    std::unique_ptr<ir::Graph> graph) const {
  auto nodes = graph->Nodes();
-  auto subblock_vars = GetSubBlockVars(nodes);
+  CollectSkipVarsSet(nodes);
-  skip_set_.insert(subblock_vars.begin(), subblock_vars.end());
  cfg_.reset(new details::ControlFlowGraph(*graph));
  cfg_->LiveVariableAnalysis();
@@ -103,48 +74,53 @@ std::unique_ptr<ir::Graph> AnalysisVarPass::ApplyImpl(
    }
    for (auto& var : op->outputs) {
-      if (NodeCanReused(var) && cfg_->Use(op).count(var->Name()) == 0) {
+      if (!NodeCanReused(var) || cfg_->Use(op).count(var->Name()) == 0 ||
-        ir::Node* cache = pool_.NodeMatch(var);
+          skip_set_.count(var->Name()))
-        if (var->Name() == FLAGS_memory_optimize_debug) {
+        continue;
-          VLOG(3) << "start match var " << DebugString(var) << " of op "
+      ir::Node* cache = pool_.NodeMatch(var);
-                  << op->Name();
-          VLOG(3) << pool_.ToString();
+      if (var->Name() == FLAGS_memory_optimize_debug) {
-          VLOG(3) << "matched in pool : "
+        VLOG(3) << "start match var " << DebugString(var) << " of op "
-                  << ((cache == nullptr) ? "False" : "True");
+                << op->Name();
-        }
+        VLOG(3) << pool_.ToString();
-        if (cache != nullptr) {
+        VLOG(3) << "matched in pool : "
-          if (var->Name() == cache->Name()) {
+                << ((cache == nullptr) ? "False" : "True");
-            VLOG(3) << "The same cache variable is cascade reused."
+      }
-                    << var->Name() << " is re-filled to the pool after"
-                    << "the reused op is finished. Current op can not "
-                    << "replace it again. Skip this candidate.";
-            continue;
-          }
-          int node_idx_in_pool = pool_.GetIndex(cache);
+      if (cache == nullptr) continue;
-          VLOG(3) << string::Sprintf(
+      if (var->Name() == cache->Name()) {
-              "!!! %s,  %s => %s, cache idx %d, pool size %d",
+        VLOG(3) << "The same cache variable is cascade reused." << var->Name()
-              std::to_string(reuse_id++), DebugString(var), DebugString(cache),
+                << " is re-filled to the pool after"
-              node_idx_in_pool, static_cast<int>(pool_.size()));
+                << "the reused op is finished. Current op can not "
-          // update CFG Graph on the fly.
+                << "replace it again. Skip this candidate.";
-          // reused var maybe re-fill into the pool
+        continue;
-          cfg_->RenameVarInCFGGraph(var->Name(), cache->Name(), idx);
-          // NOTE(dzhwinter): we need to both update the ProgramDesc
+        int node_idx_in_pool = pool_.GetIndex(cache);
-          // and IR Graph. because op_desc/var_desc is used in CreateOp,
+        VLOG(3) << string::Sprintf(
-          // CreateVar when running happens. But IR Graph
+            "!!! %s,  %s => %s, cache idx %d, pool size %d",
-          // define the dependence relationship between nodes.
+            std::to_string(reuse_id++), DebugString(var), DebugString(cache),
-          RenameVarInGraphDesc(var->Name(), cache->Name(), idx);
+            node_idx_in_pool, static_cast<int>(pool_.size()));
-          RenameVarInGraphNode(var->Name(), cache->Name(), idx, graph.get());
+        // update CFG Graph on the fly.
+        // reused var maybe re-fill into the pool
-          pool_.Erase(cache);
+        cfg_->RenameVarInCFGGraph(var->Name(), cache->Name(), idx);
+        // NOTE(dzhwinter): we need to both update the ProgramDesc
+        // and IR Graph. because op_desc/var_desc is used in CreateOp,
+        // CreateVar when running happens. But IR Graph
+        // define the dependence relationship between nodes.
+        RenameVarInGraphDesc(var->Name(), cache->Name(), idx);
+        RenameVarInGraphNode(var->Name(), cache->Name(), idx, graph.get());
+        pool_.Erase(cache);
+      }
+      // fill the pool
+      std::unordered_set<std::string> unlived_vars;
+      for (auto var : cfg_->LiveIn(op)) {
+        if (cfg_->LiveOut(op).count(var) == 0) {
+          unlived_vars.emplace(var);
        }
      }
-    }
+      for (auto var : unlived_vars) {
-    // fill the pool
-    for (auto var : cfg_->LiveIn(op)) {
-      if (cfg_->LiveOut(op).count(var) == 0) {
        ir::Node* var_node = cfg_->GetNodeFromVarName(var, op);
-        if (var_node == nullptr) continue;
        if (NodeCanReused(var_node) && !pool_.Has(var_node)) {
          pool_.Insert(var_node, op);
        }
@@ -177,7 +153,7 @@ std::unique_ptr<ir::Graph> AnalysisVarPass::ApplyImpl(
  return graph;
 }
-void AnalysisVarPass::SubGraphOptimize(OpDesc* op_desc) const {
+void MemoryOptimizePass::SubGraphOptimize(OpDesc* op_desc) const {
  // conditional block, while op and their grad op
  auto* sub_block_desc =
      AttrReader(op_desc->GetAttrMap()).Get<BlockDesc*>("sub_block");
@@ -247,25 +223,32 @@ void AnalysisVarPass::SubGraphOptimize(OpDesc* op_desc) const {
  }
 }
-std::unordered_set<std::string> AnalysisVarPass::GetSubBlockVars(
+void MemoryOptimizePass::CollectSkipVarsSet(
    const std::unordered_set<ir::Node*>& nodes) const {
-  std::unordered_set<std::string> vars;
+  auto update_skip_set = [&](OpDesc* op_desc) {
+    auto inputs = op_desc->InputArgumentNames();
+    auto outputs = op_desc->OutputArgumentNames();
+    skip_set_.insert(inputs.begin(), inputs.end());
+    skip_set_.insert(outputs.begin(), outputs.end());
+  };
  for (auto& op : nodes) {
    if (!op->IsOp() || op->Op() == nullptr) continue;
    auto* op_desc = op->Op();
-    if (OpHasSubBlock(op_desc)) {
+    // NOTE(dzhwinter):
-      auto inputs = op_desc->InputArgumentNames();
+    // current block can not reuse next level block vars.
-      auto outputs = op_desc->OutputArgumentNames();
+    if (OpHasSubBlock(op_desc)) update_skip_set(op_desc);
-      vars.insert(inputs.begin(), inputs.end());
+    // NOTE(dzhwinter):
-      vars.insert(outputs.begin(), outputs.end());
+    // distributed ops input/output name need to
-    }
+    // keep same bettwen trainer/pserver
+    if (op_desc->Type() == "send") update_skip_set(op_desc);
+    if (op_desc->Type() == "recv") update_skip_set(op_desc);
+    if (op_desc->Type() == "prefetch") update_skip_set(op_desc);
  }
-  return vars;
 }
-void AnalysisVarPass::RenameVarInGraphDesc(const std::string& var,
+void MemoryOptimizePass::RenameVarInGraphDesc(const std::string& var,
-                                           const std::string& cache_var,
+                                              const std::string& cache_var,
-                                           size_t idx) const {
+                                              size_t idx) const {
  for (size_t i = idx; i < cfg_->Ops().size(); ++i) {
    auto* op = cfg_->Ops()[i];
    PADDLE_ENFORCE(op->IsOp() && op->Op());
@@ -277,7 +260,7 @@ void AnalysisVarPass::RenameVarInGraphDesc(const std::string& var,
  }
 }
-void AnalysisVarPass::InitSSAGraphNodes() const {
+void MemoryOptimizePass::InitSSAGraphNodes() const {
  std::unordered_map<std::string, std::unordered_set<ir::Node*>> all_vars;
  if (var_nodes_.empty()) {
    for (auto* op : cfg_->Ops()) {
@@ -297,9 +280,10 @@ void AnalysisVarPass::InitSSAGraphNodes() const {
  }
 }
-void AnalysisVarPass::RenameVarInGraphNode(const std::string& var,
+void MemoryOptimizePass::RenameVarInGraphNode(const std::string& var,
-                                           const std::string& cache_var,
+                                              const std::string& cache_var,
-                                           size_t idx, ir::Graph* graph) const {
+                                              size_t idx,
+                                              ir::Graph* graph) const {
  // if replace happens, we need to create a newer version cache_var
  // but use the same dims/data_type with var.
  PADDLE_ENFORCE(var_nodes_[var].size() >= 1 &&
@@ -358,39 +342,6 @@ void AnalysisVarPass::RenameVarInGraphNode(const std::string& var,
  var_nodes_.at(var).clear();
 }
-bool AnalysisVarPass::NodeCanReused(ir::Node* node) const {
-  if (!node->IsVar() || node->IsCtrlVar()) return false;
-  auto* desc = node->Var();
-  auto type = desc->GetType();
-  if (desc->Persistable() || type != proto::VarType::LOD_TENSOR ||
-      desc->GetShape().empty()) {
-    return false;
-  }
-  // vars can be @EMPTY@, @LR_DECAY_REUSE_ID@. For example, while_grad
-  std::string name = node->Name();
-  if (!name.empty() && name[0] == '@' && name[name.size() - 1] == '@')
-    return false;
-  if (skip_set_.count(name)) return false;
-  for (auto* op : node->inputs) {
-    if (op->Op()->HasAttr("force_cpu")) {
-      // op output force generated in cpu, can not be reused.
-      return framework::AttrReader(op->Op()->GetAttrMap())
-                 .Get<bool>("force_cpu") == 0;
-    }
-  }
-  return true;
-}
-bool AnalysisVarPass::OpHasSubBlock(OpDesc* desc) const {
-  const AttributeMap& attrs = desc->GetAttrMap();
-  for (auto& attr : attrs) {
-    if (attr.second.type() == typeid(BlockDesc*) ||             // NOLINT
-        attr.second.type() == typeid(std::vector<BlockDesc*>))  // NOLINT
-      return true;
-  }
-  return false;
-}
 std::vector<ir::Node*> SortOpLikeDescOrder(const ir::Graph& graph) {
  PADDLE_ENFORCE(graph.Has(kAllOpDescs),
                 "Graph has no attribute of kAllOpDescs.");
@@ -651,6 +602,7 @@ ir::Node* ControlFlowGraph::GetNodeFromVarName(const std::string& name,
 }  // namespace framework
 }  // namespace paddle
-REGISTER_PASS(analysis_var_pass, paddle::framework::details::AnalysisVarPass)
+REGISTER_PASS(memory_optimize_pass,
+              paddle::framework::details::MemoryOptimizePass)
    .RequireGraphAttr(paddle::framework::details::kGraphNodePool)
    .RequireGraphAttr(paddle::framework::details::kAllOpDescs);
--- a/paddle/fluid/framework/details/analysis_var_pass.h
+++ b/paddle/fluid/framework/details/analysis_var_pass.h
@@ -25,7 +25,7 @@
 #include <vector>
 #include "paddle/fluid/framework/data_type.h"
-#include "paddle/fluid/framework/details/memory_reuse_types.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
 #include "paddle/fluid/framework/ir/graph.h"
 #include "paddle/fluid/framework/ir/pass.h"
@@ -35,12 +35,10 @@ namespace details {
 constexpr char kAllOpDescs[] = "all_op_descs";
 std::vector<ir::Node*> SortOpLikeDescOrder(const ir::Graph& graph);
-// sort op in bfs order
-std::vector<ir::Node*> BFSSortGraphOps(const ir::Graph& graph);
 class ControlFlowGraph;
-class AnalysisVarPass : public ir::Pass {
+class MemoryOptimizePass : public ir::Pass {
 protected:
  std::unique_ptr<ir::Graph> ApplyImpl(
      std::unique_ptr<ir::Graph> graph) const override;
@@ -57,17 +55,14 @@ class AnalysisVarPass : public ir::Pass {
                            ir::Graph* graph) const;
  void SubGraphOptimize(OpDesc* op_desc) const;
-  // valid a tensor can be reuse or not
+  // 1. scan op with subblock and collect the output/input vars.
-  bool NodeCanReused(ir::Node* node) const;
+  // while, while_grad, conditional_block
-  // scan subblock and collect the output/input variables.
+  // 2. scan distributed ops and collect the output/input vars
-  std::unordered_set<std::string> GetSubBlockVars(
+  void CollectSkipVarsSet(const std::unordered_set<ir::Node*>&) const;
-      const std::unordered_set<ir::Node*>&) const;
-  // check op has subblock or not
-  bool OpHasSubBlock(OpDesc* desc) const;
 private:
  // Reuse Node Pool, Owned.
-  mutable OrderedNodePairPool pool_;
+  mutable OrderedNodeList pool_;
  // controlflow Graph
  mutable std::unique_ptr<ControlFlowGraph> cfg_;
  // skip set

--- a/paddle/fluid/framework/details/analysis_var_pass_test.cc
+++ b/paddle/fluid/framework/details/analysis_var_pass_test.cc
@@ -12,63 +12,19 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/details/analysis_var_pass.h"
+#include "paddle/fluid/framework/details/memory_optimize_pass.h"
 #include <algorithm>
 #include <iostream>
 #include <iterator>
 #include "glog/logging.h"
 #include "gtest/gtest.h"
+#include "paddle/fluid/framework/details/graph_test_base.h"
 #include "paddle/fluid/framework/ir/graph.h"
 #include "paddle/fluid/framework/ir/graph_helper.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
 #include "paddle/fluid/framework/program_desc.h"
-namespace paddle {
-namespace framework {
-class DummyOp : public OperatorBase {
- public:
-  DummyOp(const std::string& type, const VariableNameMap& inputs,
-          const VariableNameMap& outputs, const AttributeMap& attrs)
-      : OperatorBase(type, inputs, outputs, attrs) {}
- private:
-  void RunImpl(const Scope& scope,
-               const platform::Place& place) const override {}
-};
-class SumOpMaker : public OpProtoAndCheckerMaker {
- public:
-  void Make() {
-    AddInput("X", "").AsDuplicable();
-    AddOutput("Out", "");
-    AddComment("");
-  }
-};
-class AssignOpMaker : public OpProtoAndCheckerMaker {
- public:
-  void Make() {
-    AddInput("X", "").AsDuplicable();
-    AddOutput("Out", "");
-    AddComment("");
-  }
-};
-class DummyVarTypeInference : public VarTypeInference {
- public:
-  void operator()(const OpDesc& op_desc, BlockDesc* block) const override {
-    auto& inputs = op_desc.Input("X");
-    auto type = block->Var(inputs.front())->GetType();
-    auto out_var_name = op_desc.Output("Out").front();
-    block->Var(out_var_name)->SetType(type);
-  }
-};
-}  // namespace framework
-}  // namespace paddle
 REGISTER_OPERATOR(sum, paddle::framework::DummyOp,
                  paddle::framework::SumOpMaker,
                  paddle::framework::DummyVarTypeInference);
@@ -141,15 +97,6 @@ inline static ProgramDesc FillProgramDesc() {
  return prog;
 }
-template <typename Container>
-inline static std::string DebugString(const Container& c) {
-  std::stringstream ss;
-  for (auto& item : c) {
-    ss << item << " ";
-  }
-  return ss.str();
-}
 TEST(CFGGraph, IRGraph) {
  // prepare ir graph
  auto prog = FillProgramDesc();

--- a/paddle/fluid/framework/details/op_registry.h
+++ b/paddle/fluid/framework/details/op_registry.h
@@ -18,6 +18,7 @@ limitations under the License. */
 #include <tuple>
 #include <vector>
 #include "paddle/fluid/framework/grad_op_desc_maker.h"
+#include "paddle/fluid/framework/inplace_op_inference.h"
 #include "paddle/fluid/framework/op_info.h"
 #include "paddle/fluid/framework/op_proto_maker.h"
 #include "paddle/fluid/framework/operator.h"
@@ -32,7 +33,8 @@ enum OpInfoFillType {
  kOpProtoAndCheckerMaker = 1,
  kGradOpDescMaker = 2,
  kVarTypeInference = 3,
-  kShapeInference = 4
+  kShapeInference = 4,
+  kInplaceOpInference = 5
 };
 template <typename T>
@@ -48,8 +50,11 @@ struct OpInfoFillTypeID {
                                    ? kVarTypeInference
                                    : (std::is_base_of<InferShapeBase, T>::value
                                           ? kShapeInference
-                                           : static_cast<OpInfoFillType>(
+                                           : (std::is_base_of<
-                                                 -1)))));
+                                                  InplaceOpInference, T>::value
+                                                  ? kInplaceOpInference
+                                                  : static_cast<OpInfoFillType>(
+                                                        -1))))));
  }
 };
@@ -139,6 +144,16 @@ struct OpInfoFiller<T, kShapeInference> {
  }
 };
+template <typename T>
+struct OpInfoFiller<T, kInplaceOpInference> {
+  void operator()(const char* op_type, OpInfo* info) const {
+    info->infer_inplace_ = [](const OpDesc& op_desc, BlockDesc* block) {
+      T infer;
+      return infer(op_desc, block);
+    };
+  }
+};
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/inplace_op_inference.h
+++ b/paddle/fluid/framework/inplace_op_inference.h
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+#include <functional>
+#include <numeric>
+#include <string>
+#include <unordered_map>
+#include "glog/logging.h"
+#include "paddle/fluid/framework/block_desc.h"
+#include "paddle/fluid/framework/details/memory_optimize_helper.h"
+#include "paddle/fluid/framework/op_desc.h"
+#include "paddle/fluid/framework/type_defs.h"
+namespace paddle {
+namespace framework {
+/*
+  Inplace Inference for create In->Out pairs for inplaced operator.
+  If we specify a pair of corresponding names. For example, X->Out.
+  then Out will inplaced use X's memory. The base class will do
+  legality validation for both variables.
+*/
+class InplaceOpInference {
+ public:
+  virtual ~InplaceOpInference() {}
+  virtual std::unordered_map<std::string, std::string> operator()(
+      const OpDesc& op_desc, BlockDesc* block) const = 0;
+};
+class InplaceInToOut : public InplaceOpInference {
+ public:
+  std::unordered_map<std::string, std::string> operator()(
+      const OpDesc& op_desc, BlockDesc* block) const {
+    std::unordered_map<std::string, std::string> ret;
+    auto in_out_var_names_pair = this->Apply(op_desc, block);
+    for (auto& pair : in_out_var_names_pair) {
+      PADDLE_ENFORCE(!op_desc.Input(pair.first).empty(),
+                     string::Sprintf("op %s do not have input of %s!",
+                                     op_desc.Type(), pair.first));
+      PADDLE_ENFORCE(!op_desc.Output(pair.second).empty(),
+                     string::Sprintf("op %s do not have output of %s!",
+                                     op_desc.Type(), pair.second));
+      auto& in_name = op_desc.Input(pair.first).at(0);
+      auto& out_name = op_desc.Output(pair.second).at(0);
+      auto in = block->FindRecursiveOrCreateVar(in_name);
+      auto out = block->FindRecursiveOrCreateVar(out_name);
+      if (TryInplaceInputOutput(in, out)) ret.insert({in_name, out_name});
+    }
+    return ret;
+  }
+ protected:
+  virtual std::unordered_map<std::string, std::string> Apply(
+      const OpDesc& op_desc, BlockDesc* block) const = 0;
+  bool TryInplaceInputOutput(const VarDesc& in, const VarDesc& out) const {
+    return in.Name() != out.Name() && details::NodeCanReused(in) &&
+           details::NodeCanReused(out) &&
+           details::NodeSizeInBytes(out) <= details::NodeSizeInBytes(in);
+  }
+};
+/*
+  Inplace In and Out for operator only have an Input and an Output.
+  For example, activation op.
+ */
+class SingleOpInplaceInToOut : public InplaceInToOut {
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const OpDesc& op_desc, BlockDesc* block) const override {
+    PADDLE_ENFORCE(!op_desc.InputNames().empty(),
+                   "Op inputs must not be empty");
+    PADDLE_ENFORCE(!op_desc.OutputNames().empty(),
+                   "Op outputs must not be empty");
+    auto x_name = op_desc.InputNames().at(0);
+    auto out_name = op_desc.OutputNames().at(0);
+    return std::unordered_map<std::string, std::string>{{x_name, out_name}};
+  }
+};
+/*
+  Gradient op. Inplace output use it's Input.
+  For example, Input@Grad->Input reuse strategy.
+ */
+class GradOpInplaceInToOut : public InplaceInToOut {
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const OpDesc& op_desc, BlockDesc* block) const override {
+    std::unordered_map<std::string, std::string> ret;
+    std::unordered_set<std::string> output_names(op_desc.OutputNames().begin(),
+                                                 op_desc.OutputNames().end());
+    for (auto& input_name : op_desc.InputNames()) {
+      if (output_names.count(GradVarName(input_name))) {
+        ret.insert({input_name, GradVarName(input_name)});
+      }
+    }
+    return ret;
+  }
+};
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/inplace_op_inference_test.cc
+++ b/paddle/fluid/framework/inplace_op_inference_test.cc
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#include <iterator>
+#include <string>
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_info.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/framework/var_type_inference.h"
+namespace paddle {
+namespace framework {
+class NOP : public OperatorBase {
+ public:
+  NOP(const std::string& type, const VariableNameMap& inputs,
+      const VariableNameMap& outputs, const AttributeMap& attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+ private:
+  void RunImpl(const Scope& scope,
+               const platform::Place& place) const override {}
+};
+class SingleOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  void Make() {
+    AddInput("X", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddComment("");
+  }
+};
+class SingleGradOpMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* op = new framework::OpDesc();
+    op->SetType("single_op_grad");
+    op->SetInput("Out", OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("X"), InputGrad("X"));
+    return std::unique_ptr<OpDesc>(op);
+  }
+};
+class SingleOpShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext* ctx) const override {
+    ctx->HasInput("X");
+    ctx->HasOutput("Out");
+    ctx->SetOutputDim("Out", ctx->GetInputDim("X"));
+  }
+};
+class SingleGradOpShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext* ctx) const override {
+    ctx->HasInput(framework::GradVarName("Out"));
+    ctx->HasOutput(framework::GradVarName("X"));
+    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("Out"));
+  }
+};
+class MultiOutOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  void Make() {
+    AddInput("X", "").AsDuplicable();
+    AddInput("Y", "").AsDuplicable();
+    AddInput("Z", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddOutput("YOut", "");
+    AddOutput("ZOut", "");
+    AddOutput("NotReuseOut", "");
+    AddComment("");
+  }
+};
+class MultiOutShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext* ctx) const override {
+    ctx->ShareDim("X", "Out");
+    ctx->ShareDim("Y", "YOut");
+    ctx->ShareDim("Z", "ZOut");
+  }
+};
+class MultiGradOpMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* op = new framework::OpDesc();
+    op->SetType("multi_out_grad");
+    op->SetInput("X", Input("X"));
+    op->SetOutput(framework::GradVarName("Y"), OutputGrad("YOut"));
+    op->SetOutput(framework::GradVarName("X"), OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("Z"), OutputGrad("ZOut"));
+    return std::unique_ptr<framework::OpDesc>(op);
+  }
+};
+class MultiOutGradShapeInference : public framework::InferShapeBase {
+ public:
+  void operator()(framework::InferShapeContext* ctx) const override {
+    ctx->SetOutputDim(framework::GradVarName("Y"),
+                      ctx->GetInputDim(framework::GradVarName("YOut")));
+    ctx->SetOutputDim(framework::GradVarName("X"),
+                      ctx->GetInputDim(framework::GradVarName("Out")));
+    ctx->SetOutputDim(framework::GradVarName("Z"),
+                      ctx->GetInputDim(framework::GradVarName("ZOut")));
+  }
+};
+class MultiOutInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using framework::InplaceInToOut::InplaceInToOut;
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const OpDesc& op_desc, BlockDesc* block) const override {
+    return std::unordered_map<std::string, std::string>{
+        {"X", "Out"}, {"Y", "YOut"}, {"Z", "ZOut"},
+    };
+  }
+};
+class MultiOutGradInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using framework::InplaceInToOut::InplaceInToOut;
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const OpDesc& op_desc, BlockDesc* block) const override {
+    return std::unordered_map<std::string, std::string>{
+        {framework::GradVarName("YOut"), framework::GradVarName("Y")},
+        {framework::GradVarName("Out"), framework::GradVarName("X")},
+        {framework::GradVarName("ZOut"), framework::GradVarName("Z")},
+    };
+  }
+};
+}  // namespace framework
+}  // namespace paddle
+namespace f = paddle::framework;
+REGISTER_OPERATOR(single_op, f::NOP, f::SingleOpMaker, f::SingleGradOpMaker,
+                  f::SingleOpInplaceInToOut, f::SingleOpShapeInference);
+REGISTER_OPERATOR(single_op_grad, f::NOP, f::SingleOpInplaceInToOut,
+                  f::SingleGradOpShapeInference);
+REGISTER_OPERATOR(multi_out_op, f::NOP, f::MultiOutOpMaker, f::MultiGradOpMaker,
+                  f::MultiOutInplaceInToOut, f::MultiOutShapeInference);
+REGISTER_OPERATOR(multi_out_grad, f::NOP, f::MultiOutGradInplaceInToOut,
+                  f::MultiOutGradShapeInference);
+namespace paddle {
+namespace framework {
+TEST(InferInplace, SingleOpInplaceInToOut) {
+  ProgramDesc prog;
+  auto* op = prog.MutableBlock(0)->AppendOp();
+  op->SetType("single_op");
+  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
+  op->SetOutput("Out", {"test2_out"});
+  prog.MutableBlock(0)->Var("test2_a")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_a")->SetShape({32, 64});
+  prog.MutableBlock(0)->Var("test2_b")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_c")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_out");
+  prog.MutableBlock(0)->Var("test2_out")->SetShape({32, 16});
+  auto& infer_inplace = OpInfoMap::Instance().Get(op->Type()).infer_inplace_;
+  auto in_to_outs = infer_inplace(*op, op->Block());
+  EXPECT_EQ(in_to_outs.size(), 1ul);
+  auto it = in_to_outs.begin();
+  EXPECT_EQ(it->first, "test2_a");
+  EXPECT_EQ(it->second, "test2_out");
+}
+TEST(InferInplace, SingleGradOpInplaceInToOut) {
+  ProgramDesc prog;
+  auto* op = prog.MutableBlock(0)->AppendOp();
+  op->SetType("single_op_grad");
+  op->SetInput(GradVarName("Out"), {"test2_out"});
+  op->SetOutput(GradVarName("X"), {"test2_a", "test2_b", "test2_c"});
+  prog.MutableBlock(0)->Var("test2_a")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_a")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("test2_b")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_c")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test2_out");
+  prog.MutableBlock(0)->Var("test2_out")->SetShape({32, 16});
+  auto& infer_inplace = OpInfoMap::Instance().Get(op->Type()).infer_inplace_;
+  auto in_to_outs = infer_inplace(*op, op->Block());
+  EXPECT_EQ(in_to_outs.size(), 1ul);
+  auto it = in_to_outs.begin();
+  EXPECT_EQ(it->first, "test2_out");
+  EXPECT_EQ(it->second, "test2_a");
+}
+TEST(InferInplace, MultiOutInplaceInToOut) {
+  ProgramDesc prog;
+  auto* op = prog.MutableBlock(0)->AppendOp();
+  op->SetType("multi_out_op");
+  op->SetInput("X", {"a0", "a1"});
+  op->SetInput("Y", {"b0"});
+  op->SetInput("Z", {"c0", "c1"});
+  op->SetOutput("Out", {"o0"});
+  op->SetOutput("YOut", {"y0"});
+  op->SetOutput("ZOut", {"z0"});
+  prog.MutableBlock(0)->Var("a0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("b0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("c0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("c1")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("o0");
+  prog.MutableBlock(0)->Var("y0");
+  prog.MutableBlock(0)->Var("z0");
+  prog.MutableBlock(0)->Var("a0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("b0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("c0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("o0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("y0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("z0")->SetShape({32, 16});
+  auto& infer_inplace = OpInfoMap::Instance().Get(op->Type()).infer_inplace_;
+  auto in_to_outs = infer_inplace(*op, op->Block());
+  EXPECT_EQ(in_to_outs.size(), 3ul);
+  std::unordered_map<std::string, std::string> expects = {
+      {"a0", "o0"}, {"b0", "y0"}, {"c0", "z0"},
+  };
+  EXPECT_TRUE(expects == in_to_outs);
+}
+TEST(InferInplace, MultiGradInplaceInToOut) {
+  ProgramDesc prog;
+  auto* op = prog.MutableBlock(0)->AppendOp();
+  op->SetType("multi_out_grad");
+  op->SetInput(GradVarName("Out"), {"o0"});
+  op->SetInput(GradVarName("YOut"), {"y0"});
+  op->SetInput(GradVarName("ZOut"), {"z0"});
+  op->SetOutput(GradVarName("X"), {"a0", "a1"});
+  op->SetOutput(GradVarName("Y"), {"b0"});
+  op->SetOutput(GradVarName("Z"), {"c0", "c1"});
+  prog.MutableBlock(0)->Var("a0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("b0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("c0")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("c1")->SetType(proto::VarType::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("o0");
+  prog.MutableBlock(0)->Var("y0");
+  prog.MutableBlock(0)->Var("z0");
+  prog.MutableBlock(0)->Var("a0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("b0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("c0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("o0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("y0")->SetShape({32, 16});
+  prog.MutableBlock(0)->Var("z0")->SetShape({32, 16});
+  auto& infer_inplace = OpInfoMap::Instance().Get(op->Type()).infer_inplace_;
+  auto in_to_outs = infer_inplace(*op, op->Block());
+  EXPECT_EQ(in_to_outs.size(), 3ul);
+  std::unordered_map<std::string, std::string> expects = {
+      {"o0", "a0"}, {"y0", "b0"}, {"z0", "c0"},
+  };
+  EXPECT_TRUE(expects == in_to_outs);
+}
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/ir/CMakeLists.txt
+++ b/paddle/fluid/framework/ir/CMakeLists.txt
@@ -10,8 +10,22 @@ function(pass_library TARGET DEST)
    set(options "")
    set(oneValueArgs "")
    set(multiValueArgs SRCS DEPS)
+    set(targetPrefix "")
+    # Get optional argument
+    set(extraMacroArgs ${ARGN})
+    list(LENGTH extraMacroArgs numExtraMacroArgs)
+    if(numExtraMacroArgs GREATER 0)
+        list(GET extraMacroArgs 0 targetPrefix)
+    endif()
    cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
-    cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS})
+    if(targetPrefix)
+        cc_library(${TARGET} SRCS ${targetPrefix}/${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS})
+    else()
+        cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS})
+    endif()
    # add more DEST here, such as train, dist and collect USE_PASS into a file automatically.
    if (${DEST} STREQUAL "base" OR ${DEST} STREQUAL "inference")
        message(STATUS "add pass ${TARGET} ${DEST}")
@@ -51,6 +65,7 @@ pass_library(conv_elementwise_add2_act_fuse_pass inference)
 pass_library(conv_elementwise_add_fuse_pass inference)
 pass_library(conv_affine_channel_fuse_pass inference)
 pass_library(transpose_flatten_concat_fuse_pass inference)
+pass_library(identity_scale_op_clean_pass base)
 # There may be many transpose-flatten structures in a model, and the output of
 # these structures will be used as inputs to the concat Op. This pattern will
@@ -62,11 +77,11 @@ foreach (index RANGE 3 6)
 endforeach()
 if(WITH_MKLDNN)
-    pass_library(mkldnn_placement_pass base)
+    pass_library(mkldnn_placement_pass base mkldnn)
-    pass_library(depthwise_conv_mkldnn_pass base)
+    pass_library(depthwise_conv_mkldnn_pass base mkldnn)
-    pass_library(conv_bias_mkldnn_fuse_pass inference)
+    pass_library(conv_bias_mkldnn_fuse_pass inference mkldnn)
-    pass_library(conv_relu_mkldnn_fuse_pass inference)
+    pass_library(conv_relu_mkldnn_fuse_pass inference mkldnn)
-    pass_library(conv_elementwise_add_mkldnn_fuse_pass inference)
+    pass_library(conv_elementwise_add_mkldnn_fuse_pass inference mkldnn)
 endif()
 cc_library(fuse_elewise_add_act_pass SRCS fuse_elewise_add_act_pass.cc DEPS pass graph_pattern_detector )
@@ -86,7 +101,7 @@ cc_test(test_fc_fuse_pass SRCS fc_fuse_pass_tester.cc DEPS fc_fuse_pass framewor
 cc_test(test_seqpool_concat_fuse_pass SRCS seqpool_concat_fuse_pass_tester.cc DEPS seqpool_concat_fuse_pass framework_proto)
 cc_test(test_is_test_pass SRCS is_test_pass_tester.cc DEPS is_test_pass)
 if (WITH_MKLDNN)
-    cc_test(test_depthwise_conv_mkldnn_pass SRCS depthwise_conv_mkldnn_pass_tester.cc DEPS depthwise_conv_mkldnn_pass)
+    cc_test(test_depthwise_conv_mkldnn_pass SRCS mkldnn/depthwise_conv_mkldnn_pass_tester.cc DEPS depthwise_conv_mkldnn_pass)
-    cc_test(test_conv_relu_mkldnn_fuse_pass SRCS conv_relu_mkldnn_fuse_pass_tester.cc DEPS conv_relu_mkldnn_fuse_pass)
+    cc_test(test_conv_relu_mkldnn_fuse_pass SRCS mkldnn/conv_relu_mkldnn_fuse_pass_tester.cc DEPS conv_relu_mkldnn_fuse_pass)
-    cc_test(test_conv_elementwise_add_mkldnn_fuse_pass SRCS conv_elementwise_add_mkldnn_fuse_pass_tester.cc DEPS conv_elementwise_add_mkldnn_fuse_pass)
+    cc_test(test_conv_elementwise_add_mkldnn_fuse_pass SRCS mkldnn/conv_elementwise_add_mkldnn_fuse_pass_tester.cc DEPS conv_elementwise_add_mkldnn_fuse_pass)
 endif ()
--- a/paddle/fluid/framework/ir/graph.h
+++ b/paddle/fluid/framework/ir/graph.h
@@ -141,7 +141,8 @@ class Graph {
  ir::Node *CreateControlDepVar() {
    // TODO(panyx0718): control var name should be really unique.
    const std::string name = string::Sprintf(
-        "%s@%llu", ir::Node::kControlDepVarName, node_set_.size());
+        "%s@%llu", static_cast<const char *>(ir::Node::kControlDepVarName),
+        node_set_.size());
    auto *x = AddNode(new ir::Node(name, ir::Node::Type::kVariable));
    x->SetId(num_node_created_++);
    return x;

--- a/paddle/fluid/framework/ir/graph_helper.cc
+++ b/paddle/fluid/framework/ir/graph_helper.cc
@@ -52,16 +52,29 @@ bool HasCircleHelper(
    ir::Node *node,
    const std::map<ir::Node *, std::unordered_set<ir::Node *>> &adj_list,
    std::unordered_set<ir::Node *> *visited,
-    std::unordered_set<ir::Node *> *in_trace) {
+    std::unordered_set<ir::Node *> *in_trace,
+    std::vector<std::vector<ir::Node *>> *circles) {
  if (visited->find(node) == visited->end()) {
    visited->insert(node);
    in_trace->insert(node);
    for (ir::Node *in : adj_list.at(node)) {
      if (visited->find(in) == visited->end() &&
-          HasCircleHelper(in, adj_list, visited, in_trace)) {
+          HasCircleHelper(in, adj_list, visited, in_trace, circles)) {
        return true;
      } else if (in_trace->find(in) != in_trace->end()) {
+        if (circles != nullptr) {
+          std::vector<ir::Node *> circle;
+          circle.emplace_back(in);
+          ir::Node *p = in;
+          for (auto &adj : adj_list.at(p)) {
+            if (in_trace->count(adj)) {
+              circle.emplace_back(adj);
+              p = adj;
+            }
+          }
+          circles->emplace_back(circle);
+        }
        return true;
      }
    }
@@ -71,11 +84,12 @@ bool HasCircleHelper(
 }
 bool HasCircleInternal(
-    const std::map<ir::Node *, std::unordered_set<ir::Node *>> &adj_list) {
+    const std::map<ir::Node *, std::unordered_set<ir::Node *>> &adj_list,
+    std::vector<std::vector<ir::Node *>> *circles) {
  std::unordered_set<ir::Node *> visited;
  std::unordered_set<ir::Node *> in_trace;
  for (auto &adj : adj_list) {
-    if (HasCircleHelper(adj.first, adj_list, &visited, &in_trace)) {
+    if (HasCircleHelper(adj.first, adj_list, &visited, &in_trace, circles)) {
      return true;
    }
  }
@@ -84,13 +98,18 @@ bool HasCircleInternal(
 }  // namespace
 bool HasCircle(const Graph &graph) {
-  return HasCircleInternal(BuildOperationAdjList(graph));
+  return HasCircleInternal(BuildOperationAdjList(graph), nullptr);
+}
+bool FindCircleSubGraph(const Graph &graph,
+                        std::vector<std::vector<ir::Node *>> *circles) {
+  return HasCircleInternal(BuildOperationAdjList(graph), circles);
 }
 std::vector<ir::Node *> TopologySortOperations(const Graph &graph) {
  std::map<ir::Node *, std::unordered_set<ir::Node *>> adj_list =
      BuildOperationAdjList(graph);
-  PADDLE_ENFORCE(!HasCircleInternal(adj_list));
+  PADDLE_ENFORCE(!HasCircleInternal(adj_list, nullptr));
  std::unordered_set<ir::Node *> visited;
  std::vector<ir::Node *> ret;
  for (auto adj : adj_list) {

--- a/paddle/fluid/framework/ir/graph_helper.h
+++ b/paddle/fluid/framework/ir/graph_helper.h
@@ -28,6 +28,11 @@ namespace ir {
 // Test if the graph contains circle.
 bool HasCircle(const Graph &graph);
+// Find All Circles for debugging,
+// store all subgraph in circles.
+bool FindCircleSubGraph(const Graph &graph,
+                        std::vector<std::vector<ir::Node *>> *circles);
 size_t GraphNum(const Graph &graph);
 // Topology Sort the operations in the graph from inputs to outputs.

--- a/paddle/fluid/framework/ir/graph_helper_test.cc
+++ b/paddle/fluid/framework/ir/graph_helper_test.cc
@@ -195,6 +195,17 @@ void BuildTwoGraphs(Graph* g) {
  //  v4->outputs.push_back(o5);
 }
+TEST(GraphHelperTest, Circles) {
+  ProgramDesc prog;
+  Graph g(prog);
+  BuildCircleGraph(&g);
+  std::vector<std::vector<ir::Node*>> circles;
+  ASSERT_TRUE(FindCircleSubGraph(g, &circles));
+  ASSERT_EQ(circles.size(), 1UL);
+}
 TEST(GraphHelperTest, GraphNum) {
  ProgramDesc prog;

--- a/paddle/fluid/framework/ir/graph_pattern_detector.cc
+++ b/paddle/fluid/framework/ir/graph_pattern_detector.cc
@@ -117,11 +117,6 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) {
      // return false;
    }
  }
-  for (auto &item : pdnodes2nodes_) {
-    for (auto &n : item.second) {
-      GetMarkedNodes(const_cast<Graph *>(&graph)).insert(n);
-    }
-  }
  VLOG(3) << pdnodes2nodes_.size() << " nodes marked";
  return !pdnodes2nodes_.empty();

--- a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc
+++ b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc
+// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "paddle/fluid/framework/ir/identity_scale_op_clean_pass.h"
+#include <string>
+#include "paddle/fluid/framework/ir/graph_pattern_detector.h"
+namespace paddle {
+namespace framework {
+namespace ir {
+std::unique_ptr<ir::Graph> IdentityScaleOpCleanPass::ApplyImpl(
+    std::unique_ptr<ir::Graph> graph) const {
+  FusePassBase::Init("identity_scale_op_clean", graph.get());
+  // pre_op -> scale_in -> scale_op -> scale_out
+  // ->
+  // pre_op -> scale_out
+  GraphPatternDetector detector;
+  auto pre_op = detector.mutable_pattern()->NewNode("pre_op")->assert_is_op();
+  auto scale_in = detector.mutable_pattern()
+                      ->NewNode("scale_in")
+                      ->assert_is_op_input("scale")
+                      ->AsIntermediate();
+  auto scale_op = detector.mutable_pattern()
+                      ->NewNode("scale_fuse")
+                      ->assert_is_op("scale")
+                      ->assert_op_attr<float>("scale", 1.)
+                      ->assert_op_attr<float>("bias", 0.);
+  auto scale_out = detector.mutable_pattern()
+                       ->NewNode("scale_out")
+                       ->assert_is_op_output("scale");
+  pre_op->LinksTo({scale_in});
+  scale_op->LinksFrom({scale_in}).LinksTo({scale_out});
+  GraphPatternDetector::handle_t handler = [&](
+      const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) {
+    Node* scale_op_var = subgraph.at(scale_op);
+    Node* scale_in_var = subgraph.at(scale_in);
+    Node* scale_out_var = subgraph.at(scale_out);
+    Node* pre_op_var = subgraph.at(pre_op);
+    // Link pre_op directly to scale_out
+    const std::string scale_in_name = scale_in_var->Name();
+    const std::string scale_out_name = scale_out_var->Name();
+    // Remove links in graph
+    GraphSafeRemoveNodes(graph, {scale_in_var, scale_op_var});
+    // Modify proto message
+    auto* pre_op_desc = pre_op_var->Op();
+    for (auto& parameter : *pre_op_desc->Proto()->mutable_outputs()) {
+      auto* arguments = parameter.mutable_arguments();
+      auto it = std::find(arguments->begin(), arguments->end(), scale_in_name);
+      PADDLE_ENFORCE(it != arguments->end());
+      *it = scale_out_name;
+    }
+    IR_NODE_LINK_TO(pre_op_var, scale_out_var);
+  };
+  detector(graph.get(), handler);
+  return graph;
+}
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
+REGISTER_PASS(identity_scale_op_clean_pass,
+              paddle::framework::ir::IdentityScaleOpCleanPass);
--- a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.h
+++ b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.h
+// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
+namespace paddle {
+namespace framework {
+namespace ir {
+class IdentityScaleOpCleanPass : public FusePassBase {
+ protected:
+  std::unique_ptr<ir::Graph> ApplyImpl(std::unique_ptr<ir::Graph> graph) const;
+ private:
+  virtual ~IdentityScaleOpCleanPass() = default;
+};
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/conv_bias_mkldnn_fuse_pass.h"
 #include <functional>
 #include <string>
 #include <vector>

--- a/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h
+++ b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h
--- a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h"
 #include <functional>
 #include <list>
 #include <map>

--- a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h
+++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h
--- a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass_tester.cc
+++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass_tester.cc
@@ -15,8 +15,8 @@
 #include <gtest/gtest.h>
 #include <string>
-#include "paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h"
 #include "paddle/fluid/framework/ir/graph_traits.h"
+#include "paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h"
 namespace paddle {
 namespace framework {

--- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h"
 #include <string>
 #include <vector>
 #include "paddle/fluid/platform/enforce.h"

--- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h
+++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h
--- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc
+++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/conv_relu_mkldnn_fuse_pass.h"
 #include <gtest/gtest.h>
 #include "paddle/fluid/framework/op_proto_maker.h"

--- a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.cc
+++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.cc
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include "paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h"
 #include "paddle/fluid/framework/ir/graph_pattern_detector.h"
 namespace paddle {

--- a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h
+++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h
--- a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass_tester.cc
+++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass_tester.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
-#include "paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/depthwise_conv_mkldnn_pass.h"
 #include <gtest/gtest.h>

--- a/paddle/fluid/framework/ir/mkldnn_placement_pass.cc
+++ b/paddle/fluid/framework/ir/mkldnn_placement_pass.cc
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include "paddle/fluid/framework/ir/mkldnn_placement_pass.h"
+#include "paddle/fluid/framework/ir/mkldnn/mkldnn_placement_pass.h"
 #include <string>
 namespace paddle {

--- a/paddle/fluid/framework/ir/mkldnn_placement_pass.h
+++ b/paddle/fluid/framework/ir/mkldnn_placement_pass.h
--- a/paddle/fluid/framework/op_info.h
+++ b/paddle/fluid/framework/op_info.h
@@ -38,6 +38,7 @@ struct OpInfo {
  OpAttrChecker* checker_{nullptr};
  InferVarTypeFN infer_var_type_;
  InferShapeFN infer_shape_;
+  InferInplaceOpFN infer_inplace_;
  bool HasOpProtoAndChecker() const {
    return proto_ != nullptr && checker_ != nullptr;

--- a/paddle/fluid/framework/operator.cc
+++ b/paddle/fluid/framework/operator.cc
@@ -555,18 +555,17 @@ Tensor* ExecutionContext::LegacyOutput<Tensor>(const std::string& name) const {
 template <>
 std::vector<Tensor*> ExecutionContext::MultiOutput<Tensor>(
    const std::string& name) const {
-  auto names = op().Outputs(name);
+  auto it = ctx_.outputs.find(name);
+  if (it == ctx_.outputs.end()) {
+    return {};
+  }
+  const std::vector<Variable*>& vars = it->second;
  std::vector<Tensor*> res;
-  res.reserve(names.size());
+  res.reserve(vars.size());
-  std::transform(names.begin(), names.end(), std::back_inserter(res),
+  std::transform(vars.begin(), vars.end(), std::back_inserter(res),
-                 [&](const std::string& sub_name) -> Tensor* {
+                 [&](Variable* var) -> Tensor* {
-                   auto var = scope_.FindVar(sub_name);
+                   return var == nullptr ? nullptr
-                   if (var == nullptr) return nullptr;
+                                         : var->GetMutable<LoDTensor>();
-                   PADDLE_ENFORCE(
-                       var->IsType<LoDTensor>(),
-                       "%s should be LoDTensor, but the received type is %s",
-                       sub_name, ToTypeName(var->Type()));
-                   return var->GetMutable<LoDTensor>();
                 });
  return res;
 }

--- a/paddle/fluid/framework/type_defs.h
+++ b/paddle/fluid/framework/type_defs.h
@@ -57,5 +57,8 @@ using InferVarTypeFN =
 using InferShapeFN = std::function<void(InferShapeContext*)>;
+using InplacePair = std::unordered_map<std::string, std::string>;
+using InferInplaceOpFN = std::function<InplacePair(const OpDesc&, BlockDesc*)>;
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/imperative/CMakeLists.txt
+++ b/paddle/fluid/imperative/CMakeLists.txt
 if(WITH_PYTHON)
-cc_library(layer SRCS layer.cc DEPS proto_desc operator device_context blas)
+cc_library(layer SRCS layer.cc DEPS proto_desc operator device_context blas pybind)
-cc_library(tracer SRCS tracer.cc DEPS proto_desc device_context)
+cc_library(tracer SRCS tracer.cc DEPS proto_desc device_context pybind)
 cc_library(engine SRCS engine.cc)
 endif()
--- a/paddle/fluid/imperative/layer.cc
+++ b/paddle/fluid/imperative/layer.cc
@@ -156,6 +156,8 @@ class Autograd {
      for (auto it : candidate->pre_ops_) {
        for (OpBase* pre_op : it.second) {
          if (!pre_op) continue;
+          VLOG(5) << "op dep " << candidate->op_desc_->Type() << " <---- "
+                  << it.first << " <---- " << pre_op->op_desc_->Type();
          if (visited.find(pre_op) == visited.end()) {
            visited.insert(pre_op);
            queue.push_back(pre_op);

--- a/paddle/fluid/imperative/layer.h
+++ b/paddle/fluid/imperative/layer.h
@@ -28,6 +28,7 @@
 #include "paddle/fluid/framework/var_desc.h"
 #include "paddle/fluid/platform/enforce.h"
 #include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/operators/math/math_function.h"
 #include "paddle/fluid/imperative/type_defs.h"
@@ -140,16 +141,24 @@ class VarBase {
  void RunBackward();
  void TrackPreOp(OpBase* pre_op, const std::string& pre_op_out_name,
-                  int pre_op_out_idx, bool stop_gradient) {
+                  int pre_op_out_idx, bool pre_op_stop_gradient) {
    pre_op_ = pre_op;
    pre_op_out_name_ = pre_op_out_name;
    pre_op_out_idx_ = pre_op_out_idx;
-    stop_gradient_ = stop_gradient;
+    if (pre_op_stop_gradient) {
+      stop_gradient_ = pre_op_stop_gradient;
+    }
  }
  void ClearGradient() {
-    delete grads_;
+    VLOG(1) << "clear gradient of " << var_desc_->Name();
-    grads_ = new VarBase(true);
+    if (grads_ && grads_->var_ && grads_->var_->IsInitialized()) {
+      auto grads_t = grads_->var_->GetMutable<framework::LoDTensor>();
+      operators::math::set_constant(
+          *(platform::DeviceContextPool::Instance().Get(
+              grads_->var_->Get<framework::LoDTensor>().place())),
+          grads_t, 0.0);
+    }
  }
  framework::LoDTensor& GradValue();

--- a/paddle/fluid/imperative/tracer.cc
+++ b/paddle/fluid/imperative/tracer.cc
@@ -31,6 +31,7 @@ void CreateGradOp(const framework::OpDesc& op_desc,
      framework::OpInfoMap::Instance()
          .Get(op_desc.Type())
          .GradOpMaker()(op_desc, no_grad_set, grad_to_var, grad_sub_block);
  for (auto& desc : descs) {
    grad_op_descs->emplace_back(desc.release());
  }
@@ -84,11 +85,12 @@ void Tracer::Trace(OpBase* op, const VarBasePtrMap& inputs,
  op->input_vars_ = inputs;
  for (auto it : op->input_vars_) {
    auto& invars = invars_map[it.first];
+    invars.reserve(it.second.size());
    for (VarBase* inp : it.second) {
      PADDLE_ENFORCE_NOT_NULL(inp->var_, "op %s input %s nullptr",
                              op->op_desc_->Type(), inp->var_desc_->Name());
-      invars.push_back(inp->var_);
+      invars.emplace_back(inp->var_);
      vars[inp->var_desc_->Name()] = inp;
      if (inp->PreOp()) {
        op->pre_ops_[it.first].push_back(inp->PreOp());
@@ -105,9 +107,10 @@ void Tracer::Trace(OpBase* op, const VarBasePtrMap& inputs,
  for (auto it : op->output_vars_) {
    auto& outvars = outvars_map[it.first];
    const std::vector<VarBase*>& outputs = it.second;
+    outvars.reserve(outputs.size());
    for (size_t i = 0; i < outputs.size(); ++i) {
      VarBase* out = outputs[i];
-      outvars.push_back(out->var_);
+      outvars.emplace_back(out->var_);
      vars[out->var_desc_->Name()] = out;
      framework::VarDesc* var_desc = block->FindVar(out->var_desc_->Name());

--- a/paddle/fluid/inference/CMakeLists.txt
+++ b/paddle/fluid/inference/CMakeLists.txt
@@ -58,12 +58,13 @@ if(WIN32)
  sep_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS}
              DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array
                   analysis_config paddle_pass_builder)
-  target_link_libraries(paddle_fluid_shared shlwapi)
 else(WIN32)
  cc_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS}
             DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array
                  analysis_config paddle_pass_builder)
 endif()
+get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES)
+target_link_libraries(paddle_fluid_shared ${os_dependency_modules})
 set_target_properties(paddle_fluid_shared PROPERTIES OUTPUT_NAME paddle_fluid)
 if(NOT APPLE AND NOT WIN32)

--- a/paddle/fluid/inference/analysis/argument.h
+++ b/paddle/fluid/inference/analysis/argument.h
@@ -132,7 +132,7 @@ struct Argument {
  DECL_ARGUMENT_FIELD(tensorrt_workspace_size, TensorRtWorkspaceSize, int);
  DECL_ARGUMENT_FIELD(tensorrt_min_subgraph_size, TensorRtMinSubgraphSize, int);
  DECL_ARGUMENT_FIELD(tensorrt_precision_mode, TensorRtPrecisionMode,
-                      contrib::AnalysisConfig::Precision);
+                      AnalysisConfig::Precision);
  // Memory optimized related.
  DECL_ARGUMENT_FIELD(enable_memory_optim, EnableMemoryOptim, bool);

--- a/paddle/fluid/inference/analysis/helper.h
+++ b/paddle/fluid/inference/analysis/helper.h
@@ -32,7 +32,7 @@ limitations under the License. */
 #ifdef _WIN32
 #include <direct.h>
 #include <io.h>
-#define GCC_ATTRIBUTE(attr__) ;
+#define GCC_ATTRIBUTE(attr__)
 #define MKDIR(path) _mkdir(path)
 #else
 #include <unistd.h>

--- a/paddle/fluid/inference/analysis/ir_pass_manager.cc
+++ b/paddle/fluid/inference/analysis/ir_pass_manager.cc
@@ -71,7 +71,7 @@ void IRPassManager::CreatePasses(Argument *argument,
                new framework::ProgramDesc *(&argument->main_program()));
      bool enable_int8 = argument->tensorrt_precision_mode() ==
-                         contrib::AnalysisConfig::Precision::kInt8;
+                         AnalysisConfig::Precision::kInt8;
      pass->Set("enable_int8", new bool(enable_int8));
      std::string model_opt_cache_dir =
@@ -83,7 +83,6 @@ void IRPassManager::CreatePasses(Argument *argument,
          new std::string(GetOrCreateModelOptCacheDir(model_opt_cache_dir)));
    }
-    // graph_ = pass->Apply(std::move(graph_));
    pre_pass = pass_name;
    passes_.emplace_back(std::move(pass));
@@ -97,8 +96,9 @@ std::unique_ptr<Graph> IRPassManager::Apply(std::unique_ptr<Graph> graph) {
  PADDLE_ENFORCE(graph.get());
  // Apply all the passes
  for (const auto &pass : passes_) {
-    if (pass->Type() == "graph_viz_pass") continue;
+    if (pass->Type() != "graph_viz_pass") {
-    PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass->Type());
+      PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass->Type());
+    }
    graph = pass->Apply(std::move(graph));
  }
  return std::move(graph);

--- a/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt
+++ b/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt
 cc_library(subgraph_detector SRCS subgraph_detector.cc DEPS proto_desc)
+if(WITH_TESTING)
+  add_dependencies(subgraph_detector gtest)
+endif()
 if (WITH_GPU AND TENSORRT_FOUND)
  cc_library(tensorrt_subgraph_pass SRCS tensorrt_subgraph_pass.cc DEPS subgraph_detector tensorrt_op_teller)

--- a/paddle/fluid/inference/analysis/passes/memory_optimize_pass.cc
+++ b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.cc
@@ -18,6 +18,7 @@
 #include <limits>
 #include <map>
 #include <string>
+#include <type_traits>
 #include <utility>
 #include <vector>
 #include "paddle/fluid/framework/ir/graph_helper.h"
@@ -168,7 +169,11 @@ bool FindSuitableTensorToReuse(
    if (!cluster->count(candidate)) continue;
    size_t space = space_table.at(candidate);
-    size_t space_diff = std::abs<size_t>(space - space_required);
+    PADDLE_ENFORCE(
+        space <= std::numeric_limits<std::make_signed<size_t>::type>::max(),
+        "space overload");
+    size_t space_diff =
+        std::abs((std::make_signed<size_t>::type)space - space_required);
    if (space_diff < best_fit.second) {
      best_fit.first = candidate;
      best_fit.second = space_diff;

--- a/paddle/fluid/inference/analysis/passes/memory_optimize_pass.h
+++ b/paddle/fluid/inference/analysis/passes/memory_optimize_pass.h
@@ -13,7 +13,9 @@
 // limitations under the License.
 #pragma once
+#include <string>
+#include <utility>
+#include <vector>
 #include "paddle/fluid/inference/analysis/analysis_pass.h"
 #include "paddle/fluid/platform/port.h"

--- a/paddle/fluid/inference/api/analysis_config.cc
+++ b/paddle/fluid/inference/api/analysis_config.cc
@@ -22,7 +22,7 @@
 namespace paddle {
-PassStrategy *contrib::AnalysisConfig::pass_builder() const {
+PassStrategy *AnalysisConfig::pass_builder() const {
  if (!pass_builder_.get()) {
    if (use_gpu_) {
      LOG(INFO) << "Create GPU IR passes";
@@ -42,27 +42,27 @@ PassStrategy *contrib::AnalysisConfig::pass_builder() const {
  return pass_builder_.get();
 }
-contrib::AnalysisConfig::AnalysisConfig(const std::string &model_dir) {
+AnalysisConfig::AnalysisConfig(const std::string &model_dir) {
  model_dir_ = model_dir;
  Update();
 }
-contrib::AnalysisConfig::AnalysisConfig(const std::string &prog_file,
+AnalysisConfig::AnalysisConfig(const std::string &prog_file,
-                                        const std::string &params_file) {
+                               const std::string &params_file) {
  prog_file_ = prog_file;
  params_file_ = params_file;
  Update();
 }
-void contrib::AnalysisConfig::SetModel(const std::string &prog_file_path,
+void AnalysisConfig::SetModel(const std::string &prog_file_path,
-                                       const std::string &params_file_path) {
+                              const std::string &params_file_path) {
  prog_file_ = prog_file_path;
  params_file_ = params_file_path;
  Update();
 }
-void contrib::AnalysisConfig::EnableUseGpu(uint64_t memory_pool_init_size_mb,
+void AnalysisConfig::EnableUseGpu(uint64_t memory_pool_init_size_mb,
-                                           int device_id) {
+                                  int device_id) {
 #ifdef PADDLE_WITH_CUDA
  use_gpu_ = true;
  memory_pool_init_size_mb_ = memory_pool_init_size_mb;
@@ -74,13 +74,13 @@ void contrib::AnalysisConfig::EnableUseGpu(uint64_t memory_pool_init_size_mb,
  Update();
 }
-void contrib::AnalysisConfig::DisableGpu() {
+void AnalysisConfig::DisableGpu() {
  use_gpu_ = false;
  Update();
 }
-contrib::AnalysisConfig::AnalysisConfig(const contrib::AnalysisConfig &other) {
+AnalysisConfig::AnalysisConfig(const AnalysisConfig &other) {
 #define CP_MEMBER(member__) member__ = other.member__;
  // Model related.
@@ -130,7 +130,7 @@ contrib::AnalysisConfig::AnalysisConfig(const contrib::AnalysisConfig &other) {
  Update();
 }
-void contrib::AnalysisConfig::EnableMKLDNN() {
+void AnalysisConfig::EnableMKLDNN() {
 #ifdef PADDLE_WITH_MKLDNN
  pass_builder()->EnableMKLDNN();
  use_mkldnn_ = true;
@@ -142,9 +142,9 @@ void contrib::AnalysisConfig::EnableMKLDNN() {
  Update();
 }
-void contrib::AnalysisConfig::EnableTensorRtEngine(
+void AnalysisConfig::EnableTensorRtEngine(
    int workspace_size, int max_batch_size, int min_subgraph_size,
-    contrib::AnalysisConfig::Precision precision_mode) {
+    AnalysisConfig::Precision precision_mode) {
 #ifdef PADDLE_WITH_CUDA
  if (!use_gpu()) {
    LOG(ERROR) << "To use TensorRT engine, please call EnableGpu() first";
@@ -165,7 +165,7 @@ void contrib::AnalysisConfig::EnableTensorRtEngine(
 }
 // TODO(Superjomn) refactor this, buggy.
-void contrib::AnalysisConfig::Update() {
+void AnalysisConfig::Update() {
  auto info = SerializeInfoCache();
  if (info == serialized_info_cache_) return;
@@ -225,7 +225,7 @@ void contrib::AnalysisConfig::Update() {
  }
 }
-std::string contrib::AnalysisConfig::SerializeInfoCache() {
+std::string AnalysisConfig::SerializeInfoCache() {
  std::stringstream ss;
  ss << model_dir_;
  ss << prog_file_;
@@ -260,14 +260,14 @@ std::string contrib::AnalysisConfig::SerializeInfoCache() {
  return ss.str();
 }
-void contrib::AnalysisConfig::SetCpuMathLibraryNumThreads(
+void AnalysisConfig::SetCpuMathLibraryNumThreads(
    int cpu_math_library_num_threads) {
  cpu_math_library_num_threads_ = cpu_math_library_num_threads;
  Update();
 }
-float contrib::AnalysisConfig::fraction_of_gpu_memory_for_pool() const {
+float AnalysisConfig::fraction_of_gpu_memory_for_pool() const {
 #ifdef PADDLE_WITH_CUDA
  // Get the GPU memory details and calculate the fraction of memory for the
  // GPU memory pool.
@@ -282,8 +282,8 @@ float contrib::AnalysisConfig::fraction_of_gpu_memory_for_pool() const {
 #endif
 }
-void contrib::AnalysisConfig::EnableMemoryOptim(
+void AnalysisConfig::EnableMemoryOptim(bool static_optim,
-    bool static_optim, bool force_update_static_cache) {
+                                       bool force_update_static_cache) {
  enable_memory_optim_ = true;
  static_memory_optim_ = static_optim;
  static_memory_optim_force_update_ = force_update_static_cache;
@@ -291,14 +291,14 @@ void contrib::AnalysisConfig::EnableMemoryOptim(
  Update();
 }
-bool contrib::AnalysisConfig::enable_memory_optim() const {
+bool AnalysisConfig::enable_memory_optim() const {
  return enable_memory_optim_;
 }
-void contrib::AnalysisConfig::SetModelBuffer(const char *prog_buffer,
+void AnalysisConfig::SetModelBuffer(const char *prog_buffer,
-                                             size_t prog_buffer_size,
+                                    size_t prog_buffer_size,
-                                             const char *param_buffer,
+                                    const char *param_buffer,
-                                             size_t param_buffer_size) {
+                                    size_t param_buffer_size) {
  prog_file_ = std::string(prog_buffer, prog_buffer + prog_buffer_size);
  params_file_ = std::string(param_buffer, param_buffer + param_buffer_size);
  model_from_memory_ = true;
@@ -306,7 +306,7 @@ void contrib::AnalysisConfig::SetModelBuffer(const char *prog_buffer,
  Update();
 }
-NativeConfig contrib::AnalysisConfig::ToNativeConfig() const {
+NativeConfig AnalysisConfig::ToNativeConfig() const {
  NativeConfig config;
  config.model_dir = model_dir_;
  config.prog_file = prog_file_;
@@ -318,4 +318,9 @@ NativeConfig contrib::AnalysisConfig::ToNativeConfig() const {
  return config;
 }
+void AnalysisConfig::SwitchIrDebug(int x) {
+  ir_debug_ = x;
+  Update();
+}
 }  // namespace paddle
--- a/paddle/fluid/inference/api/analysis_predictor.cc
+++ b/paddle/fluid/inference/api/analysis_predictor.cc
@@ -47,7 +47,6 @@ DECLARE_bool(profile);
 namespace paddle {
-using contrib::AnalysisConfig;
 using inference::Singleton;
 #if PADDLE_WITH_TENSORRT
 using inference::tensorrt::TRTInt8Calibrator;
@@ -59,7 +58,8 @@ namespace {
 bool IsPersistable(const framework::VarDesc *var) {
  if (var->Persistable() &&
      var->GetType() != framework::proto::VarType::FEED_MINIBATCH &&
-      var->GetType() != framework::proto::VarType::FETCH_LIST) {
+      var->GetType() != framework::proto::VarType::FETCH_LIST &&
+      var->GetType() != framework::proto::VarType::RAW) {
    return true;
  }
  return false;
@@ -731,10 +731,10 @@ std::string AnalysisPredictor::GetSeriazlizedProgram() const {
 }
 template <>
-std::unique_ptr<PaddlePredictor> CreatePaddlePredictor<contrib::AnalysisConfig>(
+std::unique_ptr<PaddlePredictor> CreatePaddlePredictor<AnalysisConfig>(
-    const contrib::AnalysisConfig &config) {
+    const AnalysisConfig &config) {
-  return CreatePaddlePredictor<contrib::AnalysisConfig,
+  return CreatePaddlePredictor<AnalysisConfig, PaddleEngineKind::kAnalysis>(
-                               PaddleEngineKind::kAnalysis>(config);
+      config);
 }
 }  // namespace paddle

--- a/paddle/fluid/inference/api/analysis_predictor.h
+++ b/paddle/fluid/inference/api/analysis_predictor.h
@@ -33,7 +33,6 @@ using inference::analysis::Argument;
 using inference::analysis::Analyzer;
 using framework::proto::ProgramDesc;
 using framework::NaiveExecutor;
-using contrib::AnalysisConfig;
 /** \brief This predictor is based on the original native predictor with IR and
 * Analysis support.
@@ -123,7 +122,7 @@ class AnalysisPredictor : public PaddlePredictor {
 #endif
 private:
-  contrib::AnalysisConfig config_;
+  AnalysisConfig config_;
  Argument argument_;
  std::unique_ptr<NaiveExecutor> executor_;
  platform::Place place_;

--- a/paddle/fluid/inference/api/analysis_predictor_tester.cc
+++ b/paddle/fluid/inference/api/analysis_predictor_tester.cc
@@ -24,7 +24,6 @@
 DEFINE_string(dirname, "", "dirname to tests.");
 namespace paddle {
-using contrib::AnalysisConfig;
 TEST(AnalysisPredictor, analysis_off) {
  AnalysisConfig config;
@@ -197,7 +196,7 @@ TEST(AnalysisPredictor, memory_optim) {
  AnalysisConfig config(FLAGS_dirname);
  config.DisableGpu();
  config.EnableMemoryOptim(true);
-  config.pass_builder()->TurnOnDebug();
+  config.SwitchIrDebug();
  auto native_predictor =
      CreatePaddlePredictor<NativeConfig>(config.ToNativeConfig());

--- a/paddle/fluid/inference/api/api_impl_tester.cc
+++ b/paddle/fluid/inference/api/api_impl_tester.cc
@@ -295,7 +295,7 @@ TEST(inference_api_native, image_classification_gpu) {
 #endif
 TEST(PassBuilder, Delete) {
-  contrib::AnalysisConfig config;
+  AnalysisConfig config;
  config.DisableGpu();
  config.pass_builder()->DeletePass("attention_lstm_fuse_pass");
  const auto& passes = config.pass_builder()->AllPasses();

--- a/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc
+++ b/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc
@@ -36,7 +36,7 @@ namespace demo {
 */
 void Main() {
  std::unique_ptr<PaddlePredictor> predictor;
-  paddle::contrib::AnalysisConfig config;
+  paddle::AnalysisConfig config;
  config.EnableUseGpu(100, 0);
  config.SetModel(FLAGS_modeldir + "/__model__",
                  FLAGS_modeldir + "/__params__");

--- a/paddle/fluid/inference/api/demo_ci/vis_demo.cc
+++ b/paddle/fluid/inference/api/demo_ci/vis_demo.cc
@@ -34,7 +34,6 @@ DEFINE_bool(use_gpu, false, "Whether use gpu.");
 namespace paddle {
 namespace demo {
-using contrib::AnalysisConfig;
 /*
 * Use the native and analysis fluid engine to inference the demo.
 */

--- a/paddle/fluid/inference/api/paddle_analysis_config.h
+++ b/paddle/fluid/inference/api/paddle_analysis_config.h
@@ -29,11 +29,6 @@
 namespace paddle {
 class AnalysisPredictor;
-// ==
-//
-// -----------------------------------------------------------------------------------
-// NOTE: The following APIs are not mature yet, we are still working on them.
-namespace contrib {
 // NOTE WIP, not stable yet.
 struct AnalysisConfig {
@@ -145,9 +140,12 @@ struct AnalysisConfig {
   */
  bool tensorrt_engine_enabled() const { return use_tensorrt_; }
-  /** Control whther to debug IR graph analysis phase.
+  /** \brief Control whether to debug IR graph analysis phase.
+   *
+   * This will generate DOT files for visualizing the computation graph after
+   * each analysis pass applied.
   */
-  void SwitchIrDebug(int x = true) { ir_debug_ = x; }
+  void SwitchIrDebug(int x = true);
  /** Turn on MKLDNN.
   */
@@ -260,5 +258,4 @@ struct AnalysisConfig {
  mutable std::unique_ptr<PassStrategy> pass_builder_;
 };
-}  // namespace contrib
 }  // namespace paddle
--- a/paddle/fluid/inference/api/paddle_api.h
+++ b/paddle/fluid/inference/api/paddle_api.h
@@ -221,7 +221,7 @@ class PaddlePredictor {
  virtual std::string GetSeriazlizedProgram() const {
    assert(false);  // Force raise error.
    return "NotImplemented";
-  };
+  }
  /** The common configs for all the predictors.
   */

--- a/paddle/fluid/inference/api/paddle_pass_builder.h
+++ b/paddle/fluid/inference/api/paddle_pass_builder.h
@@ -117,6 +117,7 @@ class CpuPassStrategy : public PassStrategy {
        "conv_bn_fuse_pass",             //
        "conv_eltwiseadd_bn_fuse_pass",  //
        "is_test_pass",                  //
+        "identity_scale_op_clean_pass",  //
    });
    use_gpu_ = false;
  }
@@ -155,6 +156,7 @@ class GpuPassStrategy : public PassStrategy {
  GpuPassStrategy() : PassStrategy({}) {
    passes_.assign({
      "infer_clean_graph_pass",                        //
+          "identity_scale_op_clean_pass",              //
          "conv_affine_channel_fuse_pass",             //
          "conv_eltwiseadd_affine_channel_fuse_pass",  //
          "conv_bn_fuse_pass",                         //

--- a/paddle/fluid/inference/tensorrt/trt_int8_calibrator.h
+++ b/paddle/fluid/inference/tensorrt/trt_int8_calibrator.h
@@ -13,16 +13,16 @@
 // limitations under the License.
 #pragma once
+#include <NvInfer.h>
+#include <cuda_runtime_api.h>
 #include <atomic>
 #include <memory>
-#include <mutex>
+#include <mutex>  // NOLINT
 #include <string>
 #include <unordered_map>
 #include <utility>
 #include <vector>
-#include <NvInfer.h>
-#include <cuda_runtime_api.h>
 #include "paddle/fluid/framework/tensor.h"
 #include "paddle/fluid/inference/tensorrt/engine.h"
 #include "paddle/fluid/platform/place.h"

--- a/paddle/fluid/inference/tests/api/CMakeLists.txt
+++ b/paddle/fluid/inference/tests/api/CMakeLists.txt
@@ -128,6 +128,11 @@ inference_analysis_api_test_with_fake_data(test_analyzer_resnet50
 inference_analysis_api_test_with_fake_data(test_analyzer_mobilenet_depthwise_conv
  "${INFERENCE_DEMO_INSTALL_DIR}/mobilenet_depthwise_conv" analyzer_resnet50_tester.cc "mobilenet_model.tar.gz" SERIAL)
+# bert, max_len=20, embedding_dim=128
+set(BERT_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/bert_emb128")
+download_model_and_data(${BERT_INSTALL_DIR} "bert_emb128_model.tar.gz" "bert_data_len20.txt.tar.gz")
+inference_analysis_api_test(test_analyzer_bert ${BERT_INSTALL_DIR} analyzer_bert_tester.cc SERIAL)
 # anakin
 if (WITH_ANAKIN AND WITH_MKL) # only needed in CI
    # anakin rnn1

--- a/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "paddle/fluid/inference/tests/api/tester_helper.h"
+namespace paddle {
+namespace inference {
+using paddle::PaddleTensor;
+template <typename T>
+void GetValueFromStream(std::stringstream *ss, T *t) {
+  (*ss) >> (*t);
+}
+template <>
+void GetValueFromStream<std::string>(std::stringstream *ss, std::string *t) {
+  *t = ss->str();
+}
+// Split string to vector
+template <typename T>
+void Split(const std::string &line, char sep, std::vector<T> *v) {
+  std::stringstream ss;
+  T t;
+  for (auto c : line) {
+    if (c != sep) {
+      ss << c;
+    } else {
+      GetValueFromStream<T>(&ss, &t);
+      v->push_back(std::move(t));
+      ss.str({});
+      ss.clear();
+    }
+  }
+  if (!ss.str().empty()) {
+    GetValueFromStream<T>(&ss, &t);
+    v->push_back(std::move(t));
+    ss.str({});
+    ss.clear();
+  }
+}
+template <typename T>
+constexpr paddle::PaddleDType GetPaddleDType();
+template <>
+constexpr paddle::PaddleDType GetPaddleDType<int64_t>() {
+  return paddle::PaddleDType::INT64;
+}
+template <>
+constexpr paddle::PaddleDType GetPaddleDType<float>() {
+  return paddle::PaddleDType::FLOAT32;
+}
+// Parse tensor from string
+template <typename T>
+bool ParseTensor(const std::string &field, paddle::PaddleTensor *tensor) {
+  std::vector<std::string> data;
+  Split(field, ':', &data);
+  if (data.size() < 2) return false;
+  std::string shape_str = data[0];
+  std::vector<int> shape;
+  Split(shape_str, ' ', &shape);
+  std::string mat_str = data[1];
+  std::vector<T> mat;
+  Split(mat_str, ' ', &mat);
+  tensor->shape = shape;
+  auto size =
+      std::accumulate(shape.begin(), shape.end(), 1, std::multiplies<int>()) *
+      sizeof(T);
+  tensor->data.Resize(size);
+  std::copy(mat.begin(), mat.end(), static_cast<T *>(tensor->data.data()));
+  tensor->dtype = GetPaddleDType<T>();
+  return true;
+}
+// Parse input tensors from string
+bool ParseLine(const std::string &line,
+               std::vector<paddle::PaddleTensor> *tensors) {
+  std::vector<std::string> fields;
+  Split(line, ';', &fields);
+  if (fields.size() < 5) return false;
+  tensors->clear();
+  tensors->reserve(5);
+  int i = 0;
+  // src_id
+  paddle::PaddleTensor src_id;
+  ParseTensor<int64_t>(fields[i++], &src_id);
+  tensors->push_back(src_id);
+  // pos_id
+  paddle::PaddleTensor pos_id;
+  ParseTensor<int64_t>(fields[i++], &pos_id);
+  tensors->push_back(pos_id);
+  // segment_id
+  paddle::PaddleTensor segment_id;
+  ParseTensor<int64_t>(fields[i++], &segment_id);
+  tensors->push_back(segment_id);
+  // self_attention_bias
+  paddle::PaddleTensor self_attention_bias;
+  ParseTensor<float>(fields[i++], &self_attention_bias);
+  tensors->push_back(self_attention_bias);
+  // next_segment_index
+  paddle::PaddleTensor next_segment_index;
+  ParseTensor<int64_t>(fields[i++], &next_segment_index);
+  tensors->push_back(next_segment_index);
+  return true;
+}
+bool LoadInputData(std::vector<std::vector<paddle::PaddleTensor>> *inputs) {
+  if (FLAGS_infer_data.empty()) {
+    LOG(ERROR) << "please set input data path";
+    return false;
+  }
+  std::ifstream fin(FLAGS_infer_data);
+  std::string line;
+  int sample = 0;
+  // The unit-test dataset only have 10 samples, each sample have 5 feeds.
+  while (std::getline(fin, line)) {
+    std::vector<paddle::PaddleTensor> feed_data;
+    ParseLine(line, &feed_data);
+    inputs->push_back(std::move(feed_data));
+    sample++;
+    if (!FLAGS_test_all_data && sample == FLAGS_batch_size) break;
+  }
+  LOG(INFO) << "number of samples: " << sample;
+  return true;
+}
+void SetConfig(AnalysisConfig *config) { config->SetModel(FLAGS_infer_model); }
+void profile(bool use_mkldnn = false) {
+  AnalysisConfig config;
+  SetConfig(&config);
+  if (use_mkldnn) {
+    config.EnableMKLDNN();
+  }
+  std::vector<PaddleTensor> outputs;
+  std::vector<std::vector<PaddleTensor>> inputs;
+  LoadInputData(&inputs);
+  TestPrediction(reinterpret_cast<const PaddlePredictor::Config *>(&config),
+                 inputs, &outputs, FLAGS_num_threads);
+}
+TEST(Analyzer_bert, profile) { profile(); }
+#ifdef PADDLE_WITH_MKLDNN
+TEST(Analyzer_bert, profile_mkldnn) { profile(true); }
+#endif
+// Check the fuse status
+TEST(Analyzer_bert, fuse_statis) {
+  AnalysisConfig cfg;
+  SetConfig(&cfg);
+  int num_ops;
+  auto predictor = CreatePaddlePredictor<AnalysisConfig>(cfg);
+  auto fuse_statis = GetFuseStatis(
+      static_cast<AnalysisPredictor *>(predictor.get()), &num_ops);
+  LOG(INFO) << "num_ops: " << num_ops;
+}
+// Compare result of NativeConfig and AnalysisConfig
+void compare(bool use_mkldnn = false) {
+  AnalysisConfig cfg;
+  SetConfig(&cfg);
+  if (use_mkldnn) {
+    cfg.EnableMKLDNN();
+  }
+  std::vector<std::vector<PaddleTensor>> inputs;
+  LoadInputData(&inputs);
+  CompareNativeAndAnalysis(
+      reinterpret_cast<const PaddlePredictor::Config *>(&cfg), inputs);
+}
+TEST(Analyzer_bert, compare) { compare(); }
+#ifdef PADDLE_WITH_MKLDNN
+TEST(Analyzer_bert, compare_mkldnn) { compare(true /* use_mkldnn */); }
+#endif
+// Compare Deterministic result
+TEST(Analyzer_bert, compare_determine) {
+  AnalysisConfig cfg;
+  SetConfig(&cfg);
+  std::vector<std::vector<PaddleTensor>> inputs;
+  LoadInputData(&inputs);
+  CompareDeterministic(reinterpret_cast<const PaddlePredictor::Config *>(&cfg),
+                       inputs);
+}
+}  // namespace inference
+}  // namespace paddle
--- a/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc
@@ -19,7 +19,6 @@ DEFINE_int32(max_turn_num, 9,
 namespace paddle {
 namespace inference {
-using contrib::AnalysisConfig;
 constexpr int32_t kMaxTurnLen = 50;
@@ -165,7 +164,7 @@ void PrepareInputs(std::vector<PaddleTensor> *input_slots, DataRecord *data,
  input_slots->push_back(std::move(response_mask_tensor));
 }
-void SetConfig(contrib::AnalysisConfig *cfg) {
+void SetConfig(AnalysisConfig *cfg) {
  cfg->SetModel(FLAGS_infer_model + "/__model__", FLAGS_infer_model + "/param");
  cfg->SwitchSpecifyInputNames();
  cfg->SwitchIrOptim(true);
@@ -187,7 +186,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 // Easy for profiling independently.
 void profile(bool use_mkldnn = false) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  if (use_mkldnn) {
@@ -223,7 +222,7 @@ TEST(Analyzer_dam, profile_mkldnn) { profile(true /* use_mkldnn */); }
 // Check the fuse status
 TEST(Analyzer_dam, fuse_statis) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  int num_ops;
@@ -256,7 +255,7 @@ void compare(bool use_mkldnn = false) {
 TEST(Analyzer_dam, compare_with_static_memory_optim) {
  // The small dam will core in CI, but works in local.
  if (FLAGS_max_turn_num == 9) {
-    contrib::AnalysisConfig cfg, cfg1;
+    AnalysisConfig cfg, cfg1;
    DataRecord data(FLAGS_infer_data, FLAGS_batch_size);
    std::vector<std::vector<PaddleTensor>> input_slots_all;
@@ -282,7 +281,7 @@ TEST(Analyzer_dam, compare_with_static_memory_optim) {
 TEST(Analyzer_dam, compare_with_dynamic_memory_optim) {
  // The small dam will core in CI, but works in local.
  if (FLAGS_max_turn_num == 9) {
-    contrib::AnalysisConfig cfg, cfg1;
+    AnalysisConfig cfg, cfg1;
    DataRecord data(FLAGS_infer_data, FLAGS_batch_size);
    std::vector<std::vector<PaddleTensor>> input_slots_all;

--- a/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc
@@ -18,8 +18,6 @@ namespace paddle {
 namespace inference {
 namespace analysis {
-using contrib::AnalysisConfig;
 struct DataRecord {
  std::vector<int64_t> data;
  std::vector<size_t> lod;

--- a/paddle/fluid/inference/tests/api/analyzer_mm_dnn_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_mm_dnn_tester.cc
@@ -16,7 +16,6 @@
 namespace paddle {
 namespace inference {
-using contrib::AnalysisConfig;
 struct DataRecord {
  std::vector<std::vector<int64_t>> query, title;
@@ -75,7 +74,7 @@ void PrepareInputs(std::vector<PaddleTensor> *input_slots, DataRecord *data,
  }
 }
-void SetConfig(contrib::AnalysisConfig *cfg) {
+void SetConfig(AnalysisConfig *cfg) {
  cfg->SetModel(FLAGS_infer_model);
  cfg->DisableGpu();
  cfg->SwitchSpecifyInputNames();
@@ -95,7 +94,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 // Easy for profiling independently.
 void profile(bool use_mkldnn = false) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<PaddleTensor> outputs;
@@ -130,7 +129,7 @@ TEST(Analyzer_MM_DNN, profile_mkldnn) { profile(true /* use_mkldnn */); }
 // Check the fuse status
 TEST(Analyzer_MM_DNN, fuse_statis) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  int num_ops;
@@ -141,7 +140,7 @@ TEST(Analyzer_MM_DNN, fuse_statis) {
 // Compare result of NativeConfig and AnalysisConfig
 void compare(bool use_mkldnn = false) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  if (use_mkldnn) {

--- a/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc
@@ -16,7 +16,6 @@
 namespace paddle {
 namespace inference {
-using contrib::AnalysisConfig;
 struct DataRecord {
  std::vector<std::vector<int64_t>> word, mention;
@@ -76,7 +75,7 @@ void PrepareInputs(std::vector<PaddleTensor> *input_slots, DataRecord *data) {
  }
 }
-void SetConfig(contrib::AnalysisConfig *cfg, bool memory_load = false) {
+void SetConfig(AnalysisConfig *cfg, bool memory_load = false) {
  if (memory_load) {
    std::string buffer_prog, buffer_param;
    ReadBinaryFile(FLAGS_infer_model + "/__model__", &buffer_prog);
@@ -105,7 +104,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 // Easy for profiling independently.
 void profile(bool memory_load = false) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg, memory_load);
  std::vector<PaddleTensor> outputs;
@@ -136,7 +135,7 @@ TEST(Analyzer_Chinese_ner, profile_memory_load) {
 // Check the fuse status
 TEST(Analyzer_Chinese_ner, fuse_statis) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  int num_ops;
@@ -152,7 +151,7 @@ TEST(Analyzer_Chinese_ner, fuse_statis) {
 // Compare result of NativeConfig and AnalysisConfig
 TEST(Analyzer_Chinese_ner, compare) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<std::vector<PaddleTensor>> input_slots_all;

--- a/paddle/fluid/inference/tests/api/analyzer_pyramid_dnn_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_pyramid_dnn_tester.cc
@@ -16,7 +16,6 @@
 namespace paddle {
 namespace inference {
-using contrib::AnalysisConfig;
 struct DataRecord {
  std::vector<std::vector<int64_t>> query_basic, query_phrase, title_basic,
@@ -103,7 +102,7 @@ void PrepareInputs(std::vector<PaddleTensor> *input_slots, DataRecord *data,
  }
 }
-void SetConfig(contrib::AnalysisConfig *cfg) {
+void SetConfig(AnalysisConfig *cfg) {
  cfg->SetModel(FLAGS_infer_model);
  cfg->DisableGpu();
  cfg->SwitchSpecifyInputNames();
@@ -123,7 +122,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 // Easy for profiling independently.
 TEST(Analyzer_Pyramid_DNN, profile) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<PaddleTensor> outputs;
@@ -147,7 +146,7 @@ TEST(Analyzer_Pyramid_DNN, profile) {
 // Check the fuse status
 TEST(Analyzer_Pyramid_DNN, fuse_statis) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  int num_ops;
@@ -158,7 +157,7 @@ TEST(Analyzer_Pyramid_DNN, fuse_statis) {
 // Compare result of NativeConfig and AnalysisConfig
 TEST(Analyzer_Pyramid_DNN, compare) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<std::vector<PaddleTensor>> input_slots_all;

--- a/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc
@@ -20,7 +20,6 @@ namespace paddle {
 namespace inference {
 using namespace framework;  // NOLINT
-using namespace contrib;    // NOLINT
 struct DataRecord {
  std::vector<std::vector<std::vector<float>>> link_step_data_all;
@@ -223,7 +222,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 // Easy for profiling independently.
 TEST(Analyzer_rnn1, profile) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  cfg.DisableGpu();
  cfg.SwitchIrDebug();
@@ -237,7 +236,7 @@ TEST(Analyzer_rnn1, profile) {
 // Check the fuse status
 TEST(Analyzer_rnn1, fuse_statis) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  int num_ops;
@@ -254,7 +253,7 @@ TEST(Analyzer_rnn1, fuse_statis) {
 // Compare result of NativeConfig and AnalysisConfig
 TEST(Analyzer_rnn1, compare) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<std::vector<PaddleTensor>> input_slots_all;
@@ -276,7 +275,7 @@ TEST(Analyzer_rnn1, compare_determine) {
 // Test Multi-Thread.
 TEST(Analyzer_rnn1, multi_thread) {
-  contrib::AnalysisConfig cfg;
+  AnalysisConfig cfg;
  SetConfig(&cfg);
  std::vector<PaddleTensor> outputs;

--- a/paddle/fluid/inference/tests/api/analyzer_seq_pool1_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_seq_pool1_tester.cc
@@ -142,7 +142,7 @@ void SetConfig(AnalysisConfig *cfg, bool use_mkldnn = false) {
  cfg->SetModel(FLAGS_infer_model + "/model", FLAGS_infer_model + "/params");
  cfg->DisableGpu();
  cfg->SwitchSpecifyInputNames();
-  cfg->pass_builder()->TurnOnDebug();
+  cfg->SwitchIrDebug();
  cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads);
  if (use_mkldnn) {
    cfg->EnableMKLDNN();

--- a/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc
@@ -69,7 +69,7 @@ void SetInput(std::vector<std::vector<PaddleTensor>> *inputs) {
 TEST(Analyzer_Text_Classification, profile) {
  AnalysisConfig cfg;
  SetConfig(&cfg);
-  cfg.pass_builder()->TurnOnDebug();
+  cfg.SwitchIrDebug();
  std::vector<PaddleTensor> outputs;
  std::vector<std::vector<PaddleTensor>> input_slots_all;

--- a/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc
@@ -20,7 +20,6 @@ limitations under the License. */
 namespace paddle {
 namespace inference {
 namespace analysis {
-using contrib::AnalysisConfig;
 struct Record {
  std::vector<float> data;

--- a/paddle/fluid/inference/tests/api/config_printer.h
+++ b/paddle/fluid/inference/tests/api/config_printer.h
@@ -58,9 +58,8 @@ std::ostream &operator<<(std::ostream &os, const NativeConfig &config) {
  return os;
 }
-std::ostream &operator<<(std::ostream &os,
+std::ostream &operator<<(std::ostream &os, const AnalysisConfig &config) {
-                         const contrib::AnalysisConfig &config) {
+  os << GenSpaces(num_spaces) << "AnalysisConfig {\n";
-  os << GenSpaces(num_spaces) << "contrib::AnalysisConfig {\n";
  num_spaces++;
  os << config.ToNativeConfig();
  if (!config.model_from_memory()) {

--- a/paddle/fluid/inference/tests/api/tester_helper.h
+++ b/paddle/fluid/inference/tests/api/tester_helper.h
@@ -56,16 +56,9 @@ DECLARE_int32(paddle_num_threads);
 namespace paddle {
 namespace inference {
-float Random(float low, float high) {
-  static std::random_device rd;
-  static std::mt19937 mt(rd());
-  std::uniform_real_distribution<double> dist(low, high);
-  return dist(mt);
-}
 void PrintConfig(const PaddlePredictor::Config *config, bool use_analysis) {
  const auto *analysis_config =
-      reinterpret_cast<const contrib::AnalysisConfig *>(config);
+      reinterpret_cast<const AnalysisConfig *>(config);
  if (use_analysis) {
    LOG(INFO) << *analysis_config;
    return;
@@ -109,9 +102,9 @@ void CompareResult(const std::vector<PaddleTensor> &outputs,
 std::unique_ptr<PaddlePredictor> CreateTestPredictor(
    const PaddlePredictor::Config *config, bool use_analysis = true) {
  const auto *analysis_config =
-      reinterpret_cast<const contrib::AnalysisConfig *>(config);
+      reinterpret_cast<const AnalysisConfig *>(config);
  if (use_analysis) {
-    return CreatePaddlePredictor<contrib::AnalysisConfig>(*analysis_config);
+    return CreatePaddlePredictor<AnalysisConfig>(*analysis_config);
  }
  auto native_config = analysis_config->ToNativeConfig();
  return CreatePaddlePredictor<NativeConfig>(native_config);
@@ -146,7 +139,8 @@ void SetFakeImageInput(std::vector<std::vector<PaddleTensor>> *inputs,
                       const std::string &dirname, bool is_combined = true,
                       std::string model_filename = "model",
                       std::string params_filename = "params",
-                       const std::vector<std::string> *feed_names = nullptr) {
+                       const std::vector<std::string> *feed_names = nullptr,
+                       const int continuous_inuput_index = 0) {
  // Set fake_image_data
  PADDLE_ENFORCE_EQ(FLAGS_test_all_data, 0, "Only have single batch of data.");
  std::vector<std::vector<int64_t>> feed_target_shapes = GetFeedTargetShapes(
@@ -183,7 +177,8 @@ void SetFakeImageInput(std::vector<std::vector<PaddleTensor>> *inputs,
    float *input_data = static_cast<float *>(input.data.data());
    // fill input data, for profile easily, do not use random data here.
    for (size_t j = 0; j < len; ++j) {
-      *(input_data + j) = Random(0.0, 1.0) / 10.;
+      *(input_data + j) =
+          static_cast<float>((j + continuous_inuput_index) % len) / len;
    }
  }
  (*inputs).emplace_back(input_slots);

--- a/paddle/fluid/inference/tests/api/trt_models_tester.cc
+++ b/paddle/fluid/inference/tests/api/trt_models_tester.cc
@@ -42,9 +42,9 @@ void SetConfig(ConfigType* config, std::string model_dir, bool use_gpu,
 }
 template <>
-void SetConfig<contrib::AnalysisConfig>(contrib::AnalysisConfig* config,
+void SetConfig<AnalysisConfig>(AnalysisConfig* config, std::string model_dir,
-                                        std::string model_dir, bool use_gpu,
+                               bool use_gpu, bool use_tensorrt,
-                                        bool use_tensorrt, int batch_size) {
+                               int batch_size) {
  if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) {
    config->SetModel(model_dir + "/" + FLAGS_prog_filename,
                     model_dir + "/" + FLAGS_param_filename);
@@ -75,11 +75,11 @@ void profile(std::string model_dir, bool use_analysis, bool use_tensorrt) {
  std::vector<PaddleTensor> outputs;
  if (use_analysis || use_tensorrt) {
-    contrib::AnalysisConfig config;
+    AnalysisConfig config;
    config.EnableUseGpu(100, 0);
    config.pass_builder()->TurnOnDebug();
-    SetConfig<contrib::AnalysisConfig>(&config, model_dir, true, use_tensorrt,
+    SetConfig<AnalysisConfig>(&config, model_dir, true, use_tensorrt,
-                                       FLAGS_batch_size);
+                              FLAGS_batch_size);
    TestPrediction(reinterpret_cast<PaddlePredictor::Config*>(&config),
                   inputs_all, &outputs, FLAGS_num_threads, true);
  } else {
@@ -99,18 +99,18 @@ void compare(std::string model_dir, bool use_tensorrt) {
    SetFakeImageInput(&inputs_all, model_dir, false, "__model__", "");
  }
-  contrib::AnalysisConfig analysis_config;
+  AnalysisConfig analysis_config;
-  SetConfig<contrib::AnalysisConfig>(&analysis_config, model_dir, true,
+  SetConfig<AnalysisConfig>(&analysis_config, model_dir, true, use_tensorrt,
-                                     use_tensorrt, FLAGS_batch_size);
+                            FLAGS_batch_size);
  CompareNativeAndAnalysis(
      reinterpret_cast<const PaddlePredictor::Config*>(&analysis_config),
      inputs_all);
 }
 void compare_continuous_input(std::string model_dir, bool use_tensorrt) {
-  contrib::AnalysisConfig analysis_config;
+  AnalysisConfig analysis_config;
-  SetConfig<contrib::AnalysisConfig>(&analysis_config, model_dir, true,
+  SetConfig<AnalysisConfig>(&analysis_config, model_dir, true, use_tensorrt,
-                                     use_tensorrt, FLAGS_batch_size);
+                            FLAGS_batch_size);
  auto config =
      reinterpret_cast<const PaddlePredictor::Config*>(&analysis_config);
  auto native_pred = CreateTestPredictor(config, false);
@@ -119,9 +119,10 @@ void compare_continuous_input(std::string model_dir, bool use_tensorrt) {
    std::vector<std::vector<PaddleTensor>> inputs_all;
    if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) {
      SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename,
-                        FLAGS_param_filename);
+                        FLAGS_param_filename, nullptr, i);
    } else {
-      SetFakeImageInput(&inputs_all, model_dir, false, "__model__", "");
+      SetFakeImageInput(&inputs_all, model_dir, false, "__model__", "", nullptr,
+                        i);
    }
    CompareNativeAndAnalysis(native_pred.get(), analysis_pred.get(),
                             inputs_all);

--- a/paddle/fluid/inference/utils/benchmark_tester.cc
+++ b/paddle/fluid/inference/utils/benchmark_tester.cc
@@ -34,6 +34,6 @@ TEST(Benchmark, PersistToFile) {
  benchmark.SetLatency(220);
  benchmark.PersistToFile("1.log");
-  benchmark.PersistToFile("1.log");
+  benchmark.PersistToFile("2.log");
-  benchmark.PersistToFile("1.log");
+  benchmark.PersistToFile("3.log");
 }
--- a/paddle/fluid/memory/allocation/legacy_allocator.cc
+++ b/paddle/fluid/memory/allocation/legacy_allocator.cc
@@ -13,9 +13,15 @@
 // limitations under the License.
 #include "paddle/fluid/memory/allocation/legacy_allocator.h"
 #include <string>
 #include <utility>
 #include <vector>
+#ifdef PADDLE_WITH_JEMALLOC
+#include <jemalloc/jemalloc.h>
+#endif
 #include "glog/logging.h"
 #include "paddle/fluid/memory/detail/buddy_allocator.h"
 #include "paddle/fluid/memory/detail/system_allocator.h"
@@ -53,11 +59,6 @@ size_t memory_usage(const platform::Place &p);
 using BuddyAllocator = detail::BuddyAllocator;
-std::unordered_map</*device id*/ int,
-                   std::pair</*current memory usage*/ uint64_t,
-                             /*peak memory usage*/ uint64_t>>
-    gpu_mem_info;
 BuddyAllocator *GetCPUBuddyAllocator() {
  // We tried thread_local for inference::RNN1 model, but that not works much
  // for multi-thread test.
@@ -95,7 +96,11 @@ struct NaiveAllocator {
 template <>
 void *Alloc<platform::CPUPlace>(const platform::CPUPlace &place, size_t size) {
  VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place);
+#ifdef PADDLE_WITH_JEMALLOC
+  void *p = malloc(size);
+#else
  void *p = GetCPUBuddyAllocator()->Alloc(size);
+#endif
  if (FLAGS_init_allocated_mem) {
    memset(p, 0xEF, size);
  }
@@ -107,12 +112,21 @@ template <>
 void Free<platform::CPUPlace>(const platform::CPUPlace &place, void *p,
                              size_t size) {
  VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place);
+#ifdef PADDLE_WITH_JEMALLOC
+  free(p);
+#else
  GetCPUBuddyAllocator()->Free(p);
+#endif
 }
 template <>
 size_t Used<platform::CPUPlace>(const platform::CPUPlace &place) {
+#ifdef PADDLE_WITH_JEMALLOC
+  // fake the result of used memory when PADDLE_WITH_JEMALLOC is ON
+  return 0U;
+#else
  return GetCPUBuddyAllocator()->Used();
+#endif
 }
 #ifdef PADDLE_WITH_CUDA
@@ -125,6 +139,8 @@ BuddyAllocator *GetGPUBuddyAllocator(int gpu_id) {
    devices = platform::GetSelectedDevices();
    int gpu_num = devices.size();
+    allocation::GPUMemMonitor.Initialize(devices.size());
    a_arr = new BuddyAllocator *[gpu_num];
    for (size_t i = 0; i < devices.size(); ++i) {
      int dev_id = devices[i];
@@ -171,25 +187,19 @@ void *Alloc<platform::CUDAPlace>(const platform::CUDAPlace &place,
    platform::SetDeviceId(place.device);
    size_t avail, total;
    platform::GpuMemoryUsage(&avail, &total);
-    LOG(WARNING) << "Cannot allocate " << string::HumanReadableSize(size)
+    LOG(FATAL) << "Cannot allocate " << string::HumanReadableSize(size)
-                 << " in GPU " << place.device << ", available "
+               << " in GPU " << place.device << ", available "
-                 << string::HumanReadableSize(avail);
+               << string::HumanReadableSize(avail) << "total " << total
-    LOG(WARNING) << "total " << total;
+               << "GpuMinChunkSize "
-    LOG(WARNING) << "GpuMinChunkSize "
+               << string::HumanReadableSize(buddy_allocator->GetMinChunkSize())
-                 << string::HumanReadableSize(
+               << "GpuMaxChunkSize "
-                        buddy_allocator->GetMinChunkSize());
+               << string::HumanReadableSize(buddy_allocator->GetMaxChunkSize())
-    LOG(WARNING) << "GpuMaxChunkSize "
+               << "GPU memory used: "
-                 << string::HumanReadableSize(
+               << string::HumanReadableSize(Used<platform::CUDAPlace>(place));
-                        buddy_allocator->GetMaxChunkSize());
-    LOG(WARNING) << "GPU memory used: "
-                 << string::HumanReadableSize(Used<platform::CUDAPlace>(place));
    platform::SetDeviceId(cur_dev);
  } else {
-    gpu_mem_info[place.device].first += size;
+    if (VLOG_IS_ON(3)) {
-    if (gpu_mem_info[place.device].first > gpu_mem_info[place.device].second) {
+      allocation::GPUMemMonitor.Add(place.device, size);
-      gpu_mem_info[place.device].second = gpu_mem_info[place.device].first;
-      VLOG(3) << "device: " << place.device << " peak memory usage : "
-              << (gpu_mem_info[place.device].second >> 20) << " MiB";
    }
    if (FLAGS_init_allocated_mem) {
      cudaMemset(ptr, 0xEF, size);
@@ -206,7 +216,9 @@ void Free<platform::CUDAPlace>(const platform::CUDAPlace &place, void *p,
                               size_t size) {
 #ifdef PADDLE_WITH_CUDA
  GetGPUBuddyAllocator(place.device)->Free(p);
-  gpu_mem_info[place.device].first -= size;
+  if (VLOG_IS_ON(3)) {
+    allocation::GPUMemMonitor.Minus(place.device, size);
+  }
 #else
  PADDLE_THROW("'CUDAPlace' is not supported in CPU only device.");
 #endif
@@ -316,6 +328,8 @@ size_t Usage::operator()(const platform::CUDAPinnedPlace &cuda_pinned) const {
 namespace allocation {
+LegacyMemMonitor GPUMemMonitor;
 Allocation *LegacyAllocator::AllocateImpl(size_t size, Allocator::Attr attr) {
  void *ptr = boost::apply_visitor(legacy::AllocVisitor(size), place_);
  return new Allocation(ptr, size, place_);
@@ -327,6 +341,63 @@ void LegacyAllocator::Free(Allocation *allocation) {
      allocation->place());
  delete allocation;
 }
+bool MemInfo::Add(const size_t &size) {
+  std::lock_guard<std::mutex> lock(mutex_);
+  usage_ += size;
+  bool peak_point = usage_ > peak_usage_;
+  if (peak_point) peak_usage_ = usage_;
+  return peak_point;
+}
+void MemInfo::Minus(const size_t &size) {
+  std::lock_guard<std::mutex> lock(mutex_);
+  usage_ -= size;
+}
+uint64_t MemInfo::GetPeakUsage() { return peak_usage_; }
+LegacyMemMonitor::~LegacyMemMonitor() {
+  for (auto &item : gpu_mem_info_) delete item.second;
+}
+void LegacyMemMonitor::Initialize(const int &device_num) {
+  for (auto i = 0; i < device_num; ++i) {
+    gpu_mem_info_[i] = new MemInfo();
+  }
+}
+void LegacyMemMonitor::Add(const int &device, const size_t &size) {
+  if (gpu_mem_info_[device]->Add(size)) {
+    VLOG(3) << "#LegacyMemMonitor# device: " << device
+            << " peak memory usage : "
+            << (gpu_mem_info_[device]->GetPeakUsage() >> 20) << " MiB";
+  }
+}
+void LegacyMemMonitor::Minus(const int &device, const size_t &size) {
+  gpu_mem_info_[device]->Minus(size);
+}
+uint64_t LegacyMemMonitor::GetMemUsage(const int &device) {
+  return gpu_mem_info_.find(device) == gpu_mem_info_.end()
+             ? 0
+             : gpu_mem_info_[device]->GetPeakUsage();
+}
+void LegacyMemMonitor::PrintMemUsage() {
+  std::vector<int> devices;
+  for (const auto &item : gpu_mem_info_) {
+    devices.emplace_back(item.first);
+  }
+  std::sort(devices.begin(), devices.end());
+  for (const auto &device : devices) {
+    std::cout << "Device : " << device << " Peak Memory Usage : "
+              << (gpu_mem_info_[device]->GetPeakUsage() >> 20) << " MiB"
+              << std::endl;
+  }
+}
 }  // namespace allocation
 }  // namespace memory
 }  // namespace paddle
--- a/paddle/fluid/memory/allocation/legacy_allocator.h
+++ b/paddle/fluid/memory/allocation/legacy_allocator.h
@@ -13,12 +13,59 @@
 // limitations under the License.
 #pragma once
+#include <algorithm>
+#include <mutex>  // NOLINT
+#include <unordered_map>
+#include <utility>
+#include <vector>
 #include "paddle/fluid/memory/allocation/allocator.h"
 #include "paddle/fluid/platform/place.h"
 namespace paddle {
 namespace memory {
 namespace allocation {
+class MemInfo {
+ public:
+  MemInfo() : usage_(0), peak_usage_(0) {}
+  MemInfo(const MemInfo &) = delete;
+  MemInfo &operator=(const MemInfo &) = delete;
+  // return a flag to indicate current operation will create a peak point or not
+  bool Add(const size_t &);
+  void Minus(const size_t &);
+  uint64_t GetPeakUsage();
+ private:
+  /* current memory usage*/
+  uint64_t usage_;
+  uint64_t peak_usage_;
+  std::mutex mutex_;
+};
+class LegacyMemMonitor {
+ public:
+  // used to store the GPU memory usage of each devices
+  using MemUsage = std::unordered_map</*device id*/ int,
+                                      /*mem usage info node*/ MemInfo *>;
+  MemUsage GetMemUsageInfo() { return gpu_mem_info_; }
+  ~LegacyMemMonitor();
+  void Initialize(const int &);
+  void Add(const int &, const size_t &);
+  void Minus(const int &, const size_t &);
+  uint64_t GetMemUsage(const int &);
+  void PrintMemUsage();
+ protected:
+  MemUsage gpu_mem_info_;
+};
+extern LegacyMemMonitor GPUMemMonitor;
 class LegacyAllocatorPrivate;
 class LegacyAllocator : public Allocator {
 public:

--- a/paddle/fluid/operators/activation_op.cc
+++ b/paddle/fluid/operators/activation_op.cc
@@ -14,7 +14,7 @@ limitations under the License. */
 #include "paddle/fluid/operators/activation_op.h"
 #include <string>
-#include "paddle/fluid/operators/mkldnn_activation_op.h"
+#include "paddle/fluid/operators/mkldnn/mkldnn_activation_op.h"
 #include "paddle/fluid/platform/port.h"
 namespace paddle {
@@ -547,12 +547,14 @@ namespace ops = paddle::operators;
  __macro(Swish, swish);             \
  __macro(ThresholdedRelu, thresholded_relu);
-#define REGISTER_INPLACE_ACTIVATION_OP(OP_NAME, KERNEL_TYPE)        \
+#define REGISTER_INPLACE_ACTIVATION_OP(OP_NAME, KERNEL_TYPE)                   \
-  REGISTER_OPERATOR(KERNEL_TYPE, ::paddle::operators::ActivationOp, \
+  REGISTER_OPERATOR(KERNEL_TYPE, ::paddle::operators::ActivationOp,            \
-                    ::paddle::operators::OP_NAME##OpMaker,          \
+                    ::paddle::operators::OP_NAME##OpMaker,                     \
-                    ::paddle::operators::ActivationOpInferVarType,  \
+                    ::paddle::operators::ActivationOpInferVarType,             \
-                    ::paddle::operators::OP_NAME##GradMaker);       \
+                    ::paddle::operators::OP_NAME##GradMaker,                   \
-  REGISTER_OPERATOR(KERNEL_TYPE##_grad, ::paddle::operators::ActivationOpGrad)
+                    ::paddle::framework::SingleOpInplaceInToOut);              \
+  REGISTER_OPERATOR(KERNEL_TYPE##_grad, ::paddle::operators::ActivationOpGrad, \
+                    ::paddle::framework::SingleOpInplaceInToOut)
 #define REGISTER_ACTIVATION_OP(OP_NAME, KERNEL_TYPE)                    \
  REGISTER_OPERATOR(KERNEL_TYPE, ::paddle::operators::ActivationOp,     \

--- a/paddle/fluid/operators/batch_norm_op.cc
+++ b/paddle/fluid/operators/batch_norm_op.cc
@@ -589,8 +589,10 @@ class BatchNormGradMaker : public framework::SingleGradOpDescMaker {
    op->SetInput("SavedVariance", Output("SavedVariance"));
    // used when setting use_global_stats True during training
-    op->SetInput("Mean", Output("MeanOut"));
+    if (boost::get<bool>(GetAttr("use_global_stats"))) {
-    op->SetInput("Variance", Output("VarianceOut"));
+      op->SetInput("Mean", Output("MeanOut"));
+      op->SetInput("Variance", Output("VarianceOut"));
+    }
    op->SetAttrMap(Attrs());
@@ -602,13 +604,48 @@ class BatchNormGradMaker : public framework::SingleGradOpDescMaker {
  }
 };
+class BatchNormInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using InplaceInToOut::InplaceInToOut;
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const framework::OpDesc &op_desc,
+      framework::BlockDesc *block) const override {
+    std::unordered_map<std::string, std::string> inplace_in_to_out = {
+        {"Mean", "MeanOut"}, {"Variance", "VarianceOut"}, {"X", "Y"},
+    };
+    return inplace_in_to_out;
+  }
+};
+class BatchNormGradInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using InplaceInToOut::InplaceInToOut;
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const framework::OpDesc &op_desc,
+      framework::BlockDesc *block) const override {
+    std::unordered_map<std::string, std::string> inplace_in_to_out = {
+        // Scale, Bias, SavedMean, SavedVariance shape is [batch_size, C]
+        {framework::GradVarName("Y"), framework::GradVarName("X")},
+        {"SavedMean", framework::GradVarName("Scale")},
+        {"SavedVariance", framework::GradVarName("Bias")},
+    };
+    return inplace_in_to_out;
+  }
+};
 }  // namespace operators
 }  // namespace paddle
 namespace ops = paddle::operators;
 REGISTER_OPERATOR(batch_norm, ops::BatchNormOp, ops::BatchNormOpMaker,
-                  ops::BatchNormOpInferVarType, ops::BatchNormGradMaker);
+                  ops::BatchNormOpInferVarType, ops::BatchNormGradMaker,
-REGISTER_OPERATOR(batch_norm_grad, ops::BatchNormGradOp);
+                  ops::BatchNormInplaceInToOut);
+REGISTER_OPERATOR(batch_norm_grad, ops::BatchNormGradOp,
+                  ops::BatchNormGradInplaceInToOut);
 REGISTER_OP_CPU_KERNEL(
    batch_norm, ops::BatchNormKernel<paddle::platform::CPUDeviceContext, float>,

--- a/paddle/fluid/operators/beam_search_op.cc
+++ b/paddle/fluid/operators/beam_search_op.cc
@@ -51,6 +51,9 @@ class BeamSearchOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("selected_scores",
              "A LoDTensor containing the accumulated scores corresponding to "
              "Output(selected_ids).");
+    AddOutput(
+        "parent_idx",
+        "A Tensor preserving the selected_ids' parent indice in pre_ids.");
    // Attributes stored in AttributeMap
    AddAttr<int>("level", "the level of LoDTensor");

--- a/paddle/fluid/operators/beam_search_op.h
+++ b/paddle/fluid/operators/beam_search_op.h
@@ -41,13 +41,15 @@ class BeamSearchOpKernel : public framework::OpKernel<T> {
    auto selected_ids = context.Output<framework::LoDTensor>("selected_ids");
    auto selected_scores =
        context.Output<framework::LoDTensor>("selected_scores");
+    auto* parent_idx = context.Output<framework::Tensor>("parent_idx");
    PADDLE_ENFORCE_NOT_NULL(selected_ids);
    PADDLE_ENFORCE_NOT_NULL(selected_scores);
+    PADDLE_ENFORCE_NOT_NULL(parent_idx);
    math::BeamSearchFunctor<DeviceContext, T> alg;
    alg(context.template device_context<DeviceContext>(), pre_ids, pre_scores,
-        ids, scores, selected_ids, selected_scores, level, beam_size, end_id,
+        ids, scores, selected_ids, selected_scores, parent_idx, level,
-        is_accumulated);
+        beam_size, end_id, is_accumulated);
  }
 };

--- a/paddle/fluid/operators/detection/CMakeLists.txt
+++ b/paddle/fluid/operators/detection/CMakeLists.txt
@@ -31,6 +31,8 @@ detection_library(polygon_box_transform_op SRCS polygon_box_transform_op.cc
 polygon_box_transform_op.cu)
 detection_library(rpn_target_assign_op SRCS rpn_target_assign_op.cc)
 detection_library(generate_proposal_labels_op SRCS generate_proposal_labels_op.cc)
+detection_library(box_clip_op SRCS box_clip_op.cc box_clip_op.cu)
+detection_library(yolov3_loss_op SRCS yolov3_loss_op.cc)
 if(WITH_GPU)
  detection_library(generate_proposals_op SRCS generate_proposals_op.cc generate_proposals_op.cu DEPS memory cub)

--- a/paddle/fluid/operators/detection/bbox_util.h
+++ b/paddle/fluid/operators/detection/bbox_util.h
@@ -99,5 +99,29 @@ void BboxOverlaps(const framework::Tensor& r_boxes,
  }
 }
+template <class T>
+void ClipTiledBoxes(const platform::DeviceContext& ctx,
+                    const framework::Tensor& im_info,
+                    const framework::Tensor& input_boxes,
+                    framework::Tensor* out) {
+  T* out_data = out->mutable_data<T>(ctx.GetPlace());
+  const T* im_info_data = im_info.data<T>();
+  const T* input_boxes_data = input_boxes.data<T>();
+  T zero(0);
+  T im_w = round(im_info_data[1] / im_info_data[2]);
+  T im_h = round(im_info_data[0] / im_info_data[2]);
+  for (int64_t i = 0; i < input_boxes.numel(); ++i) {
+    if (i % 4 == 0) {
+      out_data[i] = std::max(std::min(input_boxes_data[i], im_w - 1), zero);
+    } else if (i % 4 == 1) {
+      out_data[i] = std::max(std::min(input_boxes_data[i], im_h - 1), zero);
+    } else if (i % 4 == 2) {
+      out_data[i] = std::max(std::min(input_boxes_data[i], im_w - 1), zero);
+    } else {
+      out_data[i] = std::max(std::min(input_boxes_data[i], im_h - 1), zero);
+    }
+  }
+}
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/fluid/operators/detection/box_clip_op.cc
+++ b/paddle/fluid/operators/detection/box_clip_op.cc
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/fluid/operators/detection/box_clip_op.h"
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+class BoxClipOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+ protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Input"),
+                   "Input(Input) of BoxClipOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"),
+                   "Input(ImInfo) of BoxClipOp should not be null.");
+    auto input_box_dims = ctx->GetInputDim("Input");
+    auto im_info_dims = ctx->GetInputDim("ImInfo");
+    if (ctx->IsRuntime()) {
+      auto input_box_size = input_box_dims.size();
+      PADDLE_ENFORCE_EQ(input_box_dims[input_box_size - 1], 4,
+                        "The last dimension of Input must be 4");
+      PADDLE_ENFORCE_EQ(im_info_dims.size(), 2,
+                        "The rank of Input(Input) in BoxClipOp must be 2");
+      PADDLE_ENFORCE_EQ(im_info_dims[1], 3,
+                        "The last dimension of ImInfo must be 3");
+    }
+    ctx->ShareDim("Input", /*->*/ "Output");
+    ctx->ShareLoD("Input", /*->*/ "Output");
+  }
+};
+class BoxClipOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("Input",
+             "(LoDTensor) "
+             "Input is a LoDTensor with shape [..., 4] holds 4 points"
+             "in last dimension in format [xmin, ymin, xmax, ymax]");
+    AddInput("ImInfo",
+             "(Tensor) Information for image reshape is in shape (N, 3), "
+             "in format (height, width, im_scale)");
+    AddOutput("Output",
+              "(LoDTensor) "
+              "Output is a LoDTensor with the same shape as Input"
+              "and it is the result after clip");
+    AddComment(R"DOC(
+This operator clips input boxes to original input images.
+For each input box, The formula is given as follows:
+       $$xmin = \max(\min(xmin, im_w - 1), 0)$$
+       $$ymin = \max(\min(ymin, im_h - 1), 0)$$     
+       $$xmax = \max(\min(xmax, im_w - 1), 0)$$
+       $$ymax = \max(\min(ymax, im_h - 1), 0)$$
+where im_w and im_h are computed from ImInfo, the formula is given as follows:
+       $$im_w = \round(width / im_scale)$$
+       $$im_h = \round(height / im_scale)$$ 
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(box_clip, ops::BoxClipOp, ops::BoxClipOpMaker,
+                  paddle::framework::EmptyGradOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    box_clip, ops::BoxClipKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::BoxClipKernel<paddle::platform::CPUDeviceContext, double>);
--- a/paddle/fluid/operators/detection/box_clip_op.cu
+++ b/paddle/fluid/operators/detection/box_clip_op.cu
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <algorithm>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/detection/box_clip_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+#include "paddle/fluid/platform/hostdevice.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTenso = framework::LoDTensor;
+static constexpr int ImInfoSize = 3;
+template <typename T, int BlockSize>
+static __global__ void GPUBoxClip(const T *input, const size_t *lod,
+                                  const size_t width, const T *im_info,
+                                  T *output) {
+  T im_w = round(im_info[blockIdx.x * ImInfoSize + 1] /
+                 im_info[blockIdx.x * ImInfoSize + 2]);
+  T im_h = round(im_info[blockIdx.x * ImInfoSize] /
+                 im_info[blockIdx.x * ImInfoSize + 2]);
+  for (int i = threadIdx.x; i < (lod[blockIdx.x + 1] - lod[blockIdx.x]) * width;
+       i += BlockSize) {
+    int idx = lod[blockIdx.x] * width + i;
+    T im_size = (idx % 2 == 0) ? im_w : im_h;
+    output[idx] = max(min(input[idx], im_size - 1), T(0.));
+  }
+}
+template <typename DeviceContext, typename T>
+class GPUBoxClipKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &context) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(context.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto *input = context.Input<LoDTensor>("Input");
+    auto *im_info = context.Input<Tensor>("ImInfo");
+    auto *output = context.Output<LoDTensor>("Output");
+    const int64_t num = input->dims()[0];
+    const int64_t bbox_width = input->numel() / num;
+    auto lod = input->lod();
+    framework::LoD abs_offset_lod = framework::ToAbsOffset(lod);
+    auto &dev_ctx = context.template device_context<DeviceContext>();
+    auto stream = dev_ctx.stream();
+    const size_t batch_size = lod.back().size() - 1;
+    T *output_data = output->mutable_data<T>(dev_ctx.GetPlace());
+    GPUBoxClip<T, 512><<<batch_size, 512, 0, stream>>>(
+        input->data<T>(), abs_offset_lod[0].CUDAMutableData(dev_ctx.GetPlace()),
+        bbox_width, im_info->data<T>(), output_data);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    box_clip, ops::GPUBoxClipKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::GPUBoxClipKernel<paddle::platform::CUDADeviceContext, double>);
--- a/paddle/fluid/operators/detection/box_clip_op.h
+++ b/paddle/fluid/operators/detection/box_clip_op.h
--- a/paddle/fluid/operators/detection/box_coder_op.cc
+++ b/paddle/fluid/operators/detection/box_coder_op.cc
--- a/paddle/fluid/operators/detection/box_coder_op.cu
+++ b/paddle/fluid/operators/detection/box_coder_op.cu
--- a/paddle/fluid/operators/detection/box_coder_op.h
+++ b/paddle/fluid/operators/detection/box_coder_op.h
--- a/paddle/fluid/operators/detection/density_prior_box_op.h
+++ b/paddle/fluid/operators/detection/density_prior_box_op.h
--- a/paddle/fluid/operators/detection/multiclass_nms_op.cc
+++ b/paddle/fluid/operators/detection/multiclass_nms_op.cc
--- a/paddle/fluid/operators/yolov3_loss_op.cc
+++ b/paddle/fluid/operators/yolov3_loss_op.cc
--- a/paddle/fluid/operators/detection/yolov3_loss_op.h
+++ b/paddle/fluid/operators/detection/yolov3_loss_op.h
--- a/paddle/fluid/operators/distributed/proto_encoder_helper.h
+++ b/paddle/fluid/operators/distributed/proto_encoder_helper.h
--- a/paddle/fluid/operators/dropout_op.cu
+++ b/paddle/fluid/operators/dropout_op.cu
--- a/paddle/fluid/operators/elementwise/elementwise_add_op.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_add_op.cc
--- a/paddle/fluid/operators/elementwise/elementwise_op.h
+++ b/paddle/fluid/operators/elementwise/elementwise_op.h
--- a/paddle/fluid/operators/elementwise/elementwise_add_mkldnn_op.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_add_mkldnn_op.cc
--- a/paddle/fluid/operators/elementwise/elementwise_mul_mkldnn_op.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_mul_mkldnn_op.cc
--- a/paddle/fluid/operators/flatten_op.cc
+++ b/paddle/fluid/operators/flatten_op.cc
--- a/paddle/fluid/operators/fused/fusion_repeated_fc_relu_op.cc
+++ b/paddle/fluid/operators/fused/fusion_repeated_fc_relu_op.cc
--- a/paddle/fluid/operators/fused/fusion_squared_mat_sub_op.cc
+++ b/paddle/fluid/operators/fused/fusion_squared_mat_sub_op.cc
--- a/paddle/fluid/operators/gather_op.cu
+++ b/paddle/fluid/operators/gather_op.cu
--- a/paddle/fluid/operators/gather_op.h
+++ b/paddle/fluid/operators/gather_op.h
--- a/paddle/fluid/operators/hierarchical_sigmoid_op.h
+++ b/paddle/fluid/operators/hierarchical_sigmoid_op.h
--- a/paddle/fluid/operators/interpolate_op.cc
+++ b/paddle/fluid/operators/interpolate_op.cc
--- a/paddle/fluid/operators/interpolate_op.cu
+++ b/paddle/fluid/operators/interpolate_op.cu
--- a/paddle/fluid/operators/interpolate_op.h
+++ b/paddle/fluid/operators/interpolate_op.h
--- a/paddle/fluid/operators/jit/benchmark.cc
+++ b/paddle/fluid/operators/jit/benchmark.cc
--- a/paddle/fluid/operators/jit/gen/CMakeLists.txt
+++ b/paddle/fluid/operators/jit/gen/CMakeLists.txt
--- a/paddle/fluid/operators/jit/gen/act.cc
+++ b/paddle/fluid/operators/jit/gen/act.cc
--- a/paddle/fluid/operators/jit/gen/blas.cc
+++ b/paddle/fluid/operators/jit/gen/blas.cc
--- a/paddle/fluid/operators/jit/gen/blas.h
+++ b/paddle/fluid/operators/jit/gen/blas.h
--- a/paddle/fluid/operators/jit/gen/hopv.cc
+++ b/paddle/fluid/operators/jit/gen/hopv.cc
--- a/paddle/fluid/operators/jit/gen/hopv.h
+++ b/paddle/fluid/operators/jit/gen/hopv.h
--- a/paddle/fluid/operators/jit/gen/jitcode.h
+++ b/paddle/fluid/operators/jit/gen/jitcode.h
--- a/paddle/fluid/operators/jit/gen/matmul.cc
+++ b/paddle/fluid/operators/jit/gen/matmul.cc
--- a/paddle/fluid/operators/jit/gen/matmul.h
+++ b/paddle/fluid/operators/jit/gen/matmul.h
--- a/paddle/fluid/operators/jit/gen_base.cc
+++ b/paddle/fluid/operators/jit/gen_base.cc
--- a/paddle/fluid/operators/jit/gen_base.h
+++ b/paddle/fluid/operators/jit/gen_base.h
--- a/paddle/fluid/operators/jit/helper.cc
+++ b/paddle/fluid/operators/jit/helper.cc
--- a/paddle/fluid/operators/jit/helper.h
+++ b/paddle/fluid/operators/jit/helper.h
--- a/paddle/fluid/operators/jit/kernel_base.h
+++ b/paddle/fluid/operators/jit/kernel_base.h
--- a/paddle/fluid/operators/jit/kernel_key.cc
+++ b/paddle/fluid/operators/jit/kernel_key.cc
--- a/paddle/fluid/operators/jit/more/mix/CMakeLists.txt
+++ b/paddle/fluid/operators/jit/more/mix/CMakeLists.txt
--- a/paddle/fluid/operators/jit/more/mix/mix.cc
+++ b/paddle/fluid/operators/jit/more/mix/mix.cc
--- a/paddle/fluid/operators/jit/more/mix/mix.h
+++ b/paddle/fluid/operators/jit/more/mix/mix.h
--- a/paddle/fluid/operators/jit/more/mkl/CMakeLists.txt
+++ b/paddle/fluid/operators/jit/more/mkl/CMakeLists.txt
--- a/paddle/fluid/operators/jit/more/mkl/mkl.cc
+++ b/paddle/fluid/operators/jit/more/mkl/mkl.cc
--- a/paddle/fluid/operators/jit/more/mkl/mkl.h
+++ b/paddle/fluid/operators/jit/more/mkl/mkl.h
--- a/paddle/fluid/operators/jit/refer/CMakeLists.txt
+++ b/paddle/fluid/operators/jit/refer/CMakeLists.txt
--- a/paddle/fluid/operators/jit/refer/refer.cc
+++ b/paddle/fluid/operators/jit/refer/refer.cc
--- a/paddle/fluid/operators/jit/refer/refer.h
+++ b/paddle/fluid/operators/jit/refer/refer.h
--- a/paddle/fluid/operators/jit/test.cc
+++ b/paddle/fluid/operators/jit/test.cc
--- a/paddle/fluid/operators/lookup_table_op.cu
+++ b/paddle/fluid/operators/lookup_table_op.cu
--- a/paddle/fluid/operators/math/CMakeLists.txt
+++ b/paddle/fluid/operators/math/CMakeLists.txt
--- a/paddle/fluid/operators/math/beam_search.cc
+++ b/paddle/fluid/operators/math/beam_search.cc
--- a/paddle/fluid/operators/math/beam_search.cu
+++ b/paddle/fluid/operators/math/beam_search.cu
--- a/paddle/fluid/operators/math/beam_search.h
+++ b/paddle/fluid/operators/math/beam_search.h
--- a/paddle/fluid/operators/math/beam_search_test.cc
+++ b/paddle/fluid/operators/math/beam_search_test.cc
--- a/paddle/fluid/operators/math/fc_compute.h
+++ b/paddle/fluid/operators/math/fc_compute.h
--- a/paddle/fluid/operators/math/softmax_impl.h
+++ b/paddle/fluid/operators/math/softmax_impl.h
--- a/paddle/fluid/operators/activation_mkldnn_op.cc
+++ b/paddle/fluid/operators/activation_mkldnn_op.cc
--- a/paddle/fluid/operators/batch_norm_mkldnn_op.cc
+++ b/paddle/fluid/operators/batch_norm_mkldnn_op.cc
--- a/paddle/fluid/operators/concat_mkldnn_op.cc
+++ b/paddle/fluid/operators/concat_mkldnn_op.cc
--- a/paddle/fluid/operators/conv_mkldnn_op.cc
+++ b/paddle/fluid/operators/conv_mkldnn_op.cc
--- a/paddle/fluid/operators/conv_transpose_mkldnn_op.cc
+++ b/paddle/fluid/operators/conv_transpose_mkldnn_op.cc
--- a/paddle/fluid/operators/dequantize_mkldnn_op.cc
+++ b/paddle/fluid/operators/dequantize_mkldnn_op.cc
--- a/paddle/fluid/operators/fc_mkldnn_op.cc
+++ b/paddle/fluid/operators/fc_mkldnn_op.cc
--- a/paddle/fluid/operators/gaussian_random_mkldnn_op.cc
+++ b/paddle/fluid/operators/gaussian_random_mkldnn_op.cc
--- a/paddle/fluid/operators/lrn_mkldnn_op.cc
+++ b/paddle/fluid/operators/lrn_mkldnn_op.cc
--- a/paddle/fluid/operators/mkldnn_activation_op.h
+++ b/paddle/fluid/operators/mkldnn_activation_op.h
--- a/paddle/fluid/operators/pool_mkldnn_op.cc
+++ b/paddle/fluid/operators/pool_mkldnn_op.cc
--- a/paddle/fluid/operators/quantize_mkldnn_op.cc
+++ b/paddle/fluid/operators/quantize_mkldnn_op.cc
--- a/paddle/fluid/operators/softmax_mkldnn_op.cc
+++ b/paddle/fluid/operators/softmax_mkldnn_op.cc
--- a/paddle/fluid/operators/sum_mkldnn_op.cc
+++ b/paddle/fluid/operators/sum_mkldnn_op.cc
--- a/paddle/fluid/operators/transpose_mkldnn_op.cc
+++ b/paddle/fluid/operators/transpose_mkldnn_op.cc
--- a/paddle/fluid/operators/ngraph/ngraph_bridge.cc
+++ b/paddle/fluid/operators/ngraph/ngraph_bridge.cc
--- a/paddle/fluid/operators/ngraph/ngraph_ops.h
+++ b/paddle/fluid/operators/ngraph/ngraph_ops.h
--- a/paddle/fluid/operators/ngraph/ops/accuracy_op.h
+++ b/paddle/fluid/operators/ngraph/ops/accuracy_op.h
--- a/paddle/fluid/operators/ngraph/ops/binary_unnary_op.h
+++ b/paddle/fluid/operators/ngraph/ops/binary_unnary_op.h
--- a/paddle/fluid/operators/ngraph/ops/conv2d_op.h
+++ b/paddle/fluid/operators/ngraph/ops/conv2d_op.h
--- a/paddle/fluid/operators/ngraph/ops/pool2d_op.h
+++ b/paddle/fluid/operators/ngraph/ops/pool2d_op.h
--- a/paddle/fluid/operators/ngraph/ops/top_k_op.h
+++ b/paddle/fluid/operators/ngraph/ops/top_k_op.h
--- a/paddle/fluid/operators/norm_op.h
+++ b/paddle/fluid/operators/norm_op.h
--- a/paddle/fluid/operators/pool_op.cc
+++ b/paddle/fluid/operators/pool_op.cc
--- a/paddle/fluid/operators/reader/ctr_reader.cc
+++ b/paddle/fluid/operators/reader/ctr_reader.cc
--- a/paddle/fluid/operators/reader/ctr_reader_test.cc
+++ b/paddle/fluid/operators/reader/ctr_reader_test.cc
--- a/paddle/fluid/operators/reduce_ops/CMakeLists.txt
+++ b/paddle/fluid/operators/reduce_ops/CMakeLists.txt
--- a/paddle/fluid/operators/reshape_op.cc
+++ b/paddle/fluid/operators/reshape_op.cc
--- a/paddle/fluid/operators/scale_op.cc
+++ b/paddle/fluid/operators/scale_op.cc
--- a/paddle/fluid/operators/slice_op.cc
+++ b/paddle/fluid/operators/slice_op.cc
--- a/paddle/fluid/operators/softmax_op.cc
+++ b/paddle/fluid/operators/softmax_op.cc
--- a/paddle/fluid/operators/stack_op.cu
+++ b/paddle/fluid/operators/stack_op.cu
--- a/paddle/fluid/operators/transpose_op.cu.cc
+++ b/paddle/fluid/operators/transpose_op.cu.cc
--- a/paddle/fluid/operators/yolov3_loss_op.h
+++ b/paddle/fluid/operators/yolov3_loss_op.h
--- a/paddle/fluid/platform/CMakeLists.txt
+++ b/paddle/fluid/platform/CMakeLists.txt
--- a/paddle/fluid/platform/cuda_device_function.h
+++ b/paddle/fluid/platform/cuda_device_function.h
--- a/paddle/fluid/platform/dynload/mklml.h
+++ b/paddle/fluid/platform/dynload/mklml.h
--- a/paddle/fluid/platform/enforce.h
+++ b/paddle/fluid/platform/enforce.h
--- a/paddle/fluid/platform/nccl_helper.h
+++ b/paddle/fluid/platform/nccl_helper.h
--- a/paddle/fluid/platform/ngraph_helper.h
+++ b/paddle/fluid/platform/ngraph_helper.h
--- a/paddle/fluid/pybind/CMakeLists.txt
+++ b/paddle/fluid/pybind/CMakeLists.txt
--- a/paddle/fluid/pybind/inference_api.cc
+++ b/paddle/fluid/pybind/inference_api.cc
--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
--- a/paddle/fluid/string/printf.h
+++ b/paddle/fluid/string/printf.h
--- a/paddle/scripts/fast_install.sh
+++ b/paddle/scripts/fast_install.sh
--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
--- a/python/paddle/fluid/__init__.py
+++ b/python/paddle/fluid/__init__.py
--- a/python/paddle/fluid/compiler.py
+++ b/python/paddle/fluid/compiler.py
--- a/python/paddle/fluid/contrib/__init__.py
+++ b/python/paddle/fluid/contrib/__init__.py
--- a/python/paddle/fluid/contrib/int8_inference/README.md
+++ b/python/paddle/fluid/contrib/int8_inference/README.md
--- a/python/paddle/fluid/contrib/int8_inference/__init__.py
+++ b/python/paddle/fluid/contrib/int8_inference/__init__.py
--- a/python/paddle/fluid/contrib/int8_inference/utility.py
+++ b/python/paddle/fluid/contrib/int8_inference/utility.py
--- a/python/paddle/fluid/contrib/tests/test_calibration.py
+++ b/python/paddle/fluid/contrib/tests/test_calibration.py
--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
--- a/python/paddle/fluid/imperative/layers.py
+++ b/python/paddle/fluid/imperative/layers.py
--- a/python/paddle/fluid/imperative/nn.py
+++ b/python/paddle/fluid/imperative/nn.py
--- a/python/paddle/fluid/initializer.py
+++ b/python/paddle/fluid/initializer.py
--- a/python/paddle/fluid/io.py
+++ b/python/paddle/fluid/io.py
--- a/python/paddle/fluid/layer_helper.py
+++ b/python/paddle/fluid/layer_helper.py
--- a/python/paddle/fluid/layers/detection.py
+++ b/python/paddle/fluid/layers/detection.py
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
--- a/python/paddle/fluid/layers/ops.py
+++ b/python/paddle/fluid/layers/ops.py
--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
--- a/python/paddle/fluid/parallel_executor.py
+++ b/python/paddle/fluid/parallel_executor.py
--- a/python/paddle/fluid/tests/test_detection.py
+++ b/python/paddle/fluid/tests/test_detection.py
--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
--- a/python/paddle/fluid/tests/unittests/mkldnn/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/mkldnn/CMakeLists.txt
--- a/python/paddle/fluid/tests/unittests/mkldnn/__init__.py
+++ b/python/paddle/fluid/tests/unittests/mkldnn/__init__.py
--- a/python/paddle/fluid/tests/unittests/test_activation_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_activation_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_concat_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_concat_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_conv2d_int8_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_int8_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_conv2d_transpose_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_transpose_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_conv3d_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv3d_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_dequantize_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_dequantize_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_elementwise_add_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_add_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_elementwise_mul_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_mul_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_fc_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fc_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_gaussian_random_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_gaussian_random_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_lrn_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lrn_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_pool2d_int8_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_int8_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_pool2d_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_quantize_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_quantize_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_sum_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sum_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/test_transpose_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_transpose_mkldnn_op.py
--- a/python/paddle/fluid/tests/unittests/ngraph/test_accuracy_ngraph_op.py
+++ b/python/paddle/fluid/tests/unittests/ngraph/test_accuracy_ngraph_op.py
--- a/python/paddle/fluid/tests/unittests/ngraph/test_conv2d_ngraph_op.py
+++ b/python/paddle/fluid/tests/unittests/ngraph/test_conv2d_ngraph_op.py
--- a/python/paddle/fluid/tests/unittests/ngraph/test_pool2d_ngraph_op.py
+++ b/python/paddle/fluid/tests/unittests/ngraph/test_pool2d_ngraph_op.py
--- a/python/paddle/fluid/tests/unittests/parallel_executor_test_base.py
+++ b/python/paddle/fluid/tests/unittests/parallel_executor_test_base.py
--- a/python/paddle/fluid/tests/unittests/test_beam_search_op.py
+++ b/python/paddle/fluid/tests/unittests/test_beam_search_op.py
--- a/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py
--- a/python/paddle/fluid/tests/unittests/test_box_clip_op.py
+++ b/python/paddle/fluid/tests/unittests/test_box_clip_op.py
--- a/python/paddle/fluid/tests/unittests/test_box_coder_op.py
+++ b/python/paddle/fluid/tests/unittests/test_box_coder_op.py
--- a/python/paddle/fluid/tests/unittests/test_eager_deletion_transformer.py
+++ b/python/paddle/fluid/tests/unittests/test_eager_deletion_transformer.py
--- a/python/paddle/fluid/tests/unittests/test_imperative.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative.py
--- a/python/paddle/fluid/tests/unittests/test_imperative_optimizer.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_optimizer.py
--- a/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py
--- a/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
--- a/python/paddle/fluid/tests/unittests/test_inference_model_io.py
+++ b/python/paddle/fluid/tests/unittests/test_inference_model_io.py
--- a/python/paddle/fluid/tests/unittests/test_ir_inplace_pass.py
+++ b/python/paddle/fluid/tests/unittests/test_ir_inplace_pass.py
--- a/python/paddle/fluid/tests/unittests/test_layers.py
+++ b/python/paddle/fluid/tests/unittests/test_layers.py
--- a/python/paddle/fluid/tests/unittests/test_multiclass_nms_op.py
+++ b/python/paddle/fluid/tests/unittests/test_multiclass_nms_op.py
--- a/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py
--- a/python/paddle/fluid/tests/unittests/test_parallel_executor_seresnext.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_executor_seresnext.py
--- a/python/paddle/fluid/tests/unittests/test_parallel_executor_transformer.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_executor_transformer.py
--- a/python/paddle/fluid/tests/unittests/test_peak_gpumem_monitor.py
+++ b/python/paddle/fluid/tests/unittests/test_peak_gpumem_monitor.py
--- a/python/paddle/fluid/tests/unittests/test_yolov3_loss_op.py
+++ b/python/paddle/fluid/tests/unittests/test_yolov3_loss_op.py
--- a/python/paddle/fluid/tests/unittests/transformer_model.py
+++ b/python/paddle/fluid/tests/unittests/transformer_model.py
--- a/python/paddle/fluid/transpiler/details/__init__.py
+++ b/python/paddle/fluid/transpiler/details/__init__.py
--- a/python/paddle/fluid/transpiler/details/vars_distributed.py
+++ b/python/paddle/fluid/transpiler/details/vars_distributed.py
--- a/python/paddle/fluid/transpiler/distribute_transpiler.py
+++ b/python/paddle/fluid/transpiler/distribute_transpiler.py
--- a/python/paddle/fluid/transpiler/memory_optimization_transpiler.py
+++ b/python/paddle/fluid/transpiler/memory_optimization_transpiler.py
--- a/python/setup.py.in
+++ b/python/setup.py.in