Merge branch 'develop' of https://github.com/PaddlePaddle/FluidDoc into develop

73b63859 · tink2123 · 7340dc87 · abf5ac22 · 73b63859 · 73b63859
53 changed file
--- a/.travis.yml
+++ b/.travis.yml
@@ -14,11 +14,6 @@ services:
  - docker
 os:
  - linux
-env:
-  - JOB=doc
-  - JOB=lite_lib
-  - JOB=lite_lib2
-  - JOB=en_external_doc
 addons:
  apt:
@@ -32,20 +27,17 @@ before_install:
  -  sudo pip install pylint pytest astroid isort 
  # Load cached docker images
  #- if [[ -d $HOME/docker ]]; then ls $HOME/docker/*.tar.gz | xargs -I {file} sh -c "zcat {file} | docker load"; fi
-script:
+jobs:
-  - |
+  include:
-     if [ $JOB == "doc" ]; then scripts/deploy_docs.sh full
+    - script: scripts/deploy_docs.sh full 
-     fi
+      name: Generate Docs
+    - script: scripts/deploy_docs.sh pybind
-     if [ $JOB == "lite_lib" ]; then scripts/deploy_docs.sh pybind 
+      name: Cache pybind build
-     fi 
+    - script: scripts/deploy_docs.sh proto
+      name: Cache proto build
-     if [ $JOB == "lite_lib2" ]; then scripts/deploy_docs.sh proto
+    - script: scripts/deploy_en_external_docs.sh
-     fi 
+      name: Generate EN external docs
-     if [ $JOB == "en_external_doc" ]; then scripts/deploy_en_external_docs.sh 
-     fi 
 #before_cache:
 #  # Save tagged docker images

--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ When a release branch is pushed to Github, Travis-CI will start automatically to
 FluidDoc needs Paddle python module to compile API documents. Unfortunately, compiling Paddle python module takes longer time Travis CI permits. Usually Travis CI will fail due because of timeout. That's why there three jobs on Travis, two of them are to build libraries. Once the libraries are cached on the Travis, next build will be a lot faster.
 ## Preview with PPO
-To preview documents constructured by FluidDoc. Please follow the regular preview step, but replace the path to paddle with the path to FluidDoc
+To preview documents constructured by FluidDoc. Please follow the [regular preview step](https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develop/README.md), but replace the path to paddle with the path to FluidDoc
 `./runserver --paddle <path_to_FluidDoc_dir>`
 # Publish New release
@@ -21,5 +21,5 @@ To preview documents constructured by FluidDoc. Please follow the regular previe
 1. Make sure all the submodules are ready for release. Paddle, book, model, mobile and Anakin should all have stable commits. Note: Paddle repo should update the API RST files accordinly if Paddle changes the included module/classes. 
 1. Update the submodules under `external` folder and commit the changes.
 1. Git push the branch to Github, Travis CI will start several builds to publish the documents to the PaddlePaddle.org server
-1. Please notify the PaddlePaddle.org team that the release content is ready. PaddlePaddl.org team should enable the version and update the default version to the latest one. PaddlePaddl.org should also update the search index accordingly (Until the search server is up)
+1. Please notify the PaddlePaddle.org team that the release content is ready. PaddlePaddl.org team should enable the version and update the default version to the latest one. PaddlePaddle.org should also update the search index accordingly (Until the search server is up)
--- a/doc/fluid/advanced_usage/deploy/build_and_install_lib_cn.rst
+++ b/doc/fluid/advanced_usage/deploy/build_and_install_lib_cn.rst
-.. _install_or_build_cpp_inference_lib:
-安装与编译C++预测库
-===========================
-直接下载安装
-------------
-======================   ========================================
-版本说明                            C++预测库   
-======================   ========================================
-cpu_avx_mkl              `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/fluid.tgz>`_ 
-cpu_avx_openblas         `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/fluid.tgz>`_
-cpu_noavx_openblas       `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/fluid.tgz>`_
-cuda7.5_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda75cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda8.0_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda8.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda9.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/fluid.tgz>`_
-======================   ========================================
-从源码编译
----------
-用户也可以从 PaddlePaddle 核心代码编译C++预测库，只需在编译时配制下面这些编译选项：
-=================   =========
-选项                 值   
-=================   =========
-CMAKE_BUILD_TYPE    Release
-FLUID_INSTALL_DIR   安装路径    
-WITH_FLUID_ONLY     ON（推荐）
-WITH_SWIG_PY        OFF（推荐
-WITH_PYTHON         OFF（推荐）
-WITH_GPU            ON/OFF
-WITH_MKL            ON/OFF
-=================   =========
-建议按照推荐值设置，以避免链接不必要的库。其它可选编译选项按需进行设定。
-下面的代码片段从github拉取最新代码，配制编译选项（需要将PADDLE_ROOT替换为PaddlePaddle预测库的安装路径）：
-  .. code-block:: bash
-     pip install paddlepaddle-gpu
-     PADDLE_ROOT=/path/of/capi
-     git clone https://github.com/PaddlePaddle/Paddle.git
-     cd Paddle
-     mkdir build
-     cd build
-     cmake -DFLUID_INSTALL_DIR=$PADDLE_ROOT \
-           -DCMAKE_BUILD_TYPE=Release \
-           -DWITH_FLUID_ONLY=ON \
-           -DWITH_SWIG_PY=OFF \
-           -DWITH_PYTHON=OFF \
-           -DWITH_MKL=OFF \
-           -DWITH_GPU=OFF  \
-           ..
-      make
-      make inference_lib_dist
-成功编译后，使用C++预测库所需的依赖（包括：（1）编译出的PaddlePaddle预测库和头文件；（2）第三方链接库和头文件；（3）版本信息与编译选项信息）
-均会存放于PADDLE_ROOT目录中。目录结构如下：
-  .. code-block:: text
-     PaddleRoot/
-     ├── CMakeCache.txt
-     ├── paddle
-     │   └── fluid
-     │       ├── framework
-     │       ├── inference
-     │       ├── memory
-     │       ├── platform
-     │       ├── pybind
-     │       └── string
-     ├── third_party
-     │   ├── boost
-     │   │   └── boost
-     │   ├── eigen3
-     │   │   ├── Eigen
-     │   │   └── unsupported
-     │   └── install
-     │       ├── gflags
-     │       ├── glog
-     │       ├── mklml
-     │       ├── protobuf
-     │       ├── snappy
-     │       ├── snappystream
-     │       └── zlib
-     └── version.txt
-version.txt 中记录了该预测库的版本信息，包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号，如：
-  .. code-block:: text
-     GIT COMMIT ID: c95cd4742f02bb009e651a00b07b21c979637dc8
-     WITH_MKL: ON
-     WITH_GPU: ON
-     CUDA version: 8.0
-     CUDNN version: v5
--- a/doc/fluid/advanced_usage/deploy/index_mobile.rst
+++ b/doc/fluid/advanced_usage/deploy/index_mobile.rst
@@ -4,6 +4,7 @@
 .. toctree::
   :maxdepth: 2
+   mobile_readme.md
   mobile_build.md
   mobile_dev.md
--- a/doc/fluid/advanced_usage/deploy/mobile_readme.md
+++ b/doc/fluid/advanced_usage/deploy/mobile_readme.md
+# Paddle-Mobile
+[![Build Status](https://travis-ci.org/PaddlePaddle/paddle-mobile.svg?branch=develop&longCache=true&style=flat-square)](https://travis-ci.org/PaddlePaddle/paddle-mobile)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/doc)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
+<!--[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle-Mobile.svg)](https://github.com/PaddlePaddle/Paddle-Mobile/releases)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)-->
+欢迎来到 Paddle-Mobile GitHub 项目。
+Paddle-Mobile是PaddlePaddle组织下的项目，是一个致力于嵌入式平台的深度学习的框架。Paddle-Mobile设计思想和PaddlePaddle的最新版fluid版本保持了高度一致，同时针对嵌入式做了大量优化。设计之初就对嵌入式的性能、体积、能耗、硬件平台覆盖等方面做了考虑。
+## 简单搜索线上效果
+如下gif是简单搜索app的线上主体检测应用效果
+![ezgif-1-050a733dfb](http://otkwwi4x8.bkt.clouddn.com/2018-07-05-ezgif-1-050a733dfb.gif)
+## Demo目录
+[点我](https://github.com/PaddlePaddle/paddle-mobile/tree/develop/demo)
+## Features
+- **ARM CPU**
+|mobilenet arm v7|1线程|2线程|4线程|
+|------------|----|-----|-----|
+|麒麟970(ms)|108.180|63.935|37.545|
+|麒麟960(ms)|108.588|63.073|36.822|
+|高通845(ms)|85.952|48.890|28.641|
+|高通835(ms)|105.434|62.752|37.131|
+|||||
+|mobilenetssd arm v7|1线程|2线程|4线程|
+|麒麟970(ms)|212.686|127.205|77.485|
+|麒麟960(ms)|212.641|125.338|75.250|
+|高通845(ms)|182.863|95.671|56.857|
+|高通835(ms)|213.849|127.717|77.006|
+|||||
+|googlenet(v1) arm v7|1线程|2线程|4线程|
+|麒麟970(ms)|335.288|234.559|161.295|
+|麒麟960(ms)|354.443|232.642|157.815|
+|高通845(ms)|282.007|173.146|122.148|
+|高通835(ms)|341.250|233.354|158.554|
+|||||
+|squeezenet arm v7|1线程|2线程|4线程|
+|麒麟970(ms)|83.726|57.944|36.923|
+|麒麟960(ms)|85.835|55.762|36.496|
+|高通845(ms)|71.301|41.618|28.785|
+|高通835(ms)|82.407|56.176|36.455|
+|||||
+|yolo arm v7|1线程|2线程|4线程|
+|麒麟970(ms)|129.658|79.993|49.969|
+|麒麟960(ms)|130.208|78.791|48.390|
+|高通845(ms)|109.244|61.736|40.600|
+|高通835(ms)|130.402|80.863|50.359|
+    测试机型信息：
+    麒麟970:荣耀v10     (2.36GHz * 4 + 1.8GHz * 4)
+    麒麟960:华为mate9   (2.36GHz * 4 + 1.8GHz * 4)
+    骁龙835:小米6       (2.45GHz * 4 + 1.9GHz * 4)
+    骁龙845:OPPO FindX  (2.80GHz * 4 + 1.8GHz * 4)
+- **Mali GPU**
+    Mali GPU是百度和ARM合作开发的，双方团队近期都在致力于将paddle的op能无缝运行在ACL(arm compute library)。目前已经支持squeezenet，googlenet，resnet等几个网络模型，后续会继续加大力度。使全部移动端paddle op能高效运行在mali gpu上。
+- **苹果设备的GPU Metal实现**
+    基于Metal实现的苹果设备的GPU预测库，也已经在实现中，近期也会有相应可运行版本。
+- **FPGA**
+    FPGA实现正在进行中，是基于Xilinx的ZU5目标开发板。
+- **灵活性**
+    * paddle-mobile cpu版不依赖任何第三库, 可进行快速集成。
+    * 使用泛型特化进行平台切换, 可灵活切换 cpu、gpu 和其他协处理器。
+    * 可根据特定的常见网络, 进行编译特定的 op, 降低编译时间, 减小包大小。
+    * 使用 docker 编译, 提供统一的编译环境。
+    * 高可拓展性, 方便拓展其他协处理器, 提供高性能 arm 算子实现, 方便其他协处理器开发者集成开发。
+    * 直接兼容 paddle-fluid 模型, 不需要额外的转换操作。
+- **体积**
+    paddle-mobile从设计之初就深入考虑到移动端的包体积的问题，cpu实现中没有外部依赖。在编译过程中，如果该网络不需要的op是完全不会被打入的。同时编译选项优化也为体积压缩提供了帮助。
+    除了二进制体积，我们对代码体积极力避免过大。整个仓库的代码体积也非常小。
+## 文档
+### 设计文档
+关于paddle-mobile设计文档在下面链接中，如果想了解更多内容。[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)中会有很多早期的设计和讨论过程。
+[设计文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/design_doc.md)
+### 开发文档
+开发文档主要是关于编译、运行等问题。做为开发者，它可以和贡献文档共同结合使用。
+[开发文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/doc/development_doc.md)
+### 贡献文档
+- [贡献文档链接](https://github.com/PaddlePaddle/paddle-mobile/blob/develop/CONTRIBUTING.md)
+- 上面文档中涵盖了主要的贡献代码流程，如果在实践中您还遇到了其他问题，可以发[issue](https://github.com/PaddlePaddle/paddle-mobile/issues)。我们看到后会尽快处理。
+## 模型获得
+目前Paddle-Mobile仅支持Paddle fluid训练的模型。如果你手中的模型是不同种类的模型，需要进行模型转换才可以运行。
+### 1. 直接使用Paddle Fluid训练
+该方式最为可靠，推荐方式
+### 2. caffe转为Paddle Fluid模型
+[链接](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid)
+### 3. ONNX
+ONNX全称为“Open Neural Network Exchange”，即“开放的神经网络切换”。该项目的目的是让不同的神经网络开发框架做到互通互用。
+除直接使用PaddlePaddle训练fluid版本的模型外，还可以通过onnx转换得到个别Paddle fluid模型。
+目前，百度也在做onnx支持工作。相关转换项目在这里：[paddle-onnx](https://github.com/PaddlePaddle/paddle-onnx)。
+![](http://7xop3k.com1.z0.glb.clouddn.com/15311951836000.jpg)
+### 4. 部分测试模型和测试图片下载
+[下载链接](http://mms-graph.bj.bcebos.com/paddle-mobile%2FmodelsAndImages.zip)
+## 问题解决
+欢迎提出或解决我们的问题，有疑问可以发issue. [Github Issues](https://github.com/PaddlePaddle/paddle-mobile/issues).
+## Copyright and License
+Paddle-Mobile 提供相对宽松的Apache-2.0开源协议 [Apache-2.0 license](LICENSE).
+## 旧版 Mobile-Deep-Learning
+原MDL(Mobile-Deep-Learning)工程被迁移到了这里 [Mobile-Deep-Learning](https://github.com/allonli/mobile-deep-learning)
--- a/doc/fluid/advanced_usage/development/cpu_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/cpu_profiling_cn.md
-# CPU性能调优
+../../../howto/optimization/cpu_profiling_cn.md
\ No newline at end of file
-此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。
-Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
-PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
-* Python 代码的性能分析
-* Python 与 C++ 混合代码的性能分析
-## Python代码的性能分析
-### 生成性能分析文件
-Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
-```bash
-python -m cProfile -o profile.out main.py
-```
-其中 `main.py` 是我们要分析的程序，`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印到标准输出。
-### 查看性能分析文件
-`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：
-```bash
-cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
-```
-其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
-用Web浏览器访问对应网址，即可显示性能分析的结果：
-```
-   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
-     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
-        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
-```
-每一列的含义是:
-<table>
-<thead>
-<tr>
-<th>列名</th>
-<th>含义 </th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td> ncalls</td>
-<td> 函数的调用次数</td>
-</tr>
-<tr>
-<td>tottime</td>
-<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
-</tr>
-<tr>
-<td> percall </td>
-<td> tottime的每次调用平均时间</td>
-</tr>
-<tr>
-<td> cumtime</td>
-<td> 函数总时间。包含这个函数调用其他函数的时间</td>
-</tr>
-<tr>
-<td> percall</td>
-<td> cumtime的每次调用平均时间</td>
-</tr>
-<tr>
-<td> filename:lineno(function) </td>
-<td> 文件名, 行号，函数名 </td>
-</tr>
-</tbody>
-</table>
-### 寻找性能瓶颈
-通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
-将性能分析结果按照tottime排序，效果如下:
-```text
-     4696   12.040    0.003   12.040    0.003 {built-in method run}
-   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
-   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
-     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
-        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
-```
-可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
-```text
-Called By:
-   Ordered by: internal time
-   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
-Function                                                                                                 was called by...
-                                                                                                             ncalls  tottime  cumtime
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
-/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
-                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
-Called:
-   Ordered by: internal time
-   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
-```
-通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
-## Python与C++混合代码的性能分析
-### 生成性能分析文件
-C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
-使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
-```bash
-apt update
-apt install libgoogle-perftools-dev
-pip install yep
-```
-安装完毕后，我们可以通过
-```bash
-python -m yep -v main.py
-```
-生成性能分析文件。生成的性能分析文件为`main.py.prof`。
-命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
-1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
-2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
-3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
-### 查看性能分析文件
-在运行完性能分析后，会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
-安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
-```bash
-go get github.com/google/pprof
-```
-进而我们可以使用如下命令开启一个HTTP服务:
-```bash
-pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
-```
-这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
-访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
-![result](./pprof_1.png)
-### 寻找性能瓶颈
-与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
-例如下图中，
-![kernel_perf](./pprof_2.png)
-在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
-在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
--- a/doc/fluid/advanced_usage/development/host_memory_profiling_cn.md
+++ b/doc/fluid/advanced_usage/development/host_memory_profiling_cn.md
-# 堆内存分析和优化
+../../../howto/optimization/host_memory_profiling_cn.md
\ No newline at end of file
-计算机程序都可能有内存泄漏的风险。**内存泄漏**一般是由于程序在堆(heap)上分配了内存而没有释放，随着程序的运行占用的内存越来越大，一方面会影响程序的稳定性，可能让运行速度越来越慢，或者造成oom，甚至会影响运行程序的机器的稳定性，造成宕机。
-目前有很多内存泄漏分析工具，比较经典的有[valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/)。
-因为Fluid是用Python驱动C++ core来运行，valgrind直接分析非常困难，需要自己编译debug版本的、带valgrind支持的专用Python版本，而且输出的信息中大部分是Python自己的符号和调用信息，分析起来很困难，另外使用valgrind会让程序运行速度变得非常慢，所以不建议使用。
-本教程主要介绍[gperftools](https://gperftools.github.io/gperftools/)的使用。
-gperftool主要支持以下四个功能：
- thread-caching malloc
- heap-checking using tcmalloc
- heap-profiling using tcmalloc
- CPU profiler
-Paddle也提供了基于gperftool的[CPU性能分析教程](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/howto/optimization/cpu_profiling_cn.md)。
-对于堆内存的分析，主要用到thread-caching malloc和heap-profiling using tcmalloc。
-## 环境
-本教程基于paddle提供的Docker开发环境paddlepaddle/paddle:latest-dev，基于Ubuntu 16.04.4 LTS环境。
-## 使用流程
- 安装google-perftools
-```
-apt-get install libunwind-dev 
-apt-get install google-perftools
-```
- 安装pprof
-```
-go get -u github.com/google/pprof
-```
- 设置运行环境
-```
-export PPROF_PATH=/root/gopath/bin/pprof
-export PPROF_BINARY_PATH=/root/gopath/bin/pprof
-export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
-```
- 使用heap profile来运行python程序。本质上是周期性的对堆的分配情况做一次快照。
-```
-# HEAPPROFILE 设置生成的堆分析文件的目录和文件前缀
-# HEAP_PROFILE_ALLOCATION_INTERVAL 设置每分配多少存储dump一次dump，默认1GB
-env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py
-```
-随着程序的运行，会在perf_log这个文件夹下生成很多文件，如下：
-```
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0001.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0002.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0003.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0004.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0005.heap
-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0006.heap
-```
- 使用pprof对heap文件进行分析。分析有两种模式：
-	- 完整模式。会对当前heap做一个分析，显示目前分配内存一些调用路径。
-	```
-	pprof --pdf python test.log.0012.heap
-	```
-	上述命令会生成一个profile00x.pdf的文件，可以直接打开，例如：[memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf)。从下图可以看出，在CPU版本fluid的运行过程中，分配存储最多的模块式CPUAllocator. 而别的模块相对而言分配内存较少，所以被忽略了，这对于分配内存泄漏是很不方便的，因为泄漏是一个缓慢的过程，在这种图中是无法看到的。
-	![result](https://user-images.githubusercontent.com/3048612/40964027-a54033e4-68dc-11e8-836a-144910c4bb8c.png)
-	- Diff模式。可以对两个时刻的heap做diff，把一些内存分配没有发生变化的模块去掉，而把增量部分显示出来。
-	```
-	pprof --pdf --base test.log.0010.heap python test.log.1045.heap
-	```
-	生成的结果为：[`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf)
-	从图中可以看出：ProgramDesc这个结构，在两个版本之间增长了200MB+，所以这里有很大的内存泄漏的可能性，最终结果也确实证明是这里造成了泄漏。
-	![result](https://user-images.githubusercontent.com/3048612/40964057-b434d5e4-68dc-11e8-894b-8ab62bcf26c2.png)
-	![result](https://user-images.githubusercontent.com/3048612/40964063-b7dbee44-68dc-11e8-9719-da279f86477f.png)
--- a/doc/fluid/advanced_usage/development/new_op.md
+++ b/doc/fluid/advanced_usage/development/new_op.md
-# 如何写新的Operator
+../../../dev/new_op_cn.md
\ No newline at end of file
- - [概念简介](#概念简介)
- - [实现C++类](#实现c类)
-   - [定义ProtoMaker类](#定义protomaker类)
-   - [定义Operator类](#定义operator类)
-   - [定义OpKernel类](#定义opkernel类)
-   - [注册Operator](#注册operator)
-   - [编译](#编译)
- - [绑定Python](#绑定python)
- - [实现单元测试](#实现单元测试)
-   - [前向Operator单测](#前向operator单测)
-   - [反向Operator单测](#反向operator单测)
-   - [编译和执行](#编译和执行)
- - [注意事项](#注意事项)
-## 概念简介
-简单介绍需要用到基类，详细介绍请参考设计文档。
- `framework::OperatorBase`: Operator(简写，Op)基类。
- `framework::OpKernel`: Op计算函数的基类，称作Kernel。
- `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
- `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
-依据是否包含kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorWithKernel`，后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
-<table>
-<thead>
-<tr>
-<th>内容</th>
-<th>定义位置</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>OpProtoMake定义 </td>
-<td>.cc 文件，Backward Op不需要定义OpProtoMake </td>
-</tr>
-<tr>
-<td>Op定义 </td>
-<td> .cc 文件</td>
-</tr>
-<tr>
-<td>Kernel实现 </td>
-<td> CPU、CUDA共享Kernel实现在.h 文件中，否则，CPU 实现在.cc 文件中，CUDA 实现在.cu 文件中。</td>
-</tr>
-<tr>
-<td>注册Op </td>
-<td> Op注册实现在.cc 文件；Kernel注册CPU实现在.cc 文件中，CUDA实现在.cu 文件中</td>
-</tr>
-</tbody>
-</table>
-实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
-下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
-## 实现C++类
-### 定义ProtoMaker类
-矩阵乘法的公式：$Out = X * Y$, 可见该计算由两个输入，一个输出组成。
-首先定义`ProtoMaker`来描述该Op的输入、输出，并添加注释：
-```cpp
-class MulOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  MulOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor), 2D tensor of size (M x K)");
-    AddInput("Y", "(Tensor), 2D tensor of size (K x N)");
-    AddOutput("Out", "(Tensor), 2D tensor of size (M x N)");
-    AddComment(R"DOC(
-Two Element Mul Operator.
-The equation is: Out = X * Y
-)DOC");
-  }
-};
-```
-[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)继承自`framework::OpProtoAndCheckerMaker`，构造函数含有2个参数：
-   - `framework::OpProto` ： 前者存储Op的输入输出和参数属性，将用于Python API接口的生成。
-   - `framework::OpAttrChecker` ：后者用于检查参数属性的合法性。
-构造函数里通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
-上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/name_convention.md)。
-再以[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55)为例：
-```cpp
-template <typename AttrType>
-class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor) Input tensor of scale operator.");
-    AddOutput("Out", "(Tensor) Output tensor of scale operator.");
-    AddComment(R"DOC(
-Scale operator
-$$Out = scale*X$$
-)DOC");
-    AddAttr<AttrType>("scale",
-                      "(float, default 1.0)"
-                      "The scaling factor of the scale operator.")
-        .SetDefault(1.0);
-  }
-};
-```
-这个例子有`AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
-### 定义GradProtoMaker类
-每个Op的必须有一个对应的GraProtoMaker，若未定制对应前向Op的GradProtoMaker，fluid提供了DefaultGradProtoMaker，默认注册会使用全部输入输出，包括Input, Output, Output@Grad等，使用不需要的变量的会造成显存浪费。
-下面示例定义了ScaleOp的GradProtoMaker。
-```cpp
-class ScaleGradMaker : public framework::SingleGradOpDescMaker {
- public:
-  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
-  std::unique_ptr<framework::OpDesc> Apply() const override {
-    auto *grad_op = new framework::OpDesc();
-    grad_op->SetType("scale");
-    grad_op->SetInput("X", OutputGrad("Out"));
-    grad_op->SetOutput("Out", InputGrad("X"));
-    grad_op->SetAttr("scale", GetAttr("scale"));
-    return std::unique_ptr<framework::OpDesc>(grad_op);
-  }
-};
-```
-### 定义Operator类
-下面实现了MulOp的定义：
-```cpp
-class MulOp : public framework::OperatorWithKernel {
- public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
-  void InferShape(const framework::InferShapeContext &ctx) const override {
-    auto dim0 = ctx.Input<Tensor>("X")->dims();
-    auto dim1 = ctx.Input<Tensor>("Y")->dims();
-    PADDLE_ENFORCE_EQ(dim0.size(), 2,
-                      "input X(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("X"));
-    PADDLE_ENFORCE_EQ(dim1.size(), 2,
-                      "input Y(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("Y"));
-    PADDLE_ENFORCE_EQ(
-        dim0[1], dim1[0],
-        "First matrix's width must be equal with second matrix's height.");
-    ctx.Output<Tensor>("Out")->Resize({dim0[0], dim1[1]});
-  }
-};
-```
-[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L22)继承自`OperatorWithKernel`。`public`成员：
-```cpp
-using framework::OperatorWithKernel::OperatorWithKernel;
-```
-这句表示使用基类`OperatorWithKernel`的构造函数，也可写成：
-```cpp
-MulOp(const std::string &type, const framework::VariableNameMap &inputs,
-      const framework::VariableNameMap &outputs,
-      const framework::AttributeMap &attrs)
-  : OperatorWithKernel(type, inputs, outputs, attrs) {}
-```
-还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
-  - 1). 做检查， 尽早报错：检查输入数据维度、类型等是否合法。
-  - 2). 设置输出Tensor的形状。
-通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和下面将要介绍的注册函数一起放在`.cc`中
-### 定义OpKernel类
-`MulKernel`继承自`framework::OpKernel`，带有下面两个模板参数:
- `typename DeviceContext`: 表示设备类型，不同设备(CPU、CUDA)共享同一个Kernel时，需加该模板参数，不共享则不加，一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
- `typename T` : 表示数据类型，如`float`, `double`等。
-需要为`MulKernel`类重写`Compute`接口。
- `Compute`接受一个输入参数：`const framework::ExecutionContext& context`。
- 与`InferShapeContext`相比，`ExecutionContext`增加了设备类型，同样可获取到输入输出和属性参数。
- `Compute`函数里实现`OpKernel`的具体计算逻辑。
-下面是 `MulKernel` `Compute`的实现：
-  ```cpp
-  template <typename DeviceContext, typename T>
-  class MulKernel : public framework::OpKernel {
-  public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    auto* X = context.Input<Tensor>("X");
-    auto* Y = context.Input<Tensor>("Y");
-    auto* Z = context.Output<Tensor>("Out");
-    Z->mutable_data<T>(context.GetPlace());
-    auto& device_context = context.template device_context<DeviceContext>();
-    math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
-  }
-  };
-  ```
-需要注意：**不同设备(CPU、CUDA)共享一个Op定义，是否则共享同一个`OpKernel`，取决于`Compute`调用的函数是否支持不同设备。**
-`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考：[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
-为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用，我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/dev/use_eigen_cn.md)。
-到此，前向Op实现完成。接下来，需要在`.cc`文件中注册该op和kernel。
-反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。**但需注意反向Op没有`ProtoMaker`**。
-### 注册Operator
- 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
-    ```cpp
-    namespace ops = paddle::operators;
-    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
-                  paddle::framework::DefaultGradOpDescMaker<true>)
-    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
-    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
-    REGISTER_OP_CPU_KERNEL(mul_grad,
-                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
-    ```
-   在上面的代码中：
-    - `REGISTER_OPERATOR` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`。
-    - `REGISTER_OP_CPU_KERNEL` ：注册`ops::MulKernel`类，并特化模板参数为`paddle::platform::CPUPlace`和`float`类型，同理，注册`ops::MulGradKernel`类。
- 在 `.cu`文件中注册CUDA Kernel。
-    - 请注意，如果CUDA Kernel的实现基于Eigen unsupported模块，那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`，代码示例如下：
-    ```cpp
-    // if use Eigen unsupported module before include head files
-    #define EIGEN_USE_GPU
-    namespace ops = paddle::operators;
-    REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
-    REGISTER_OP_CUDA_KERNEL(mul_grad,
-                           ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
-    ```
-### 编译
-运行下面命令可以进行编译：
-```
-make mul_op
-```
-## 绑定Python
-系统会对新增的op自动绑定Python，并链接到生成的lib库中。
-## 实现单元测试
-单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py)。
-### 前向Operator单测
-Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator，需要：
-1. 在`setUp`函数定义输入、输出，以及相关的属性参数。
-2. 生成随机的输入数据。
-3. 在Python脚本中实现与前向operator相同的计算逻辑，得到输出值，与operator前向计算的输出进行对比。
-4. 反向计算已经自动集成进测试框架，直接调用相应接口即可。
-  ```python
-  import unittest
-  import numpy as np
-  from op_test import OpTest
-  class TestMulOp(OpTest):
-      def setUp(self):
-          self.op_type = "mul"
-          self.inputs = {
-              'X': np.random.random((32, 84)).astype("float32"),
-              'Y': np.random.random((84, 100)).astype("float32")
-          }
-          self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
-      def test_check_output(self):
-          self.check_output()
-      def test_check_grad_normal(self):
-          self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
-      def test_check_grad_ingore_x(self):
-          self.check_grad(
-              ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
-      def test_check_grad_ingore_y(self):
-          self.check_grad(
-              ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
-  ```
-上面的代码首先导入依赖的包，下面是对`setUp`函数中操作的重要变量的详细解释：
- `self.op_type = "mul" ` : 定义类型，与operator注册时注册的类型一致。
- `self.inputs` : 定义输入，类型为`numpy.array`，并初始化。
- `self.outputs` : 定义输出，并在Python脚本中完成与operator同样的计算逻辑，返回Python端的计算结果。
-### 反向operator单测
-而反向测试中：
- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
-  - 第一个参数`["X", "Y"]` : 指定对输入变量`X`、`Y`做梯度检测。
-  - 第二个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`。
-  - 第三个参数`max_relative_error`：指定检测梯度时能容忍的最大错误值。
- `test_check_grad_ingore_x`和`test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
-### 编译和执行
-`python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
-请注意，**不同于Op的编译测试，运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake paddle_dir -DWITH_TESTING=ON`。编译成功后，执行下面的命令来运行单元测试：
-```bash
-make test ARGS="-R test_mul_op -V"
-```
-或者:
-```bash
-ctest -R test_mul_op
-```
-## 注意事项
- 注册Op时的类型名，需要和该Op的名字一样。即不允许在`A_op.cc`里面，注册`REGISTER_OPERATOR(B, ...)`等，这将会导致单元测试出错。
- 如果Op没有实现CUDA Kernel，请不要创建空的`*_op.cu`，这将会导致单元测试出错。
- 如果多个Op依赖一些共用的函数，可以创建非`*_op.*`格式的文件来存放，如`gather.h`文件。
-### PADDLE_ENFORCE使用注意
-实现Op时检查数据的合法性需要使用PADDLE_ENFORCE以及PADDLE_ENFORCE_EQ等宏定义，基本格式如下：
-```
-PADDLE_ENFORCE(表达式, 错误提示信息)
-PADDLE_ENFORCE_EQ(比较对象A, 比较对象B, 错误提示信息)
-```
-如果表达式为真，或者比较对象A=B，则检查通过，否则会终止程序运行，向用户反馈相应的错误提示信息。
-为了确保提示友好易懂，开发者需要注意其使用方法。
-#### 总体原则
-任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_**检查的地方，必须有详略得当的备注解释！**错误提示信息**不能为空！
-#### 提示信息书写标准
-1. [required] 哪里错了？为什么错了？
-    - 例如：`ValueError: Mismatched label shape`
-2. [optional] 期望的输入是什么样的？实际的输入是怎样的？
-    - 例如：`Expected labels dimension=1. Received 4.`
-3. [optional] 能否给出修改意见？
-    - 例如：`Suggested Fix:If your classifier expects one-hot encoding label,check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
-如果并非必要或者简洁的描述即可表达清楚以上要点，根据情况书写亦可。
-##### FAQ 典型问题
-1. 无报错信息或报错信息过于简单，不能给用户提供有效的提示！
-问题示例1 ：未写提示信息
-```
-PADDLE_ENFORCE(ctx->HasInput("X"), "");
-```
-问题示例2 ：提示信息过于简单
-```
-PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么？
-```
-2. 在报错信息中使用开发人员定义的变量缩写，不易理解！
-问题示例：
-```
-PADDLE_ENFORCE(forward_pd != nullptr,
-                    "Fail to find eltwise_fwd_pd in device context");  //eltwise_fwd_pd用户可能看不懂
-```
-3. OP内部调用非法接口：Op内部如果出现Output = ShareDataWith(Input) 
-问题示例：
-```cpp
-auto *out = ctx.Output<framework::LoDTensor>("Out");
-auto *in = ctx.Input<framework::LoDTensor>("X");
-out->ShareDataWith(*in);
-```
-Op内部如果出现Output = ShareDataWith(Input)，相当于operator图的中有一条隐藏边，连接了Input和Output，这条边无法在图分析中表达，引发基于图优化的错误。
-4. OP实现的性能实践
-调用了eigen的broadcast, chop等操作，性能会比手写cuda kernel差几倍以上。此时cpu的实现可以复用eigen，gpu实现可以实现cuda kernel.
-#### OP InferShape检查提示信息特别说明
- 检查输入输出变量，请统一遵循以下格式
-`Input(变量名) of OP名 operator should not be null.`  
-正确示例：
-```
-PADDLE_ENFORCE(ctx->HasInput("Input"),
-                        "Input(Input) of LSTMP operator should not be null.");
-```
- 反向Op的输入输出检查，要写明反向Op的名字
-正确示例：
-```
-PADDLE_ENFORCE(ctx->HasInput("X"),
-                        "Input(X) of LoDResetGrad opreator should not be null.");
-```
--- a/doc/fluid/advanced_usage/development/timeline_cn.md
+++ b/doc/fluid/advanced_usage/development/timeline_cn.md
-# 如何使用timeline工具做性能分析
+../../../howto/optimization/timeline_cn.md
\ No newline at end of file
-1. 在训练的主循环外加上`profiler.start_profiler(...)`和`profiler.stop_profiler(...)`。运行之后，代码会在`/tmp/profile`目录下生成一个profile的记录文件。
-	**提示：**
-	请不要在timeline记录信息时运行太多次迭代，因为timeline中的记录数量和迭代次数是成正比的。
-	```python
-    for pass_id in range(pass_num):
-        for batch_id, data in enumerate(train_reader()):
-            if pass_id == 0 and batch_id == 5:
-                profiler.start_profiler("All")
-            elif pass_id == 0 and batch_id == 10:
-                profiler.stop_profiler("total", "/tmp/profile")
-            exe.run(fluid.default_main_program(),
-                    feed=feeder.feed(data),
-                    fetch_list=[])
-	            ...
-	```
-1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`，这个程序默认会生成一个`/tmp/timeline`文件，你也可以用命令行参数来修改这个路径，请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)。
-```python
-python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
-```
-1. 打开chrome浏览器，访问<chrome://tracing/>，用`load`按钮来加载生成的`timeline`文件。
-	![chrome tracing](./tracing.jpeg)
-1. 结果如下图所示，可以放到来查看timetime的细节信息。
-	![chrome timeline](./timeline.jpeg)
--- a/doc/fluid/advanced_usage/development/write_docs.rst
+++ b/doc/fluid/advanced_usage/development/write_docs.rst
-#############
+../../../dev/write_docs_cn.rst
-如何贡献文档
\ No newline at end of file
-#############
-PaddlePaddle的文档包括中英文两个部分。文档都是通过 ``cmake`` 驱动 ``sphinx`` 编译生成的，PaddlePaddle.org工具可以帮助我们实现这一编译过程，并提供更好的预览效果。
-如何构建文档
-============
-PaddlePaddle的文档构建有两种方式，分别为使用paddlepaddle.org工具和不使用paddlepaddle.org工具，两种方式都有各自的优点，前者方便预览，后者方便开发者进行调试。这两种方式中又分别有使用docker和不使用docker的两种构建方法。
-我们建议使用PaddlePaddle.org工具来构建文档。
-使用PaddlePaddle.org工具
------------------------
-这个是目前推荐的使用方法。除了可以自动编译文档，还可以直接在网页中预览文档，需要注意的是，采用后续说明的其它方式虽然也可以预览文档，但是文档的样式与官网文档是不一致的，使用PaddlePaddle.org工具进行编译才能产生与官网文档样式一致的预览效果。
-PaddlePaddle.org工具可以配合Docker使用，需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 <https://docs.docker.com/>`_ 。安装好Docker之后即可用以下命令启动工具
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    # Please specify the working directory through -v
-    docker run -it -p 8000:8000 -v `pwd`:/var/content paddlepaddle/paddlepaddle.org:latest
-注意: PaddlePaddle.org 会在 -v (volume) 指定的内容存储库运行命令
-之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档
-编译后的文件将被存储在工作目录 <paddlepaddle working directory>/.ppo_workspace/content。
-如果不想使用Docker，你还可以通过运行Django框架直接激活工具的服务器。使用下面的命令来运行它。
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories and PaddlePaddle.org
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git
-    # Please specify the PaddlePaddle working directory. In the current setting, it should be pwd
-    export CONTENT_DIR=<path_to_paddlepaddle_working_directory>
-    export ENV=''
-    cd PaddlePaddle.org/portal/
-    pip install -r requirements.txt
-    python manage.py runserver
-工具服务器将读取环境变量 CONTENT_DIR 搜索代码库。请指定的PaddlePaddle工作目录给环境变量 CONTENT_DIR。
-之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档。
-编译后的文件将被存储在工作目录 <paddlepaddle working directory>/.ppo_workspace/content。
-想了解更多PaddlePaddle.org工具的详细信息，可以 `点击这里 <https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develop/README.cn.md>`_ 。
-不使用PaddlePaddle.org工具
--------------------------
-使用Docker构建PaddlePaddle的文档，需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 <https://docs.docker.com/>`_ 。该方法与 `从源码编译PaddlePaddle <http://paddlepaddle.org/docs/develop/documentation/zh/build_and_install/build_from_source_cn.html>`_ 相似，通过从源码中构建可用于编译PaddlePaddle文档的Docker镜像并运行，在进入Docker容器后使用源码中的脚本构建PaddlePaddle文档，具体步骤如下：
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   # 从源码中构建可用于编译PaddlePaddle文档的Docker镜像
-   docker build -t paddle:dev .
-   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash
-   # 进入Docker容器后使用build.sh脚本构建PaddlePaddle文档
-   bash -x /paddle/paddle/scripts/docker/build.sh
-注：上述命令把当前目录（源码根目录）映射为 container 里的 :code:`/paddle` 目录。
-编译完成后，会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录，在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 、 ``api/en/html`` 共三个子目录，分别进入这些目录下，执行以下命令：
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。
-如果不想使用Docker，也可以使用以下命令直接构建PaddlePaddle文档，即
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   mkdir -p build
-   cd build
-   cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON
-   # 如果只需要构建使用文档，则执行以下命令
-   make -j $processors paddle_docs
-   # 如果只需要构建API，则执行以下命令
-   make -j $processors paddle_apis
-其中$processors代表启动和CPU核一样多的进程来并行编译，可以根据本机的CPU核数设置相应的值。
-编译完成后，同样会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录，如果选择构建文档则会在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 两个子目录，选择构建API则会在这两个目录下分别生成 ``api/en/html`` 目录，分别进入这些子目录下，执行以下命令：
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。下图为生成的 ``v2`` 英文文档首页示例。注意，示例中由于使用了sphinx的原始主题，所以页面的风格与官网并不一致，但这并不影响开发者进行调试。
-..  image:: src/doc_en.png
-    :align: center
-    :scale: 60 %
-如何书写文档
-============
-PaddlePaddle文档使用 `sphinx`_ 自动生成，用户可以参考sphinx教程进行书写。
-如何更新www.paddlepaddle.org
-============================
-更新的文档以PR的形式提交到github中，提交方式参见 `如何贡献文档 <http://www.paddlepaddle.org/docs/develop/documentation/zh/dev/write_docs_cn.html>`_ 。
-目前PaddlePaddle的develop分支的文档是自动触发更新的，用户可以分别查看最新的 `中文文档 <http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html>`_ 和
-`英文文档 <http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html>`_ 。
-..  _cmake: https://cmake.org/
-..  _sphinx: http://www.sphinx-doc.org/en/1.4.8/
--- a/doc/fluid/api/fluid.rst
+++ b/doc/fluid/api/fluid.rst
@@ -64,6 +64,14 @@ get_var
 ..  autofunction:: paddle.fluid.get_var
    :noindex:
+.. _api_fluid_name_scope:
+name_scope
+----------
+..  autofunction:: paddle.fluid.name_scope
+    :noindex:
 .. _api_fluid_Executor:
 Executor
@@ -97,69 +105,6 @@ _switch_scope
 ..  autofunction:: paddle.fluid._switch_scope
    :noindex:
-.. _api_fluid_Trainer:
-Trainer
-------
-..  autoclass:: paddle.fluid.Trainer
-    :members:
-    :noindex:
-.. _api_fluid_BeginEpochEvent:
-BeginEpochEvent
---------------
-..  autoclass:: paddle.fluid.BeginEpochEvent
-    :members:
-    :noindex:
-.. _api_fluid_EndEpochEvent:
-EndEpochEvent
-------------
-..  autoclass:: paddle.fluid.EndEpochEvent
-    :members:
-    :noindex:
-.. _api_fluid_BeginStepEvent:
-BeginStepEvent
--------------
-..  autoclass:: paddle.fluid.BeginStepEvent
-    :members:
-    :noindex:
-.. _api_fluid_EndStepEvent:
-EndStepEvent
------------
-..  autoclass:: paddle.fluid.EndStepEvent
-    :members:
-    :noindex:
-.. _api_fluid_CheckpointConfig:
-CheckpointConfig
----------------
-..  autoclass:: paddle.fluid.CheckpointConfig
-    :members:
-    :noindex:
-.. _api_fluid_Inferencer:
-Inferencer
----------
-..  autoclass:: paddle.fluid.Inferencer
-    :members:
-    :noindex:
 .. _api_fluid_DistributeTranspiler:
 DistributeTranspiler
@@ -169,15 +114,6 @@ DistributeTranspiler
    :members:
    :noindex:
-.. _api_fluid_InferenceTranspiler:
-InferenceTranspiler
-------------------
-..  autoclass:: paddle.fluid.InferenceTranspiler
-    :members:
-    :noindex:
 .. _api_fluid_memory_optimize:
 memory_optimize

--- a/doc/fluid/api/gen_doc.sh
+++ b/doc/fluid/api/gen_doc.sh
 #!/bin/bash
-python gen_doc.py layers --submodules control_flow device io nn ops tensor learning_rate_scheduler detection metric_op tensor > layers.rst
+python gen_doc.py layers --submodules control_flow device io nn ops tensor learning_rate_scheduler detection metric_op > layers.rst
 for module in data_feeder clip metrics executor initializer io nets optimizer param_attr profiler regularizer transpiler recordio_writer backward average profiler
 do

--- a/doc/fluid/api/initializer.rst
+++ b/doc/fluid/api/initializer.rst
@@ -32,6 +32,15 @@ Normal
    :members:
    :noindex:
+.. _api_fluid_initializer_TruncatedNormal:
+TruncatedNormal
+---------------
+..  autoclass:: paddle.fluid.initializer.TruncatedNormal
+    :members:
+    :noindex:
 .. _api_fluid_initializer_Xavier:
 Xavier
@@ -102,6 +111,15 @@ NormalInitializer
    :members:
    :noindex:
+.. _api_fluid_initializer_TruncatedNormalInitializer:
+TruncatedNormalInitializer
+--------------------------
+..  autoclass:: paddle.fluid.initializer.TruncatedNormalInitializer
+    :members:
+    :noindex:
 .. _api_fluid_initializer_XavierInitializer:
 XavierInitializer

--- a/doc/fluid/api/io.rst
+++ b/doc/fluid/api/io.rst
@@ -69,11 +69,3 @@ load_inference_model
 ..  autofunction:: paddle.fluid.io.load_inference_model
    :noindex:
-.. _api_fluid_io_get_inference_program:
-get_inference_program
---------------------
-..  autofunction:: paddle.fluid.io.get_inference_program
-    :noindex:
--- a/doc/fluid/api/layers.rst
+++ b/doc/fluid/api/layers.rst
@@ -117,15 +117,6 @@ reorder_lod_tensor_by_rank
 ..  autofunction:: paddle.fluid.layers.reorder_lod_tensor_by_rank
    :noindex:
-.. _api_fluid_layers_ParallelDo:
-ParallelDo
----------
-..  autoclass:: paddle.fluid.layers.ParallelDo
-    :members:
-    :noindex:
 .. _api_fluid_layers_Print:
 Print
@@ -156,14 +147,6 @@ data
 ..  autofunction:: paddle.fluid.layers.data
    :noindex:
-.. _api_fluid_layers_open_recordio_file:
-open_recordio_file
------------------
-..  autofunction:: paddle.fluid.layers.open_recordio_file
-    :noindex:
 .. _api_fluid_layers_open_files:
 open_files
@@ -440,6 +423,14 @@ sequence_expand
 ..  autofunction:: paddle.fluid.layers.sequence_expand
    :noindex:
+.. _api_fluid_layers_sequence_expand_as:
+sequence_expand_as
+------------------
+..  autofunction:: paddle.fluid.layers.sequence_expand_as
+    :noindex:
 .. _api_fluid_layers_sequence_pad:
 sequence_pad
@@ -688,6 +679,22 @@ reshape
 ..  autofunction:: paddle.fluid.layers.reshape
    :noindex:
+.. _api_fluid_layers_squeeze:
+squeeze
+-------
+..  autofunction:: paddle.fluid.layers.squeeze
+    :noindex:
+.. _api_fluid_layers_unsqueeze:
+unsqueeze
+---------
+..  autofunction:: paddle.fluid.layers.unsqueeze
+    :noindex:
 .. _api_fluid_layers_lod_reset:
 lod_reset
@@ -712,6 +719,14 @@ pad
 ..  autofunction:: paddle.fluid.layers.pad
    :noindex:
+.. _api_fluid_layers_pad_constant_like:
+pad_constant_like
+-----------------
+..  autofunction:: paddle.fluid.layers.pad_constant_like
+    :noindex:
 .. _api_fluid_layers_label_smooth:
 label_smooth
@@ -768,6 +783,22 @@ gather
 ..  autofunction:: paddle.fluid.layers.gather
    :noindex:
+.. _api_fluid_layers_scatter:
+scatter
+-------
+..  autofunction:: paddle.fluid.layers.scatter
+    :noindex:
+.. _api_fluid_layers_sequence_scatter:
+sequence_scatter
+----------------
+..  autofunction:: paddle.fluid.layers.sequence_scatter
+    :noindex:
 .. _api_fluid_layers_random_crop:
 random_crop
@@ -816,6 +847,54 @@ rank_loss
 ..  autofunction:: paddle.fluid.layers.rank_loss
    :noindex:
+.. _api_fluid_layers_elu:
+elu
+---
+..  autofunction:: paddle.fluid.layers.elu
+    :noindex:
+.. _api_fluid_layers_relu6:
+relu6
+-----
+..  autofunction:: paddle.fluid.layers.relu6
+    :noindex:
+.. _api_fluid_layers_pow:
+pow
+---
+..  autofunction:: paddle.fluid.layers.pow
+    :noindex:
+.. _api_fluid_layers_stanh:
+stanh
+-----
+..  autofunction:: paddle.fluid.layers.stanh
+    :noindex:
+.. _api_fluid_layers_hard_sigmoid:
+hard_sigmoid
+------------
+..  autofunction:: paddle.fluid.layers.hard_sigmoid
+    :noindex:
+.. _api_fluid_layers_swish:
+swish
+-----
+..  autofunction:: paddle.fluid.layers.swish
+    :noindex:
 .. _api_fluid_layers_prelu:
 prelu
@@ -824,6 +903,30 @@ prelu
 ..  autofunction:: paddle.fluid.layers.prelu
    :noindex:
+.. _api_fluid_layers_brelu:
+brelu
+-----
+..  autofunction:: paddle.fluid.layers.brelu
+    :noindex:
+.. _api_fluid_layers_leaky_relu:
+leaky_relu
+----------
+..  autofunction:: paddle.fluid.layers.leaky_relu
+    :noindex:
+.. _api_fluid_layers_soft_relu:
+soft_relu
+---------
+..  autofunction:: paddle.fluid.layers.soft_relu
+    :noindex:
 .. _api_fluid_layers_flatten:
 flatten
@@ -832,39 +935,68 @@ flatten
 ..  autofunction:: paddle.fluid.layers.flatten
    :noindex:
-ops
+.. _api_fluid_layers_sequence_mask:
-===
-.. _api_fluid_layers_mean:
+sequence_mask
+-------------
-mean
----
-..  autofunction:: paddle.fluid.layers.mean
+..  autofunction:: paddle.fluid.layers.sequence_mask
    :noindex:
-.. _api_fluid_layers_mul:
+.. _api_fluid_layers_stack:
-mul
+stack
---
+-----
-..  autofunction:: paddle.fluid.layers.mul
+..  autofunction:: paddle.fluid.layers.stack
    :noindex:
-.. _api_fluid_layers_scale:
+.. _api_fluid_layers_pad2d:
-scale
+pad2d
 -----
-..  autofunction:: paddle.fluid.layers.scale
+..  autofunction:: paddle.fluid.layers.pad2d
    :noindex:
-.. _api_fluid_layers_sigmoid_cross_entropy_with_logits:
+.. _api_fluid_layers_unstack:
-sigmoid_cross_entropy_with_logits
+unstack
---------------------------------
+-------
-..  autofunction:: paddle.fluid.layers.sigmoid_cross_entropy_with_logits
+..  autofunction:: paddle.fluid.layers.unstack
+    :noindex:
+.. _api_fluid_layers_sequence_enumerate:
+sequence_enumerate
+------------------
+..  autofunction:: paddle.fluid.layers.sequence_enumerate
+    :noindex:
+.. _api_fluid_layers_expand:
+expand
+------
+..  autofunction:: paddle.fluid.layers.expand
+    :noindex:
+.. _api_fluid_layers_sequence_concat:
+sequence_concat
+---------------
+..  autofunction:: paddle.fluid.layers.sequence_concat
+    :noindex:
+.. _api_fluid_layers_scale:
+scale
+-----
+..  autofunction:: paddle.fluid.layers.scale
    :noindex:
 .. _api_fluid_layers_elementwise_add:
@@ -923,6 +1055,33 @@ elementwise_pow
 ..  autofunction:: paddle.fluid.layers.elementwise_pow
    :noindex:
+ops
+===
+.. _api_fluid_layers_mean:
+mean
+----
+..  autofunction:: paddle.fluid.layers.mean
+    :noindex:
+.. _api_fluid_layers_mul:
+mul
+---
+..  autofunction:: paddle.fluid.layers.mul
+    :noindex:
+.. _api_fluid_layers_sigmoid_cross_entropy_with_logits:
+sigmoid_cross_entropy_with_logits
+---------------------------------
+..  autofunction:: paddle.fluid.layers.sigmoid_cross_entropy_with_logits
+    :noindex:
 .. _api_fluid_layers_clip:
 clip
@@ -987,20 +1146,20 @@ gaussian_random
 ..  autofunction:: paddle.fluid.layers.gaussian_random
    :noindex:
-.. _api_fluid_layers_gaussian_random_batch_size_like:
+.. _api_fluid_layers_sampling_id:
-gaussian_random_batch_size_like
+sampling_id
-------------------------------
+-----------
-..  autofunction:: paddle.fluid.layers.gaussian_random_batch_size_like
+..  autofunction:: paddle.fluid.layers.sampling_id
    :noindex:
-.. _api_fluid_layers_scatter:
+.. _api_fluid_layers_gaussian_random_batch_size_like:
-scatter
+gaussian_random_batch_size_like
-------
+-------------------------------
-..  autofunction:: paddle.fluid.layers.scatter
+..  autofunction:: paddle.fluid.layers.gaussian_random_batch_size_like
    :noindex:
 .. _api_fluid_layers_sum:
@@ -1171,78 +1330,6 @@ softsign
 ..  autofunction:: paddle.fluid.layers.softsign
    :noindex:
-.. _api_fluid_layers_brelu:
-brelu
-----
-..  autofunction:: paddle.fluid.layers.brelu
-    :noindex:
-.. _api_fluid_layers_leaky_relu:
-leaky_relu
----------
-..  autofunction:: paddle.fluid.layers.leaky_relu
-    :noindex:
-.. _api_fluid_layers_soft_relu:
-soft_relu
---------
-..  autofunction:: paddle.fluid.layers.soft_relu
-    :noindex:
-.. _api_fluid_layers_elu:
-elu
---
-..  autofunction:: paddle.fluid.layers.elu
-    :noindex:
-.. _api_fluid_layers_relu6:
-relu6
-----
-..  autofunction:: paddle.fluid.layers.relu6
-    :noindex:
-.. _api_fluid_layers_pow:
-pow
---
-..  autofunction:: paddle.fluid.layers.pow
-    :noindex:
-.. _api_fluid_layers_stanh:
-stanh
-----
-..  autofunction:: paddle.fluid.layers.stanh
-    :noindex:
-.. _api_fluid_layers_hard_sigmoid:
-hard_sigmoid
------------
-..  autofunction:: paddle.fluid.layers.hard_sigmoid
-    :noindex:
-.. _api_fluid_layers_swish:
-swish
-----
-..  autofunction:: paddle.fluid.layers.swish
-    :noindex:
 .. _api_fluid_layers_uniform_random:
 uniform_random
@@ -1532,6 +1619,30 @@ anchor_generator
 ..  autofunction:: paddle.fluid.layers.anchor_generator
    :noindex:
+.. _api_fluid_layers_roi_perspective_transform:
+roi_perspective_transform
+-------------------------
+..  autofunction:: paddle.fluid.layers.roi_perspective_transform
+    :noindex:
+.. _api_fluid_layers_generate_proposal_labels:
+generate_proposal_labels
+------------------------
+..  autofunction:: paddle.fluid.layers.generate_proposal_labels
+    :noindex:
+.. _api_fluid_layers_generate_proposals:
+generate_proposals
+------------------
+..  autofunction:: paddle.fluid.layers.generate_proposals
+    :noindex:
 .. _api_fluid_layers_iou_similarity:
 iou_similarity
@@ -1575,126 +1686,3 @@ auc
 ..  autofunction:: paddle.fluid.layers.auc
    :noindex:
-tensor
-======
-.. _api_fluid_layers_create_tensor:
-create_tensor
-------------
-..  autofunction:: paddle.fluid.layers.create_tensor
-    :noindex:
-.. _api_fluid_layers_create_parameter:
-create_parameter
----------------
-..  autofunction:: paddle.fluid.layers.create_parameter
-    :noindex:
-.. _api_fluid_layers_create_global_var:
-create_global_var
-----------------
-..  autofunction:: paddle.fluid.layers.create_global_var
-    :noindex:
-.. _api_fluid_layers_cast:
-cast
----
-..  autofunction:: paddle.fluid.layers.cast
-    :noindex:
-.. _api_fluid_layers_concat:
-concat
------
-..  autofunction:: paddle.fluid.layers.concat
-    :noindex:
-.. _api_fluid_layers_sums:
-sums
----
-..  autofunction:: paddle.fluid.layers.sums
-    :noindex:
-.. _api_fluid_layers_assign:
-assign
------
-..  autofunction:: paddle.fluid.layers.assign
-    :noindex:
-.. _api_fluid_layers_fill_constant_batch_size_like:
-fill_constant_batch_size_like
-----------------------------
-..  autofunction:: paddle.fluid.layers.fill_constant_batch_size_like
-    :noindex:
-.. _api_fluid_layers_fill_constant:
-fill_constant
-------------
-..  autofunction:: paddle.fluid.layers.fill_constant
-    :noindex:
-.. _api_fluid_layers_argmin:
-argmin
------
-..  autofunction:: paddle.fluid.layers.argmin
-    :noindex:
-.. _api_fluid_layers_argmax:
-argmax
------
-..  autofunction:: paddle.fluid.layers.argmax
-    :noindex:
-.. _api_fluid_layers_argsort:
-argsort
-------
-..  autofunction:: paddle.fluid.layers.argsort
-    :noindex:
-.. _api_fluid_layers_ones:
-ones
----
-..  autofunction:: paddle.fluid.layers.ones
-    :noindex:
-.. _api_fluid_layers_zeros:
-zeros
-----
-..  autofunction:: paddle.fluid.layers.zeros
-    :noindex:
-.. _api_fluid_layers_reverse:
-reverse
-------
-..  autofunction:: paddle.fluid.layers.reverse
-    :noindex:
--- a/doc/fluid/api/nets.rst
+++ b/doc/fluid/api/nets.rst
@@ -37,3 +37,11 @@ scaled_dot_product_attention
 ..  autofunction:: paddle.fluid.nets.scaled_dot_product_attention
    :noindex:
+.. _api_fluid_nets_img_conv_group:
+img_conv_group
+--------------
+..  autofunction:: paddle.fluid.nets.img_conv_group
+    :noindex:
--- a/doc/fluid/api/transpiler.rst
+++ b/doc/fluid/api/transpiler.rst
@@ -14,15 +14,6 @@ DistributeTranspiler
    :members:
    :noindex:
-.. _api_fluid_transpiler_InferenceTranspiler:
-InferenceTranspiler
-------------------
-..  autoclass:: paddle.fluid.transpiler.InferenceTranspiler
-    :members:
-    :noindex:
 .. _api_fluid_transpiler_memory_optimize:
 memory_optimize

--- a/doc/fluid/api_guides/high_low_level_api.md
+++ b/doc/fluid/api_guides/high_low_level_api.md
+## High/Low-level API简介
+Paddle目前有2套API接口：
+- Low-level（底层） API：
+  - 灵活性强并且已经相对成熟，使用它训练的模型，能直接支持C++预测上线。
+  - 提供了大量的模型作为使用示例，包括[Book](https://github.com/PaddlePaddle/book)中的第7和8章，
+    以及[models](https://github.com/PaddlePaddle/models)中的所有章节。
+  - 适用人群：对深度学习有一定了解，需要自定义网络进行训练/预测/上线部署的用户。
+- High-level（高层）API：
+  - 使用简单，[Book](https://github.com/PaddlePaddle/book)中前六章提供了示例。
+  - 尚未成熟，接口暂时在[paddle.fluid.contrib](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid/contrib)下面。
+  - 适用人群：想通过Book课程进行深度学习基础知识学习的初级用户。
--- a/doc/fluid/api_guides/low_level/optimizer/optimizer_all.rst
+++ b/doc/fluid/api_guides/low_level/optimizer/optimizer_all.rst
+..  _api_guide_optimizer:
+Optimizer
+#########
+神经网络最终是一个 `最优化问题 <https://en.wikipedia.org/wiki/Optimization_problem>`_ ，
+在经过 `前向计算和反向传播 <https://zh.wikipedia.org/zh-hans/反向传播算法>`_ 后，
+:code:`Optimizer` 使用反向传播梯度，优化神经网络中的参数。
+1.SGD/SGDOptimizer
+------------------
+:code:`SGD` 是实现 `随机梯度下降 <https://arxiv.org/pdf/1609.04747.pdf>`_ 的一个 :code:`Optimizer` 子类，是 `梯度下降 <https://zh.wikipedia.org/zh-hans/梯度下降法>`_ 大类中的一种方法。
+当需要训练大量样本的时候，往往选择 :code:`SGD` 来使损失函数更快的收敛。  
+API Reference 请参考 api_fluid_optimizer_SGDOptimizer_
+.. _api_fluid_optimizer_SGDOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-8-sgdoptimizer
+2.Momentum/MomentumOptimizer
+----------------------------
+:code:`Momentum` 优化器在 :code:`SGD` 基础上引入动量，减少了随机梯度下降过程中存在的噪声问题。
+用户在使用时可以将 :code:`ues_nesterov` 参数设置为False或True，分别对应传统 `Momentum(论文4.1节)
+<https://arxiv.org/pdf/1609.04747.pdf>`_  算法和 `Nesterov accelerated gradient(论文4.2节)
+<https://arxiv.org/pdf/1609.04747.pdf>`_ 算法。
+API Reference 请参考 api_fluid_optimizer_MomentumOptimizer_
+.. _api_fluid_optimizer_MomentumOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-9-momentumoptimizer
+3. Adagrad/AdagradOptimizer
+---------------------------
+`Adagrad <http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf>`_ 优化器可以针对不同参数样本数不平均的问题，自适应地为各个参数分配不同的学习率。
+API Reference 请参考 api_fluid_optimizer_AdagradOptimizer_
+.. _api_fluid_optimizer_AdagradOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-10-adagradoptimizer
+4.RMSPropOptimizer
+------------------
+`RMSProp优化器 <http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf>`_ ，是一种自适应调整学习率的方法，
+主要解决使用Adagrad后，模型训练中后期学习率急剧下降的问题。
+API Reference 请参考 api_fluid_optimizer_RMSPropOptimizer_
+.. _api_fluid_optimizer_RMSPropOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-14-rmspropoptimizer
+5.Adam/AdamOptimizer
+--------------------
+`Adam <https://arxiv.org/abs/1412.6980>`_ 的优化器是一种自适应调整学习率的方法，
+适用于大多非 `凸优化 <https://zh.wikipedia.org/zh/凸優化>`_ 、大数据集和高维空间的场景。在实际应用中，:code:`Adam` 是最为常用的一种优化方法。
+API Reference 请参考 api_fluid_optimizer_AdamOptimizer_
+.. _api_fluid_optimizer_AdamOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-11-adamoptimizer
+6.Adamax/AdamaxOptimizer
+------------------------
+`Adamax <https://arxiv.org/abs/1412.6980>`_ 是 :code:`Adam` 算法的一个变体，对学习率的上限提供了一个更简单的范围，使学习率的边界范围更简单。
+API Reference 请参考 api_fluid_optimizer_AdamxOptimizer_
+.. _api_fluid_optimizer_AdamxOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-12-adamaxoptimizer
+7.DecayedAdagrad/ DecayedAdagradOptimizer
+-------------------------------------------
+`DecayedAdagrad <http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf>`_ 优化器，可以看做是引入了衰减速率的 :code:`Adagrad` 算法，解决使用Adagrad后，模型训练中后期学习率急剧下降的问题。
+API Reference 请参考 api_fluid_optimizer_DecayedAdagrad_
+.. _api_fluid_optimizer_DecayedAdagrad: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-13-decayedadagradoptimizer
+8. Ftrl/FtrlOptimizer
+----------------------
+`FtrlOptimizer <https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf>`_ 优化器结合了 `FOBOS算法 <https://stanford.edu/~jduchi/projects/DuchiSi09b.pdf>`_ 的高精度与 `RDA算法
+<http://www1.se.cuhk.edu.hk/~sqma/SEEM5121_Spring2015/dual-averaging.pdf>`_ 的稀疏性，是目前效果非常好的一种 `Online Learning <https://en.wikipedia.org/wiki/Online_machine_learning>`_ 算法。
+API Reference 请参考 api_fluid_optimizer_FtrlOptimizer_
+.. _api_fluid_optimizer_FtrlOptimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-15-ftrloptimizer
+9.ModelAverage
+-----------------
+:code:`ModelAverage` 优化器，在训练中通过窗口来累计历史 parameter，在预测时使用取平均值后的paramet，整体提高预测的精度。
+API Reference 请参考 api_fluid_optimizer_ModelAverage_
+.. _api_fluid_optimizer_ModelAverage: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-17-modelaverage
+10.Optimizer
+--------------
+:code:`Optimizer` 这个类是 :code:`Fluid` 中优化器的基类。它的作用是定义优化器的公共接口，用户通过该类调用上述经典的优化算法。
+API Reference 请参考 api_fluid_optimizer_
+.. _api_fluid_optimizer: http://www.paddlepaddle.org/docs/0.14.0/api/fluid/en/optimizer.html#permalink-18-optimizer
--- a/doc/fluid/beginners_guide/image/tensor.jpg
+++ b/doc/fluid/beginners_guide/image/tensor.jpg
--- a/doc/fluid/dev/contribute_to_paddle_cn.md
+++ b/doc/fluid/dev/contribute_to_paddle_cn.md
-# 如何贡献代码
+../../v2/dev/contribute_to_paddle_cn.md
\ No newline at end of file
-我们真诚地感谢您的贡献，欢迎通过 GitHub 的 fork 和 pull request 流程来提交代码。
-## 代码要求
- 代码注释请遵守 [Doxygen](http://www.stack.nl/~dimitri/doxygen/) 的样式。
- 确保编译器选项 `WITH_STYLE_CHECK` 已打开，并且编译能通过代码样式检查。
- 所有代码必须具有单元测试。
- 通过所有单元测试。
- 请遵守[提交代码的一些约定](#提交代码的一些约定)。
-以下教程将指导您提交代码。
-## [Fork](https://help.github.com/articles/fork-a-repo/)
-跳转到[PaddlePaddle](https://github.com/PaddlePaddle/Paddle) GitHub首页，然后单击 `Fork` 按钮，生成自己目录下的仓库，比如 <https://github.com/USERNAME/Paddle>。
-## 克隆（Clone）
-将远程仓库 clone 到本地：
-```bash
-➜  git clone https://github.com/USERNAME/Paddle
-➜  cd Paddle
-```
-## 创建本地分支
-Paddle 目前使用[Git流分支模型](http://nvie.com/posts/a-successful-git-branching-model/)进行开发，测试，发行和维护，具体请参考 [Paddle 分支规范](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/releasing_process.md#paddle-分支规范)。
-所有的 feature 和 bug fix 的开发工作都应该在一个新的分支上完成，一般从 `develop` 分支上创建新分支。
-使用 `git checkout -b` 创建并切换到新分支。
-```bash
-➜  git checkout -b my-cool-stuff
-```
-值得注意的是，在 checkout 之前，需要保持当前分支目录 clean，否则会把 untracked 的文件也带到新分支上，这可以通过 `git status` 查看。
-## 使用 `pre-commit` 钩子
-Paddle 开发人员使用 [pre-commit](http://pre-commit.com/) 工具来管理 Git 预提交钩子。 它可以帮助我们格式化源代码（C++，Python），在提交（commit）前自动检查一些基本事宜（如每个文件只有一个 EOL，Git 中不要添加大文件等）。
-`pre-commit`测试是 Travis-CI 中单元测试的一部分，不满足钩子的 PR 不能被提交到 Paddle，首先安装并在当前目录运行它：
-```bash
-➜  pip install pre-commit
-➜  pre-commit install
-```
-Paddle 使用 `clang-format` 来调整 C/C++ 源代码格式，请确保 `clang-format` 版本在 3.8 以上。
-注：通过`pip install pre-commit`和`conda install -c conda-forge pre-commit`安装的`yapf`稍有不同的，Paddle 开发人员使用的是`pip install pre-commit`。
-## 开始开发
-在本例中，我删除了 README.md 中的一行，并创建了一个新文件。
-通过 `git status` 查看当前状态，这会提示当前目录的一些变化，同时也可以通过 `git diff` 查看文件具体被修改的内容。
-```bash
-➜  git status
-On branch test
-Changes not staged for commit:
-  (use "git add <file>..." to update what will be committed)
-  (use "git checkout -- <file>..." to discard changes in working directory)
-	modified:   README.md
-Untracked files:
-  (use "git add <file>..." to include in what will be committed)
-	test
-no changes added to commit (use "git add" and/or "git commit -a")
-```
-## 构建和测试
-编译 PaddlePaddle 的源码以及生成文档需要多种开发工具。为了方便大家，我们的标准开发流程是把这些工具都装进一个Docker image，称为*开发镜像*，通常名字是 `paddle:latest-dev` 或者 `paddle:[version tag]-dev` 如 `paddle:0.11.0-dev`。然后所有用 `cmake && make` 的地方（比如IDE配置里）都用 `docker run paddle:latest-dev`来代替。
-如要build这个开发镜像，在源码目录树的根目录中运行：
-```bash
-➜  docker build -t paddle:latest-dev .
-```
-随后可以用这个开发镜像开始build PaddlePaddle的源码。比如如果要build一个不依赖GPU，但是支持AVX指令集，并且包括unit tests的PaddlePaddle，可以：
-```bash
-➜  docker run -v $(pwd):/paddle -e "WITH_GPU=OFF" -e "WITH_AVX=ON" -e "WITH_TESTING=ON" paddle:latest-dev
-```
-这个过程除了编译PaddlePaddle为 `./build/libpaddle.so`，并且输出一个 `./build/paddle.deb`文件之外，还会输出一个 `build/Dockerfile`。我们只需要运行下面命令把编译好的PaddlePaddle打包成一个*生产镜像*（`paddle:prod`）：
-```bash
-➜  docker build -t paddle:prod -f build/Dockerfile .
-```
-如果要运行所有的单元测试，可以用如下命令：
-```bash
-➜  docker run -it -v $(pwd):/paddle paddle:latest-dev bash -c "cd /paddle/build && ctest"
-```
-关于构建和测试的更多信息，请参见[使用Docker安装运行](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/v2/build_and_install/docker_install_cn.rst)。
-## 提交（commit）
-接下来我们取消对 README.md 文件的改变，然后提交新添加的 test 文件。
-```bash
-➜  git checkout -- README.md
-➜  git status
-On branch test
-Untracked files:
-  (use "git add <file>..." to include in what will be committed)
-	test
-nothing added to commit but untracked files present (use "git add" to track)
-➜  git add test
-```
-Git 每次提交代码，都需要写提交说明，这可以让其他人知道这次提交做了哪些改变，这可以通过`git commit` 完成。
-```bash
-➜  git commit
-CRLF end-lines remover...............................(no files to check)Skipped
-yapf.................................................(no files to check)Skipped
-Check for added large files..............................................Passed
-Check for merge conflicts................................................Passed
-Check for broken symlinks................................................Passed
-Detect Private Key...................................(no files to check)Skipped
-Fix End of Files.....................................(no files to check)Skipped
-clang-formater.......................................(no files to check)Skipped
-[my-cool-stuff c703c041] add test file
- 1 file changed, 0 insertions(+), 0 deletions(-)
- create mode 100644 233
-```
-## 保持本地仓库最新
-在准备发起 Pull Request 之前，需要同步原仓库（<https://github.com/PaddlePaddle/Paddle>）最新的代码。
-首先通过 `git remote` 查看当前远程仓库的名字。
-```bash
-➜  git remote
-origin
-➜  git remote -v
-origin	https://github.com/USERNAME/Paddle (fetch)
-origin	https://github.com/USERNAME/Paddle (push)
-```
-这里 origin 是我们 clone 的远程仓库的名字，也就是自己用户名下的 Paddle，接下来我们创建一个原始 Paddle 仓库的远程主机，命名为 upstream。
-```bash
-➜  git remote add upstream https://github.com/PaddlePaddle/Paddle
-➜  git remote
-origin
-upstream
-```
-获取 upstream 的最新代码并更新当前分支。
-```bash
-➜  git fetch upstream
-➜  git pull upstream develop
-```
-## Push 到远程仓库
-将本地的修改推送到 GitHub 上，也就是 https://github.com/USERNAME/Paddle。
-```bash
-# 推送到远程仓库 origin 的 my-cool-stuff 分支上
-➜  git push origin my-cool-stuff
-```
-## 建立 Issue 并完成 Pull Request
-建立一个 Issue 描述问题，并记录它的编号。
-切换到所建分支，然后点击 `New pull request`。
-<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
-选择目标分支：
-<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
-在 PR 的描述说明中，填写 `resolve #Issue编号` 可以在这个 PR 被 merge 后，自动关闭对应的 Issue，具体请见 <https://help.github.com/articles/closing-issues-via-commit-messages/>。
-接下来等待 review，如果有需要修改的地方，参照上述步骤更新 origin 中的对应分支即可。
-## 删除远程分支
-在 PR 被 merge 进主仓库后，我们可以在 PR 的页面删除远程仓库的分支。
-<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
-也可以使用 `git push origin :分支名` 删除远程分支，如：
-```bash
-➜  git push origin :my-cool-stuff
-```
-## 删除本地分支
-最后，删除本地分支。
-```bash
-# 切换到 develop 分支
-➜  git checkout develop 
-# 删除 my-cool-stuff 分支
-➜  git branch -D my-cool-stuff
-```
-至此，我们就完成了一次代码贡献的过程。
-## 提交代码的一些约定
-为了使评审人在评审代码时更好地专注于代码本身，请您每次提交代码时，遵守以下约定：
-1. 请保证Travis-CI 中单元测试能顺利通过。如果没过，说明提交的代码存在问题，评审人一般不做评审。
-2. 提交PUll Request前：
-   - 请注意commit的数量：
-     - 原因：如果仅仅修改一个文件但提交了十几个commit，每个commit只做了少量的修改，这会给评审人带来很大困扰。评审人需要逐一查看每个commit才能知道做了哪些修改，且不排除commit之间的修改存在相互覆盖的情况。
-     - 建议：每次提交时，保持尽量少的commit，可以通过`git commit --amend`补充上次的commit。对已经Push到远程仓库的多个commit，可以参考[squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)。
-   - 请注意每个commit的名称：应能反映当前commit的内容，不能太随意。
-3. 如果解决了某个Issue的问题，请在该PUll Request的**第一个**评论框中加上：`fix #issue_number`，这样当该PUll Request被合并后，会自动关闭对应的Issue。关键词包括：close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved，请选择合适的词汇。详细可参考[Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages)。
-此外，在回复评审人意见时，请您遵守以下约定：
-1. 评审人的每个意见都必须回复（这是开源社区的基本礼貌，别人帮了忙，应该说谢谢）：
-   - 对评审意见同意且按其修改完的，给个简单的`Done`即可；
-   - 对评审意见不同意的，请给出您自己的反驳理由。
-2. 如果评审意见比较多：
-   - 请给出总体的修改情况。
-   - 请采用[start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)进行回复，而非直接回复的方式。原因是每个回复都会发送一封邮件，会造成邮件灾难。
--- a/doc/fluid/dev/contribute_to_paddle_en.md
+++ b/doc/fluid/dev/contribute_to_paddle_en.md
-# Contribute Code
+../../v2/dev/contribute_to_paddle_en.md
\ No newline at end of file
-You are welcome to contribute to project PaddlePaddle. To contribute to PaddlePaddle, you have to agree with the 
-[PaddlePaddle Contributor License Agreement](https://gist.github.com/wangkuiyi/0c22c7b1bd3bb7eb27d76f85c3a3e329).
-We sincerely appreciate your contribution.  This document explains our workflow and work style.
-## Workflow
-PaddlePaddle uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/).  The following steps guide usual contributions.
-1. Fork
-   Our development community has been growing fastly; it doesn't make sense for everyone to write into the official repo.  So, please file Pull Requests from your fork.  To make a fork,  just head over to the GitHub page and click the ["Fork" button](https://help.github.com/articles/fork-a-repo/).
-1. Clone
-   To make a copy of your fork to your local computers, please run
-   ```bash
-   git clone https://github.com/your-github-account/paddle
-   cd paddle
-   ```
-1. Create the local feature branch
-   For daily works like adding a new feature or fixing a bug, please open your feature branch before coding:
-   ```bash
-   git checkout -b my-cool-stuff
-   ```
-1. Commit
-   Before issuing your first `git commit` command, please install [`pre-commit`](http://pre-commit.com/) by running the following commands:
-   ```bash
-   pip install pre-commit
-   pre-commit install
-   ```
-   Our pre-commit configuration requires clang-format 3.8 for auto-formating C/C++ code and yapf for Python.
-   Once installed, `pre-commit` checks the style of code and documentation in every commit.  We will see something like the following when you run `git commit`:
-   ```
-   ➜  git commit
-   CRLF end-lines remover...............................(no files to check)Skipped
-   yapf.................................................(no files to check)Skipped
-   Check for added large files..............................................Passed
-   Check for merge conflicts................................................Passed
-   Check for broken symlinks................................................Passed
-   Detect Private Key...................................(no files to check)Skipped
-   Fix End of Files.....................................(no files to check)Skipped
-   clang-formater.......................................(no files to check)Skipped
-   [my-cool-stuff c703c041] add test file
-    1 file changed, 0 insertions(+), 0 deletions(-)
-    create mode 100644 233
-   ```
-	NOTE: The `yapf` installed by `pip install pre-commit` and `conda install -c conda-forge pre-commit` is slightly different. Paddle developers use `pip install pre-commit`.
-1. Build and test
-   Users can build PaddlePaddle natively on Linux and Mac OS X.  But to unify the building environment and to make it easy for debugging, the recommended way is [using Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/build_en.md).
-1. Keep pulling
-   An experienced Git user pulls from the official repo often -- daily or even hourly, so they notice conflicts with others work early, and it's easier to resolve smaller conflicts.
-   ```bash
-   git remote add upstream https://github.com/PaddlePaddle/Paddle
-   git pull upstream develop
-   ```
-1. Push and file a pull request
-   You can "push" your local work into your forked repo:
-   ```bash
-   git push origin my-cool-stuff
-   ```
-   The push allows you to create a pull request, requesting owners of this [official repo](https://github.com/PaddlePaddle/Paddle) to pull your change into the official one.
-   To create a pull request, please follow [these steps](https://help.github.com/articles/creating-a-pull-request/).
-   If your change is for fixing an issue, please write ["Fixes <issue-URL>"](https://help.github.com/articles/closing-issues-using-keywords/) in the description section of your pull request.  Github would close the issue when the owners merge your pull request.
-   Please remember to specify some reviewers for your pull request.  If you don't know who are the right ones, please follow Github's recommendation.
-1. Delete local and remote branches
-   To keep your local workspace and your fork clean, you might want to remove merged branches:
-   ```bash
-   git push origin :my-cool-stuff
-   git checkout develop
-   git pull upstream develop
-   git branch -d my-cool-stuff
-   ```
-### Code Review
-  Please feel free to ping your reviewers by sending them the URL of your pull request via IM or email.  Please do this after your pull request passes the CI.
- Please answer reviewers' every comment.  If you are to follow the comment, please write "Done"; please give a reason otherwise.
- If you don't want your reviewers to get overwhelmed by email notifications, you might reply their comments by [in a batch](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/).
- Reduce the unnecessary commits.  Some developers commit often.  It is recommended to append a sequence of small changes into one commit by running `git commit --amend` instead of `git commit`.
-## Coding Standard
-### Code Style
-Our C/C++ code follows the [Google style guide](http://google.github.io/styleguide/cppguide.html).
-Our Python code follows the [PEP8 style guide](https://www.python.org/dev/peps/pep-0008/).
-Our build process helps to check the code style.  In [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/docker/build.sh#L42), the entry point of our [builder Docker image](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/Dockerfile#L88), the CMake argument `WITH_STYLE_CHECK` is set to `ON` by default.  This flag is on
-Please install pre-commit, which automatically reformat the changes to C/C++ and Python code whenever we run `git commit`.  To check the whole codebase, we can run the command `pre-commit run -a`, as in the [`check_style.sh` file](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/travis/check_style.sh#L30), which is invoked by [our Travis CI configuration](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/.travis.yml#L43).
-### Unit Tests
-Please remember to add related unit tests.
- For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
- For Python code, please use [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/).
-### Writing Logs
-We use [glog](https://github.com/google/glog) for logging in our C/C++ code.
-For general information, please use `LOG`.  For debug information, please use [`VLOG`](http://htmlpreview.github.io/?https://github.com/google/glog/blob/master/doc/glog.html#verbose).  The reason is at [here](https://groups.google.com/a/chromium.org/d/msg/chromium-dev/3NDNd1KzXeY/AZKMMx37fdQJ).
-`VLOG` requires a *verbose level* parameter.  For example:
-```c++
-VLOG(3) << "Operator FC is taking " << num_inputs << "inputs."
-```
-When we run a PaddlePaddle application or test, we can specify a verbose threshold.  For example:
-```bash
-GLOG_vmodule=buddy_allocator=2 \
-GLOG_v=10 \
-python \
-../python/paddle/v2/framework/tests/test_recurrent_op.py
-```
-This will enable VLOG messages generated by `buddy_allocator.{h,cc}` and in the verbose range of 0 to 3, so you will see above example VLOG message, which is in level 3.  This suggests that we output overall messages in lower verbose levels, so they display with higher probability.  When coding C++, please follow the verbose level convention as follows:
- verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework)
- verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)
- verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform)
- verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/legacy/math)
--- a/doc/fluid/dev/releasing_process_cn.md
+++ b/doc/fluid/dev/releasing_process_cn.md
 # PaddlePaddle发行规范
-PaddlePaddle使用git-flow branching model做分支管理，使用[Semantic Versioning](http://semver.org/)标准表示PaddlePaddle版本号。
+PaddlePaddle使用Trunk Based Development，使用[Semantic Versioning](http://semver.org/)标准表示PaddlePaddle版本号。
 PaddlePaddle每次发新的版本，遵循以下流程:
 1. 从`develop`分支派生出新的分支，分支名为`release/版本号`。例如，`release/0.10.0`
-1. 将新分支的版本打上tag，tag为`版本号rc.Patch号`。第一个tag为`0.10.0rc1`，第二个为`0.10.0rc2`，依次类推。
+2. 将新分支的版本打上tag，tag为`版本号rc-Patch号`。例如，第一个tag为`0.10.0-rc0`。
-1. 对这个版本的提交，做如下几个操作:
+3. 新分支一般不接受新的feature和优化。QA在release分支上进行测试。研发基于最新的develop开发。
-  * 使用Regression Test List作为检查列表，测试本次release的正确性。
+4. QA和研发发现的bug，在develop上修复验证后，cherry-pick修复到release分支。直到release分支相对稳定。
-	  * 如果失败，记录下所有失败的例子，在这个`release/版本号`分支中，修复所有bug后，Patch号加一，到第二步
+5. 如果有需要，在release分支最新代码上打上新的tag，比如`0.10.0-rc1`，让更多的用户加入测试。重复3-4步。
-	* 修改`python/setup.py.in`中的版本信息,并将`istaged`字段设为`True`。
+6. release分支稳定后，打上正式的release tag，比如`0.10.0`。
-	* 将这个版本的python wheel包发布到pypi。
+7. 将这个版本的python wheel包发布到pypi。
-	* 更新Docker镜像（参考后面的操作细节）。
+8. 更新Docker镜像（参考后面的操作细节）。
-1. 第三步完成后，将`release/版本号`分支合入master分支，将master分支的合入commit打上tag，tag为`版本号`。同时再将`master`分支合入`develop`分支。
-1. 协同完成Release Note的书写。
 需要注意的是:
-* `release/版本号`分支一旦建立，一般不允许再从`develop`分支合入`release/版本号`。这样保证`release/版本号`分支功能的封闭，方便测试人员测试PaddlePaddle的行为。
+* bug修复需要先在develop上进行，然后进入release分支。而不是直接在release分支上开发。
-* 在`release/版本号`分支存在的时候，如果有bugfix的行为，需要将bugfix的分支同时merge到`master`, `develop`和`release/版本号`这三个分支。
+* release分支原则上只接受修复类的修改，不接受新feature。
 ## 发布wheel包到pypi
@@ -61,24 +60,21 @@ docker push [镜像]:[version]
 ## PaddlePaddle 分支规范
-PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范，并适应github的特性做了一些区别。
+PaddlePaddle开发过程使用[Trunk Based Development](https://trunkbaseddevelopment.com/) 开发规范。
-* PaddlePaddle的主版本库遵循[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范。其中:
-	* `master`分支为稳定(stable branch)版本分支。每一个`master`分支的版本都是经过单元测试和回归测试的版本。
-	* `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试，但并没有经过回归测试。
-	* `release/版本号`分支为每一次Release时建立的临时分支。在这个阶段的代码正在经历回归测试。
-* 其他用户的fork版本库并不需要严格遵守[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范，但所有fork的版本库的所有分支都相当于特性分支。
+* `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试。并且会经过模型回归测试。
-	* 建议，开发者fork的版本库使用`develop`分支同步主版本库的`develop`分支
+* `release/版本号`分支为每一次Release时建立的临时分支。release分支主要用于测试，bug修复和最终发版。
-	* 建议，开发者fork的版本库中，再基于`develop`版本fork出自己的功能分支。
+* `master`分支因为历史原因，已经废弃。
-	* 当功能分支开发完毕后，向PaddlePaddle的主版本库提交`Pull Reuqest`，进而进行代码评审。
-		* 在评审过程中，开发者修改自己的代码，可以继续在自己的功能分支提交代码。
-* BugFix分支也是在开发者自己的fork版本库维护，与功能分支不同的是，BugFix分支需要分别给主版本库的`master`、`develop`与可能有的`release/版本号`分支，同时提起`Pull Request`。
+* 其他开发者fork的feature branch。
+	* 建议，开发者的feature branch需要同步主版本库的`develop`分支。
+	* 建议，开发者的feature branch需要基于主版本库中的`develop`分支。
+	* 当feature branch开发完毕后，向PaddlePaddle的主版本库提交`Pull Reuqest`，进而进行代码评审。
+		* 在评审过程中，开发者修改自己的代码，可以继续在自己的feature branch提交代码。
 ## PaddlePaddle回归测试列表
-本列表说明PaddlePaddle发版之前需要测试的功能点。
+TODO
 ### PaddlePaddle Book中所有章节

--- a/doc/fluid/dev/releasing_process_en.md
+++ b/doc/fluid/dev/releasing_process_en.md
 # PaddlePaddle Releasing Process
-PaddlePaddle manages its branches using "git-flow branching model", and [Semantic Versioning](http://semver.org/) as it's version number semantics.
+PaddlePaddle manages its branches using Trunk Based Development, and [Semantic Versioning](http://semver.org/) as it's version number semantics.
 Each time we release a new PaddlePaddle version, we should follow the below steps:
-1. Fork a new branch from `develop` named `release/[version]`, e.g. `release/0.10.0`.
+1. Create a new release branch from `develop`，named `release/[version]`. E.g.，`release/0.10.0`
-1. Push a new tag on the release branch, the tag name should be like `[version]rc.patch`. The
+2. Create a new tag for the release branch, tag format: `version-rc.Patch`. E.g. the first tag is `0.10.0-rc0`。
-   first tag should be `0.10.0rc1`, and the second should be `0.10.0.rc2` and so on.
+3. New release branch normally doesn't accept new features or optimizations. QA will test on the release branch. Developer should develop based on `develop` branch.
-1. After that, we should do:
+4. If QA or Developer find bugs. They should first fix and verify on `develop` branch. Then cherry-pick the fix to the release branch. Wait until the release branch is stable.
-  * Run all regression test on the Regression Test List (see PaddlePaddle TeamCity CI), to confirm
+5. If necessary, create a new tag on the relese branch, e.g. `0.10.0-rc1`. Involve more users to try it and repeat step 3-4.
-      that this release has no major bugs.
+6. After release branch is stable，Create the official release tag，such as `0.10.0`.
-        * If regression test fails, we must fix those bugs and create a new `release/[version]`
+7. Release the python wheel package to pypi.
-          branch from previous release branch.
+8. Update the docker image (More details below).
-    * Modify `python/setup.py.in`, change the version number and change `ISTAGED` to `True`.
-    * Publish PaddlePaddle release wheel packages to pypi (see below instructions for detail).
+NOTE:
-    * Update the Docker images (see below instructions for detail).
-1. After above step, merge `release/[version]` branch to master and push a tag on the master commit,
+* bug fix should happen on `develop` branch, then cherry-pick to relese branch. Avoid developing directly on release branch.
-   then merge `master` to `develop`.
-1. Update the Release Note.          
+* release normally only accept bug fixes. Don't add new features.
-***NOTE:***
-* Do ***NOT*** merge commits from develop branch to release branches to keep the release branch contain
-  features only for current release, so that we can test on that version.
-* If we want to fix bugs on release branches, we must merge the fix to master, develop and release branch.
 ## Publish Wheel Packages to pypi
@@ -97,26 +92,22 @@ You can then checkout the latest pushed tags at https://hub.docker.com/r/paddlep
 ## Branching Model
-We use [git-flow](http://nvie.com/posts/a-successful-git-branching-model/) as our branching model,
+PaddlePaddle uses [Trunk Based Development](https://trunkbaseddevelopment.com/) as our branching model.
-with some modifications:
+* `develop` branch is used for development. Each comment to `develop` branc goes through unit tests and model regression tests.
-* `master` branch is the stable branch. Each version on the master branch is tested and guaranteed.
+* `release/[version]` branch is used for each release. Release branch is used for tests, bug fix and evetual release.
-* `develop` branch is for development. Each commit on develop branch has passed CI unit test, but no
+* `master` branch as been deprecated for historical reasons
-  regression tests are run.
-* `release/[version]` branch is used to publish each release. Latest release version branches have
+* Developer's feature branch。
-  bugfix only for that version, but no feature updates.
+	* Developer's feature branch should sync with upstream `develop` branch.
-* Developer forks are not required to follow
+	* Developer's feature branch should be forked from upstream `develop` branch.
-  [git-flow](http://nvie.com/posts/a-successful-git-branching-model/)
+	* After feature branch is ready, create a `Pull Request` against the Paddle repo and go through code review.
-  branching model, all forks is like a feature branch.
+	   * In the review process, develop modify codes and push to their own feature branch.
-    * Advise: developer fork's develop branch is used to sync up with main repo's develop branch.
-    * Advise: developer use it's fork's develop branch to for new branch to start developing.
-  * Use that branch on developer's fork to create pull requests and start reviews.
-      * developer can push new commits to that branch when the pull request is open.
-* Bug fixes are also started from developers forked repo. And, bug fixes branch can merge to
-  `master`, `develop` and `releases`.
 ## PaddlePaddle Regression Test List
+TODO
 ### All Chapters of PaddlePaddle Book
 We need to guarantee that all the chapters of PaddlePaddle Book can run correctly. Including

--- a/doc/fluid/dev/versioning_en.md
+++ b/doc/fluid/dev/versioning_en.md
+# Versioning (Work In Progress)
+PaddlePaddle framework follows Semantic Versioning 2.0 (semver).
+Each release has version of the following format: MAJOR.MINOR.PATCH
+(e.g. 1.2.0). Some key points:
+ * Major version number change can result in backward-incompatible changes. Codes working in old version don’t necessarily work in the new version. In addition, data, such as program model and checkpointed parameters, generated by the previous major version might not work in the new version. Tools will be attempted to be built to help the release migration.
+ * Minor version number change always maintain backward compatibility. It normally contains compatible improvements and bug fixes.
+ * Patch number change is for bug fixes.
+ * Violation of the policy are considered as bugs and should be fixed.
+### What is Covered
+* All public documented Python APIs, excluding those live in the contrib namespace.
+### What is Not Covered
+* If an API’s implementation has bugs, we reserve the rights to fix the bugs and change the behavior.
+* The Python APIs in contrib namespace.
+* The Python function and classes that start with ‘_’.
+* The offline tools.
+* The data generated by the framework, such as serialized Program model file and checkpointed variables, are subject to different versioning scheme described below.
+* C++ Inference APIs. (To be covered)
+## Data
+Data refers to the artifacts generated by the framework. Here, we specifically mean model Program file and the checkpointed variables.
+* Backward Compatibility: User sometimes generates Data at PaddlePaddle version 1.1 and expects it to be consumed by PaddlePaddle version 1.2.
+  This can happen when an new online system wants to serve an old model trained previously.
+* Forward Compatibility: User sometimes generates Data at PaddlePaddle version 1.2 and expects it to be consumed by PaddlePaddle version 1.1.
+  The can happen when an new successful research model want to be served by an old online system that is not frequently upgraded.
+### Versioning
+Data version. Data is assigned an integer version number. Version is increased when incompatible change is introduced.
+PaddlePaddle framework has an interval of Data version that it supports. PadlePaddle framework within the same major version (semver) cannot drop support of lower version of Data. Hence, a minor version change cannot drop support of Data version.
+For example, For PaddlePaddle version 1.1, it supports Program version 3 to 5. Later, Program version is increased from 5 to 6 due to addition of an attribute. As a result PaddlePaddle version 1.1 won’t be able to consume it. PaddlePaddle 1.2 should support Program version 3 to 6. PaddlePaddle can only drop support for Program version 3 until PaddlePaddle version 2.0.
+### Known Issues
+Currently, forward compatibility for new Data version is best-effort.
--- a/doc/fluid/dev/write_docs_cn.rst
+++ b/doc/fluid/dev/write_docs_cn.rst
-#############
+../../v2/dev/write_docs_cn.rst
-如何贡献文档
\ No newline at end of file
-#############
-PaddlePaddle的文档包括中英文两个部分。文档都是通过 ``cmake`` 驱动 ``sphinx`` 编译生成的，PaddlePaddle.org工具可以帮助我们实现这一编译过程，并提供更好的预览效果。
-如何构建文档
-============
-PaddlePaddle的文档构建有两种方式，分别为使用paddlepaddle.org工具和不使用paddlepaddle.org工具，两种方式都有各自的优点，前者方便预览，后者方便开发者进行调试。这两种方式中又分别有使用docker和不使用docker的两种构建方法。
-我们建议使用PaddlePaddle.org工具来构建文档。
-使用PaddlePaddle.org工具
------------------------
-这个是目前推荐的使用方法。除了可以自动编译文档，还可以直接在网页中预览文档，需要注意的是，采用后续说明的其它方式虽然也可以预览文档，但是文档的样式与官网文档是不一致的，使用PaddlePaddle.org工具进行编译才能产生与官网文档样式一致的预览效果。
-PaddlePaddle.org工具可以配合Docker使用，需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 <https://docs.docker.com/>`_ 。安装好Docker之后即可用以下命令启动工具
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    # Please specify the working directory through -v
-    docker run -it -p 8000:8000 -v `pwd`:/var/content paddlepaddle/paddlepaddle.org:latest
-注意: PaddlePaddle.org 会在 -v (volume) 指定的内容存储库运行命令
-之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档
-编译后的文件将被存储在工作目录 <paddlepaddle working directory>/.ppo_workspace/content。
-如果不想使用Docker，你还可以通过运行Django框架直接激活工具的服务器。使用下面的命令来运行它。
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories and PaddlePaddle.org
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git
-    # Please specify the PaddlePaddle working directory. In the current setting, it should be pwd
-    export CONTENT_DIR=<path_to_paddlepaddle_working_directory>
-    export ENV=''
-    cd PaddlePaddle.org/portal/
-    pip install -r requirements.txt
-    python manage.py runserver
-工具服务器将读取环境变量 CONTENT_DIR 搜索代码库。请指定的PaddlePaddle工作目录给环境变量 CONTENT_DIR。
-之后再用网页连到 http://localhost:8000 就可以在网页上生成需要的文档。
-编译后的文件将被存储在工作目录 <paddlepaddle working directory>/.ppo_workspace/content。
-想了解更多PaddlePaddle.org工具的详细信息，可以 `点击这里 <https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develop/README.cn.md>`_ 。
-不使用PaddlePaddle.org工具
--------------------------
-使用Docker构建PaddlePaddle的文档，需要在系统里先安装好Docker工具包。Docker安装请参考 `Docker的官网 <https://docs.docker.com/>`_ 。该方法与 `从源码编译PaddlePaddle <http://paddlepaddle.org/docs/develop/documentation/zh/build_and_install/build_from_source_cn.html>`_ 相似，通过从源码中构建可用于编译PaddlePaddle文档的Docker镜像并运行，在进入Docker容器后使用源码中的脚本构建PaddlePaddle文档，具体步骤如下：
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   # 从源码中构建可用于编译PaddlePaddle文档的Docker镜像
-   docker build -t paddle:dev .
-   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash
-   # 进入Docker容器后使用build.sh脚本构建PaddlePaddle文档
-   bash -x /paddle/paddle/scripts/docker/build.sh
-注：上述命令把当前目录（源码根目录）映射为 container 里的 :code:`/paddle` 目录。
-编译完成后，会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录，在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 、 ``api/en/html`` 共三个子目录，分别进入这些目录下，执行以下命令：
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。
-如果不想使用Docker，也可以使用以下命令直接构建PaddlePaddle文档，即
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   mkdir -p build
-   cd build
-   cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON
-   # 如果只需要构建使用文档，则执行以下命令
-   make -j $processors paddle_docs
-   # 如果只需要构建API，则执行以下命令
-   make -j $processors paddle_apis
-其中$processors代表启动和CPU核一样多的进程来并行编译，可以根据本机的CPU核数设置相应的值。
-编译完成后，同样会产生 ``doc/v2`` 和 ``doc/fluid`` 两个目录，如果选择构建文档则会在这两个目录下分别都生成 ``cn/html/`` 、 ``en/html`` 两个子目录，选择构建API则会在这两个目录下分别生成 ``api/en/html`` 目录，分别进入这些子目录下，执行以下命令：
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-在浏览器中输入 http://localhost:8088 就可以看到编译生成的 ``v2`` 和 ``fluid`` 两种版本的中/英文的文档页面和英文的API页面。下图为生成的 ``v2`` 英文文档首页示例。注意，示例中由于使用了sphinx的原始主题，所以页面的风格与官网并不一致，但这并不影响开发者进行调试。
-..  image:: src/doc_en.png
-    :align: center
-    :scale: 60 %
-如何书写文档
-============
-PaddlePaddle文档使用 `sphinx`_ 自动生成，用户可以参考sphinx教程进行书写。
-如何更新www.paddlepaddle.org
-============================
-更新的文档以PR的形式提交到github中，提交方式参见 `如何贡献文档 <http://www.paddlepaddle.org/docs/develop/documentation/zh/dev/write_docs_cn.html>`_ 。
-目前PaddlePaddle的develop分支的文档是自动触发更新的，用户可以分别查看最新的 `中文文档 <http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html>`_ 和
-`英文文档 <http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html>`_ 。
-..  _cmake: https://cmake.org/
-..  _sphinx: http://www.sphinx-doc.org/en/1.4.8/
--- a/doc/fluid/dev/write_docs_en.rst
+++ b/doc/fluid/dev/write_docs_en.rst
-########################
+../../v2/dev/write_docs_en.rst
-Contribute Documentation
\ No newline at end of file
-########################
-PaddlePaddle's documentation includes both Chinese and English versions. The documentation is built using the ``cmake`` command to drive the ``sphinx`` compiler. The PaddlePaddle.org tool helps us to implement this compilation process and provides better preview results.
-How to build Documentation
-===========================
-PaddlePaddle's documentation is built in two ways: using the PaddlePaddle.org tool and without using it. Both methods have their own advantages. The former facilitates previewing, while the latter facilitates debugging by the developer. We could choose to build the documentation with Docker or without it in each of the above ways.
-We recommend using PaddlePaddle.org tool to build documentation.
-Using PaddlePaddle.org tool
-----------------------------
-This is the recommended method to build documentation, because it can automatically compile the documentation and preview the documentation directly in a web page. Note that, although you can preview the documentation in other ways, its style may not be consistent with the official website. Compiling with the PaddlePaddle.org tool produces a preview that will be consistent with the official website documentation style.
-The PaddlePaddle.org tool can be used with Docker and Docker needs to be installed first. Please refer to `Docker's official website <https://docs.docker.com/>`_ on how to install Docker. After installing Docker, you may use the following commands to activate the tool
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories. You may only clone the contents you need
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    # Please specify the working directory through -v
-    docker run -it -p 8000:8000 -v `pwd`:/var/content paddlepaddle/paddlepaddle.org:latest
-Note: PaddlePaddle.org will read the content repos specified in the -v (volume) flag of the docker run commands
-Use a web browser and navigate to http://localhost:8000. Click the buttons to compile the documentation.
-The compiled documentations will be stored in <paddlepaddle working directory>/.ppo_workspace/content
-If you don't wish to use Docker, you can also activate the tool through Django. Use the following the commands to set up
-..  code-block:: bash
-    mkdir paddlepaddle # Create paddlepaddle working directory
-    cd paddlepaddle
-    # Clone the content repositories and PaddlePaddle.org
-    git clone https://github.com/PaddlePaddle/Paddle.git
-    git clone https://github.com/PaddlePaddle/book.git
-    git clone https://github.com/PaddlePaddle/models.git
-    git clone https://github.com/PaddlePaddle/Mobile.git
-    git clone https://github.com/PaddlePaddle/PaddlePaddle.org.git
-    # Please specify the PaddlePaddle working directory. In the current setting, it should be pwd
-    export CONTENT_DIR=<path_to_paddlepaddle_working_directory>
-    export ENV=''
-    cd PaddlePaddle.org/portal/
-    pip install -r requirements.txt
-    python manage.py runserver
-Specify the PaddlePaddle working directory for the environment variable CONTENT_DIR so that the tool could find where the working directory is.
-Use a web browser and navigate to http://localhost:8000. Click the buttons to compile the documentation
-The compiled documentations will be stored in <paddlepaddle working directory>/.ppo_workspace/content
-Please `click here <https://github.com/PaddlePaddle/PaddlePaddle.org/blob/develop/README.md>`_ for more information about the PaddlePaddle.org tool.
-Manually Building the Documentation
-------------------------------------
-Build PaddlePaddle's documentation with Docker，you need to install Docker first. Please refer to `Docker's official website <https://docs.docker.com/>`_ on how to install Docker. This method is quite similar to ` Build From Sources <http://paddlepaddle.org/docs/develop/documentation/en/build_and_install/build_from_source_en.html>`_ , by constructing, from source code, a docker image that can be used to build PaddlePaddle documentation. Enter the Docker container and use the script ``build.sh`` in the source directory to build the PaddlePaddle documentation. The specific steps are as follows:
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   # Construct a docker image from source code
-   docker build -t paddle:dev .
-   docker run -it -v $PWD:/paddle -e "WITH_GPU=OFF" -e "WITH_TESTING=OFF" -e "WITH_DOC=ON" paddle:dev /bin/bash
-   # Use build.sh to build PaddlePaddle documentation
-   bash -x /paddle/paddle/scripts/docker/build.sh
-Note: The above commands maps the current directory (source root directory) to the :code:`/paddle` directory in the container.
-After compiling, there should be two generated directories: ``doc/v2`` and ``doc/fluid``, where three subdirectories ``cn/html/``, ``en/html`` and ``api/en/html`` are generated. Please enter these directories respectively and execute the following commands:
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-Use a web browser and navigate to http://localhost:8000, you could see the compiled  ``v2`` 's and ``fluid`` 's Chinese/English documents page and English APIs page.
-If you do not wish to use Docker, you can also use the following commands to directly build the PaddlePaddle documentation.
-.. code-block:: bash
-   git clone https://github.com/PaddlePaddle/Paddle.git
-   cd Paddle
-   mkdir -p build
-   cd build
-   cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_DOC=ON
-   # If you only need to build documents, use the following commands
-   make -j $processors paddle_docs
-   # If you only need to build APIs, use the following commands
-   make -j $processors paddle_apis
-$processors indicates that as many processes as the CPU cores are started to compile in parallel. It should be set according to the number of CPU cores of your machine.
-After compiling, there also should be two generated directories: ``doc/v2`` and ``doc/fluid`` . If you chose to build documents, two subdirectories ``cn/html/`` and ``en/html``  will be generated in both two directories. If you chose to build APIs，a subdirectory ``api/en/html`` will be generated. Please enter these directories respectively and execute the following commands:
-.. code-block:: bash
-   python -m SimpleHTTPServer 8088
-Use a web browser and navigate to http://localhost:8000, you could see the compiled  ``v2`` 's and ``fluid`` 's Chinese/English documents page and English APIs page. The following figure is an example of the built ``v2`` 's English documents home page. Note that due to the sphinx's original theme used in the example, the style of the page is not consistent with the official website, but this does not affect the developer's debugging.
-..  image:: src/doc_en.png
-    :align: center
-    :scale: 60 %
-How to write Documentation
-===========================
-PaddlePaddle uses `sphinx`_ to compile documentation，Please check sphinx official website for more detail.
-How to update www.paddlepaddle.org
-===================================
-Please create PRs and submit them to github, please check `Contribute Code <http://www.paddlepaddle.org/docs/develop/documentation/en/howto/dev/contribute_to_paddle_en.html>`_ 。
-PaddlePaddle develop branch will update the documentation once the PR is merged. User may check latest `Chinese Docs <http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html>`_ and
-`English Docs <http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html>`_ 。
-..  _cmake: https://cmake.org/
-..  _sphinx: http://www.sphinx-doc.org/en/1.4.8/
--- a/doc/fluid/user_guides/howto/debug/visualdl.md
+++ b/doc/fluid/user_guides/howto/debug/visualdl.md
@@ -104,6 +104,7 @@ visualDL --logdir=scratch_log --port=8080
 # 访问 http://127.0.0.1:8080
 ```
+如果出现`TypeError: __init__() got an unexpected keyword argument 'file'`, 是因为protobuf不是3.5以上，运行`pip install --upgrade protobuf`就能解决。
 如果在虚拟环境下仍然遇到安装问题，请尝试以下方法。

--- a/doc/fluid/user_guides/howto/inference/build_and_install_lib_cn.rst
+++ b/doc/fluid/user_guides/howto/inference/build_and_install_lib_cn.rst
@@ -9,13 +9,13 @@
 ======================   ========================================
 版本说明                            C++预测库   
 ======================   ========================================
-cpu_avx_mkl              `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_ 
+cpu_avx_mkl              `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxCp27cp27mu/.lastSuccessful/fluid.tgz>`_ 
-cpu_avx_openblas         `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cpu_avx_openblas         `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuAvxOpenblas/.lastSuccessful/fluid.tgz>`_
-cpu_noavx_openblas       `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cpu_noavx_openblas       `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_CpuNoavxOpenblas/.lastSuccessful/fluid.tgz>`_
-cuda7.5_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda75cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cuda7.5_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda75cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda8.0_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cuda8.0_cudnn5_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda80cudnn5cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda8.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cuda8.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda8cudnn7cp27cp27mu/.lastSuccessful/fluid.tgz>`_
-cuda9.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/fluid.tgz/?branch=0.15.0>`_
+cuda9.0_cudnn7_avx_mkl   `fluid.tgz <https://guest:@paddleci.ngrok.io/repository/download/Manylinux1_Cuda90cudnn7avxMkl/.lastSuccessful/fluid.tgz>`_
 ======================   ========================================
 从源码编译
@@ -40,6 +40,7 @@ WITH_MKL            ON/OFF
  .. code-block:: bash
+     pip install paddlepaddle-gpu
     PADDLE_ROOT=/path/of/capi
     git clone https://github.com/PaddlePaddle/Paddle.git
     cd Paddle

--- a/doc/fluid/user_guides/howto/inference/native_infer.rst
+++ b/doc/fluid/user_guides/howto/inference/native_infer.rst
@@ -4,7 +4,7 @@ Paddle 预测 API
 为了更简单方便的预测部署，Fluid 提供了一套高层 API
 用来隐藏底层不同的优化实现。
-`预测库相关代码 <https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api>`__
+`预测库相关代码 <https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api>`_
 包括
 -  头文件 ``paddle_inference_api.h`` 定义了所有的接口

--- a/doc/fluid/user_guides/howto/modification/foo.rst
+++ b/doc/fluid/user_guides/howto/modification/foo.rst
-###
-FAQ
-###
--- a/doc/fluid/user_guides/howto/prepare_data/index.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/index.rst
@@ -38,7 +38,6 @@ PaddlePaddle Fluid支持两种传入数据的方式:
   :maxdepth: 2
   feeding_data
-   use_recordio_reader
 Python Reader
 #############
@@ -50,3 +49,14 @@ Python Reader
   :maxdepth: 2
   reader.md
+PyReader
+#############
+Python Reader是纯Python的接口，数据传入与模型训练/预测过程是同步的，效率较低。
+Fluid提供PyReader异步数据传入方式，具体请参考：
+.. toctree::
+   :maxdepth: 2
+   use_py_reader.rst
--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader.rst
+.. _user_guide_use_py_reader:
+############################
+使用PyReader读取训练和测试数据
+############################
+Paddle Fluid支持PyReader，实现Python端往C++端导入数据的功能。与 :ref:`user_guide_use_numpy_array_as_train_data` 不同，在使用PyReader时，Python端导入数据的过程和C++端 :code:`Executor::Run()` 读取数据的过程是异步进行的，且能与 :code:`double_buffer_reader` 配合以进一步提高数据读取性能。
+创建PyReader对象
+################################
+用户创建PyReader对象的方式为：
+.. code-block:: python
+    import paddle.fluid as fluid
+    py_reader = fluid.layers.py_reader(capacity=64,
+                                       shapes=[(-1,3,224,224), (-1,1)],
+                                       dtypes=['float32', 'int64'],
+                                       name='py_reader',
+                                       use_double_buffer=True)
+其中，capacity为PyReader对象的缓存区大小；shapes为batch各参量（如图像分类任务中的image和label)的尺寸；dtypes为batch各参量的数据类型；name为PyReader对象的名称；use_double_buffer默认为True，表示使用 :code:`double_buffer_reader` 。
+若要创建多个不同的PyReader对象（如训练阶段和测试阶段往往需创建两个不同的PyReader对象），必须给不同的PyReader对象指定不同的name。比如，在同一任务中创建训练阶段和测试阶段的PyReader对象的方式为：
+.. code-block:: python
+    import paddle.fluid as fluid
+    train_py_reader = fluid.layers.py_reader(capacity=64,
+                                             shapes=[(-1,3,224,224), (-1,1)],
+                                             dtypes=['float32', 'int64'],
+                                             name='train',
+                                             use_double_buffer=True)
+    test_py_reader = fluid.layers.py_reader(capacity=64,
+                                            shapes=[(-1,3,224,224), (-1,1)],
+                                            dtypes=['float32', 'int64'],
+                                            name='test',
+                                            use_double_buffer=True)
+注意， :code:`Program.clone()` 方法不能实现PyReader对象的复制，因此必须用以上方式创建训练阶段和测试阶段的不同
+PyReader对象。
+由于 :code:`Program.clone()` 无法实现PyReader对象的复制，因此用户需通过 :code:`fluid.unique_name.guard()`
+的方式实现训练阶段和测试阶段模型参数的共享，具体方式为：
+.. code-block:: python
+    import paddle.fluid as fluid
+    import paddle.dataset.mnist as mnist
+    import paddle.v2
+    import numpy
+    def network(is_train):
+        reader = fluid.layers.py_reader(
+            capacity=10,
+            shapes=((-1, 784), (-1, 1)),
+            dtypes=('float32', 'int64'),
+            name="train_reader" if is_train else "test_reader",
+            use_double_buffer=True)
+        img, label = fluid.layers.read_file(reader)
+        ...
+        # Here, we omitted the definition of loss of the model
+        return loss , reader
+    train_prog = fluid.Program()
+    train_startup = fluid.Program()
+    with fluid.program_guard(train_prog, train_startup):
+        with fluid.unique_name.guard():
+            train_loss, train_reader = network(True)
+            adam = fluid.optimizer.Adam(learning_rate=0.01)
+            adam.minimize(train_loss)
+    test_prog = fluid.Program()
+    test_startup = fluid.Program()
+    with fluid.program_guard(test_prog, test_startup):
+        with fluid.unique_name.guard():
+            test_loss, test_reader = network(False)
+设置PyReader对象的数据源
+################################
+PyReader对象提供 :code:`decorate_tensor_provider` 和 :code:`decorate_paddle_reader` 方法，它们均接收一个Python生成器 :code:`generator` 对象作为数据源，两个方法的区别在于：
+1. :code:`decorate_tensor_provider` 方法：要求 :code:`generator` 每次产生一个 :code:`list` 或 :code:`tuple` 对象， :code:`list` 或 :code:`tuple` 对象中的每个元素为 :code:`LoDTensor` 类型或Numpy数组类型，且 :code:`LoDTensor` 或Numpy数组的 :code:`shape` 必须与创建PyReader对象时指定的 :code:`shapes` 参数完全一致。
+2. :code:`decorate_paddle_reader` 方法：要求 :code:`generator` 每次产生一个 :code:`list` 或 :code:`tuple` 对象， :code:`list` 或 :code:`tuple` 对象中的每个元素为Numpy数组类型，但Numpy数组的 :code:`shape` 不必与创建PyReader对象时指定的 :code:`shapes` 参数完全一致， :code:`decorate_paddle_reader` 方法内部会对其进行 :code:`reshape` 操作。
+使用PyReader进行模型训练和测试
+################################
+具体方式为（接上述代码）：
+.. code-block:: python
+    place = fluid.CUDAPlace(0)
+    startup_exe = fluid.Executor(place)
+    startup_exe.run(train_startup)
+    startup_exe.run(test_startup)
+    trainer = fluid.ParallelExecutor(
+        use_cuda=True, loss_name=train_loss.name, main_program=train_prog)
+    tester = fluid.ParallelExecutor(
+        use_cuda=True, share_vars_from=trainer, main_program=test_prog)
+    train_reader.decorate_paddle_reader(
+        paddle.v2.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))
+    test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))
+    for epoch_id in xrange(10):
+        train_reader.start()
+        try:
+            while True:
+                print 'train_loss', numpy.array(
+                    trainer.run(fetch_list=[train_loss.name]))
+        except fluid.core.EOFException:
+            print 'End of epoch', epoch_id
+            train_reader.reset()
+        test_reader.start()
+        try:
+            while True:
+                print 'test loss', numpy.array(
+                    tester.run(fetch_list=[test_loss.name]))
+        except fluid.core.EOFException:
+            print 'End of testing'
+            test_reader.reset()
+具体步骤为：
+1. 在每个epoch开始前，调用 :code:`start()` 方法启动PyReader对象；
+2. 在每个epoch结束时， :code:`read_file` 抛出 :code:`fluid.core.EOFException` 异常，在捕获异常后调用 :code:`reset()` 方法重置PyReader对象的状态，以便启动下一轮的epoch。
--- a/doc/fluid/user_guides/howto/prepare_data/use_recordio_reader.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_recordio_reader.rst
-.. _user_guide_use_recordio_as_train_data:
-############################
-使用RecordIO文件作为训练数据
-############################
-相比于 :ref:`user_guide_use_numpy_array_as_train_data`，
-:ref:`user_guide_use_recordio_as_train_data` 的性能更好；
-但是用户需要先将训练数据集转换成RecordIO文件格式，再使用
-:code:`fluid.layers.open_files()` 层在神经网络配置中导入 RecordIO 文件。
-用户还可以使用 :code:`fluid.layers.double_buffer()` 加速数据从内存到显存的拷贝，
-使用 :code:`fluid.layers.Preprocessor` 工具进行数据增强。
-将训练数据转换成RecordIO文件格式
-################################
-:code:`fluid.recordio_writer` 中，每个记录都是一个
-:code:`vector<LoDTensor>`, 即一个支持序列信息的Tensor数组。这个数组包括训练所需
-的所有特征。例如对于图像分类来说，这个数组可以包含图片和分类标签。
-用户可以使用 :code:`fluid.recordio_writer.convert_reader_to_recordio_file()` 可以将
-:ref:`user_guide_reader` 转换成一个RecordIO文件。或者可以使用
-:code:`fluid.recordio_writer.convert_reader_to_recordio_files()` 将一个
-:ref:`user_guide_reader` 转换成多个RecordIO文件。
-具体使用方法为:
-.. code-block:: python
-   import paddle.fluid as fluid
-   import numpy
-   def reader_creator():
-       def __impl__():
-           for i in range(1000):
-               yield [
-                        numpy.random.random(size=[3,224,224], dtype="float32"),
-                        numpy.random.random(size=[1], dtype="int64")
-                     ]
-       return __impl__
-   img = fluid.layers.data(name="image", shape=[3, 224, 224])
-   label = fluid.layers.data(name="label", shape=[1], dtype="int64")
-   feeder = fluid.DataFeeder(feed_list=[img, label], place=fluid.CPUPlace())
-   BATCH_SIZE = 32
-   reader = paddle.batch(reader_creator(), batch_size=BATCH_SIZE)
-   fluid.recordio_writer.convert_reader_to_recordio_file(
-      "train.recordio", feeder=feeder, reader_creator=reader)
-其中 :code:`reader_creator` 创建了一个 :code:`Reader`。
-:ref:`_api_fluid_data_feeder_DataFeeder`
-是将 :code:`Reader` 转换成 :code:`LoDTensor` 的工具。详细请参考
-:ref:`user_guide_reader` 。
-上述程序将 :code:`reader_creator` 的数据转换成了 :code:`train.recordio` 文件，
-其中每一个record 含有 32 条样本。如果batch size会在训练过程中调整，
-用户可以将每一个Record的样本数设置成1。并参考
-:ref:`user_guide_use_recordio_as_train_data_use_op_create_batch`。
-配置神经网络, 打开RecordIO文件
-##############################
-RecordIO文件转换好之后，用户可以使用 :code:`fluid.layers.open_files()`
-打开文件，并使用 :code:`fluid.layers.read_file` 读取文件内容。
-简单使用方法如下:
-.. code-block:: python
-   import paddle.fluid as fluid
-   file_obj = fluid.layers.open_files(
-     filenames=["train.recordio"],
-     shape=[[3, 224, 224], [1]],
-     lod_levels=[0, 0],
-     dtypes=["float32", "int64"],
-     pass_num=100
-   )
-   image, label = fluid.layers.read_file(file_obj)
-其中如果设置了 :code:`pass_num` ，那么当所有数据读完后，会重新读取数据，
-直到读取了 :code:`pass_num` 遍。
-进阶使用
-########
-使用 :code:`fluid.layers.double_buffer()`
------------------------------------------
-:code:`Double buffer` 使用双缓冲技术，将训练数据从内存中复制到显存中。配置双缓冲
-需要使用 :code:`fluid.layers.double_buffer()` 修饰文件对象。 例如:
-.. code-block:: python
-   import paddle.fliud as fluid
-   file_obj = fluid.layers.open_files(...)
-   file_obj = fluid.layers.double_buffer(file_obj)
-   image, label = fluid.layers.read_file(file_obj)
-双缓冲技术可以参考
-`Multiple buffering <https://en.wikipedia.org/wiki/Multiple_buffering>`_ 。
-配置数据增强
------------
-使用 :code:`fluid.layers.Preprocessor` 可以配置文件的数据增强方法。例如
-.. code-block:: python
-   import paddle.fluid as fluid
-   file_obj = fluid.layers.open_files(...)
-   preprocessor = fluid.layers.Preprocessor(reader=data_file)
-   with preprocessor.block():
-       image, label = preprocessor.inputs()
-       image = image / 2
-       label = label + 1
-       preprocessor.outputs(image, label)
-如上代码所示，使用 :code:`Preprocessor` 定义了一个数据增强模块，并在
-:code:`with preprocessor.block()` 中定义了数据增强的具体操作。 用户通过配置
-:code:`preprocessor.inputs()` 获得数据文件中的各个字段。 并用
-:code:`preprocessor.outputs()` 标记预处理后的输出。
-.. _user_guide_use_recordio_as_train_data_use_op_create_batch:
-使用Op组batch
-------------
-使用 :code:`fluid.layers.batch()` 可以在训练的过程中动态的组batch。例如
-.. code-block:: python
-   import paddle.fluid as fluid
-   file_obj = fluid.layers.open_files(...)
-   file_obj = fluid.layers.batch(file_obj, batch_size=32)
-   img, label = fluid.layers.read_file(file_obj)
-需要注意的是，如果数据集中的最后几个样本不能组成 :code:`batch_size` 大小的批量数据，
-那么这几个样本直接组成一个批量数据进行训练。
-读入数据的shuffle
-----------------
-使用 :code:`fluid.layers.shuffle()` 可以在训练过程中动态重排训练数据。例如
-.. code-block:: python
-   import paddle.fluid as fluid
-   file_obj = fluid.layers.open_files(...)
-   file_obj = fliud.layers.shuffle(file_obj, buffer_size=8192)
-   img, label = fliud.layers.read_file(file_obj)
-需要注意的是:
-1. :code:`shuffle` 实现方法是:
-先读入 :code:`buffer_size` 条样本，再随机的选出样本进行训练。
-2. :code:`shuffle` 中 :code:`buffer_size` 会占用训练内存，需要确定训练过程中内存
-足够支持缓存 :code:`buffer_size` 条数据。
--- a/doc/fluid/user_guides/howto/training/cluster_howto.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto.rst
@@ -37,7 +37,7 @@ Fluid分布式训练使用手册
  完整的模型，并使用一部分数据进行训练，然后向pserver发送梯度，最后从pserver拉取更新后的参数。
  pserver进程可以在和trainer完全不同的计算节点上，也可以和trainer公用节点。一个分布式任务所需要的\
-  pserver进程个数通常需要根据实际情况调整，已达到最佳的性能，然而通常来说pserver的进程不会比trainer\
+  pserver进程个数通常需要根据实际情况调整，以达到最佳的性能，然而通常来说pserver的进程不会比trainer\
  更多。
  在使用GPU训练时，pserver可以选择使用GPU或只使用CPU，如果pserver也使用GPU，则会增加一次从CPU拷贝\
@@ -54,7 +54,7 @@ Fluid分布式训练使用手册
 使用parameter server方式的训练
 ------------------------------
-使用 :code:`trainer` API，程序可以自动的通过识别环境变量决定是否已分布式方式执行。
+使用 :code:`trainer` API，程序可以自动地通过识别环境变量决定是否以分布式方式执行。
 .. csv-table:: 需要在您的分布式环境中配置的环境变量包括：
   :header: "环境变量", "说明"

--- a/doc/fluid/user_guides/howto/training/cluster_quick_start.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_quick_start.rst
@@ -9,110 +9,170 @@
 在本篇文章中，我们将会在介绍如何快速在一个集群中启动一个 PaddlePaddle
 的分布式训练任务，在开始之前，请按如下步骤做些准备工作：
-1. 准备一个至少4个节点的集群，并且保证网络可以联通，在本文中我们使用
+1. 准备一个网络连通的训练集群，在本文中我们使用4个训练节点使用 ``*.paddlepaddle.com``
-   ``*.paddlepaddle.com`` 来表示每个节点的主机名称，您可以根据集群的实际情况来修改它。
+   来表示节点的主机名称，您可以根据实际情况修改它。
-2. 在开始之前确保已经阅读过 :ref:`how_to_install`
+2. 在开始之前确保已经阅读过 :ref:`install_steps`
   并且可以在集群的所有节点上可以正常运行 PaddlePaddle。
-启动集群训练任务
----------------
-在启动集群训练脚本时，需要在不同的节点上指定不同的环境变量，具体如下：
-+-----------------+-----------------+-----------------+---------------------+
-| 环境变量        | 数据类型        | 样例            | 描述                |
-+=================+=================+=================+=====================+
-| PADDLE_TRAINING | str             | PSERVER,TRAINER | 训练节点的角色      |
-| _ROLE           |                 |                 |                     |
-+-----------------+-----------------+-----------------+---------------------+
-| PADDLE_PSERVER_ | str             | ps0.paddlepaddl | 所有 pserver        |
-| IPS             |                 | e.com,ps1.paddl | 节点的 IP           |
-|                 |                 | epaddle.com…    | 地址或              |
-|                 |                 |                 | hostname,           |
-|                 |                 |                 | 用“,”分隔           |
-+-----------------+-----------------+-----------------+---------------------+
-| PADDLE_PSERVER_ | int             | 6174            | pserver             |
-| PORT            |                 |                 | 节点监听的端口      |
-+-----------------+-----------------+-----------------+---------------------+
-| PADDLE_TRAINERS | int             | 2               | 训练任务中          |
-|                 |                 |                 | trainer             |
-|                 |                 |                 | 节点的数量          |
-+-----------------+-----------------+-----------------+---------------------+
-| PADDLE_CURRENT_ | str             | ps0.paddlepaddl | 当前 pserver        |
-| IP              |                 | e.com           | 节点的 IP           |
-|                 |                 |                 | 地址或 hostanme     |
-+-----------------+-----------------+-----------------+---------------------+
-| PADDLE_TRAINER_ | int             | 0               | 当前 trainer        |
-| ID              |                 |                 | 节点的唯一 ID,      |
-|                 |                 |                 | 取值范围为从0开始到 |
-|                 |                 |                 | PADDLE_TRAINERS-1   |
-+-----------------+-----------------+-----------------+---------------------+
 样例代码
-~~~~~~~~
+-------
-将下面程序代码保存为 ``fluid_dist.py``
+下面使用一个非常简单的线性回归模型作为样例来解释如何启动一个包含2个 pserver server 节点以及
+2个 trainer 节点的分布式训练任务，您可以将本段代码保存为 ``dist_train.py``
 .. code:: python
-   import paddle
+    import os
-   import paddle.fluid as fluid
+    import paddle
-   import contextlib
+    import paddle.fluid as fluid
-   import numpy
-   import unittest
+    # train reader
+    BATCH_SIZE = 20
-   # train reader
+    EPOCH_NUM = 30
-   BATCH_SIZE = 20
+    BATCH_SIZE = 8
-   train_reader = paddle.batch(
+    train_reader = paddle.batch(
-       paddle.reader.shuffle(
+        paddle.reader.shuffle(
-           paddle.dataset.uci_housing.train(), buf_size=500),
+            paddle.dataset.uci_housing.train(), buf_size=500),
-       batch_size=BATCH_SIZE)
+        batch_size=BATCH_SIZE)
-   test_reader = paddle.batch(
+    def train():
-       paddle.reader.shuffle(
+        y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-           paddle.dataset.uci_housing.test(), buf_size=500),
+        x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-       batch_size=BATCH_SIZE)
+        y_predict = fluid.layers.fc(input=x, size=1, act=None)
+        loss = fluid.layers.square_error_cost(input=y_predict, label=y)
+        avg_loss = fluid.layers.mean(loss)
+        opt = fluid.optimizer.SGD(learning_rate=0.001)
+        opt.minimize(avg_loss)
+        place = fluid.CPUPlace()
+        feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
+        exe = fluid.Executor(place)
+        # fetch distributed training environment setting
+        training_role = os.getenv("PADDLE_TRAINING_ROLE", None)
+        port = os.getenv("PADDLE_PSERVER_PORT", "6174")
+        pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        eplist = []
+        for ip in pserver_ips.split(","):
+            eplist.append(':'.join([ip, port]))
+        pserver_endpoints = ",".join(eplist)
+        trainers = int(os.getenv("PADDLE_TRAINERS"))
+        current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port
+        t = fluid.DistributeTranspiler()
+        t.transpile(
+            trainer_id = trainer_id,
+            pservers = pserver_endpoints,
+            trainers = trainers)
+        if training_role == "PSERVER":
+            pserver_prog = t.get_pserver_program(current_endpoint)
+            startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
+            exe.run(startup_prog)
+            exe.run(pserver_prog)
+        elif training_role == "TRAINER":
+            trainer_prog = t.get_trainer_program()
+            exe.run(fluid.default_startup_program())
+            for epoch in range(EPOCH_NUM):
+                for batch_id, batch_data in enumerate(train_reader()):
+                    avg_loss_value, = exe.run(trainer_prog,
+                                          feed=feeder.feed(batch_data),
+                                          fetch_list=[avg_loss])
+                    if (batch_id + 1) % 10 == 0:
+                        print("Epoch: {0}, Batch: {1}, loss: {2}".format(
+                            epoch, batch_id, avg_loss_value[0]))
+            # destory the resource of current trainer node in pserver server node
+            exe.close()
+        else:
+            raise AssertionError("PADDLE_TRAINING_ROLE should be one of [TRAINER, PSERVER]")
+    train()
+环境变量说明
+-----------
+在启动分布式训练任务时，使用不同的环境变量来表示不同的节点角色，具体如下：
+.. list-table::
+  :header-rows: 1
+  * - 环境变量
+    - 数据类型
+    - 样例
+    - 描述
+  * - :code:`PADDLE_TRAINING_ROLE`
+    - str
+    - :code:`PSERVER,TRAINER`
+    - 当前训练节点角色
+  * - :code:`PADDLE_PSERVER_IPS`
+    - str
+    - :code:`ps0.paddlepaddle.com,ps1.paddlepaddle.com`
+    - 分布式训练任务中所有 pserver 节点的 IP 地址或 hostname, 使用","分隔
+  * - :code:`PADDLE_PSERVER_PORT`
+    - int
+    - 6174
+    - pserver 进程监听的端口
+  * - :code:`PADDLE_TRAINERS`
+    - int
+    - 2
+    - 分布式训练任务中 trainer 节点的数量
+  * - :code:`PADDLE_CURRENT_IP`
+    - str
+    - :code:`ps0.paddlepaddle.com`
+    - 当前 pserver 节点的 IP 地址或 hostname
+  * - :code:`PADDLE_TRAINER_ID`
+    - str 
+    - 0
+    - 当前 trainer 节点的 ID (唯一)， 取值范围为 [0, PADDLE_TRAINERS)
+注： 环境变量只是获取运行时信息的一种方式，实际任务中可以采用命令行参数等方式获取运行时信息。
+分布式训练相关 API
+------------------
+DistributeTranspiler
+~~~~~~~~~~~~~~~~~~~~~~
+基于 pserver-trainer 架构的的分布式训练任务分为两种角色： Parameter Server(pserver) 以及 trainer, 
+在 Fluid 中，用户只需配置单机训练所需要的网络配置, ``DistributeTranspiler`` 模块会自动地根据
+当前训练节点的角色将用户配置的单机网路配置改写成 pserver 和 trainer 需要运行的网络配置:
-   def train_program():
+.. code:: python
-       y = fluid.layers.data(name='y', shape=[1], dtype='float32')
-       x = fluid.layers.data(name='x', shape=[13], dtype='float32')
-       y_predict = fluid.layers.fc(input=x, size=1, act=None)
-       loss = fluid.layers.square_error_cost(input=y_predict, label=y)
-       avg_loss = fluid.layers.mean(loss)
-       return avg_loss
-   def optimizer_func():
+    t = fluid.DistributeTranspiler()
-       return fluid.optimizer.SGD(learning_rate=0.001)
+    t.transpile(
+        trainer_id = trainer_id,                   
+        pservers = pserver_endpoints,    
+        trainers = trainers)
+    if PADDLE_TRAINING_ROLE == "TRAINER":
+        # fetch the pserver program and execute it
+        trainer_prog = t.get_trainer_program()
+        ...
-   def train(use_cuda, train_program):
+    elif PADDLE_TRAINER_ROLE == "PSERVER":
-       place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+        # fetch the trainer program and execute it
+        pserver_prog = t.get_pserver_program(current_endpoint) 
+        ...
-       trainer = fluid.Trainer(
+exe.close()
-           train_func=train_program, place=place, optimizer_func=optimizer_func)
+~~~~~~~~~~~~~~
-       def event_handler(event):
+pserver 节点中会保存所有 trainer 节点的状态信息，在 trainer结束训练时需要调用 ``exe.close()``
-           if isinstance(event, fluid.EndStepEvent):
+通知所有 PServer 节点释放当前 Trainer 节点的资源:
-               if event.step == 10:
-                   test_metrics = trainer.test(
-                       reader=test_reader, feed_order=['x', 'y'])
-                   print("step {0}, loss: {1}".format(event.step, test_metrics))
-                   trainer.stop()
-       trainer.train(
+.. code:: python
-           reader=train_reader,
-           num_epochs=100,
-           event_handler=event_handler,
-           feed_order=['x', 'y'])
-   train(False, train_program)
+    exe = fluid.Executor(fluid.CPUPlace())
+    # training process ...
+    exe.close() # notify PServer to destory the resource
-启动trainer节点和pserver节点
+启动分布式训练任务
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------------
 .. list-table::
   :header-rows: 1
@@ -132,12 +192,3 @@
   * - trainer1.paddlepaddle.com
     - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
     - 启动第1号 trainer 节点
-**注意**
-  需要先启动pserver节点再启动trainer节点
-  看到trainer节点输出如下日志表示训练任务执行正确
-   .. code:: bash
-      step 10, loss: [258.2326202392578]
--- a/doc/fluid/user_guides/howto/training/multi_node.rst
+++ b/doc/fluid/user_guides/howto/training/multi_node.rst
@@ -7,3 +7,4 @@
   cluster_quick_start.rst
   cluster_howto.rst
+   train_on_baidu_cloud_cn.rst
--- a/doc/fluid/user_guides/howto/training/src/create_gpu_machine.png
+++ b/doc/fluid/user_guides/howto/training/src/create_gpu_machine.png
--- a/doc/fluid/user_guides/howto/training/src/create_image.png
+++ b/doc/fluid/user_guides/howto/training/src/create_image.png
--- a/doc/fluid/user_guides/howto/training/src/create_more_nodes.png
+++ b/doc/fluid/user_guides/howto/training/src/create_more_nodes.png
--- a/doc/fluid/user_guides/howto/training/src/dist_train_demo.py
+++ b/doc/fluid/user_guides/howto/training/src/dist_train_demo.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import paddle.fluid.core as core
+import math
+import os
+import sys
+import numpy
+import paddle
+import paddle.fluid as fluid
+BATCH_SIZE = 64
+PASS_NUM = 1
+def loss_net(hidden, label):
+    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+    loss = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=prediction, label=label)
+    return prediction, avg_loss, acc
+def conv_net(img, label):
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    return loss_net(conv_pool_2, label)
+def train(use_cuda, role, endpoints, current_endpoint, trainer_id, trainers):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    prediction, avg_loss, acc = conv_net(img, label)
+    test_program = fluid.default_main_program().clone(for_test=True)
+    optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+    optimizer.minimize(avg_loss)
+    t = fluid.DistributeTranspiler()
+    t.transpile(trainer_id, pservers=endpoints, trainers=trainers)
+    if role == "pserver":
+        prog = t.get_pserver_program(current_endpoint)
+        startup = t.get_startup_program(current_endpoint, pserver_program=prog)
+        exe = fluid.Executor(fluid.CPUPlace())
+        exe.run(startup)
+        exe.run(prog)
+    elif role == "trainer":
+        prog = t.get_trainer_program()
+        place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+        exe = fluid.Executor(place)
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                paddle.dataset.mnist.train(), buf_size=500),
+            batch_size=BATCH_SIZE)
+        test_reader = paddle.batch(
+            paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
+        feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
+        exe.run(fluid.default_startup_program())
+        for pass_id in range(PASS_NUM):
+            for batch_id, data in enumerate(train_reader()):
+                acc_np, avg_loss_np = exe.run(prog,
+                                            feed=feeder.feed(data),
+                                            fetch_list=[acc, avg_loss])
+                if (batch_id + 1) % 10 == 0:
+                    print(
+                        'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
+                        format(pass_id, batch_id + 1,
+                                float(avg_loss_np.mean()), float(acc_np.mean())))
+if __name__ == '__main__':
+    if len(sys.argv) != 6:
+        print("Usage: python %s role endpoints current_endpoint trainer_id trainers" % sys.argv[0])
+        exit(0)
+    role, endpoints, current_endpoint, trainer_id, trainers = \
+        sys.argv[1:]
+    train(True, role, endpoints, current_endpoint, int(trainer_id), int(trainers))
--- a/doc/fluid/user_guides/howto/training/src/parallelism.png
+++ b/doc/fluid/user_guides/howto/training/src/parallelism.png
--- a/doc/fluid/user_guides/howto/training/src/release.png
+++ b/doc/fluid/user_guides/howto/training/src/release.png
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
+.. _train_on_baidu_cloud_cn
+在百度云上启动Fluid分布式训练
+=========================
+PaddlePaddle Fluid分布式训练，可以不依赖集群系统（比如MPI，Kubernetes）启动分布式训练。
+本章节将会以 `百度云 <https://cloud.baidu.com/>`_ 为实例，说明如何在云端环境，甚至云端GPU环境启动
+大规模分布式任务。
+创建集群模板
+----------
+登录到百度云控制台，选择BCC服务，点击“创建实例”。选择地域，注意，只有一些地域有GPU服务器可选，
+选择合适的地域之后，再选择对应型号，然后创建一个空的服务器，如下图：
+.. image:: src/create_gpu_machine.png
+* 在操作系统选项中，可以根据需要选择对应的版本，注意根据实际情况选择CUDA版本，这里我们选择CUDA-9.2。
+* 示例中选择机器付费方式为后付费，表示随着机器的释放，收费也会对应停止，对运行一次性任务会比较划算。
+在机器创建成功之后，执行下面的命令安装paddlepaddle GPU版本和相关依赖。
+.. code-block:: bash
+  apt-get update && apt-get install -y python python-pip python-opencv
+  # 注：百度云cuda-9.2镜像默认没有安装cudnn和nccl2，需要手动安装，如果自行安装，需要从官网下载
+  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
+  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
+  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
+  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
+  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
+  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
+  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
+  # 注：可以选择是否使用下面的pip镜像加速下载
+  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
+  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
+完成安装后，使用下面的测试程序，测试当前机器是否可以正确运行GPU训练程序，如果遇到报错，请根据报错提示修复
+运行环境问题。为了方便启动GPU集群，测试程序执行成功之后，选择当前服务器，然后选择“创建自定义镜像”，后续
+创建GPU集群时即可选择配置好的镜像。
+.. image:: src/create_image.png
+* 测试程序：
+.. code-block:: python
+  from __future__ import print_function
+  import paddle.fluid.core as core
+  import math
+  import os
+  import sys
+  import numpy
+  import paddle
+  import paddle.fluid as fluid
+  BATCH_SIZE = 64
+  PASS_NUM = 1
+  def loss_net(hidden, label):
+      prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+      loss = fluid.layers.cross_entropy(input=prediction, label=label)
+      avg_loss = fluid.layers.mean(loss)
+      acc = fluid.layers.accuracy(input=prediction, label=label)
+      return prediction, avg_loss, acc
+  def conv_net(img, label):
+      conv_pool_1 = fluid.nets.simple_img_conv_pool(
+          input=img,
+          filter_size=5,
+          num_filters=20,
+          pool_size=2,
+          pool_stride=2,
+          act="relu")
+      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+      conv_pool_2 = fluid.nets.simple_img_conv_pool(
+          input=conv_pool_1,
+          filter_size=5,
+          num_filters=50,
+          pool_size=2,
+          pool_stride=2,
+          act="relu")
+      return loss_net(conv_pool_2, label)
+  def train(use_cuda):
+      if use_cuda and not fluid.core.is_compiled_with_cuda():
+          return
+      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+      prediction, avg_loss, acc = conv_net(img, label)
+      test_program = fluid.default_main_program().clone(for_test=True)
+      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+      optimizer.minimize(avg_loss)
+      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+      exe = fluid.Executor(place)
+      train_reader = paddle.batch(
+          paddle.reader.shuffle(
+              paddle.dataset.mnist.train(), buf_size=500),
+          batch_size=BATCH_SIZE)
+      test_reader = paddle.batch(
+          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
+      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
+      exe.run(fluid.default_startup_program())
+      for pass_id in range(PASS_NUM):
+          for batch_id, data in enumerate(train_reader()):
+              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
+                                            feed=feeder.feed(data),
+                                            fetch_list=[acc, avg_loss])
+              if (batch_id + 1) % 10 == 0:
+                  print(
+                      'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
+                      format(pass_id, batch_id + 1,
+                              float(avg_loss_np.mean()), float(acc_np.mean())))
+  if __name__ == '__main__':
+      train(True)
+创建集群
+------
+完成创建镜像之后，可以使用这个配置好的镜像创建一个GPU集群，根据您的实际需求创建足够数量的GPU服务器，
+作为示例，这里启动2台GPU服务器，包括上一步创建的服务器，所以这里再启动一台新的服务器。
+点击“创建实例”，在相同地域选择同样配置的GPU服务器，注意选择刚才创建的镜像作为操作系统。
+.. image:: src/create_more_nodes.png
+编写集群任务启动脚本
+----------------
+为了方便在更多的GPU服务器上启动分布式训练任务，我们将使用
+`fabric <http://www.fabfile.org/>`_
+作为集群任务启动管理工具，您可以选择其他熟悉的集群框架，比如MPI, Kubernetes，本示例演示的方法
+仅针对简单集群环境，而且服务器之间可以互相ssh登录。
+安装fabric，需要执行：
+.. code-block:: bash
+  pip install fabric
+假设我们创建了2台GPU服务器，ip分别是 :code:`172.16.0.5,172.16.0.6` ，然后在第一台服务器上，
+先创建训练程序文件 :code:`dist_train_demo.py` ，从
+`这里 <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_ 
+下载代码。然后编写 :code:`fabfile.py` 脚本，用于控制在不同服务器上启动训练任务的parameter server和trainer：
+.. code-block:: python
+  from fabric import Group, task
+  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
+  port = "6173"
+  pservers = 2
+  trainers = 2
+  hosts = []
+  eps = []
+  for ep in endpoints.split(","):
+      eps.append(ep)
+      hosts.append(ep.split(":")[0])
+  def start_server(c):
+      current_endpoint = "%s:%s" % (c.host, port)
+      trainer_id = hosts.index(c.host)
+      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
+          endpoints, current_endpoint, trainer_id, trainers, c.host)
+      c.run(cmd)
+  def start_trainer(c):
+      current_endpoint = "%s:%s" % (c.host, port)
+      trainer_id = hosts.index(c.host)
+      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
+          endpoints, current_endpoint, trainer_id, trainers, c.host)
+      c.run(cmd)
+  @task
+  def start(c):
+      c.connect_kwargs.password = "work@paddle123"
+      c.run("mkdir -p /root/work")
+      c.put("dist_train_demo.py", "/root/work")
+      start_server(c)
+      start_trainer(c)
+  @task
+  def tail_log(c):
+      c.connect_kwargs.password = "work@paddle123"
+      c.run("tail /root/work/trainer.log.%s" % c.host)
+保存上述代码到 :code:`fabfile.py` 之后，执行
+.. code-block:: bash
+  fab -H 172.16.0.5,172.16.0.6 start
+就可以开始一个分布式训练任务。这个任务会在两台GPU服务器分别启动2个pserver进程和2个trainer进程开始训练。
+获取分布式训练结果
+---------------
+示例任务会在 :code:`/root/work` 下记录日志，分别为
+:code:`pserver.log.[IP]` 和 :code:`trainer.log.[IP]` 的形式，可以手动在
+服务器上查看这些日志文件观察结果，也可以使用fabric获取所有节点的日志信息，比如：
+.. code-block:: bash
+  fab -H 172.16.0.5,172.16.0.6 tail-log
+关闭集群
+------
+任务执行完成后，不要忘记释放掉GPU集群资源，勾选选择需要释放的服务器，选择“释放”，则会关闭机器并释放资源。
+如果需要执行新的任务，可以直接使用之前保存的镜像，启动新的集群，并参照前面的步骤开始训练。
+.. image:: src/release.png
\ No newline at end of file
--- a/doc/fluid/user_guides/image/executor_design.png
+++ b/doc/fluid/user_guides/image/executor_design.png
--- a/doc/fluid/user_guides/image/fluid_process.png
+++ b/doc/fluid/user_guides/image/fluid_process.png
--- a/doc/fluid/user_guides/index.rst
+++ b/doc/fluid/user_guides/index.rst
@@ -16,4 +16,4 @@
    howto/debug/index
    howto/evaluation/index
    howto/inference/index
-    models/index.rst
+    models/index.md
--- a/doc/fluid/user_guides/models/index.md
+++ b/doc/fluid/user_guides/models/index.md
+../../../../external/models/fluid/README.md
\ No newline at end of file
--- a/doc/fluid/user_guides/models/index.rst
+++ b/doc/fluid/user_guides/models/index.rst
-../../../../external/models/fluid/README.cn.rst
\ No newline at end of file
--- a/doc/survey/dynamic_graph.md
+++ b/doc/survey/dynamic_graph.md
@@ -2,42 +2,47 @@
 ## Automatic Differentiation
-A key challenge in the field of deep learning is to automatically derive the backward pass from the forward pass described algorithmically by researchers.  Such a derivation, or a transformation of the forward pass program, has been long studied before the recent prosperity of deep learning in the field known as [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf).
+A key challenge in deep learning is to automatically derive the backward pass given the forward pass as a program, which has been long studied in the field of [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf), or autodiff, before the prosperity of deep learning.
-## The Tape
+## Program Transformation v.s. Backtracking
-Given the forward pass program (usually in Python in practices), there are two strategies to derive the backward pass:
+Given the forward pass program, there are two strategies to derive the backward pass:
-1. from the forward pass program itself, or
+1. by transforming the forward pass program without executing it, or
-1. from the execution trace of the forward pass program, which is often known as the *tape*.
+1. by backtracking the execution process of the forward pass program.
-This article surveys systems that follow the latter strategy.
+This article is about the latter strategy. 
-## Dynamic Network
+## The Tape and Dynamic Networks
-When we train a deep learning model, the tape changes every iteration as the input data change, so we have to re-derive the backward pass every iteration.  This is known as *dynamic network*.
+We refer to the trace of the execution of the forward pass program as a *tape* [[1]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf).  When we train a deep learning model, the tape changes every iteration as the input data change, so we'd have to re-derive the backward pass, which is time-consuming, but also eases the case that the forward program includes control flows like if-else and for/while. With these control flows, the execution trace might change with iterations.  Such changes are known as *dynamic networks* in the field of deep learning.
-Deep learning systems that utilize the idea of dynamic network gained their popularities in recent years.  This article surveys two representative systems: [PyTorch](https://pytorch.org/) and [DyNet](https://dynet.readthedocs.io/en/latest/).
+## Typical Systems
-## An Overview
+Deep learning systems that utilize the idea of dynamic networks gained their popularities in recent years.  This article surveys the following typical systems: 
-Both frameworks record a ‘tape’ of the computation and interpreting (or run-time compiling) a transformation of the tape played back in reverse. This tape is a different kind of entity than the original program.[[link]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf)
+- [DyNet](https://dynet.readthedocs.io/en/latest/)
+- [PyTorch](https://pytorch.org/)
+- Chainer
+- Autograd from HIPS
-Consider the following code feedforward model.
+Before diving into these systems, let us pose an example forward pass program:
 ```python
 x = Variable(randn(20, 1)))
 label = Variable(randint(1))
 W_1, W_2 = Variable(randn(20, 20)), Variable(randn(10, 20))
 h = matmul(W_1, x)
-pred = matmul(W_2, x)
+pred = matmul(W_2, h)
 loss = softmax(pred, label)
 loss.backward()
 ```
-### 1) Dynet uses List to encode the Tape
+## The Representation of Tapes
-During the forward execution, a list of operators, in this case `matmul`, `matmul` and `softmax`, are recorded in the tape, along with the necessary information needed to do the backward such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward()`.
+### DyNet: the Tape as a List
+DyNet uses a linear data structure, a list, to represent the tape. During the execution of the above example, it is a list of operators: `matmul`, `matmul`, and `softmax`.  The list also includes information needed to do the backward pass, such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward().`
 <details> 
 <summary></summary>
@@ -69,9 +74,9 @@ digraph g {
 ![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22ellipse%22%20];%20edge%20[];%20%22node0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_1,%20x%20|%20%3Cf2%3E%20output:%20h%22%20shape%20=%20%22record%22%20];%20%22node1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20%3Cf1%3E%20input:%20W_2,%20h%20|%20%3Cf2%3E%20output:%20pred%22%20shape%20=%20%22record%22%20];%20%22node2%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20%3Cf1%3E%20input:%20pred,%20label%20|%20%3Cf2%3E%20output:%20loss%22%20shape%20=%20%22record%22%20];%20%22node0%22:f0%20-%3E%20%22node1%22:f0%20[%20id%20=%200%20];%20%22node1%22:f0%20-%3E%20%22node2%22:f0%20[%20id%20=%201%20];%20})
-### 2) Pytorch uses Node Graph to encode the Tape
+### PyTorch: the Tape as a Graph
-The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order.
+The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order.  Please be aware that a `Function` might have more than one `prev_func`s.
 <details> 
 <summary></summary>
@@ -132,27 +137,22 @@ digraph g {
 ![Alt text](https://g.gravizo.com/svg?digraph%20g%20{%20graph%20[%20rankdir%20=%20%22LR%22%20];%20subgraph%20function%20{%20node%20[%20fontsize%20=%20%2216%22%20style%20=%20filled%20shape%20=%20%22record%22%20];%20%22matmul0%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20None%22%20];%20%22matmul1%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20matmul%20|%20prev_func:%20matmul%22%20];%20%22softmax%22%20[%20label%20=%20%22%3Cf0%3E%20type:%20softmax%20|%20prev_func:%20matmul%22%20];%20}%20subgraph%20variable%20{%20node%20[%20fontsize%20=%20%2216%22%20shape%20=%20%22Mrecord%22%20style%20=%20filled%20fillcolor%20=%20white%20];%20%22x%22%20[%20label%20=%20%22%3Cf0%3E%20x%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22label%22%20[%20label%20=%20%22%3Cf0%3E%20label%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_1%22%20[%20label%20=%20%22%3Cf0%3E%20W_1%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22W_2%22%20[%20label%20=%20%22%3Cf0%3E%20W_2%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22h%22%20[%20label%20=%20%22%3Cf0%3E%20h%20|%20%3Cf1%3E%20creator:%20None%22%20];%20%22pred%22%20[%20label%20=%20%22%3Cf0%3E%20pred%20|%20%3Cf1%3E%20creator:%20matmul%22%20];%20%22loss%22%20[%20label%20=%20%22%3Cf0%3E%20loss%20|%20%3Cf1%3E%20creator:%20softmax%22%20];%20}%20subgraph%20data_flow%20{%20%22x%22:f0%20-%3E%20%22matmul0%22:f0;%20%22W_1%22:f0%20-%3E%20%22matmul0%22:f0;%20%22matmul0%22:f0%20-%3E%20%22h%22:f0;%20%22h%22:f0%20-%3E%20%22matmul1%22:f0;%20%22W_2%22:f0%20-%3E%20%22matmul1%22:f0;%20%22matmul1%22:f0%20-%3E%20%22pred%22:f0;%20%22pred%22:f0%20-%3E%20%22softmax%22:f0;%20%22label%22:f0%20-%3E%20%22softmax%22:f0;%20%22softmax%22:f0%20-%3E%20%22loss%22:f0;%20}%20subgraph%20prev_func%20{%20edge%20[color=%22red%22,%20arrowsize=%220.6%22,%20penwidth=%221%22,%20constraint=false];%20%22matmul1%22:f1%20-%3E%20%22matmul0%22:f0;%20%22softmax%22:f1%20-%3E%20%22matmul1%22:f0;%20label%20=%20%22prev_func%22;%20}%20})
-Chainer and Autograd uses the similar techniques to record the forward pass. For details please refer to the appendix.
+Chainer and Autograd use the similar techniques to record the forward pass. For details, please refer to the appendix.
-## Design choices
-### 1) Dynet's List vs Pytorch's Node Graph
+## Comparison: List v.s. Graph
-What's good about List:
+The list of DyNet could be considered the result of the topological sort of the graph of PyTorch. Or, the graph is the raw representation of the tape, which gives us the chance to *prune* part of the graph that is irrelevant with the backward pass before the topological sort [[2]](https://openreview.net/pdf?id=BJJsrmfCZ). Consider the following example, PyTorch only does backward on `SmallNet` while DyNet does both `SmallNet` and `BigNet`:
-1. It avoids a topological sort. One only needs to traverse the list of operators in reverse and calling the corresponding backward operator.
-1. It promises effient data parallelism implementations. One could count the time of usage of a certain variable during the construction list. Then in the play back, one knows the calculation of a variable has completed. This enables communication and computation overlapping.
-What's good about Node Graph:
-1. More flexibility. PyTorch users can mix and match independent graphs however they like, in whatever threads they like (without explicit synchronization). An added benefit of structuring graphs this way is that when a portion of the graph becomes dead, it is automatically freed. [[2]](https://openreview.net/pdf?id=BJJsrmfCZ) Consider the following example, Pytorch only does backward on SmallNet while Dynet does both BigNet and SmallNet.
 ```python
 result = BigNet(data)
 loss = SmallNet(data)
 loss.backward()
 ```
-### 2) Dynet's Lazy evaluation vs Pytorch's Immediate evaluation
+## Lazy v.s. Immediate Evaluation
+Another difference between DyNet and PyTorch is that DyNet lazily evaluates the forward pass, whereas PyTorch executes it immediately. Consider the following example:
-Dynet builds the list in a symbolic matter. Consider the following example
 ```python
 for epoch in range(num_epochs):
    for in_words, out_label in training_data:
@@ -164,16 +164,17 @@ for epoch in range(num_epochs):
        loss_val = loss_sym.value()
        loss_sym.backward()
 ```
 The computation of `lookup`, `concat`, `matmul` and `softmax` didn't happen until the call of `loss_sym.value()`. This defered execution is useful because it allows some graph-like optimization possible, e.g. kernel fusion.
-Pytorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
+PyTorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
-## What can fluid learn from them?
+## Fluid: Learning the Lessons
 Please refer to `paddle/contrib/dynamic/`.
-# Appendix
+## Appendix
 ### Overview

--- a/Paddle @ 42359797
+++ b/Paddle @ 42359797
-Subproject commit d4d71dce04dcd2a41964e6067be18f544cff3767
+Subproject commit 423597974bb17f996d589b0e70e1f1584a6af9da
--- a/book @ da161514
+++ b/book @ da161514
-Subproject commit fe1df41a7f45a02f5c36a5dc55053b6ba04bafa0
+Subproject commit da1615146f1c460b2602c69a9512d656d4a85baf
--- a/models @ 5d172a5f
+++ b/models @ 5d172a5f
-Subproject commit d6024059de7ba447ab2859c23ef86e8519c127ae
+Subproject commit 5d172a5f7ab247abf0ffe1faab96a20867dfbb98