fix guides index

f43b3704 · chenlong · fc3e4626 · f43b3704 · f43b3704 · f43b3704
186 changed file
--- a/doc/paddle/guides/broadcasting_cn.rst
+++ b/doc/paddle/guides/broadcasting_cn.rst
--- a/doc/paddle/guides/broadcasting_en.rst
+++ b/doc/paddle/guides/broadcasting_en.rst
--- a/doc/paddle/guides/images/Axis_2.0.png
+++ b/doc/paddle/guides/images/Axis_2.0.png
--- a/doc/paddle/guides/images/ComplexTensor_2.0.png
+++ b/doc/paddle/guides/images/ComplexTensor_2.0.png
--- a/doc/paddle/guides/images/Tensor_2.0.png
+++ b/doc/paddle/guides/images/Tensor_2.0.png
--- a/doc/paddle/guides/images/load_2.0.png
+++ b/doc/paddle/guides/images/load_2.0.png
--- a/doc/paddle/guides/images/save_2.0.png
+++ b/doc/paddle/guides/images/save_2.0.png
--- a/doc/paddle/guides/01_paddle2.0_introduction/basic_concept/index_cn.rst
+++ b/doc/paddle/guides/01_paddle2.0_introduction/basic_concept/index_cn.rst
+###################
+飞桨框架2.0基本概念
+###################
+让我们从学习飞桨的基本概念这里开始：
+- `Tensor概念介绍 <./tensor_introduction_cn.html>`_ : 飞桨中数据的表示方式，Tensor概念介绍。
+- `飞桨广播介绍 <./broadcasting_cn.html>`_ : 飞桨中广播概念的介绍。
+..  toctree::
+    :hidden:
+    tensor_introduction_cn.md
+    broadcasting_cn.rst
--- a/doc/paddle/guides/01_paddle2.0_introduction/basic_concept/index_en.rst
+++ b/doc/paddle/guides/01_paddle2.0_introduction/basic_concept/index_en.rst
+########################
+Paddle 2.0 Basic Concept
+########################
+Let's start with studying basic concept of PaddlePaddle:
+- `Introduction to Tensor <tensor_introduction_en.html>`_ : Introduction of Tensor, which is the representation of data in Paddle.
+- `broadcasting <./broadcasting_en.html>`_ : Introduction of broadcasting.
+..  toctree::
+    :hidden:
+    tensor_introduction_en.md
+    broadcasting_en.md
--- a/doc/paddle/guides/tensor_introduction_cn.md
+++ b/doc/paddle/guides/tensor_introduction_cn.md
--- a/doc/paddle/guides/tensor_introduction_en.md
+++ b/doc/paddle/guides/tensor_introduction_en.md
--- a/doc/paddle/guides/01_paddle2.0_introduction/index_cn.rst
+++ b/doc/paddle/guides/01_paddle2.0_introduction/index_cn.rst
+###############
+飞桨框架2.0介绍
+###############
+飞桨框架2.0简要介绍。
+您可以通过下面的内容，了解更多飞桨框架2.0的内容:
+- `飞桨框架2.0基本概念介绍 <./basic_concept/index_cn.html>`_ : 飞桨框架2.0基本概念的介绍。
+- `飞桨框架2.0beta升级指南 <./upgrade_guide_cn.html>`_: 介绍飞桨开源框架2.0beta的主要变化和如何升级。
+- `版本迁移工具 <./migration_cn.html>`_: 介绍paddle1to2转换工具的使用。
+..  toctree::
+    :hidden:
+    basic_concept/index_cn.rst
+    upgrade_guide_cn.md
+    migration_cn.rst
--- a/doc/paddle/guides/01_paddle2.0_introduction/index_en.rst
+++ b/doc/paddle/guides/01_paddle2.0_introduction/index_en.rst
+#####################
+Paddle 2 Introduction
+####################
+Introduction of paddle2.
+For more information, you can view these pages:
+- `paddle 2 basic concept <./basic_concept/index_en.html>`_ : introduction of paddle2 basic concept.
+- `migration tools <./migration_en.html>`_ ：how to use migration tools to upgrade your code.
+..  toctree::
+    :hidden:
+    migration_en.rst
--- a/doc/paddle/guides/migration_cn.rst
+++ b/doc/paddle/guides/migration_cn.rst
--- a/doc/paddle/guides/migration_en.rst
+++ b/doc/paddle/guides/migration_en.rst
--- a/doc/paddle/guides/upgrade_guide_cn.md
+++ b/doc/paddle/guides/upgrade_guide_cn.md
--- a/doc/paddle/guides/02_paddle2.0_develop/index_cn.rst
+++ b/doc/paddle/guides/02_paddle2.0_develop/index_cn.rst
+###################
+飞桨框架2.0模型开发
+###################
+飞桨框架2.0模型开发相关内容。
+..
+  TODO
+  补充内容
+  10分钟快速上手Paddle
+  数据预处理 (vision + text)
+  数据加载 (Dataset + DataLoader、内置数据集介绍)
+  模型组网 (paddle.nn + paddle.nn.functional、Model介绍、内置模型介绍)
+  训练与预测 (model.fit evaluate predict、一步步拆解fit、evaluate、predict)
+  单机多卡 (训练 + 预测)
+  动态图代码调试
--- a/doc/paddle/guides/model_save_load_cn.rst
+++ b/doc/paddle/guides/model_save_load_cn.rst
--- a/doc/paddle/guides/03_VisualDL/index_cn.rst
+++ b/doc/paddle/guides/03_VisualDL/index_cn.rst
+.. PaddlePaddle Fluid documentation master file, created by
+   sphinx-quickstart on Thu Jun  7 17:04:53 2018.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+##############
+VisualDL 工具
+##############
+..  toctree::
+    :maxdepth: 1
+    visualdl.md
+    visualdl_usage.md
--- a/doc/paddle/guides/03_VisualDL/index_en.rst
+++ b/doc/paddle/guides/03_VisualDL/index_en.rst
+VisualDL Tools
+==========================
+..  toctree::
+    :maxdepth: 1
+    visualdl_en.md
+    visualdl_usage_en.md
--- a/doc/paddle/guides/03_VisualDL/visualdl.md
+++ b/doc/paddle/guides/03_VisualDL/visualdl.md
+# VisualDL 工具简介
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/vdl-logo.png" width="70%"/>
+</p>
+VisualDL是飞桨可视化分析工具，以丰富的图表呈现训练参数变化趋势、模型结构、数据样本、直方图、PR曲线及高维数据分布。可帮助用户更清晰直观地理解深度学习模型训练过程及模型结构，进而实现高效的模型优化。
+具体功能使用方式请参见**VisualDL使用指南**。项目正处于高速迭代中，敬请期待新组件的加入。
+VisualDL支持浏览器种类：Chrome（81和83）、Safari 13、FireFox（77和78）、Edge（Chromium版）。
+VisualDL原生支持python的使用， 通过在模型的Python配置中添加几行代码，便可为训练过程提供丰富的可视化支持。
+## 目录
+* [核心亮点](#核心亮点)
+* [安装方式](#安装方式)
+* [使用方式](#使用方式)
+* [可视化功能概览](#可视化功能概览)
+* [开源贡献](#开源贡献)
+* [更多细节](#更多细节)
+* [技术交流](#技术交流)
+## 核心亮点
+### 简单易用
+API设计简洁易懂，使用简单。模型结构一键实现可视化。
+### 功能丰富
+功能覆盖标量、数据样本、图结构、直方图、PR曲线及数据降维可视化。
+### 高兼容性
+全面支持Paddle、ONNX、Caffe等市面主流模型结构可视化，广泛支持各类用户进行可视化分析。
+### 全面支持
+与飞桨服务平台及工具组件全面打通，为您在飞桨生态系统中提供最佳使用体验。
+## 安装方式
+### 使用pip安装
+```shell
+pip install --upgrade --pre visualdl
+```
+### 使用代码安装
+```
+git clone https://github.com/PaddlePaddle/VisualDL.git
+cd VisualDL
+python setup.py bdist_wheel
+pip install --upgrade dist/visualdl-*.whl
+```
+需要注意，官方自2020年1月1日起不再维护Python2，为了保障代码可用性，VisualDL现仅支持Python3
+## 使用方式
+VisualDL将训练过程中的数据、参数等信息储存至日志文件中后，启动面板即可查看可视化结果。
+### 1. 记录日志
+VisualDL的后端提供了Python SDK，可通过LogWriter定制一个日志记录器，接口如下：
+```python
+class LogWriter(logdir=None,
+                comment='',
+                max_queue=10,
+                flush_secs=120,
+                filename_suffix='',
+                write_to_disk=True,
+                **kwargs)
+```
+#### 接口参数
+| 参数            | 格式    | 含义                                                         |
+| --------------- | ------- | ------------------------------------------------------------ |
+| logdir          | string  | 日志文件所在的路径，VisualDL将在此路径下建立日志文件并进行记录，如果不填则默认为`runs/${CURRENT_TIME}` |
+| comment         | string  | 为日志文件夹名添加后缀，如果制定了logdir则此项无效           |
+| max_queue       | int     | 日志记录消息队列的最大容量，达到此容量则立即写入到日志文件   |
+| flush_secs      | int     | 日志记录消息队列的最大缓存时间，达到此时间则立即写入到日志文件 |
+| filename_suffix | string  | 为默认的日志文件名添加后缀                                   |
+| write_to_disk   | boolean | 是否写入到磁盘                                               |
+#### 示例
+设置日志文件并记录标量数据：
+```python
+from visualdl import LogWriter
+# 在`./log/scalar_test/train`路径下建立日志文件
+with LogWriter(logdir="./log/scalar_test/train") as writer:
+    # 使用scalar组件记录一个标量数据
+    writer.add_scalar(tag="acc", step=1, value=0.5678)
+    writer.add_scalar(tag="acc", step=2, value=0.6878)
+    writer.add_scalar(tag="acc", step=3, value=0.9878)
+```
+### 2. 启动面板
+在上述示例中，日志已记录三组标量数据，现可启动VisualDL面板查看日志的可视化结果，共有两种启动方式：
+#### 在命令行启动
+使用命令行启动VisualDL面板，命令格式如下：
+```python
+visualdl --logdir <dir_1, dir_2, ... , dir_n> --host <host> --port <port> --cache-timeout <cache_timeout> --language <language> --public-path <public_path> --api-only
+```
+参数详情：
+| 参数            | 意义                                                         |
+| --------------- | ------------------------------------------------------------ |
+| --logdir        | 设定日志所在目录，可以指定多个目录，VisualDL将遍历并且迭代寻找指定目录的子目录，将所有实验结果进行可视化 |
+| --model         | 设定模型文件路径(非文件夹路径)，VisualDL将在此路径指定的模型文件进行可视化，目前可支持PaddlePaddle、ONNX、Keras、Core ML、Caffe等多种模型结构，详情可查看[graph支持模型种类]([https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/components/README.md#Graph--%E7%BD%91%E7%BB%9C%E7%BB%93%E6%9E%84%E7%BB%84%E4%BB%B6](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/components/README.md#Graph--网络结构组件)) |
+| --host          | 设定IP，默认为`127.0.0.1`                                    |
+| --port          | 设定端口，默认为`8040`                                       |
+| --cache-timeout | 后端缓存时间，在缓存时间内前端多次请求同一url，返回的数据从缓存中获取，默认为20秒 |
+| --language      | VisualDL面板语言，可指定为'EN'或'ZH'，默认为浏览器使用语言   |
+| --public-path   | VisualDL面板URL路径，默认是'/app'，即访问地址为'http://&lt;host&gt;:&lt;port&gt;/app' |
+| --api-only      | 是否只提供API，如果设置此参数，则VisualDL不提供页面展示，只提供API服务，此时API地址为'http://&lt;host&gt;:&lt;port&gt;/&lt;public_path&gt;/api'；若没有设置public_path参数，则默认为'http://&lt;host&gt;:&lt;port&gt;/api' |
+针对上一步生成的日志，启动命令为：
+```
+visualdl --logdir ./log
+```
+#### 在Python脚本中启动
+支持在Python脚本中启动VisualDL面板，接口如下：
+```python
+visualdl.server.app.run(logdir,
+                        host="127.0.0.1",
+                        port=8080,
+                        cache_timeout=20,
+                        language=None,
+                        public_path=None,
+                        api_only=False,
+                        open_browser=False)
+```
+请注意：除`logdir`外，其他参数均为不定参数，传递时请指明参数名。
+接口参数具体如下：
+| 参数          | 格式                                             | 含义                                                         |
+| ------------- | ------------------------------------------------ | ------------------------------------------------------------ |
+| logdir        | string或list[string_1, string_2, ... , string_n] | 日志文件所在的路径，VisualDL将在此路径下递归搜索日志文件并进行可视化，可指定单个或多个路径 |
+| model         | string                                           | 模型文件路径(非文件夹路径)，VisualDL将在此路径指定的模型文件进行可视化 |
+| host          | string                                           | 指定启动服务的ip，默认为`127.0.0.1`                          |
+| port          | int                                              | 启动服务端口，默认为`8040`                                   |
+| cache_timeout | int                                              | 后端缓存时间，在缓存时间内前端多次请求同一url，返回的数据从缓存中获取，默认为20秒 |
+| language      | string                                           | VisualDL面板语言，可指定为'en'或'zh'，默认为浏览器使用语言   |
+| public_path   | string                                           | VisualDL面板URL路径，默认是'/app'，即访问地址为'http://<host>:<port>/app' |
+| api_only      | boolean                                          | 是否只提供API，如果设置此参数，则VisualDL不提供页面展示，只提供API服务，此时API地址为'http://<host>:<port>/<public_path>/api'；若没有设置public_path参数，则默认为http://<host>:<port>/api' |
+| open_browser  | boolean                                          | 是否打开浏览器，设置为True则在启动后自动打开浏览器并访问VisualDL面板，若设置api_only，则忽略此参数 |
+针对上一步生成的日志，我们的启动脚本为：
+```python
+from visualdl.server import app
+app.run(logdir="./log")
+```
+在使用任意一种方式启动VisualDL面板后，打开浏览器访问VisualDL面板，即可查看日志的可视化结果，如图：
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/82786044-67ae9880-9e96-11ea-8a2b-3a0951a6ec19.png" width="60%"/>
+</p>
+## 可视化功能概览
+### Scalar
+以图表形式实时展示训练过程参数，如loss、accuracy。让用户通过观察单组或多组训练参数变化，了解训练过程，加速模型调优。具有两大特点：
+#### 动态展示
+在启动VisualDL后，LogReader将不断增量的读取日志中数据并供前端调用展示，因此能够在训练中同步观测指标变化，如下图：
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/dynamic_display.gif" width="60%"/>
+</p>
+#### 多实验对比
+只需在启动VisualDL时将每个实验日志所在路径同时传入即可，每个实验中相同tag的指标将绘制在一张图中同步呈现，如下图：
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/multi_experiments.gif" width="100%"/>
+</p>
+### Image
+实时展示训练过程中的图像数据，用于观察不同训练阶段的图像变化，进而深入了解训练过程及效果。
+<p align="center">
+<img src="http://visualdl.bj.bcebos.com/images/image-eye.gif" width="60%"/>
+</p>
+### Audio
+实时查看训练过程中的音频数据，监控语音识别与合成等任务的训练过程。
+<p align="center">
+<img src="https://user-images.githubusercontent.com/48054808/89017647-38605000-d34d-11ea-9d75-7d10b9854c36.gif" width="100%"/>
+</p>
+### Graph
+一键可视化模型的网络结构。可查看模型属性、节点信息、节点输入输出等，并支持节点搜索，辅助用户快速分析模型结构与了解数据流向。
+<p align="center">
+<img src="https://user-images.githubusercontent.com/48054808/84483052-5acdd980-accb-11ea-8519-1608da7ee698.png" width="100%"/>
+</p>
+### Histogram
+以直方图形式展示Tensor（weight、bias、gradient等）数据在训练过程中的变化趋势。深入了解模型各层效果，帮助开发者精准调整模型结构。
+- Offset模式
+<p align="center">
+<img src="https://user-images.githubusercontent.com/48054808/86551031-86647c80-bf76-11ea-8ec2-8c86826c8137.png" width="100%"/>
+</p>
+- Overlay模式
+<p align="center">
+<img src="https://user-images.githubusercontent.com/48054808/86551033-882e4000-bf76-11ea-8e6a-af954c662ced.png" width="100%"/>
+</p>
+### PR Curve
+精度-召回率曲线，帮助开发者权衡模型精度和召回率之间的平衡，设定最佳阈值。
+<p align="center">
+<img src="https://user-images.githubusercontent.com/48054808/86738774-ee46c000-c067-11ea-90d2-a98aac445cca.png" width="100%"/>
+</p>
+### High Dimensional
+将高维数据进行降维展示，目前支持T-SNE、PCA两种降维方式，用于深入分析高维数据间的关系，方便用户根据数据特征进行算法优化。
+<p align="center">
+<img src="http://visualdl.bj.bcebos.com/images/high_dimensional_test.png" width="100%"/>
+</p>
+## 开源贡献
+VisualDL 是由 [PaddlePaddle](https://www.paddlepaddle.org/) 和 [ECharts](https://echarts.apache.org/) 合作推出的开源项目。
+Graph 相关功能由 [Netron](https://github.com/lutzroeder/netron) 提供技术支持。
+欢迎所有人使用，提意见以及贡献代码。
+## 更多细节
+想了解更多关于VisualDL可视化功能的使用详情介绍，请查看**VisualDL使用指南**。
+## 技术交流
+欢迎您加入VisualDL官方QQ群：1045783368 与飞桨团队以及其他用户共同针对VisualDL进行讨论与交流。
--- a/doc/paddle/guides/03_VisualDL/visualdl_en.md
+++ b/doc/paddle/guides/03_VisualDL/visualdl_en.md
+# Introduction to VisualDL Toolset
+<p align="center">
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/images/vs-logo.png" width="60%" />
+</p>
+## Introduction
+VisualDL is a deep learning visualization tool that can help design deep learning jobs.
+It includes features such as scalar, parameter distribution, model structure and image visualization.
+Currently it is being developed at a high pace.
+New features will be continuously added.
+At present, most DNN frameworks use Python as their primary language. VisualDL supports Python by nature.
+Users can get plentiful visualization results by simply add a few lines of Python code into their model before training.
+Besides Python SDK, VisualDL was writen in C++ on the low level. It also provides C++ SDK that
+can be integrated into other platforms.
+## Component
+VisualDL provides following components:
+- scalar
+- histogram
+- image
+- audio
+- graph
+- high dimensional
+### Scalar
+Scalar can be used to show the trends of error during training.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_scalar.gif" width="60%"/>
+</p>
+### Histogram
+Histogram can be used to visualize parameter distribution and trends for any tensor.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/histogram.gif" width="60%"/>
+</p>
+### Image
+Image can be used to visualize any tensor or intermediate generated image.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_image.gif" width="60%"/>
+</p>
+### Audio
+Audio can be used to play input audio samples or generated audio samples.
+### Graph
+VisualDL graph supports displaying paddle model, furthermore is compatible with ONNX ([Open Neural Network Exchange](https://github.com/onnx/onnx)),
+Cooperated with Python SDK, VisualDL can be compatible with most major DNN frameworks, including
+PaddlePaddle, PyTorch and MXNet.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/images/graph_demo.gif" width="60%" />
+</p>
+To display the paddle model, all you have to do is:
+1. call the `fluid.io.save_inference_model()`interface to save paddle model
+2. use `visualdl --model_pb [paddle_model_dir]` to load paddle model in command line
+### High Dimensional
+High Dimensional can be used to visualize data embeddings by projecting high-dimensional data into 2D / 3D.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/getting_started/high_dimensional_3d.png" width="60%"/>
+</p>
+## Quick Start
+To give the VisualDL a quick test, please use the following commands.
+```
+# Install the VisualDL. Preferably under a virtual environment or anaconda.
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you encounter the error `TypeError: __init__() got an unexpected keyword argument 'file'`, that is due to protobuf version is not 3.5+，simply run `pip install --upgrade protobuf` will fix the issue.
+If you run into any other issues in above steps, it could be error caused by environmental issues by different python or pip versions.
+Following installation methods might fix the issues.
+## Install with Virtualenv
+[Virtualenv](https://virtualenv.pypa.io/en/stable/) creates isolated Python environment that prevents interfering
+by other Python programs on the same machine and make sure Python and pip are located properly.
+On macOS, install pip and virtualenv by:
+```
+sudo easy_install pip
+pip install --upgrade virtualenv
+```
+On Linux, install pip and virtualenv by:
+```
+sudo apt-get install python3-pip python3-dev python-virtualenv
+```
+Then create a Virtualenv environment by one of following command:
+```
+virtualenv ~/vdl  # for Python2.7
+virtualenv -p python3 ~/vdl for Python 3.x
+```
+```~/vdl``` will be your Virtualenv directory, you may choose to install anywhere.
+Activate your Virtualenv environment by:
+```
+source ~/vdl/bin/activate
+```
+Now you should be able to install VisualDL and run our demo:
+```
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you still have issues installing VisualDL from Virtualenv, try following installation method.
+## Install with Anaconda
+Anaconda is a python distribution, with installation and package management tools. Also it is an environment manager,
+which provides the facility to create different python environments, each with their own settings.
+Follow the instructions on the [Anaconda download site](https://www.anaconda.com/download) to download and install Anaconda.
+Download Python 3.6 version command-Line installer.
+Create a conda environment named ```vdl``` or anything you want by:
+```
+conda create -n vdl pip python=2.7 # or python=3.3, etc.
+```
+Activate the conda environment by:
+```
+source activate vdl
+```
+Now you should be able to install VisualDL and run our demo:
+```
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you still have issues installing VisualDL, try installing from sources as in following section.
+### Install from source
+```
+#Preferably under a virtualenv or anaconda.
+git clone https://github.com/PaddlePaddle/VisualDL.git
+cd VisualDL
+python setup.py bdist_wheel
+pip install --upgrade dist/visualdl-*.whl
+```
+If there are still issues regarding the ```pip install```, you can still start Visual DL by starting the dev server
+[here](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/develop/how_to_dev_frontend_en.md)
+## SDK
+VisualDL provides both Python SDK and C++ SDK in order to fit more use cases.
+### Python SDK
+VisualDL now supports both Python 2 and Python 3.
+Below is an example of creating a simple Scalar component and inserting data from different timestamps:
+```python
+import random
+from visualdl import LogWriter
+logdir = "./tmp"
+logger = LogWriter(logdir, sync_cycle=10000)
+# mark the components with 'train' label.
+with logger.mode("train"):
+    # create a scalar component called 'scalars/scalar0'
+    scalar0 = logger.scalar("scalars/scalar0")
+# add some records during DL model running.
+for step in range(100):
+    scalar0.add_record(step, random.random())
+```
+### C++ SDK
+Here is the C++ SDK identical to the Python SDK example above:
+```c++
+#include <cstdlib>
+#include <string>
+#include "visualdl/logic/sdk.h"
+namespace vs = visualdl;
+namespace cp = visualdl::components;
+int main() {
+  const std::string dir = "./tmp";
+  vs::LogWriter logger(dir, 10000);
+  logger.SetMode("train");
+  auto tablet = logger.AddTablet("scalars/scalar0");
+  cp::Scalar<float> scalar0(tablet);
+  for (int step = 0; step < 1000; step++) {
+    float v = (float)std::rand() / RAND_MAX;
+    scalar0.AddRecord(step, v);
+  }
+  return 0;
+}
+```
+## Launch Visual DL
+After some logs have been generated during training, users can launch Visual DL application to see real-time data visualization by:
+```
+visualdl --logdir <some log dir>
+```
+visualDL also supports following optional parameters:
+- `--host` set IP
+- `--port` set port
+- `-m / --model_pb` specify ONNX format for model file to view graph
+### Contribute
+VisualDL is initially created by [PaddlePaddle](http://www.paddlepaddle.org/) and
+[ECharts](http://echarts.baidu.com/).
+We welcome everyone to use, comment and contribute to VisualDL :)
+## More details
+For more details about how to use VisualDL, please take a look at [documents](https://github.com/PaddlePaddle/VisualDL/tree/develop/demo)
--- a/doc/paddle/guides/03_VisualDL/visualdl_usage.md
+++ b/doc/paddle/guides/03_VisualDL/visualdl_usage.md
+# VisualDL 使用指南
+### 概述
+VisualDL 是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据，用户可以更直观、清晰地查看数据的特征与变化趋势，有助于分析数据、及时发现错误，进而改进神经网络模型的设计。
+目前，VisualDL 支持 scalar, image, audio, graph, histogram, pr curve, high dimensional 七个组件，项目正处于高速迭代中，敬请期待新组件的加入。
+|                      组件名称                       |  展示图表  | 作用                                                         |
+| :-------------------------------------------------: | :--------: | :----------------------------------------------------------- |
+|            [ Scalar](#Scalar--标量组件)             |   折线图   | 动态展示损失函数值、准确率等标量数据                         |
+|           [Image](#Image--图片可视化组件)           | 图片可视化 | 显示图片，可显示输入图片和处理后的结果，便于查看中间过程的变化 |
+|            [Audio](#Audio--音频播放组件)            |  音频播放  | 播放训练过程中的音频数据，监控语音识别与合成等任务的训练过程 |
+|            [Graph](#Graph--网络结构组件)            |  网络结构  | 展示网络结构、节点属性及数据流向，辅助学习、优化网络结构     |
+|         [Histogram](#Histogram--直方图组件)         |   直方图   | 展示训练过程中权重、梯度等张量的分布                         |
+|          [PR Curve](#PR-Curve--PR曲线组件)          |   折线图   | 权衡精度与召回率之间的平衡关系，便于选择最佳阈值             |
+| [High Dimensional](#High-Dimensional--数据降维组件) |  数据降维  | 将高维数据映射到 2D/3D 空间来可视化嵌入，便于观察不同数据的相关性 |
+## Scalar -- 折线图组件
+### 介绍
+Scalar 组件的输入数据类型为标量，该组件的作用是将训练参数以折线图形式呈现。将损失函数值、准确率等标量数据作为参数传入 scalar 组件，即可画出折线图，便于观察变化趋势。
+### 记录接口
+Scalar 组件的记录接口如下：
+```python
+add_scalar(tag, value, step, walltime=None)
+```
+接口参数说明如下：
+| 参数     | 格式   | 含义                                        |
+| -------- | ------ | ------------------------------------------- |
+| tag      | string | 记录指标的标志，如`train/loss`，不能含有`%` |
+| value    | float  | 要记录的数据值                              |
+| step     | int    | 记录的步数                                  |
+| walltime | int    | 记录数据的时间戳，默认为当前时间戳          |
+### Demo
+- 基础使用
+下面展示了使用 Scalar 组件记录数据的示例，代码文件请见[Scalar组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/scalar_test.py)
+```python
+from visualdl import LogWriter
+if __name__ == '__main__':
+    value = [i/1000.0 for i in range(1000)]
+    # 初始化一个记录器
+    with LogWriter(logdir="./log/scalar_test/train") as writer:
+        for step in range(1000):
+            # 向记录器添加一个tag为`acc`的数据
+            writer.add_scalar(tag="acc", step=step, value=value[step])
+            # 向记录器添加一个tag为`loss`的数据
+            writer.add_scalar(tag="loss", step=step, value=1/(value[step] + 1))
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+接着在浏览器打开`http://127.0.0.1:8080`，即可查看以下折线图。
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/82397559-478c6d00-9a83-11ea-80db-a0844dcaca35.png" width="100%"/>
+</p>
+- 多组实验对比
+下面展示了使用Scalar组件实现多组实验对比
+多组实验对比的实现分为两步：
+1. 创建子日志文件储存每组实验的参数数据
+2. 将数据写入scalar组件时，**使用相同的tag**，即可实现对比**不同实验**的**同一类型参数**
+```python
+from visualdl import LogWriter
+if __name__ == '__main__':
+    value = [i/1000.0 for i in range(1000)]
+    # 步骤一：创建父文件夹：log与子文件夹：scalar_test
+    with LogWriter(logdir="./log/scalar_test") as writer:
+        for step in range(1000):
+            # 步骤二：向记录器添加一个tag为`train/acc`的数据
+            writer.add_scalar(tag="train/acc", step=step, value=value[step])
+            # 步骤二：向记录器添加一个tag为`train/loss`的数据
+            writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
+    # 步骤一：创建第二个子文件夹scalar_test2  
+    value = [i/500.0 for i in range(1000)]
+    with LogWriter(logdir="./log/scalar_test2") as writer:
+        for step in range(1000):
+            # 步骤二：在同样名为`train/acc`下添加scalar_test2的accuracy的数据
+            writer.add_scalar(tag="train/acc", step=step, value=value[step])
+            # 步骤二：在同样名为`train/loss`下添加scalar_test2的loss的数据
+            writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+接着在浏览器打开`http://127.0.0.1:8080`，即可查看以下折线图，对比「scalar_test」和「scalar_test2」的Accuracy和Loss。
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84644158-5efb3080-af31-11ea-8e64-bbe4078425f4.png" width="100%"/>
+</p>
+*多组实验对比的应用案例可参考AI Studio项目：[VisualDL 2.0--眼疾识别训练可视化](https://aistudio.baidu.com/aistudio/projectdetail/502834)
+### 功能操作说明
+* 支持数据卡片「最大化」、「还原」、「坐标系转化」（y轴对数坐标）、「下载」折线图
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/scalar-icon.png" width="55%"/>
+</p>
+* 数据点Hover展示详细信息
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/scalar-tooltip.png" width="60%"/>
+</p>
+* 可搜索卡片标签，展示目标图像
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/scalar-searchlabel.png" width="90%"/>
+</p>
+* 可搜索打点数据标签，展示特定数据
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/scalar-searchstream.png" width="40%"/>
+</p>
+* X轴有三种衡量尺度
+1. Step：迭代次数
+2. Walltime：训练绝对时间
+3. Relative：训练时长
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/x-axis.png" width="40%"/>
+</p>
+* 可调整曲线平滑度，以便更好的展现参数整体的变化趋势
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/scalar-smooth.png" width="37%"/>
+</p>
+## Image -- 图片可视化组件
+### 介绍
+Image 组件用于显示图片数据随训练的变化。在模型训练过程中，将图片数据传入 Image 组件，就可在 VisualDL 的前端网页查看相应图片。
+### 记录接口
+Image 组件的记录接口如下：
+```python
+add_image(tag, img, step, walltime=None)
+```
+接口参数说明如下：
+| 参数     | 格式          | 含义                                        |
+| -------- | ------------- | ------------------------------------------- |
+| tag      | string        | 记录指标的标志，如`train/loss`，不能含有`%` |
+| img      | numpy.ndarray | 以ndarray格式表示的图片                     |
+| step     | int           | 记录的步数                                  |
+| walltime | int           | 记录数据的时间戳，默认为当前时间戳          |
+### Demo
+下面展示了使用 Image 组件记录数据的示例，代码文件请见[Image组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/image_test.py)
+```python
+import numpy as np
+from PIL import Image
+from visualdl import LogWriter
+def random_crop(img):
+    """获取图片的随机 100x100 分片
+    """
+    img = Image.open(img)
+    w, h = img.size
+    random_w = np.random.randint(0, w - 100)
+    random_h = np.random.randint(0, h - 100)
+    r = img.crop((random_w, random_h, random_w + 100, random_h + 100))
+    return np.asarray(r)
+if __name__ == '__main__':
+    # 初始化一个记录器
+    with LogWriter(logdir="./log/image_test/train") as writer:
+        for step in range(6):
+            # 添加一个图片数据
+            writer.add_image(tag="eye",
+                             img=random_crop("../../docs/images/eye.jpg"),
+                             step=step)
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+在浏览器输入`http://127.0.0.1:8080`，即可查看图片数据。
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/image-static.png" width="100%"/>
+</p>
+### 功能操作说明
+可搜索图片标签显示对应图片数据
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/image-search.png" width="90%"/>
+</p>
+支持滑动Step/迭代次数查看不同迭代次数下的图片数据
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/image-eye.gif" width="60%"/>
+</p>
+## Audio--音频播放组件
+### 介绍
+Audio组件实时查看训练过程中的音频数据，监控语音识别与合成等任务的训练过程。
+### 记录接口
+Audio 组件的记录接口如下：
+```python
+add_audio(tag, audio_array, step, sample_rate)
+```
+接口参数说明如下：
+| 参数        | 格式          | 含义                                       |
+| ----------- | ------------- | ------------------------------------------ |
+| tag         | string        | 记录指标的标志，如`audio_tag`，不能含有`%` |
+| audio_arry  | numpy.ndarray | 以ndarray格式表示的音频                    |
+| step        | int           | 记录的步数                                 |
+| sample_rate | int           | 采样率，**注意正确填写对应音频的原采样率** |
+### Demo
+下面展示了使用 Audio 组件记录数据的示例，代码文件请见[Audio组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/audio_test.py)
+```python
+from visualdl import LogWriter
+import numpy as np
+import wave
+def read_audio_data(audio_path):
+    """
+    Get audio data.
+    """
+    CHUNK = 4096
+    f = wave.open(audio_path, "rb")
+    wavdata = []
+    chunk = f.readframes(CHUNK)
+    while chunk:
+        data = np.frombuffer(chunk, dtype='uint8')
+        wavdata.extend(data)
+        chunk = f.readframes(CHUNK)
+    # 8k sample rate, 16bit frame, 1 channel
+    shape = [8000, 2, 1]
+    return shape, wavdata
+if __name__ == '__main__':
+    with LogWriter(logdir="./log") as writer:
+        audio_shape, audio_data = read_audio_data("./testing.wav")
+        audio_data = np.array(audio_data)
+        writer.add_audio(tag="audio_tag",
+                         audio_array=audio_data,
+                         step=0,
+                         sample_rate=8000)
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+在浏览器输入`http://127.0.0.1:8080`，即可查看音频数据。
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87659138-b4746880-c78f-11ea-965b-c33804e7c296.png" width="100%"/>
+</p>
+### 功能操作说明
+- 可搜索音频标签显示对应音频数据
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87661431-29956d00-c793-11ea-833b-172d8fc1b221.png" width="100%"/>
+</p>
+- 支持滑动Step/迭代次数试听不同迭代次数下的音频数据
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87661089-a07e3600-c792-11ea-8740-cbe99a64d830.png" width="60%"/>
+</p>
+- 支持播放/暂停音频数据
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87661130-b3910600-c792-11ea-9f9f-2ae66132e9de.png" width="60%"/>
+</p>
+- 支持音量调节
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87661497-49c52c00-c793-11ea-9eeb-471543cd2a0b.png" width="60%"/>
+</p>
+- 支持音频下载
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/87661166-c277b880-c792-11ea-8ad7-5c60bb08379b.png" width="60%"/>
+</p>
+## Graph--网络结构组件
+### 介绍
+Graph组件一键可视化模型的网络结构。用于查看模型属性、节点信息、节点输入输出等，并进行节点搜索，协助开发者们快速分析模型结构与了解数据流向。
+### Demo
+共有两种启动方式：
+- 前端模型文件拖拽上传：
+  - 如只需使用Graph组件，则无需添加任何参数，在命令行执行`visualdl`后即可启动面板进行上传。
+  - 如果同时需使用其他功能，在命令行指定日志文件路径（以`./log`为例）即可启动面板进行上传：
+  ```shell
+  visualdl --logdir ./log --port 8080
+  ```
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487396-44c31780-acd1-11ea-831a-1632e636613d.png" width="80%"/>
+</p>
+- 后端启动Graph：
+  - 在命令行加入参数`--model`并指定**模型文件**路径（非文件夹路径），即可启动并查看网络结构可视化：
+  ```shell
+  visualdl --model ./log/model --port 8080
+  ```
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84490149-51e20580-acd5-11ea-9663-1f156892c0e0.png" width="100%"/>
+</p>
+### 功能操作说明
+- 一键上传模型
+  - 支持模型格式：PaddlePaddle、ONNX、Keras、Core ML、Caffe、Caffe2、Darknet、MXNet、ncnn、TensorFlow Lite
+  - 实验性支持模型格式：TorchScript、PyTorch、Torch、 ArmNN、BigDL、Chainer、CNTK、Deeplearning4j、MediaPipe、ML.NET、MNN、OpenVINO、Scikit-learn、Tengine、TensorFlow.js、TensorFlow
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487396-44c31780-acd1-11ea-831a-1632e636613d.png" width="80%"/>
+</p>
+- 支持上下左右任意拖拽模型、放大和缩小模型
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/89163601-6ab9b980-d5a8-11ea-9c6d-2dc5eaed0d41.gif" width="100%"/>
+</p>
+- 搜索定位到对应节点
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487694-b9965180-acd1-11ea-8214-34f3febc1828.png" width="30%"/>
+</p>
+- 点击查看模型属性
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487751-cadf5e00-acd1-11ea-9ce2-4fdfeeea9c5a.png" width="30%"/>
+</p>
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487759-d03ca880-acd1-11ea-9294-520ef7f9e0b1.png" width="30%"/>
+</p>
+- 支持选择模型展示的信息
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487829-ee0a0d80-acd1-11ea-8563-6682a15483d9.png" width="23%"/>
+</p>
+- 支持以PNG、SVG格式导出模型结构图
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487884-ff531a00-acd1-11ea-8b12-5221db78683e.png" width="30%"/>
+</p>
+- 点击节点即可展示对应属性信息
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487941-13971700-acd2-11ea-937d-42fb524b9ee1.png" width="30%"/>
+</p>
+- 支持一键更换模型
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/84487998-27db1400-acd2-11ea-83d7-5d75832ef41d.png" width="25%"/>
+</p>
+## Histogram--直方图组件
+### 介绍
+Histogram组件以直方图形式展示Tensor（weight、bias、gradient等）数据在训练过程中的变化趋势。深入了解模型各层效果，帮助开发者精准调整模型结构。
+### 记录接口
+Histogram 组件的记录接口如下：
+```python
+add_histogram(tag, values, step, walltime=None, buckets=10)
+```
+接口参数说明如下：
+| 参数     | 格式                  | 含义                                        |
+| -------- | --------------------- | ------------------------------------------- |
+| tag      | string                | 记录指标的标志，如`train/loss`，不能含有`%` |
+| values   | numpy.ndarray or list | 以ndarray或list格式表示的数据               |
+| step     | int                   | 记录的步数                                  |
+| walltime | int                   | 记录数据的时间戳，默认为当前时间戳          |
+| buckets  | int                   | 生成直方图的分段数，默认为10                |
+### Demo
+下面展示了使用 Histogram组件记录数据的示例，代码文件请见[Histogram组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/histogram_test.py)
+```python
+from visualdl import LogWriter
+import numpy as np
+if __name__ == '__main__':
+    values = np.arange(0, 1000)
+    with LogWriter(logdir="./log/histogram_test/train") as writer:
+        for index in range(1, 101):
+            interval_start = 1 + 2 * index / 100.0
+            interval_end = 6 - 2 * index / 100.0
+            data = np.random.uniform(interval_start, interval_end, size=(10000))
+            writer.add_histogram(tag='default tag',
+                                 values=data,
+                                 step=index,
+                                 buckets=10)
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+在浏览器输入`http://127.0.0.1:8080`，即可查看训练参数直方图。
+### 功能操作说明
+- 支持数据卡片「最大化」、直方图「下载」
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86535351-42d82700-bf12-11ea-89f0-171280e7c526.png" width="60%"/>
+  </p>
+- 可选择Offset或Overlay模式
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86535413-c134c900-bf12-11ea-9ad6-f0ad8eafa76f.png" width="30%"/>
+  </p>
+  - Offset模式
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86536435-2b9d3780-bf1a-11ea-9981-92f837d22ae5.png" width="60%"/>
+  </p>
+  - Overlay模式
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86536458-5ab3a900-bf1a-11ea-985e-05f06c1b762b.png" width="60%"/>
+  </p>
+- 数据点Hover展示参数值、训练步数、频次
+  - 在第240次训练步数时，权重为-0.0031，且出现的频次是2734次
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86536482-80d94900-bf1a-11ea-9e12-5bea9f382b34.png" width="60%"/>
+  </p>
+- 可搜索卡片标签，展示目标直方图
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86536503-baaa4f80-bf1a-11ea-80ab-cd988617d018.png" width="30%"/>
+  </p>
+- 可搜索打点数据标签，展示特定数据流
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86536639-b894c080-bf1b-11ea-9ee5-cf815dd4bbd7.png" width="30%"/>
+  </p>
+## PR Curve--PR曲线组件
+### 介绍
+PR Curve以折线图形式呈现精度与召回率的权衡分析，清晰直观了解模型训练效果，便于分析模型是否达到理想标准。
+### 记录接口
+PR Curve组件的记录接口如下：
+```python
+add_pr_curve(tag, labels, predictions, step=None, num_thresholds=10)
+```
+接口参数说明如下：
+| 参数           | 格式                  | 含义                                        |
+| -------------- | --------------------- | ------------------------------------------- |
+| tag            | string                | 记录指标的标志，如`train/loss`，不能含有`%` |
+| labels         | numpy.ndarray or list | 以ndarray或list格式表示的实际类别           |
+| predictions    | numpy.ndarray or list | 以ndarray或list格式表示的预测类别           |
+| step           | int                   | 记录的步数                                  |
+| num_thresholds | int                   | 阈值设置的个数，默认为10，最大值为127       |
+### Demo
+下面展示了使用 PR Curve 组件记录数据的示例，代码文件请见[PR Curve组件](#https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/pr_curve_test.py)
+```python
+from visualdl import LogWriter
+import numpy as np
+with LogWriter("./log/pr_curve_test/train") as writer:
+    for step in range(3):
+        labels = np.random.randint(2, size=100)
+        predictions = np.random.rand(100)
+        writer.add_pr_curve(tag='pr_curve',
+                            labels=labels,
+                            predictions=predictions,
+                            step=step,
+                            num_thresholds=5)
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+接着在浏览器打开`http://127.0.0.1:8080`，即可查看PR Curve
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/86738774-ee46c000-c067-11ea-90d2-a98aac445cca.png" width="100%"/>
+</p>
+### 功能操作说明
+- 支持数据卡片「最大化」，「还原」、「下载」PR曲线
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86740067-f18e7b80-c068-11ea-96bf-52cb7da1f799.png" width="60%"/>
+  </p>
+- 数据点Hover展示详细信息：阈值对应的TP、TN、FP、FN
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86740477-43370600-c069-11ea-93f0-f4d05445fbab.png" width="70%"/>
+  </p>
+- 可搜索卡片标签，展示目标图表
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86740670-66fa4c00-c069-11ea-9ee3-0a22e2d0dbec.png" width="50%"/>
+  </p>
+- 可搜索打点数据标签，展示特定数据
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86740817-809b9380-c069-11ea-9453-6531e3ff5f43.png" width="50%"/>
+  </p>
+- 支持查看不同训练步数下的PR曲线
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86741057-b04a9b80-c069-11ea-9fef-2dcc16f9cd46.png" width="50%"/>
+  </p>
+- X轴-时间显示类型有三种衡量尺度
+  - Step：迭代次数
+  - Walltime：训练绝对时间
+  - Relative：训练时长
+  <p align="center">
+    <img src="https://user-images.githubusercontent.com/48054808/86741304-db34ef80-c069-11ea-86eb-787b49ed3705.png" width="50%"/>
+  </p>
+## High Dimensional -- 数据降维组件
+### 介绍
+High Dimensional 组件将高维数据进行降维展示，用于深入分析高维数据间的关系。目前支持以下两种降维算法：
+ - PCA : Principle Component Analysis 主成分分析
+ - t-SNE : t-distributed stochastic neighbor embedding t-分布式随机领域嵌入
+### 记录接口
+High Dimensional 组件的记录接口如下：
+```python
+add_embeddings(tag, labels, hot_vectors, walltime=None)
+```
+接口参数说明如下：
+| 参数        | 格式                | 含义                                                 |
+| ----------- | ------------------- | ---------------------------------------------------- |
+| tag         | string              | 记录指标的标志，如`default`，不能含有`%`             |
+| labels      | numpy.array 或 list | 一维数组表示的标签，每个元素是一个string类型的字符串 |
+| hot_vectors | numpy.array or list | 与labels一一对应，每个元素可以看作是某个标签的特征   |
+| walltime    | int                 | 记录数据的时间戳，默认为当前时间戳                   |
+### Demo
+下面展示了使用 High Dimensional 组件记录数据的示例，代码文件请见[High Dimensional组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/high_dimensional_test.py)
+```python
+from visualdl import LogWriter
+if __name__ == '__main__':
+    hot_vectors = [
+        [1.3561076367500755, 1.3116267195134017, 1.6785401875616097],
+        [1.1039614644440658, 1.8891609992484688, 1.32030488587171],
+        [1.9924524852447711, 1.9358920727142739, 1.2124401279391606],
+        [1.4129542689796446, 1.7372166387197474, 1.7317806077076527],
+        [1.3913371800587777, 1.4684674577930312, 1.5214136352476377]]
+    labels = ["label_1", "label_2", "label_3", "label_4", "label_5"]
+    # 初始化一个记录器
+    with LogWriter(logdir="./log/high_dimensional_test/train") as writer:
+        # 将一组labels和对应的hot_vectors传入记录器进行记录
+        writer.add_embeddings(tag='default',
+                              labels=labels,
+                              hot_vectors=hot_vectors)
+```
+运行上述程序后，在命令行执行
+```shell
+visualdl --logdir ./log --port 8080
+```
+接着在浏览器打开`http://127.0.0.1:8080`，即可查看降维后的可视化数据。
+<p align="center">
+  <img src="http://visualdl.bj.bcebos.com/images/dynamic_high_dimensional.gif" width="100%"/>
+</p>
+#
--- a/doc/paddle/guides/03_VisualDL/visualdl_usage_en.md
+++ b/doc/paddle/guides/03_VisualDL/visualdl_usage_en.md
+# VisualDL user guide
+## Overview
+VisualDL is a toolkit to visualize data generated in deep learning tasks. VisualDL make use of [ECharts](https://echarts.apache.org/en/feature.html) to display the distribution and change tendency of data, so that users can view data more clearly and intuitively.
+To be conductive to analyze the characteristics of data, detect errors, and optimize the neural network model, VisualDL provides seven functional components, including  scalar, histogram, image, text, audio, high dimensional and graph.
+| Component name | Display chart | Function of component |
+|:----:|:----:|:---|
+|<a href="#1">scalar</a>| Line Chart | Dynamically display scalar data, such as loss, accuracy, etc.|
+|<a href="#2">histogram</a>| Histogram | Dynamically display the numerical distribution and change tendency of parameters (such as weight matrix, offset, gradient, etc)|
+|<a href="#3">image</a>| Image | Dynamically display images, including input images and convolution results, it is conveniently to view the change tendency of intermediate process|
+|<a href="#4">text</a>| Text | Dynamically display text |
+|<a href="#5">audio</a>| Audio | Dynamically display audio, users can play directly or choose to download|
+|<a href="#6">high dimensional</a>| Coordinate | Map high dimensional data into 2D/3D space, for making it easy to observe the correlation of different data|
+|<a href="#7">graph</a>| Directed Graph | Display the neural networks |
+## Toolkits of adding data
+The six components (scalar, histogram, image, text, audio and high dimensional) are used to add data during program running. Class LogWriter must be initialized before adding data, in order to set the storage path and synchronization cycle. The input parameters of each components will be saved as log file in disk, after that the log file will be loaded into front end to display.  
+### LogWriter
+LogWriter is a Python wrapper to write data to log file with the data format defined as in protobuf file [storage.proto](https://github.com/PaddlePaddle/VisualDL/blob/develop/visualdl/storage/storage.proto).
+The definition of LogWriter :
+```python
+class LogWriter(dir, sync_cycle)
+```
+> :param dir : the directory path to the saved log files.  
+> :param sync_cycle : specify how often should the system store data into the file system, that is, system will save the data into the file system once operations count reaches sync_cycle.  
+> :return: a new LogWriter instance.  
+Demo 1.  Create a LogWriter instance
+```python
+# Create a LogWriter instance named log_writer
+log_writer = LogWriter("./log", sync_cycle=10)
+```
+class LogWriter include the following member functions:
+* `mode()`  
+* `scalar()`, `histogram()`, `image()`, `text()`, `audio()`, `embedding()`  
+The member function mode() is used to specify the phase of program running. The input string is customized, such as `test`, `validation`, `test`, `conv_layer1`. Components with same mode are grouped together, so users can choose different modes to display on the frontend webpage.
+The member functions scalar(), histogram(), image(), text(), audio() and embedding() are used to create component instance。
+Demo 2. Use LogWriter instance to create component instance
+```python
+# Set the name of mode to "train", and create a scalar component instance
+with log_writer.mode("train") as logger:
+    train_scalar = logger.scalar("acc")
+# Set the name of mode to "test", and create an image component instance
+with log_writer.mode("test") as shower:
+    test_image = shower.image("conv_image", 10, 1)
+```
+### scalar -- component to draw line charts
+The <a name="1">scalar</a> component is used to draw line charts. By passing scalar data such as loss value, accuracy as input parameters into the scalar() function, the frontend webpage will display the data in the form of line charts. It can facilitate users to grasp the changing tendency of training process.
+The first step of using scalar component is initializing the member function scalar() of LogWriter instance, then you can add data through the member function add_record() of ScalarWriter instance.
+* The member function `scalar()` of LogWriter instance :  
+```python
+def scalar(tag, type)  
+```  
+> :param tag : The scalar writer will label the data with tag.  
+> :param type : Data type, optional choice is limited to “float”, "double", "int", the default setting is "float".  
+> :return : A ScalarWriter instance to handle step and value records.  
+* The member function `add_record()` of ScalarWriter instance :  
+```python
+def add_record(step, value)  
+```
+> :param step : Step number.  
+> :param value : Input data.  
+Demo 3. scalar demo program[Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/scalar-demo.py)
+```python
+# coding=utf-8
+from visualdl import LogWriter
+# Create a LogWriter instance
+log_writer = LogWriter("./log", sync_cycle=20)
+# Create two ScalarWriter instances, whose mode is set to be "train"
+with log_writer.mode("train") as logger:
+    train_acc = logger.scalar("acc")
+    train_loss = logger.scalar("loss")
+# Create a ScalarWriter instance, whose mode is set to be "test"
+with log_writer.mode("test") as logger:
+    test_acc = logger.scalar("acc")
+value = [i/1000.0 for i in range(1000)]
+for step in range(1000):
+    # Add data
+    train_acc.add_record(step, value[step])
+    train_loss.add_record(step, 1 / (value[step] + 1))
+    test_acc.add_record(step, 1 - value[step])
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/scalar-interface.png" width=800><br/>
+Figure 1. scalar component displays line charts <br/>
+</p>
+The right sidebar of VisualDL has adjustment options for each component, take scalar component as example:
+* Smoothing : To adjust the smoothness of the line charts.  
+* X-axis : The horizontal ordinate of line charts, optional choice : Step, Relative, Wall Time.  
+* Tooltip sorting : Sorting method of tag, optional choice : default, descending, ascending, nearest.  
+There is also a ``RUNNING`` button at the bottom of the right sidebar, the frontend webpage will send request to the flask server for data synchronization. Switching to ``Stopped``, it will pause the data update.  
+### histogram -- component to display data distribution
+The <a name="2">histogram</a> component is used to draw histogram for displaying the distribution of input data. By passing some parameters of model training, such as weight matrices, biases, gradient, as input parameters into the `histogram()` function, the frontend webpage will display the data in the form of histogram. It can facilitate users to view the change tendency of parameters distribution.
+The first step of using histogram component is initializing the member function `histogram()` of LogWriter instance, then you can add data through the member function `add_record()` of HistogramWriter instance.
+* The member function histogram() of LogWriter instance :
+```python
+def histogram(tag, num_buckets, type)  
+```
+> :param tag : The histogram writer will label the data with tag.  
+> :param num_buckets : The number of pillar in the histogram.  
+> :param type : Data type, optional choice is limited to “float”, "double", "int", the default setting is "float".  
+> :return : A HistogramWriter instance to record distribution.  
+* The member function add_record() of HistogramWriter instance :
+```python
+def add_record(step, value)  
+```
+> :param step : Step number.  
+> :param value : Input data, type is list[].  
+Demo 4. histogram demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/histogram-demo.py)
+```python
+# coding=utf-8
+import numpy as np
+from visualdl import LogWriter
+# Create a LogWriter instance
+log_writer = LogWriter('./log', sync_cycle=10)
+# Create a HistogramWriter instance, whose mode is set to be "train"
+with log_writer.mode("train") as logger:
+    param1_histogram = logger.histogram("param1", num_buckets=100)
+# Loop
+for step in range(1, 101):
+    # Create input data
+    interval_start = 1 + 2 * step/100.0
+    interval_end = 6 - 2 * step/100.0
+    data = np.random.uniform(interval_start, interval_end, size=(10000))
+    # Use member function add_record() to add data
+    param1_histogram.add_record(step, data)
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/histogram-interface.png" width=800><br/>
+Figure 2. histogram component displays histograms <br/>
+</p>
+### image -- component to display image
+The <a name="3">image</a> component is used to visualize the image data. By passing the image data (type numpy.ndarray) into the image() function, the frontend webpage will display the image directly.
+The first step of using image component is initializing the member function image() of LogWriter instance. Then you can add data through the member functions start_sampling(), is_sample_taken(), set_sample(), and finish_sample() of ImageWriter instance.
+* The member function image() of LogWriter instance :
+```python
+def image(tag, num_samples, step_cycle)  
+```
+> :param tag : The image writer will label the image with tag.  
+> :param num_samples : Appoint the number of samples to take in a step.  
+> :param step_cycle : Store every `step_cycle` as a record, the default value is 1.  
+> :return:  A ImageWriter instance to sample images.  
+* Start a new sampling cycle, allocate memory space for the sampled data
+```python
+def start_sampling()
+```
+* Determine whether the picture should be sampled or not. If the return value is -1, it means no sampling, otherwise it should be sampled :
+```python
+def is_sample_taken()
+```
+* Add image data :
+```python
+def set_sample(index, image_shape, image_data)  
+```  
+> :param index : Combined with tag, used to determine the sub-frame of the image display.  
+> :param image_shape : The shape of image, [weight, height, channel(RGB is 3, GrayScale is 1)].  
+> :param image_data : Image data with type numpy.ndarray, member function flatten() can turn the shape to row vector.  
+* End the current sampling period, load the sampled data into disk, and release the memory space :
+```python
+def finish_sample()  
+```
+Demo 5. image demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/image-demo.py)
+```python
+# coding=utf-8
+import numpy as np
+from visualdl import LogWriter
+from PIL import Image
+def random_crop(img):
+    '''
+    This function is used to get a random block (100*100 pixels) of data img.
+    '''
+    img = Image.open(img)
+    w, h = img.size
+    random_w = np.random.randint(0, w - 100)
+    random_h = np.random.randint(0, h - 100)
+    return img.crop((random_w, random_h, random_w + 100, random_h + 100))
+# Create a LogWriter instance
+log_writer = LogWriter("./log", sync_cycle=10)
+# Create a ImageWriter instance
+ns = 2
+with log_writer.mode("train") as logger:
+    input_image = logger.image(tag="test", num_samples=ns)
+# The variable sample_num is used to record the number of image data that have been sampled
+sample_num = 0
+for step in range(6):
+    # Set the condition of start_sampling()
+    if sample_num == 0:
+        input_image.start_sampling()
+    idx = input_image.is_sample_taken()
+    # if idx != -1，sample this data, otherwise skip
+    if idx != -1:
+        # Get image data
+        image_path = "test.jpg"
+        image_data = np.array(random_crop(image_path))
+        # Add data
+        input_image.set_sample(idx, image_data.shape, image_data.flatten())
+        sample_num += 1
+        # If sampling of the present period have been completed, call finish_sample()
+        if sample_num % ns == 0:
+        input_image.finish_sampling()
+        sample_num = 0
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/image-interface.png" width=800><br/>
+Figure 3. image component displays images <br/>
+</p>
+Each subgraph has a horizontal axis which can be dragged to display images of different steps.
+### text -- component to display text
+The <a name="4">text</a> component is used to visualize the text data. By passing the text data (type string) into the text() function, the frontend webpage will display the image directly.
+The first step of using text component is initializing the member function text() of LogWriter instance, then you can add data through the member function add_record() of TextWriter instance.
+* The member function text() of LogWriter instance :  
+```python
+def text(tag)
+```
+> :param tag : Combined with tag, used to determine the sub-frame of the image display.  
+* The member function add_record() of TextWriter instance :  
+```python
+def add_record(step, str)
+```
+> :param step : Step number.  
+> :param value : Input data, type is string.  
+Demo 6. text demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/text-demo.py)
+```python
+# coding=utf-8
+from visualdl import LogWriter
+# create a LogWriter instance
+log_writter = LogWriter("./log", sync_cycle=10)
+# Create a TextWriter instance
+with log_writter.mode("train") as logger:
+    vdl_text_comp = logger.text(tag="test")
+# Use member function add_record() to add data
+for i in range(1, 6):
+    vdl_text_comp.add_record(i, "这是第 %d 个 step 的数据。" % i)
+    vdl_text_comp.add_record(i, "This is data %d ." % i)
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/text-interface.png" width=800><br/>
+Figure 4. text component displays texts  <br/>
+</p>
+Each subgraph has a horizontal axis which can be dragged to display text of different steps.
+### audio -- component to play audio
+The <a name="5"> audio</a> component is used to play audio. By passing the audio data (type numpy.ndarray) into the audio() function, users can play audio directly, or choose to download.  
+The first step of using audio component is initializing the member function audio() of LogWriter instance. Then you can add data through the member functions start_sampling(), is_sample_taken(), set_sample(), and finish_sample() of AudioWriter instance.
+* The member function audio() of LogWriter instance :
+```python  
+def audio(tag, num_samples, step_cycle)  
+```
+> :param tag : The audio writer will label the audio with tag.  
+> :param num_samples : Appoint the number of samples to take in a step.  
+> :param step_cycle : Store every `step_cycle` as a record, the default value is 1.  
+> :return:  An AudioWriter instance to sample images.  
+* Start a new sampling cycle, allocate memory space for the sampled data :
+```python
+def start_sampling()
+```
+* Determine whether the audio should be sampled or not. If the return value is -1, it means no sampling, otherwise it should be sampled :
+```python
+def is_sample_taken()
+```
+* Add audio data :
+```python
+def set_sample(index, audio_params, audio_data)
+```
+> :param index : Combined with tag, used to determine the sub-frame of the audio.  
+> :param audio_params : The parameters of audio, [sample rate, sample width, channels].  
+> :param audio_data : Audio data with type numpy.ndarray, member function flatten() can turn the shape to row vector.  
+* End the current sampling period, load the sampled data into disk, and release the memory space :
+```python
+def finish_sample()  
+```
+Demo 7. audio demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/audio-demo.py)
+```python
+# coding=utf-8
+import numpy as np
+import wave
+from visualdl import LogWriter
+def read_audio_data(audio_path):
+    """
+    Read audio data
+    """
+    CHUNK = 4096
+    f = wave.open(audio_path, "rb")
+    wavdata = []
+    chunk = f.readframes(CHUNK)
+    while chunk:
+        data = np.fromstring(chunk, dtype='uint8')
+        wavdata.extend(data)
+        chunk = f.readframes(CHUNK)
+    # 8k sample rate, 16bit frame, 1 channel
+    shape = [8000, 2, 1]
+    return shape, wavdata
+# Create a LogWriter instance
+log_writter = LogWriter("./log", sync_cycle=10)
+# Create an AudioWriter instance
+ns = 2
+with log_writter.mode("train") as logger:
+    input_audio = logger.audio(tag="test", num_samples=ns)
+# The variable sample_num is used to record the number of audio data that have been sampled
+audio_sample_num = 0
+for step in range(9):
+# Set the condition of start_sampling()
+if audio_sample_num == 0:
+    input_audio.start_sampling()
+    # Get idx
+    idx = input_audio.is_sample_taken()
+    # if idx != -1，sample this data, otherwise skip
+    if idx != -1:
+        # Read audio data
+        audio_path = "test.wav"
+        audio_shape, audio_data = read_audio_data(audio_path)
+        # Add data through member function set_samle()
+        input_audio.set_sample(idx, audio_shape, audio_data)
+        audio_sample_num += 1
+        #  If sampling of the present period have been completed, call finish_sample()
+        if audio_sample_num % ns ==0:
+            input_audio.finish_sampling()
+            audio_sample_num = 0
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/audio-interface.png" width=800><br/>
+Figure 5. audio component displays audios <br/>
+</p>
+Each subgraph has a horizontal axis which can be dragged to play audio of different steps.
+### high dimensional -- component of dimensionality reduction
+The role of <a name="6">high dimensional</a> component is to map data into 2D or 3D space for embedding visualization, which is helpful for users to understand the relevance of different data.
+The high dimensional component supports the following two dimensionality reduction algorithms :
+* PCA    : Principle Component Analysis  
+* [t-SNE](https://lvdmaaten.github.io/tsne/)  : t-distributed stochastic neighbor embedding  
+The first step of using audio component is initializing the member function embedding() of LogWriter instance. Then you can add data through the member functions add_embeddings_with_word_dict() of EmbeddingWriter instance.
+* The member function embedding() of LogWriter instance
+```python
+def embedding()  
+```
+* The member function add_embeddings_with_word_dict() of EmbeddingWriter instance :
+```python
+def add_embeddings_with_word_dict(data, Dict)  
+```
+> :param data : input data , type List[List(float)].  
+> :param Dict : dictionary， type Dict[str, int].  
+Demo 8. high dimensional demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/embedding-demo.py)
+```python
+# coding=utf-8
+import numpy as np
+from visualdl import LogWriter
+# Create a LogWriter instance
+log_writer = LogWriter("./log", sync_cycle=10)
+# Create an EmbeddingWriter instance
+with log_writer.mode("train") as logger:
+    train_embedding = logger.embedding()
+# Initialize data List[List(float)]  
+hot_vectors = np.random.uniform(1, 2, size=(10, 3))  
+word_dict = {
+    "label_1": 5,
+    "label_2": 4,
+    "label_3": 3,
+    "label_4": 2,
+    "label_5": 1,}
+# Add data through member function add_embeddings_with_word_dict(data, Dict)
+train_embedding.add_embeddings_with_word_dict(hot_vectors, word_dict)
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，then click the ``HIGHDIMENSIONAL`` option at the top of the webpage, you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/embedding-2D.png" width=800><br/>
+Figure 6. high dimensional component displays plane coordinates <br/>
+</p>
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/embedding-3D.png" width=800><br/>
+Figure 7. High dimensional component displays Cartesian coordinates <br/>
+</p>
+## graph -- component to visualize neural network
+The role of <a name="7">graph</a> component is to visualize neural network. This component can display models with
+Paddle format or [ONNX](https://onnx.ai) format. The graph component can help users understand the model structure of the neural network, and also help to troubleshoot neural network configuration errors.
+Unlike other components that need to record data, the only one prerequisite for using graph component is specifying the storage path of the model file. That is, adding the option --model_pb to the command ``visualdl`` to specify the storage path of the model file, then you can see the corresponding neural network in the frontend webpage.
+Demo 9. graph demo program（How to save a Lenet-5 model by Paddle）[Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/graph-demo.py)
+```python
+# coding=utf-8
+import paddle.fluid as fluid
+def lenet_5(img):
+    '''
+    Define the Lenet-5 model
+    '''
+    conv1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv1_bn = fluid.layers.batch_norm(input=conv1)
+    conv2 = fluid.nets.simple_img_conv_pool(
+        input=conv1_bn,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    predition = fluid.layers.fc(input=conv2, size=10, act="softmax")
+    return predition
+# Variable assignment
+image = fluid.layers.data(name="img", shape=[1, 28, 28], dtype="float32")
+predition = lenet_5(image)
+place = fluid.CPUPlace()
+exe = fluid.Executor(place=place)
+exe.run(fluid.default_startup_program())
+# save the result to "./paddle_lenet_5_model"
+fluid.io.save_inference_model(
+    "./paddle_lenet_5_model",
+    feeded_var_names=[image.name],
+    target_vars=[predition],
+    executor=exe)
+```
+After running the demo program above, you can start the flask server with command ``visualdl`` :  
+```shell
+visualdl --logdir ./log --host 0.0.0.0 --port 8080 --model_pb paddle_lenet_5_model
+```
+By opening the URL [http://0.0.0.0:8080](http://0.0.0.0:8080) in your browser，then click the `GRAPHS` option at the top of the webpage, you will see the interface below.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/graph.png" width=800><br/>
+Figure 8. graph component displays the model structure of Lenet-5 <br/>
+</p>
--- a/doc/paddle/guides/dygraph_to_static/basic_usage_cn.rst
+++ b/doc/paddle/guides/dygraph_to_static/basic_usage_cn.rst
--- a/doc/paddle/guides/dygraph_to_static/basic_usage_en.rst
+++ b/doc/paddle/guides/dygraph_to_static/basic_usage_en.rst
--- a/doc/paddle/guides/dygraph_to_static/debugging_cn.md
+++ b/doc/paddle/guides/dygraph_to_static/debugging_cn.md
--- a/doc/paddle/guides/dygraph_to_static/debugging_en.md
+++ b/doc/paddle/guides/dygraph_to_static/debugging_en.md
--- a/doc/paddle/guides/dygraph_to_static/error_handling_cn.md
+++ b/doc/paddle/guides/dygraph_to_static/error_handling_cn.md
--- a/doc/paddle/guides/dygraph_to_static/error_handling_en.md
+++ b/doc/paddle/guides/dygraph_to_static/error_handling_en.md
--- a/doc/paddle/guides/dygraph_to_static/grammar_list_cn.rst
+++ b/doc/paddle/guides/dygraph_to_static/grammar_list_cn.rst
--- a/doc/paddle/guides/dygraph_to_static/grammar_list_en.rst
+++ b/doc/paddle/guides/dygraph_to_static/grammar_list_en.rst
--- a/doc/paddle/guides/dygraph_to_static/index_cn.rst
+++ b/doc/paddle/guides/dygraph_to_static/index_cn.rst
--- a/doc/paddle/guides/dygraph_to_static/index_en.rst
+++ b/doc/paddle/guides/dygraph_to_static/index_en.rst
--- a/doc/paddle/guides/dygraph_to_static/input_spec_cn.rst
+++ b/doc/paddle/guides/dygraph_to_static/input_spec_cn.rst
--- a/doc/paddle/guides/dygraph_to_static/input_spec_en.rst
+++ b/doc/paddle/guides/dygraph_to_static/input_spec_en.rst
--- a/doc/paddle/guides/dygraph_to_static/program_translator_cn.rst
+++ b/doc/paddle/guides/dygraph_to_static/program_translator_cn.rst
--- a/doc/paddle/guides/dygraph_to_static/program_translator_en.rst
+++ b/doc/paddle/guides/dygraph_to_static/program_translator_en.rst
--- a/doc/paddle/guides/05_inference_deployment/index_cn.rst
+++ b/doc/paddle/guides/05_inference_deployment/index_cn.rst
+########
+预测部署
+########
+- `服务器端部署 <inference/index_cn.html>`_ ：介绍了如何在服务器端将模型部署上线
+- `移动端部署 <mobile/index_cn.html>`_ ：介绍了 PaddlePaddle 组织下的嵌入式平台深度学习框架Paddle-Lite
+- `模型压缩 <paddleslim/paddle_slim.html>`_ ：简要介绍了PaddleSlim模型压缩工具库的特点以及使用说明。
+..  toctree::
+    :hidden:
+    inference/index_cn.rst
+    mobile/index_cn.rst
+    paddleslim/paddle_slim.md
--- a/doc/paddle/guides/05_inference_deployment/index_en.rst
+++ b/doc/paddle/guides/05_inference_deployment/index_en.rst
+#######################
+Deploy Inference Model
+#######################
+- `Server side Deployment <inference/index_en.html>`_ : This section illustrates the method how to deploy and release the trained models on the servers
+- `Model Compression <paddleslim/paddle_slim_en.html>`_ : Introduce the features and usage of PaddleSlim which is a toolkit for model compression.
+..  toctree::
+    :hidden:
+    inference/index_en.rst 
+    paddleslim/paddle_slim_en.rst
--- a/doc/paddle/guides/05_inference_deployment/inference/build_and_install_lib_cn.rst
+++ b/doc/paddle/guides/05_inference_deployment/inference/build_and_install_lib_cn.rst
+.. _install_or_build_cpp_inference_lib:
+安装与编译 Linux 预测库
+===========================
+直接下载安装
+-------------
+..  csv-table:: 
+    :header: "版本说明", "预测库(1.8.4版本)", "预测库(2.0.0-beta0版本)", "预测库(develop版本)"
+    :widths: 3, 2, 2, 2
+    "ubuntu14.04_cpu_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-cpu-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cpu_avx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-openblas/paddle_inference.tgz>`_"
+    "ubuntu14.04_cpu_noavx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-noavx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-noavx-openblas/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda9.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda9-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda10.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda10.1_cudnn7.6_avx_mkl_trt6", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Ffluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Fpaddle_inference.tgz>`_", 
+    "nv-jetson-cuda10-cudnn7.5-trt5", "`fluid_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/1.7.1-nv-jetson-cuda10-cudnn7.5-trt5/fluid_inference.tar.gz>`_", "`paddle_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-nv-jetson-cuda10-cudnn7.5-trt5/paddle_inference.tgz>`_",
+从源码编译
+----------
+用户也可以从 PaddlePaddle 核心代码编译C++预测库，只需在编译时配制下面这些编译选项：
+============================  =============  ==================
+选项                           值             说明
+============================  =============  ==================
+CMAKE_BUILD_TYPE              Release        编译方式，仅使用预测库设为Release即可
+FLUID_INFERENCE_INSTALL_DIR   安装路径         预测库安装路径
+WITH_PYTHON                   OFF(推荐)       编译python预测库与whl包
+ON_INFER                      ON(推荐)        预测时使用，必须设为ON
+WITH_GPU                      ON/OFF         编译支持GPU的预测库
+WITH_MKL                      ON/OFF         编译支持MKL的预测库
+WITH_MKLDNN                   ON/OFF         编译支持MKLDNN的预测库
+WITH_XBYAK                    ON             使用XBYAK编译，在jetson硬件上编译需要设置为OFF
+WITH_NV_JETSON                OFF            在NV Jetson硬件上编译时需要设为ON
+============================  =============  ==================
+建议按照推荐值设置，以避免链接不必要的库。其它可选编译选项按需进行设定。
+首先从github拉取最新代码
+.. code-block:: bash
+  git clone https://github.com/paddlepaddle/Paddle
+  cd Paddle
+  # 建议使用git checkout切换到Paddle稳定的版本，如：
+  git checkout v1.8.4
+**note**: 如果您是多卡机器，建议安装NCCL；如果您是单卡机器则可以在编译时显示指定WITH_NCCL=OFF来跳过这一步。注意如果WITH_NCCL=ON，且没有安装NCCL，则编译会报错。
+.. code-block:: bash
+  git clone https://github.com/NVIDIA/nccl.git
+  cd nccl
+  make -j4
+  make install
+**Server端预测库源码编译**
+下面的代码片段配制编译选项并进行编译（需要将PADDLE_ROOT替换为PaddlePaddle预测库的安装路径，WITH_NCCL根据实际情况进行修改）：
+  .. code-block:: bash
+     PADDLE_ROOT=/path/of/paddle
+     cd Paddle
+     mkdir build
+     cd build
+     cmake -DFLUID_INFERENCE_INSTALL_DIR=$PADDLE_ROOT \
+           -DCMAKE_BUILD_TYPE=Release \
+           -DWITH_PYTHON=OFF \
+           -DWITH_MKL=OFF \
+           -DWITH_GPU=OFF  \
+           -DON_INFER=ON \
+           -DWITH_NCCL=OFF \
+           ..
+      make
+      make inference_lib_dist
+**NVIDIA Jetson嵌入式硬件预测库源码编译**
+NVIDIA Jetson是NVIDIA推出的嵌入式AI平台，Paddle Inference支持在 NVIDIA Jetson平台上编译预测库。具体步骤如下：
+    1. 准备环境
+      开启硬件性能模式
+      .. code-block:: bash
+        sudo nvpmodel -m 0 && sudo jetson_clocks
+      如果硬件为Nano，增加swap空间
+      .. code-block:: bash
+        #增加DDR可用空间，Xavier默认内存为16G，所以内存足够，如想在Nano上尝试，请执行如下操作。
+        sudo fallocate -l 5G /var/swapfile
+        sudo chmod 600 /var/swapfile
+        sudo mkswap /var/swapfile
+        sudo swapon /var/swapfile
+        sudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab'
+    2. 编译Paddle Inference预测库
+      .. code-block:: bash
+        cd Paddle
+        mkdir build
+        cd build
+        cmake .. \
+          -DWITH_CONTRIB=OFF \
+          -DWITH_MKL=OFF  \
+          -DWITH_MKLDNN=OFF \
+          -DWITH_TESTING=OFF \
+          -DCMAKE_BUILD_TYPE=Release \
+          -DON_INFER=ON \
+          -DWITH_PYTHON=OFF \
+          -DWITH_XBYAK=OFF  \
+          -DWITH_NV_JETSON=ON 
+        make -j4       
+        # 生成预测lib
+        make inference_lib_dist -j4
+    3. 样例测试
+      请参照官网样例：https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
+    **FAQ**
+    1. 报错：
+      .. code-block:: bash
+        ERROR: ../aarch64-linux-gpn/crtn.o: Too many open files.
+      则增加系统同一时间最多可开启的文件数至2048
+      .. code-block:: bash
+        ulimit -n 2048
+    2. 编译卡住
+      可能是下载第三方库较慢的原因，耐心等待或kill掉编译进程重新编译
+    3. 使用TensorRT报错IPluginFactory或IGpuAllocator缺少虚析构函数
+      下载安装TensorRT后，在NvInfer.h文件中为class IPluginFactory和class IGpuAllocator分别添加虚析构函数：
+      .. code-block:: bash
+        virtual ~IPluginFactory() {};
+        virtual ~IGpuAllocator() {};
+成功编译后，使用C++预测库所需的依赖（包括:（1）编译出的PaddlePaddle预测库和头文件；（2）第三方链接库和头文件；（3）版本信息与编译选项信息）
+均会存放于PADDLE_ROOT目录中。目录结构如下：
+  .. code-block:: text
+     PaddleRoot/
+     ├── CMakeCache.txt
+     ├── paddle
+     │   ├── include
+     │   │   ├── paddle_anakin_config.h
+     │   │   ├── paddle_analysis_config.h
+     │   │   ├── paddle_api.h
+     │   │   ├── paddle_inference_api.h
+     │   │   ├── paddle_mkldnn_quantizer_config.h
+     │   │   └── paddle_pass_builder.h
+     │   └── lib
+     │       ├── libpaddle_fluid.a
+     │       └── libpaddle_fluid.so
+     ├── third_party
+     │   └── install
+     │       ├── gflags
+     │       ├── glog
+     │       ├── mkldnn
+     │       ├── mklml
+     │       └── protobuf
+     └── version.txt
+version.txt 中记录了该预测库的版本信息，包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号，如：
+  .. code-block:: text
+     GIT COMMIT ID: 0231f58e592ad9f673ac1832d8c495c8ed65d24f
+     WITH_MKL: ON
+     WITH_MKLDNN: ON
+     WITH_GPU: ON
+     CUDA version: 10.1
+     CUDNN version: v7
--- a/doc/paddle/guides/05_inference_deployment/inference/build_and_install_lib_en.rst
+++ b/doc/paddle/guides/05_inference_deployment/inference/build_and_install_lib_en.rst
+.. _install_or_build_cpp_inference_lib_en:
+Install and Compile C++ Inference Library on Linux
+=============================================
+Direct Download and Installation
+---------------------------------
+..  csv-table:: c++ inference library list
+    :header: "version description", "inference library(1.8.4 version)", "inference library(2.0.0-beta0 version)", "inference library(develop version)"
+    :widths: 3, 2, 2, 2
+    "ubuntu14.04_cpu_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-cpu-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cpu_avx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-openblas/paddle_inference.tgz>`_"
+    "ubuntu14.04_cpu_noavx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-noavx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-noavx-openblas/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda9.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda9-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda10.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_"
+    "ubuntu14.04_cuda10.1_cudnn7.6_avx_mkl_trt6", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Ffluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Fpaddle_inference.tgz>`_",
+    "nv-jetson-cuda10-cudnn7.5-trt5", "`fluid_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/1.7.1-nv-jetson-cuda10-cudnn7.5-trt5/fluid_inference.tar.gz>`_", "`paddle_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-nv-jetson-cuda10-cudnn7.5-trt5/paddle_inference.tgz>`_",
+Build from Source Code
+-----------------------
+Users can also compile C++ inference libraries from the PaddlePaddle core code by specifying the following compile options at compile time:
+============================  ===============  ==================
+Option                        Value            Description
+============================  ===============  ==================
+CMAKE_BUILD_TYPE              Release          cmake build type, set to Release if debug messages are not needed
+FLUID_INFERENCE_INSTALL_DIR   path             install path of inference libs
+WITH_PYTHON                   OFF(recomended)  build python libs and whl package
+ON_INFER                      ON(recomended)   build with inference settings
+WITH_GPU                      ON/OFF           build inference libs on GPU
+WITH_MKL                      ON/OFF           build inference libs supporting MKL
+WITH_MKLDNN                   ON/OFF           build inference libs supporting MKLDNN
+WITH_XBYAK                    ON               build with XBYAK, must be OFF when building on NV Jetson platforms
+WITH_NV_JETSON                OFF              build inference libs on NV Jetson platforms
+============================  ===============  ==================
+It is recommended to configure options according to the recommended values to avoid linking unnecessary libraries. Other options can be set if it is necessary.
+Firstly we pull the latest code from github.
+.. code-block:: bash
+  git clone https://github.com/paddlepaddle/Paddle
+  cd Paddle
+  # Use git checkout to switch to stable versions such as v1.8.4
+  git checkout v1.8.4
+**note**: If your environment is a multi-card machine, it is recommended to install nccl; otherwise, you can skip this step by specifying WITH_NCCL = OFF during compilation. Note that if WITH_NCCL = ON, and NCCL is not installed, the compiler will report an error.
+.. code-block:: bash
+  git clone https://github.com/NVIDIA/nccl.git
+  cd nccl
+  make -j4
+  make install
+**build inference libs on server**
+Following codes set the configurations and execute building(PADDLE_ROOT should be set to the actual installing path of inference libs, WITH_NCCL should be modified according to the actual environment.).
+  .. code-block:: bash
+     PADDLE_ROOT=/path/of/capi
+     git clone https://github.com/PaddlePaddle/Paddle.git
+     cd Paddle
+     mkdir build
+     cd build
+     cmake -DFLUID_INFERENCE_INSTALL_DIR=$PADDLE_ROOT \
+           -DCMAKE_BUILD_TYPE=Release \
+           -DWITH_PYTHON=OFF \
+           -DWITH_MKL=OFF \
+           -DWITH_GPU=OFF  \
+           -DON_INFER=ON \
+           -DWITH_NCCL=OFF \
+           ..
+      make
+      make inference_lib_dist
+**build inference libs on NVIDIA Jetson platforms**
+NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVIDIA. Paddle Inference supports building inference libs on NVIDIA Jetson platforms. The steps are as following.
+    1. Prepare environments
+      Turn on hardware performance mode
+      .. code-block:: bash
+        sudo nvpmodel -m 0 && sudo jetson_clocks
+      if building on Nano hardwares, increase swap memory
+      .. code-block:: bash
+        # Increase DDR valid space. Default memory allocated is 16G, which is enough for Xavier. Following steps are for Nano hardwares.
+        sudo fallocate -l 5G /var/swapfile
+        sudo chmod 600 /var/swapfile
+        sudo mkswap /var/swapfile
+        sudo swapon /var/swapfile
+        sudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab'
+    2. Build paddle inference libs
+      .. code-block:: bash
+        cd Paddle
+        mkdir build
+        cd build
+        cmake .. \
+          -DWITH_CONTRIB=OFF \
+          -DWITH_MKL=OFF  \
+          -DWITH_MKLDNN=OFF \
+          -DWITH_TESTING=OFF \
+          -DCMAKE_BUILD_TYPE=Release \
+          -DON_INFER=ON \
+          -DWITH_PYTHON=OFF \
+          -DWITH_XBYAK=OFF  \
+          -DWITH_NV_JETSON=ON 
+        make -j4       
+        # Generate inference libs
+        make inference_lib_dist -j4
+    3. Test with samples
+      Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
+    **FAQ**
+    1. Error:
+      .. code-block:: bash
+        ERROR: ../aarch64-linux-gpn/crtn.o: Too many open files.
+      Fix this by increasing the number of files the system can open at the same time to 2048.
+      .. code-block:: bash
+        ulimit -n 2048
+    2. The building process hangs.
+      Might be downloading third-party libs. Wait or kill the building process and start again.
+    3. Lacking virtual destructors for IPluginFactory or IGpuAllocator when using TensorRT.
+      After downloading and installing TensorRT, add virtual destructors for IPluginFactory and IGpuAllocator in NvInfer.h:
+      .. code-block:: bash
+        virtual ~IPluginFactory() {};
+        virtual ~IGpuAllocator() {};      
+After successful compilation, dependencies required by the C++ inference library Will be stored in the PADDLE_ROOT directory. (dependencies including: (1) compiled PaddlePaddle inference library and header files; (2) third-party link libraries and header files; (3) version information and compilation option information)
+The directory structure is:
+  .. code-block:: text
+     PaddleRoot/
+     ├── CMakeCache.txt
+     ├── paddle
+     │   ├── include
+     │   │   ├── paddle_anakin_config.h
+     │   │   ├── paddle_analysis_config.h
+     │   │   ├── paddle_api.h
+     │   │   ├── paddle_inference_api.h
+     │   │   ├── paddle_mkldnn_quantizer_config.h
+     │   │   └── paddle_pass_builder.h
+     │   └── lib
+     │       ├── libpaddle_fluid.a
+     │       └── libpaddle_fluid.so
+     ├── third_party
+     │   ├── boost
+     │   │   └── boost
+     │   ├── eigen3
+     │   │   ├── Eigen
+     │   │   └── unsupported
+     │   └── install
+     │       ├── gflags
+     │       ├── glog
+     │       ├── mkldnn
+     │       ├── mklml
+     │       ├── protobuf
+     │       ├── snappy
+     │       ├── snappystream
+     │       ├── xxhash
+     │       └── zlib
+     └── version.txt
+The version information of the inference library is recorded in version.txt, including Git Commit ID, version of OpenBlas, MKL math library, or CUDA/CUDNN. For example:
+  .. code-block:: text
+     GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
+     WITH_MKL: ON
+     WITH_MKLDNN: ON
+     WITH_GPU: ON
+     CUDA version: 8.0
+     CUDNN version: v7
--- a/doc/paddle/guides/05_inference_deployment/inference/c_infer_cn.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/c_infer_cn.md
+# C 预测 API介绍
+Fluid提供了高度优化的[C++预测库](./native_infer.html)，为了方便使用，我们也提供了封装了C++预测库对应的C接口。C接口的使用方式，首先是需要`#include paddle_c_api.h`，头文件`paddle_c_api.h`可以在Paddle的仓库中的`paddle/fluid/inference/capi/paddle_c_api.h`找到，或是在编译Paddle的`Paddle/build/`路径下，`build/fluid_inference_c_install_dir/paddle/include/`路径下找到。此外，使用 CAPI 还需要在编译项目的时候，链接相关的编译的库`libpaddle_fluid_c.so`。下面是详细的使用说明。
+需要说明的是，与 C++ API 不同，C API 为了兼顾多语言封装的需要，将不会再设置默认参数，即使用时，所有的参数都需要用户显式地提供。
+## C预测相关数据结构
+使用C预测API与C++预测API不完全一样，C预测主要包括`PD_AnalysisConfig`, `PD_DataType`, `PD_Predictor`, `PD_Buffer`和`PD_ZeroCopyTensor`。接下来将会进一步详细地介绍这些数据结构以及使用的方法，并提供相应的示例。
+### PD_AnalysisConfig
+`PD_AnalysisConfig`是创建预测引擎的配置，提供了模型路径设置、预测引擎运行设备选择以及多种优化预测流程的选项，主要包括以下方法  
+* `PD_AnalysisConfig* PD_NewAnalysisConfig()`: 新建一个`PD_AnalysisConfig`的指针。
+* `void PD_DeleteAnalysisConfig(PD_AnalysisConfig* config)`: 删除一个`PD_AnalysisConfig`的指针。
+* `void PD_SetModel(PD_AnalysisConfig* config, const char* model_dir, const char* params_path)`: 设置模型的路径，输入的参数包括`PD_AnalysisConfig`，`model_dir`，`params_path`，其中`model_dir`是指的是模型保存位置的路径，一般不用包括文件名，`params_path`为可选参数，<strong>注意</strong>:
+    - 如果不给定`params_path`，即`params_path`为`NULL`，则认为该模型的参数存储路径与`model_dir`一致，且模型文件和参数文件是按照默认的文件名存储的，此时参数文件可能有多个。此时，需要用户输入参数与模型文件的`model_dir`，即<strong>模型和参数保存的路径名</strong>，不需要指定文件名，同时，需要显式地设置`params_path`为`NULL`。
+    - 如果提供了`params_path`，为了方便用户的自定义，则在指明`model_dir`路径最后需要加上模型文件的文件名传入，即`model_dir`传入对应的<strong>模型文件的路径</strong>，`params_path`传入对应的<strong>模型参数文件的路径</strong>，需要指定文件名。
+* `const char* PD_ModelDir(const PD_AnalysisConfig* config)`: 如果未指明`PD_SetModel()`的`params_path`，则可以返回模型文件夹路径。
+* `const char* PD_ProgFile(const PD_AnalysisConfig* config)`: 如果是指明`PD_SetModel()`的`params_path`，则可以返回模型文件路径。
+* `const char* PD_ParamsFile(const PD_AnalysisConfig* config)`: 如果是指明`PD_SetModel()`的`params_path`，则可以返回参数文件路径。
+* `void PD_SwitchSpecifyInputNames(PD_AnalysisConfig* config, bool x)`: 设置为`true`是指模型运算在读取输入的时候，依据名称来确定不同的输入，否则根据输入的顺序。使用`PD_ZeroCopyTensor`并且是多输入的情况，建议设置为`true`。
+* `void PD_SwitchUseFeedFetchOps(PD_AnalysisConfig* config, bool x)`: 设置是否使用`feed`，`fetch` op。在使用`PD_ZeroCopyTensor`必须设置该选项为`false`。
+* `void PD_EnableUseGpu(PD_AnalysisConfig* config, uint64_t memory_pool_init_size_mb, int device_id)`: 设置开启GPU，并且设定GPU显存(单位M)和设备的Device ID。
+* `void PD_DisableGpu(PD_AnalysisConfig* config)`: 禁用GPU。
+* `int PD_GpuDeviceId(const PD_AnalysisConfig* config)`: 返回使用的GPU设备的ID。
+* `void PD_SwitchIrOptim(PD_AnalysisConfig* config, bool x)`: 设置预测是否开启IR优化。
+* `void PD_EnableTensorRtEngine(PD_AnalysisConfig* config, int workspace_size, int max_batch_size, int min_subgraph_size, Precision precision, bool use_static, bool use_calib_mode)`: 开启TensorRT。关于参数的解释，详见[使用Paddle-TensorRT库预测](../../performance_improving/inference_improving/paddle_tensorrt_infer.html)。
+* `void PD_EnableMKLDNN(PD_AnalysisConfig* config)`: 开启MKLDNN。
+#### 代码示例
+首先，新建一个`PD_AnalysisConfig`的指针。
+``` C
+PD_AnalysisConfig* config = PD_NewAnalysisConfig();
+```
+如前文所述，设置模型和参数路径有两种形式：
+* 当模型文件夹下存在一个以默认文件名保存的模型文件和多个参数文件时，传入模型文件夹路径，模型文件名默认为`__model__`，需要显式地设置`params_path`为`NULL`，不需要指定文件名。
+``` C
+const char* model_dir = "./model/";
+PD_SetModel(config, model_dir, NULL);
+```
+* 当模型文件夹下只有一个模型文件和一个参数文件，传入模型文件和参数文件，需要指定文件名。
+``` C
+const char* model_path = "./model/model";
+const char* params_path = "./params/params";
+PD_SetModel(config, model_path, params_path);
+```
+其他预测引擎配置选项示例如下
+``` C
+PD_EnableUseGpu(config, 100, 0); // 初始化100M显存，使用的gpu id为0
+PD_GpuDeviceId(config);          // 返回正在使用的gpu id
+PD_DisableGpu(config);           // 禁用gpu
+PD_SwitchIrOptim(config, true);  // 开启IR优化
+PD_EnableMKLDNN(config);         // 开启MKLDNN
+PD_SwitchSpecifyInputNames(config, true);
+PD_SwitchUseFeedFetchOps(config, false);
+```
+### PD_ZeroCopyTensor
+`PD_ZeroCopyTensor`是设置数据传入预测运算的数据结构。包括一下成员：
+* `data - (PD_Buffer)`: 设置传入数据的值。
+* `shape - (PD_Buffer)`: 设置传入数据的形状（shape）。
+* `lod - (PD_Buffer)`: 设置数据的`lod`，目前只支持一阶的`lod`。
+* `dtype - (PD_DataType)`: 设置传入数据的数据类型，用枚举`PD_DataType`表示。
+* `name - (char*)`: 设置传入数据的名称。
+涉及使用`PD_ZeroCopyTensor`有以下方法：
+* `PD_ZeroCopyTensor* PD_NewZeroCopyTensor()`: 新创建一个`PD_ZeroCopyTensor`的指针。
+* `void PD_DeleteZeroCopyTensor(PD_ZeroCopyTensor*)`: 删除一个`PD_ZeroCopyTensor`的指针。
+* `void PD_InitZeroCopyTensor(PD_ZeroCopyTensor*)`: 使用默认初始化一个`PD_ZeroCopyTensor`的指针并分配的内存空间。
+* `void PD_DestroyZeroCopyTensor(PD_ZeroCopyTensor*)`: 删除`PD_ZeroCopyTensor`指针中，`data`，`shape`，`lod`的`PD_Buffer`的变量。
+### PD_DataType
+`PD_DataType`是一个提供给用户的枚举，用于设定存有用户数据的`PD_ZeroCopyTensor`的数据类型。包括以下成员：
+* `PD_FLOAT32`: 32位浮点型
+* `PD_INT32`: 32位整型
+* `PD_INT64`: 64位整型
+* `PD_UINT8`: 8位无符号整型
+#### 代码示例
+首先可以新建一个`PD_ZeroCopyTensor`。
+``` C
+PD_ZeroCopyTensor input;
+PD_InitZeroCopyTensor(&input);
+```
+调用设置`PD_ZeroCopyTensor`的数据类型的方式如下:
+``` C
+input.dtype = PD_FLOAT32;
+```
+### PD_Buffer
+`PD_Buffer`可以用于设置`PD_ZeroCopyTensor`数据结构中，数据的`data`，`shape`和`lod`。包括以下成员：
+* `data`: 输入的数据，类型是`void*`，用于存储数据开始的地址。
+* `length`: 输入数据的实际的<strong>字节长度</strong>。
+* `capacity`: 为数据分配的内存大小，必定大于等于`length`。
+### 示例代码
+``` C
+PD_ZeroCopyTensor input;
+PD_InitZeroCopyTensor(&input);
+// 设置输入的名称
+input.name = "data";
+// 设置输入的数据大小
+input.data.capacity = sizeof(float) * 1 * 3 * 300 * 300;
+input.data.length = input.data.capacity;
+input.data.data = malloc(input.data.capacity);
+// 设置数据的输入的形状 shape
+int shape[] = {1, 3, 300, 300};
+input.shape.data = (int *)shape;
+input.shape.capacity = sizeof(shape);
+input.shape.length = sizeof(shape);
+// 设置输入数据的类型
+input.dtype = PD_FLOAT32;
+```
+### PD_Predictor
+`PD_Predictor`是一个高性能预测引擎，该引擎通过对计算图的分析，可以完成对计算图的一系列的优化（如OP的融合、内存/显存的优化、 MKLDNN，TensorRT 等底层加速库的支持等）。主要包括一下函数：
+* `PD_Predictor* PD_NewPredictor(const PD_AnalysisConfig* config)`: 创建一个新的`PD_Predictor`的指针。
+* `void PD_DeletePredictor(PD_Predictor* predictor)`: 删除一个`PD_Predictor`的指针。
+* `int PD_GetInputNum(const PD_Predictor* predictor)`: 获取模型输入的个数。
+* `int PD_GetOutputNum(const PD_Predictor* predictor)`: 获取模型输出的个数。
+* `const char* PD_GetInputName(const PD_Predictor* predictor, int n)`: 获取模型第`n`个输入的名称。
+* `const char* PD_GetOutputName(const PD_Predictor* predictor, int n)`: 获取模型第`n`个输出的名称。
+* `void PD_SetZeroCopyInput(PD_Predictor* predictor, const PD_ZeroCopyTensor* tensor)`: 使用`PD_ZeroCopyTensor`数据结构设置模型输入的具体值、形状、lod等信息。目前只支持一阶lod。
+* `void PD_GetZeroCopyOutput(PD_Predictor* predictor, PD_ZeroCopyTensor* tensor)`: 使用`PD_ZeroCopyTensor`数据结构获取模型输出的具体值、形状、lod等信息。目前只支持一阶lod。
+* `void PD_ZeroCopyRun(PD_Predictor* predictor)`: 运行预测的引擎，完成模型由输入到输出的计算。
+#### 代码示例
+如前文所述，当完成网络配置`PD_AnalysisConfig`以及输入`PD_ZeroCopyTensor`的设置之后，只需要简单的几行代码就可以获得模型的输出。
+首先完成`PD_AnalysisConfig`的设置，设置的方式与相关的函数如前文所述，这里同样给出了示例。
+``` C
+PD_AnalysisConfig* config = PD_NewAnalysisConfig();
+const char* model_dir = "./model/";
+PD_SetModel(config, model_dir, NULL);
+PD_DisableGpu(config);
+PD_SwitchSpecifyInputNames(config, true); // 使用PD_ZeroCopyTensor并且是多输入建议设置。
+PD_SwitchUseFeedFetchOps(config, false);  // 使用PD_ZeroCopyTensor一定需要设置为false。
+```
+其次，完成相应的输入的设置，设置的方式如前文所述，这里同样给出了示例。
+``` C
+PD_ZeroCopyTensor input;
+PD_InitZeroCopyTensor(&input);
+// 设置输入的名称
+input.name = (char *)(PD_GetInputName(predictor, 0));
+// 设置输入的数据大小
+input.data.capacity = sizeof(float) * 1 * 3 * 300 * 300;
+input.data.length = input.data.capacity;
+input.data.data = malloc(input.data.capacity);
+// 设置数据的输入的形状(shape)
+int shape[] = {1, 3, 300, 300};
+input.shape.data = (int *)shape;
+input.shape.capacity = sizeof(shape);
+input.shape.length = sizeof(shape);
+// 设置输入数据的类型
+input.dtype = PD_FLOAT32;
+```
+最后，执行预测引擎，完成计算的步骤。
+``` C
+PD_Predictor *predictor = PD_NewPredictor(config);
+int input_num = PD_GetInputNum(predictor);
+printf("Input num: %d\n", input_num);
+int output_num = PD_GetOutputNum(predictor);
+printf("Output num: %d\n", output_num);
+PD_SetZeroCopyInput(predictor, &input); // 这里只有一个输入，根据多输入情况，可以传入一个数组
+PD_ZeroCopyRun(predictor); // 执行预测引擎
+PD_ZeroCopyTensor output;
+PD_InitZeroCopyTensor(&output);
+output.name = (char *)(PD_GetOutputName(predictor, 0));
+PD_GetZeroCopyOutput(predictor, &output);
+```
+最后，可以根据前文所述的`PD_ZeroCopyTensor`的数据结构，获得返回的数据的值等信息。
+## 完整使用示例
+下面是使用Fluid C API进行预测的一个完整示例，使用resnet50模型
+下载[resnet50模型](http://paddle-inference-dist.bj.bcebos.com/resnet50_model.tar.gz)并解压，运行如下代码将会调用预测引擎。
+``` C
+#include "paddle_c_api.h"
+#include <memory.h>
+#include <malloc.h>
+/*
+ * The main procedures to run a predictor according to c-api:
+ * 1. Create config to set how to process the inference.
+ * 2. Prepare the input PD_ZeroCopyTensor for the inference.
+ * 3. Set PD_Predictor.
+ * 4. Call PD_ZeroCopyRun() to start.
+ * 5. Obtain the output.
+ * 6. According to the size of the PD_PaddleBuf's data's size, print all the output data.
+ */
+int main() {
+    // 配置 PD_AnalysisConfig
+    PD_AnalysisConfig* config = PD_NewAnalysisConfig();
+    PD_DisableGpu(config);
+    const char* model_path = "./model/model";
+    const char* params_path = "./model/params";
+    PD_SetModel(config, model_path, params_path);
+    PD_SwitchSpecifyInputNames(config, true);
+    PD_SwitchUseFeedFetchOps(config, false);
+    // 新建一个 PD_Predictor 的指针
+    PD_Predictor *predictor = PD_NewPredictor(config);
+    // 获取输入输出的个数
+    int input_num = PD_GetInputNum(predictor);
+    printf("Input num: %d\n", input_num);
+    int output_num = PD_GetOutputNum(predictor);
+    printf("Output num: %d\n", output_num);
+    // 设置输入的数据结构
+    PD_ZeroCopyTensor input;
+    PD_InitZeroCopyTensor(&input);
+    // 设置输入的名称
+    input.name = (char *)(PD_GetInputName(predictor, 0));
+    // 设置输入的数据大小
+    input.data.capacity = sizeof(float) * 1 * 3 * 318 * 318;
+    input.data.length = input.data.capacity;
+    input.data.data = malloc(input.data.capacity);
+    memset(input.data.data, 0, (sizeof(float) * 3 * 318 * 318));
+    // 设置数据的输入的形状(shape)
+    int shape[] = {1, 3, 318, 318};
+    input.shape.data = (int *)shape;
+    input.shape.capacity = sizeof(shape);
+    input.shape.length = sizeof(shape);
+    // 设置输入数据的类型
+    input.dtype = PD_FLOAT32;
+    PD_SetZeroCopyInput(predictor, &input);
+    // 执行预测引擎
+    PD_ZeroCopyRun(predictor);
+    // 获取预测输出
+    PD_ZeroCopyTensor output;
+    PD_InitZeroCopyTensor(&output);
+    output.name = (char *)(PD_GetOutputName(predictor, 0));
+    // 获取 output 之后，可以通过该数据结构，读取到 data, shape 等信息
+    PD_GetZeroCopyOutput(predictor, &output);  
+    float* result = (float *)(output.data.data);
+    int result_length = output.data.length / sizeof(float);
+    return 0;
+}
+```
+运行以上代码，需要将 paddle_c_api.h 拷贝到指定位置，确保编译时可以找到这个头文件。同时，需要将 libpaddle_fluid_c.so 的路径加入环境变量。
+最后可以使用 gcc 命令编译。
+``` shell
+gcc ${SOURCE_NAME} \
+    -lpaddle_fluid_c
+```
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image1.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image1.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image2.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image2.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image3.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image3.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image4.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image4.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image5.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image5.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image6.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image6.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image7.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image7.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image8.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image8.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/image9.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/image9.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/model_graph_original.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/model_graph_original.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/model_graph_trt.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/model_graph_trt.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/project_property.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/project_property.png
--- a/doc/paddle/guides/05_inference_deployment/inference/image/runtime_library.png
+++ b/doc/paddle/guides/05_inference_deployment/inference/image/runtime_library.png
--- a/doc/paddle/guides/05_inference_deployment/inference/index_cn.rst
+++ b/doc/paddle/guides/05_inference_deployment/inference/index_cn.rst
+############
+服务器端部署
+############
+PaddlePaddle 提供了C++，C和Python的API来支持模型的部署上线。
+.. toctree::
+   :titlesonly:
+   build_and_install_lib_cn.rst
+   windows_cpp_inference.md
+   native_infer.md
+   c_infer_cn.md
+   python_infer_cn.md
--- a/doc/paddle/guides/05_inference_deployment/inference/index_en.rst
+++ b/doc/paddle/guides/05_inference_deployment/inference/index_en.rst
+######################
+Server-side Deployment
+######################
+PaddlePaddle provides various methods to support deployment and release of trained models.
+.. toctree::
+   :titlesonly:
+   build_and_install_lib_en.rst
+   windows_cpp_inference_en.md
+   native_infer_en.md
+   paddle_gpu_benchmark_en.md
--- a/doc/paddle/guides/05_inference_deployment/inference/native_infer.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/native_infer.md
+# C++ 预测 API介绍
+为了更简单方便地预测部署，PaddlePaddle 提供了一套高层 C++ API 预测接口。下面是详细介绍。
+如果您在使用2.0之前的Paddle，请参考[旧版API](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/advanced_guide/inference_deployment/inference/native_infer.html)文档，升级到新版API请参考[推理升级指南](#推理升级指南)。
+## 内容
+- [使用Predictor进行高性能预测](#使用Predictor进行高性能预测)
+- [使用Config管理预测配置](#使用Config管理预测配置)
+- [使用Tensor管理输入/输出](#使用Tensor管理输入/输出)
+- [使用PredictorPool在多线程下进行预测](#使用PredictorPool在多线程下进行预测)
+- [C++预测样例编译测试](#C++预测样例编译测试)
+- [性能调优](#性能调优)
+- [推理升级指南](#推理升级指南)
+- [C++ API](#C++_API)
+## <a name="使用Predictor进行高性能预测"> 使用Predictor进行高性能预测</a>
+Paddle Inference采用 Predictor 进行预测。Predictor 是一个高性能预测引擎，该引擎通过对计算图的分析，完成对计算图的一系列的优化（如OP的融合、内存/显存的优化、 MKLDNN，TensorRT 等底层加速库的支持等），能够大大提升预测性能。
+为了展示完整的预测流程，下面是一个使用 Predictor 进行预测的完整示例，其中涉及到的具体概念和配置会在后续部分展开详细介绍。
+#### Predictor 预测示例
+``` c++
+#include "paddle_inference_api.h"
+namespace paddle_infer {
+void CreateConfig(Config* config, const std::string& model_dirname) {
+  // 模型从磁盘进行加载
+  config->SetModel(model_dirname + "/model",
+                   model_dirname + "/params");
+  // config->SetModel(model_dirname);
+  // 如果模型从内存中加载，可以使用SetModelBuffer接口
+  // config->SetModelBuffer(prog_buffer, prog_size, params_buffer, params_size);
+  config->EnableUseGpu(100 /*设定GPU初始显存池为MB*/,  0 /*设定GPU ID为0*/); //开启GPU预测
+  /* for cpu
+  config->DisableGpu();
+  config->EnableMKLDNN();   // 开启MKLDNN加速
+  config->SetCpuMathLibraryNumThreads(10);
+  */
+  config->SwitchIrDebug(true);         // 可视化调试选项，若开启，则会在每个图优化过程后生成dot文件
+  // config->SwitchIrOptim(false);     // 默认为true。如果设置为false，关闭所有优化
+  // config->EnableMemoryOptim();     // 开启内存/显存复用
+}
+void RunAnalysis(int batch_size, std::string model_dirname) {
+  // 1. 创建AnalysisConfig
+  Config config;
+  CreateConfig(&config, model_dirname);
+  // 2. 根据config 创建predictor，并准备输入数据，此处以全0数据为例
+  auto predictor = CreatePredictor(config);
+  int channels = 3;
+  int height = 224;
+  int width = 224;
+  float input[batch_size * channels * height * width] = {0};
+  // 3. 创建输入
+  // 使用了ZeroCopy接口，可以避免预测中多余的CPU copy，提升预测性能
+  auto input_names = predictor->GetInputNames();
+  auto input_t = predictor->GetInputHandle(input_names[0]);
+  input_t->Reshape({batch_size, channels, height, width});
+  input_t->CopyFromCpu(input);
+  // 4. 运行预测引擎
+  CHECK(predictor->Run());
+  // 5. 获取输出
+  std::vector<float> out_data;
+  auto output_names = predictor->GetOutputNames();
+  auto output_t = predictor->GetOutputHandle(output_names[0]);
+  std::vector<int> output_shape = output_t->shape();
+  int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
+  out_data.resize(out_num);
+  output_t->CopyToCpu(out_data.data());
+}
+}  // namespace paddle_infer
+int main() {
+  // 模型下载地址 http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz
+  paddle_infer::RunAnalysis(1, "./mobilenet");
+  return 0;
+}
+```
+## <a name="使用Config管理预测配置"> 使用Config管理预测配置</a>
+Config管理Predictor的预测配置，提供了模型路径设置、预测引擎运行设备选择以及多种优化预测流程的选项。配置方法如下：
+#### 通用优化配置
+``` c++
+config->SwitchIrOptim(true);  // 开启计算图分析优化，包括OP融合等
+config->EnableMemoryOptim();  // 开启内存/显存复用
+```
+#### 设置模型和参数路径
+从磁盘加载模型时，根据模型和参数文件存储方式不同，设置Config加载模型和参数的路径有两种形式：
+* 非combined形式：模型文件夹`model_dir`下存在一个模型文件和多个参数文件时，传入模型文件夹路径，模型文件名默认为`__model__`。
+``` c++
+config->SetModel("./model_dir");
+```
+* combined形式：模型文件夹`model_dir`下只有一个模型文件`model`和一个参数文件`params`时，传入模型文件和参数文件路径。
+``` c++
+config->SetModel("./model_dir/model", "./model_dir/params");
+```
+#### 配置CPU预测
+``` c++
+config->DisableGpu();          // 禁用GPU
+config->EnableMKLDNN();            // 开启MKLDNN，可加速CPU预测
+config->SetCpuMathLibraryNumThreads(10);        // 设置CPU Math库线程数，CPU核心数支持情况下可加速预测
+```
+#### 配置GPU预测
+``` c++
+config->EnableUseGpu(100, 0); // 初始化100M显存，使用GPU ID为0
+config->GpuDeviceId();        // 返回正在使用的GPU ID
+// 开启TensorRT预测，可提升GPU预测性能，需要使用带TensorRT的预测库
+config->EnableTensorRtEngine(1 << 20             /*workspace_size*/,
+                             batch_size        /*max_batch_size*/,
+                             3                 /*min_subgraph_size*/,
+                                AnalysisConfig::Precision::kFloat32 /*precision*/,
+                             false             /*use_static*/,
+                             false             /*use_calib_mode*/);
+```
+## <a name="使用Tensor管理输入/输出"> 使用Tensor管理输入/输出</a>
+Tensor是Predictor的输入/输出数据结构。
+``` c++
+// 通过创建的Predictor获取输入和输出的tensor
+auto input_names = predictor->GetInputNames();
+auto input_t = predictor->GetInputHandle(input_names[0]);
+auto output_names = predictor->GetOutputNames();
+auto output_t = predictor->GetOutputHandle(output_names[0]);
+// 对tensor进行reshape
+input_t->Reshape({batch_size, channels, height, width});
+// 通过CopyFromCpu接口，将cpu数据输入；通过CopyToCpu接口，将输出数据copy到cpu
+input_t->CopyFromCpu<float>(input_data /*数据指针*/);
+output_t->CopyToCpu(out_data /*数据指针*/);
+// 设置LOD
+std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
+input_t->SetLoD(lod_data);
+// 获取Tensor数据指针
+float *input_d = input_t->mutable_data<float>(PaddlePlace::kGPU);  // CPU下使用PaddlePlace::kCPU
+int output_size;
+float *output_d = output_t->data<float>(PaddlePlace::kGPU, &output_size);
+```
+## <a name="使用PredictorPool在多线程下进行预测"> 使用PredictorPool在多线程下进行预测</a>
+`PredictorPool`对`Predictor`进行管理。`PredictorPool`对`Predictor`进行了简单的封装，通过传入config和thread的数目来完成初始化，在每个线程中，根据自己的线程id直接从池中取出对应的`Predictor`来完成预测过程。
+```c++
+# 服务初始化时，完成PredictorPool的初始化
+PredictorPool pool(config, thread_num);
+# 根据线程id来获取Predictor
+auto predictor = pool.Retrive(thread_id);
+# 使用Predictor进行预测
+...
+```
+## <a name="C++预测样例编译测试"> C++预测样例编译测试</a>
+1. 下载或编译paddle预测库，参考[安装与编译C++预测库](./build_and_install_lib_cn.html)。
+2. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压，进入`sample/inference`目录下。  
+    `inference` 文件夹目录结构如下：
+    ``` shell
+    inference
+    ├── CMakeLists.txt
+    ├── mobilenet_test.cc
+    ├── thread_mobilenet_test.cc
+    ├── mobilenetv1
+    │   ├── model
+    │   └── params
+    ├── run.sh
+    └── run_impl.sh
+    ```
+    - `mobilenet_test.cc` 为单线程预测的C++源文件
+    - `thread_mobilenet_test.cc` 为多线程预测的C++源文件  
+    - `mobilenetv1` 为模型文件夹
+    - `run.sh` 为预测运行脚本文件
+3. 配置编译与运行脚本
+    编译运行预测样例之前，需要根据运行环境配置编译与运行脚本`run.sh`。`run.sh`的选项与路径配置的部分如下：
+    ``` shell
+    # 设置是否开启MKL、GPU、TensorRT，如果要使用TensorRT，必须打开GPU
+    WITH_MKL=ON
+    WITH_GPU=OFF
+    USE_TENSORRT=OFF
+    # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径
+    LIB_DIR=YOUR_LIB_DIR
+    CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+    CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+    TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
+    MODEL_DIR=YOUR_MODEL_DIR
+    ```
+    按照实际运行环境配置`run.sh`中的选项开关和所需lib路径。
+4. 编译与运行样例  
+    ``` shell
+    sh run.sh
+    ```
+## <a name="性能调优"> 性能调优</a>
+### CPU下预测
+1. 在CPU型号允许的情况下，尽量使用带AVX和MKL的版本。
+2. 可以尝试使用Intel的 MKLDNN 加速。
+3. 在CPU可用核心数足够时，可以将设置`config->SetCpuMathLibraryNumThreads(num);`中的num值调高一些。
+### GPU下预测
+1. 可以尝试打开 TensorRT 子图加速引擎, 通过计算图分析，Paddle可以自动将计算图中部分子图融合，并调用NVIDIA的 TensorRT 来进行加速，详细内容可以参考 [使用Paddle-TensorRT库预测](../../performance_improving/inference_improving/paddle_tensorrt_infer.html)。
+### 多线程预测
+Paddle Inference支持通过在不同线程运行多个Predictor的方式来优化预测性能，支持CPU和GPU环境。
+使用多线程预测的样例详见[C++预测样例编译测试](#C++预测样例编译测试)中下载的[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)中的
+`thread_mobilenet_test.cc`文件。可以将`run.sh`中`mobilenet_test`替换成`thread_mobilenet_test`再执行
+```
+sh run.sh
+```
+即可运行多线程预测样例。
+## <a name="推理升级指南"> 推理升级指南</a>
+2.0对API做了整理，简化了写法，以及去掉了历史上冗余的概念。
+新的 API 为纯增，原有 API 保持不变，在后续版本会逐步删除。
+重要变化：
+- 命名空间从 `paddle` 变更为 `paddle_infer`
+- `PaddleTensor`, `PaddleBuf` 等被废弃，`ZeroCopyTensor` 变为默认 Tensor 类型，并更名为 `Tensor`
+- 新增 `PredictorPool` 工具类简化多线程 predictor 的创建，后续也会增加更多周边工具
+- `CreatePredictor` (原 `CreatePaddlePredictor`) 的返回值由 `unique_ptr` 变为 `shared_ptr` 以避免 Clone 后析构顺序出错的问题
+API 变更
+| 原有命名                     | 现有命名                     | 行为变化                      |
+| ---------------------------- | ---------------------------- | ----------------------------- |
+| 头文件 `paddle_infer.h`      | 无变化                       | 包含旧接口，保持向后兼容      |
+| 无                           | `paddle_inference_api.h`     | 新API，可以与旧接口并存       |
+| `CreatePaddlePredictor`      | `CreatePredictor`            | 返回值变为 shared_ptr         |
+| `ZeroCopyTensor`             | `Tensor`                     | 无                            |
+| `AnalysisConfig`             | `Config`                     | 无                            |
+| `TensorRTConfig`             | 废弃                         |                               |
+| `PaddleTensor` + `PaddleBuf` | 废弃                         |                               |
+| `Predictor::GetInputTensor`  | `Predictor::GetInputHandle`  | 无                            |
+| `Predictor::GetOutputTensor` | `Predictor::GetOutputHandle` | 无                            |
+|                              | `PredictorPool`              | 简化创建多个 predictor 的支持 |
+使用新 C++ API 的流程与之前完全一致，只有命名变化
+```c++
+#include "paddle_infernce_api.h"
+using namespace paddle_infer;
+Config config;
+config.SetModel("xxx_model_dir");
+auto predictor = CreatePredictor(config);
+// Get the handles for the inputs and outputs of the model
+auto input0 = predictor->GetInputHandle("X");
+auto output0 = predictor->GetOutputHandle("Out");
+for (...) {
+    // Assign data to input0
+  MyServiceSetData(input0);
+  predictor->Run();
+  // get data from the output0 handle
+  MyServiceGetData(output0);
+}
+```
+## <a name="C++_API"> C++ API</a>
+##### CreatePredictor
+```c++
+std::shared_ptr<Predictor> CreatePredictor(const Config& config);
+```
+`CreatePredictor`用来根据`Config`构建预测引擎。
+示例：
+```c++
+// 设置Config
+Config config;
+config.SetModel(FLAGS_model_dir);
+// 根据Config创建Predictor
+std::shared_ptr<Predictor> predictor = CreatePredictor(config);
+```
+参数：
+- `config(Config)` - 用于构建Predictor的配置信息
+返回：`Predictor`智能指针
+返回类型：`std::shared_ptr<Predictor>`
+##### GetVersion()
+```c++
+std::string GetVersion();
+```
+打印Paddle Inference的版本信息。
+参数：
+- `None`
+返回：版本信息
+返回类型：`std::string`
+##### PlaceType
+```c++
+enum class PaddlePlace { kUNK };
+using PlaceType = paddle::PaddlePlace;
+```
+PlaceType为目标设备硬件类型，用户可以根据应用场景选择硬件平台类型。
+枚举变量`PlaceType`的所有可能取值包括：
+`{kUNK, kCPU, kGPU}`
+##### PrecisionType
+```c++
+enum class Precision { kFloat32 };
+using PrecisionType = paddle::AnalysisConfig::Precision;
+```
+`PrecisionType`设置模型的运行精度，默认值为kFloat32(float32)。
+枚举变量`PrecisionType`的所有可能取值包括：
+`{kFloat32, kInt8, kHalf}`
+##### DataType
+```c++
+enum class PaddleDType { FLOAT32 };
+using DataType = paddle::PaddleDType;
+```
+`DataType`为模型中Tensor的数据精度，默认值为FLOAT32(float32)。
+枚举变量`DataType`的所有可能取值包括：
+`{FLOAT32, INT64, INT32, UINT8}`
+##### GetNumBytesOfDataType
+```c++
+int GetNumBytesOfDataType(DataType dtype);
+```
+获取各个`DataType`对应的字节数。
+参数：
+- `dtype` - DataType枚举
+返回：字节数
+返回类型：`int`
+##### Predictor
+```c++
+class Predictor;
+```
+`Predictor`是Paddle Inference的预测器，由`CreatePredictor`根据`Config`进行创建。用户可以根据Predictor提供的接口设置输入数据、执行模型预测、获取输出等.
+示例：
+```c++
+using namespace paddle_infer;
+Config config;
+config.SetModel("xxx_model_dir");
+auto predictor = CreatePredictor(config);
+// Get the handles for the inputs and outputs of the model
+auto input0 = predictor->GetInputHandle("X");
+auto output0 = predictor->GetOutputHandle("Out");
+for (...) {
+    // Assign data to input0
+  MyServiceSetData(input0);
+  predictor->Run();
+  // get data from the output0 handle
+  MyServiceGetData(output0);
+}
+```
+###### GetInputNames()
+获取所有输入Tensor的名称。
+参数：
+- `None`
+返回：所有输入Tensor的名称
+返回类型：`std::vector<std::string>`
+###### GetOutputNames()
+获取所有输出Tensor的名称。
+参数：
+- `None`
+返回：所有输出Tensor的名称
+返回类型：`std::vector<std::string>`
+###### GetInputHandle(const std::string& name)
+根据名称获取输入Tensor的句柄。
+参数：
+- `name` - Tensor的名称
+返回：指向`Tensor`的指针
+返回类型：`std::unique_ptr<Tensor>`
+###### GetOutputHandle(const std::string& name)
+根据名称获取输出Tensor的句柄。
+参数：
+- `name` - Tensor的名称
+返回：指向`Tensor`的指针
+返回类型：`std::unique_ptr<Tensor>`
+###### Run()
+执行模型预测，需要在***设置输入数据后***调用。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### ClearIntermediateTensor()
+释放中间tensor。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### Clone()
+根据该Predictor，克隆一个新的Predictor，两个Predictor之间共享权重。
+参数：
+- `None`
+返回：新的Predictor
+返回类型：`std::unique_ptr<Predictor>`
+##### Tensor
+```c++
+class Tensor;
+```
+Tensor是Paddle Inference的数据组织形式，用于对底层数据进行封装并提供接口对数据进行操作，包括设置Shape、数据、LoD信息等。
+*注意：用户应使用`Predictor`的`GetInputHandle`和`GetOuputHandle`接口获取输入/输出的`Tensor`。*
+示例：
+```c++
+// 通过创建的Predictor获取输入和输出的tensor
+auto input_names = predictor->GetInputNames();
+auto input_t = predictor->GetInputHandle(input_names[0]);
+auto output_names = predictor->GetOutputNames();
+auto output_t = predictor->GetOutputHandle(output_names[0]);
+// 对tensor进行reshape
+input_t->Reshape({batch_size, channels, height, width});
+// 通过CopyFromCpu接口，将cpu数据输入；通过CopyToCpu接口，将输出数据copy到cpu
+input_t->CopyFromCpu<float>(input_data /*数据指针*/);
+output_t->CopyToCpu(out_data /*数据指针*/);
+// 设置LOD
+std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
+input_t->SetLoD(lod_data);
+// 获取Tensor数据指针
+float *input_d = input_t->mutable_data<float>(PlaceType::kGPU);  // CPU下使用PlaceType::kCPU
+int output_size;
+float *output_d = output_t->data<float>(PlaceType::kGPU, &output_size);
+```
+###### Reshape(shape)
+设置Tensor的维度信息。
+参数：
+- `shape(const std::vector<int>&)` - 维度信息
+返回：`None`
+返回类型：`void`
+###### shape()
+获取Tensor的维度信息。
+参数：
+- `None`
+返回：Tensor的维度信息
+返回类型：`std::vector<int>`
+###### CopyFromCpu(data)
+```c++
+template <typename T>
+void CopyFromCpu(const T* data);
+```
+从cpu获取数据，设置到tensor内部。
+示例：
+```c++
+// float* data = ...;
+auto in_tensor = predictor->GetInputHandle("in_name");
+in_tensor->CopyFromCpu(data);
+```
+参数：
+- `data(const T*)` - cpu数据指针
+返回：`None`
+返回类型：`void`
+###### CopyToCpu(data)
+```c++
+template <typename T>
+void CopyToCpu(T* data);
+```
+示例：
+```c++
+std::vector<float> data(100);
+auto out_tensor = predictor->GetOutputHandle("out_name");
+out_tensor->CopyToCpu(data.data());
+```
+参数：
+- `data(T*)` - cpu数据指针
+返回：`None`
+返回类型：`void`
+###### data<T>(place, size)
+```c++
+template <typename T>
+T* data(PlaceType* place, int* size) const;
+```
+获取Tensor的底层数据的常量指针，用于读取Tensor数据。
+示例：
+```c++
+PlaceType place;
+int size;
+auto out_tensor = predictor->GetOutputHandle("out_name");
+float* data = out_tensor->data<float>(&place, &size);
+```
+参数：
+- `place(PlaceType*)` - 获取tensor的PlaceType
+- `size(int*)` - 获取tensor的size
+返回：数据指针
+返回类型：`T*`
+###### mutable_data<T>(place)
+```c++
+template <typename T>
+T* mutable_data(PlaceType place);
+```
+获取Tensor的底层数据的指针，用于设置Tensor数据。
+```c++
+auto in_tensor = predictor->GetInputHandle("in_name");
+float* data = out_tensor->mutable_data<float>(PlaceType::kCPU);
+data[0] = 1.;
+```
+参数：
+- `place(PlaceType)` - 设备信息
+返回：`Tensor`底层数据指针
+返回类型：`T*`
+###### SetLoD(lod)
+设置Tensor的LoD信息。
+参数：
+- `lod(const std::vector<std::vector<size_t>>)` - Tensor的LoD信息
+返回：`None`
+返回类型：`void`
+###### lod()
+获取Tensor的LoD信息
+参数：
+- `None`
+返回：`Tensor`的LoD信息
+返回类型：`std::vector<std::vector<size_t>>`
+###### type()
+tensor的DataType信息。
+参数：
+- `None`
+返回：`Tensor`的DataType信息
+返回类型：`DataType`
+###### name()
+tensor对应的name。
+参数：
+- `None`
+返回：`Tensor`对应的name
+返回类型：`std::string`
+##### Config
+```c++
+class Config;
+```
+`Config`用来配置构建`Predictor`的配置信息，如模型路径、是否开启gpu等等。
+示例：
+```c++
+Config config;
+config.SetModel(FLAGS_model_dir);
+config.DisableGpu();
+config->SwitchIrOptim(false);     // 默认为true。如果设置为false，关闭所有优化
+config->EnableMemoryOptim();     // 开启内存/显存复用
+```
+###### SetModel(const std::string& model_dir)
+设置模型文件路径，当需要从磁盘加载非combine模式时使用。
+参数：
+- `model_dir` - 模型文件夹路径
+返回：`None`
+返回类型：`void`
+###### model_dir()
+获取模型文件夹路径。
+参数：
+- `None`
+返回：模型文件夹路径
+返回类型：`string`
+###### SetModel(const std::string& prog, const std::string& params)
+设置模型文件路径，当需要从磁盘加载combine模式时使用。
+参数：
+- `prog` - 模型文件路径
+- `params` - 模型参数文件路径
+返回：`None`
+返回类型：`void`
+###### SetProgFile(const std::string& prog)
+设置模型文件路径。
+参数：
+- `prog` - 模型文件路径
+返回：`None`
+返回类型：`void`
+###### prog_file()
+获取模型文件路径。
+参数：
+- `None`
+返回：模型文件路径
+返回类型：`string`
+###### SetParamsFile(const std::string& params)
+设置模型参数文件路径。
+参数：
+- `params` - 模型文件路径
+返回：`None`
+返回类型：`void`
+###### params_file()
+获取模型参数文件路径。
+参数：
+- `None`
+返回：模型参数文件路径
+返回类型：`string`
+###### SetModelBuffer(const char* prog_buffer, size_t prog_buffer_size, const char* params_buffer, size_t params_buffer_size)
+从内存加载模型。
+参数：
+- `prog_buffer` - 内存中模型结构数据
+- `prog_buffer_size` - 内存中模型结构数据的大小
+- `params_buffer` - 内存中模型参数数据
+- `params_buffer_size` - 内存中模型参数数据的大小
+返回：`None`
+返回类型：`void`
+###### model_from_memory()
+判断是否从内存中加载模型。
+参数：
+- `None`
+返回：是否从内存中加载模型
+返回类型：`bool`
+###### SetOptimCacheDir(const std::string& opt_cache_dir)
+设置缓存路径。
+参数：
+- `opt_cache_dir` - 缓存路径
+返回：`None`
+返回类型：`void`
+###### DisableFCPadding（）
+关闭fc padding。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### use_fc_padding()
+判断是否启用fc padding。
+参数：
+- `None`
+返回：是否启用fc padding
+返回类型：`bool`
+###### EnableUseGpu(uint64_t memory_pool_init_size_mb, int device_id = 0)
+启用gpu。
+参数：
+- `memory_pool_init_size_mb` - 初始化分配的gpu显存，以MB为单位
+- `device_id` - 设备id
+返回：`None`
+返回类型：`void`
+###### DisableGpu()
+禁用gpu。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### use_gpu()
+是否启用gpu。
+参数：
+- `None`
+返回：是否启用gpu
+返回类型：`bool`
+###### gpu_device_id()
+获取gpu的device id。
+参数：
+- `None`
+返回：gpu的device id
+返回类型：`int`
+###### memory_pool_init_size_mb()
+获取gpu的初始显存大小。
+参数：
+- `None`
+返回：初始的显存大小
+返回类型：`int`
+###### fraction_of_gpu_memory_for_pool()
+初始化显存占总显存的百分比
+参数：
+- `None`
+返回：初始的显存占总显存的百分比
+返回类型：`float`
+###### EnableCUDNN()
+启用cudnn。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### cudnn_enabled()
+是否启用cudnn。
+参数：
+- `None`
+返回：是否启用cudnn
+返回类型：`bool`
+###### EnableXpu(int l3_workspace_size)
+启用xpu。
+参数：
+- `l3_workspace_size` - l3 cache分配的显存大小
+返回：`None`
+返回类型：`void`
+###### SwitchIrOptim(int x=true)
+设置是否开启ir优化。
+参数：
+- `x` - 是否开启ir优化，默认打开
+返回：`None`
+返回类型：`void`
+###### ir_optim()
+是否开启ir优化。
+参数：
+- `None`
+返回：是否开启ir优化
+返回类型：`bool`
+###### SwitchUseFeedFetchOps(int x = true)
+设置是否使用feed，fetch op，仅内部使用。
+参数：
+- `x` - 是否使用feed, fetch op
+返回：`None`
+返回类型：`void`
+###### use_feed_fetch_ops_enabled()
+是否使用feed，fetch op。
+参数：
+- `None`
+返回：是否使用feed，fetch op
+返回类型：`bool`
+###### SwitchSpecifyInputNames(bool x = true)
+设置是否需要指定输入tensor的name。
+参数：
+- `x` - 是否指定输入tensor的name
+返回：`None`
+返回类型：`void`
+###### specify_input_name()
+是否需要指定输入tensor的name。
+参数：
+- `None`
+返回：是否需要指定输入tensor的name
+返回类型：`bool`
+###### EnableTensorRtEngine(int workspace_size = 1 << 20, int max_batch_size = 1, int min_subgraph_size = 3, Precision precision = Precision::kFloat32, bool use_static = false, bool use_calib_mode = true)
+设置是否启用TensorRT。
+参数：
+- `workspace_size` - 指定TensorRT使用的工作空间大小
+- `max_batch_size` - 设置最大的batch大小，运行时batch大小不得超过此限定值
+- `min_subgraph_size` - Paddle-TRT是以子图的形式运行，为了避免性能损失，当子图内部节点个数大于min_subgraph_size的时候，才会使用Paddle-TRT运行
+- `precision` - 指定使用TRT的精度，支持FP32（kFloat32），FP16（kHalf），Int8（kInt8）
+- `use_static` - 如果指定为true，在初次运行程序的时候会将TRT的优化信息进行序列化到磁盘上，下次运行时直接加载优化的序列化信息而不需要重新生成
+- `use_calib_mode` - 若要运行Paddle-TRT int8离线量化校准，需要将此选项设置为true
+返回：`None`
+返回类型：`void`
+###### tensorrt_engine_enabled()
+是否启用tensorRT。
+参数：
+- `None`
+返回：是否启用tensorRT
+返回类型：`bool`
+###### SetTRTDynamicShapeInfo(std::map<std::string, std::vector<int>> min_input_shape, std::map<std::string, std::vector<int>> max_input_shape, std::map<std::string, std::vector<int>> optim_input_shape, bool disable_trt_plugin_fp16 = false)
+设置tensorRT的动态shape。
+参数：
+- `min_input_shape` - tensorRT子图支持动态shape的最小shape
+- `max_input_shape` - tensorRT子图支持动态shape的最大shape
+- `optim_input_shape` - tensorRT子图支持动态shape的最优shape
+- `disable_trt_plugin_fp16` - 设置tensorRT的plugin不在fp16精度下运行
+返回：`None`
+返回类型：`void`
+###### EnableLiteEngine(AnalysisConfig::Precision precision_mode = Precsion::kFloat32, bool zero_copy = false, const std::vector<std::string>& passes_filter = {}, const std::vector<std::string>& ops_filter = {})
+启用lite子图。
+参数：
+- `precision_mode` - lite子图的运行精度
+- `zero_copy` - 启用zero_copy，lite子图与paddle inference之间共享数据
+- `passes_filter` - 设置lite子图的pass
+- `ops_filter` - 设置不使用lite子图运行的op
+返回：`None`
+返回类型：`void`
+###### lite_engine_enabled()
+是否启用lite子图。
+参数：
+- `None`
+返回：是否启用lite子图
+返回类型：`bool`
+###### SwitchIrDebug(int x = true)
+设置是否在图分析阶段打印ir，启用后会在每一个pass后生成dot文件。
+参数：
+- `x` - 是否打印ir
+返回：`None`
+返回类型：`void`
+###### EnableMKLDNN()
+启用mkldnn。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### SetMkldnnCacheCapacity(int capacity)
+设置mkldnn针对不同输入shape的cache容量大小，MKLDNN cache设计文档请参考[链接](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/mkldnn/caching/caching.md)
+参数：
+- `capacity` - cache容量大小
+返回：`None`
+返回类型：`void`
+###### mkldnn_enabled()
+是否启用mkldnn。
+参数：
+- `None`
+返回：是否启用mkldnn
+返回类型：`bool`
+###### SetMKLDNNOp(std::unordered_set<std::string> op_list)
+指定优先使用mkldnn加速的op列表。
+参数：
+- `op_list` - 优先使用mkldnn的op列表
+返回：`None`
+返回类型：`void`
+###### EnableMkldnnQuantizer()
+启用mkldnn量化。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### mkldnn_quantizer_enabled()
+是否启用mkldnn量化。
+参数：
+- `None`
+返回：是否启用mkldnn量化
+返回类型：`bool`
+###### EnableMkldnnBfloat16()
+启用mkldnn bf16。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### mkldnn_bfloat16_enabled()
+是否启用mkldnn bf16。
+参数：
+- `None`
+返回：是否启用mkldnn bf16
+返回类型：`bool`
+###### mkldnn_quantizer_config()
+返回mkldnn量化config。
+参数：
+- `None`
+返回：mkldnn量化config
+返回类型：`MkldnnQuantizerConfig`
+###### SetCpuMathLibraryNumThreads(int cpu_math_library_num_threads)
+设置cpu blas库计算线程数。
+参数：
+- `cpu_math_library_num_threads` - blas库计算线程数
+返回：`None`
+返回类型：`void`
+###### cpu_math_library_num_threads()
+cpu blas库计算线程数。
+参数：
+- `None`
+返回：cpu blas库计算线程数。
+返回类型：`int`
+###### ToNativeConfig()
+转化为NativeConfig，不推荐使用。
+参数：
+- `None`
+返回：当前Config对应的NativeConfig
+返回类型：`NativeConfig`
+###### EnableGpuMultiStream()
+开启线程流，目前的行为是为每一个线程绑定一个流，在将来该行为可能改变。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### thread_local_stream_enabled()
+是否启用线程流。
+参数：
+- `None`
+返回：是否启用线程流。
+返回类型：`bool`
+###### EnableMemoryOptim()
+开启内/显存复用，具体降低内存效果取决于模型结构。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### enable_memory_optim()
+是否开启内/显存复用。
+参数：
+- `None`
+返回：是否开启内/显存复用。
+返回类型：`bool`
+###### EnableProfile()
+打开profile，运行结束后会打印所有op的耗时占比。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### profile_enabled()
+是否开启profile。
+参数：
+- `None`
+返回：是否开启profile
+返回类型：`bool`
+###### DisableGlogInfo()
+去除Paddle Inference运行中的log。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### glog_info_disabled()
+是否禁用了log。
+参数：
+- `None`
+返回：是否禁用了log
+返回类型：`bool`
+###### SetInValid()
+设置Config为无效状态，仅内部使用，保证每一个Config仅用来初始化一次Predictor。
+参数：
+- `None`
+返回：`None`
+返回类型：`void`
+###### is_valid()
+当前Config是否有效。
+参数：
+- `None`
+返回：Config是否有效
+返回类型：`bool`
+###### pass_builder()
+返回pass_builder，用来自定义图分析阶段选择的ir。
+示例:
+```c++
+Config config;
+auto pass_builder = config.pass_builder()
+pass_builder->DeletePass("fc_fuse_pass") // 去除fc_fuse
+```
+参数：
+- `None`
+返回：pass_builder
+返回类型：`PassStrategy`
+##### PredictorPool
+```c++
+class PredictorPool;
+```
+`PredictorPool`对`Predictor`进行了简单的封装，通过传入config和thread的数目来完成初始化，在每个线程中，根据自己的线程id直接从池中取出对应的`Predictor`来完成预测过程。
+示例：
+```c++
+Config config;
+// init config
+int thread_num = 4;
+PredictorPool pool(config, thread_num);
+auto predictor0 = pool.Retrive(0);
+...
+auto predictor3 = pool.Retrive(3);
+```
+###### Retrive(idx)
+根据线程id取出该线程对应的Predictor。
+参数：
+- `idx(int)` - 线程id
+返回：线程对应的Predictor
+返回类型：`Predictor*`
--- a/doc/paddle/guides/05_inference_deployment/inference/native_infer_en.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/native_infer_en.md
+# Introduction to C++ Inference API
+To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level.
+Details are as follows:
+## <a name="Use AnalysisPredictor to perform high-performance inference"> Use AnalysisPredictor to perform high-performance inference</a>
+Paddy fluid uses AnalysisPredictor to perform inference. AnalysisPredictor is a high-performance inference engine. Through the analysis of the calculation graph, the engine completes a series of optimization of the calculation graph (such as the integration of OP, the optimization of memory / graphic memory, the support of MKLDNN, TensorRT and other underlying acceleration libraries), which can greatly improve the inference performance.
+In order to show the complete inference process, the following is a complete example of using AnalysisPredictor. The specific concepts and configurations involved will be detailed in the following sections.
+#### AnalysisPredictor sample
+``` c++
+#include "paddle_inference_api.h"
+namespace paddle {
+void CreateConfig(AnalysisConfig* config, const std::string& model_dirname) {
+  // load model from disk
+  config->SetModel(model_dirname + "/model",  
+                   model_dirname + "/params");  
+  // config->SetModel(model_dirname);
+  // use SetModelBuffer if load model from memory
+  // config->SetModelBuffer(prog_buffer, prog_size, params_buffer, params_size);
+  config->EnableUseGpu(100 /*init graphic memory by 100MB*/,  0 /*set GPUID to 0*/);
+  /* for cpu
+  config->DisableGpu();
+  config->EnableMKLDNN();   // enable MKLDNN
+  config->SetCpuMathLibraryNumThreads(10);
+  */
+  config->SwitchUseFeedFetchOps(false);
+  // set to true if there are multiple inputs
+  config->SwitchSpecifyInputNames(true);
+  config->SwitchIrDebug(true);         // If the visual debugging option is enabled, a dot file will be generated after each graph optimization process
+  // config->SwitchIrOptim(false);     // The default is true. Turn off all optimizations if set to false
+  // config->EnableMemoryOptim();     // Enable memory / graphic memory reuse
+}
+void RunAnalysis(int batch_size, std::string model_dirname) {
+  // 1. create AnalysisConfig
+  AnalysisConfig config;
+  CreateConfig(&config, model_dirname);
+  // 2. create predictor based on config, and prepare input data
+  auto predictor = CreatePaddlePredictor(config);
+  int channels = 3;
+  int height = 224;
+  int width = 224;
+  float input[batch_size * channels * height * width] = {0};
+  // 3. build inputs
+  // uses ZeroCopy API here to avoid extra copying from CPU, improving performance
+  auto input_names = predictor->GetInputNames();
+  auto input_t = predictor->GetInputTensor(input_names[0]);
+  input_t->Reshape({batch_size, channels, height, width});
+  input_t->copy_from_cpu(input);
+  // 4. run inference
+  CHECK(predictor->ZeroCopyRun());
+  // 5. get outputs
+  std::vector<float> out_data;
+  auto output_names = predictor->GetOutputNames();
+  auto output_t = predictor->GetOutputTensor(output_names[0]);
+  std::vector<int> output_shape = output_t->shape();
+  int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
+  out_data.resize(out_num);
+  output_t->copy_to_cpu(out_data.data());
+}
+}  // namespace paddle
+int main() {
+  // the model can be downloaded from http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz
+  paddle::RunAnalysis(1, "./mobilenet");
+  return 0;
+}
+```
+## <a name="Use AnalysisConfig to manage inference configurations"> Use AnalysisConfig to manage inference configurations</a>
+AnalysisConfig manages the inference configuration of AnalysisPredictor, providing model path setting, inference engine running device selection, and a variety of options to optimize the inference process. The configuration method is as follows:
+#### General optimizing configuration
+``` c++
+config->SwitchIrOptim(true);  // Enable analysis and optimization of calculation graph,including OP fusion, etc
+config->EnableMemoryOptim();  // Enable memory / graphic memory reuse
+```
+**Note:** Using ZeroCopyTensor requires following setting:
+``` c++
+config->SwitchUseFeedFetchOps(false);  // disable feed and fetch OP
+```
+#### set model and param path
+When loading the model from disk, there are two ways to set the path of AnalysisConfig to load the model and parameters according to the storage mode of the model and parameter file:
+* Non combined form: when there is a model file and multiple parameter files under the model folder 'model_dir', the path of the model folder is passed in. The default name of the model file is'__model_'.
+``` c++
+config->SetModel("./model_dir");
+```
+* Combined form: when there is only one model file 'model' and one parameter file 'params' under the model folder' model_dir ', the model file and parameter file path are passed in.
+``` c++
+config->SetModel("./model_dir/model", "./model_dir/params");
+```
+At compile time, it is proper to co-build with `libpaddle_fluid.a/.so` .
+#### Configure CPU inference
+``` c++
+config->DisableGpu();          // disable GPU
+config->EnableMKLDNN();            // enable MKLDNN, accelerating CPU inference  
+config->SetCpuMathLibraryNumThreads(10);        // set number of threads of CPU Math libs, accelerating CPU inference if CPU cores are adequate
+```
+#### Configure GPU inference
+``` c++
+config->EnableUseGpu(100, 0); // initialize 100M graphic memory, using GPU ID 0
+config->GpuDeviceId();        // Returns the GPU ID being used
+// Turn on TRT to improve GPU performance. You need to use library with tensorrt
+config->EnableTensorRtEngine(1 << 20             /*workspace_size*/,  
+                             batch_size        /*max_batch_size*/,  
+                             3                 /*min_subgraph_size*/,
+                                AnalysisConfig::Precision::kFloat32 /*precision*/,
+                             false             /*use_static*/,
+                             false             /*use_calib_mode*/);
+```
+## <a name="Use ZeroCopyTensor to manage I/O"> Use ZeroCopyTensor to manage I/O</a>
+ZeroCopyTensor is the input / output data structure of AnalysisPredictor. The use of zerocopytensor can avoid redundant data copy when preparing input and obtaining output, and improve inference performance.  
+**Note:** Using zerocopytensor, be sure to set `config->SwitchUseFeedFetchOps(false);`.
+``` c++
+// get input/output tensor
+auto input_names = predictor->GetInputNames();
+auto input_t = predictor->GetInputTensor(input_names[0]);
+auto output_names = predictor->GetOutputNames();
+auto output_t = predictor->GetOutputTensor(output_names[0]);
+// reshape tensor
+input_t->Reshape({batch_size, channels, height, width});
+// Through the copy_from_cpu interface, the CPU data is prepared; through the copy_to_cpu interface, the output data is copied to the CPU
+input_t->copy_from_cpu<float>(input_data /*data pointer*/);
+output_t->copy_to_cpu(out_data /*data pointer*/);
+// set LOD
+std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
+input_t->SetLoD(lod_data);
+// get Tensor data pointer
+float *input_d = input_t->mutable_data<float>(PaddlePlace::kGPU);  // use PaddlePlace::kCPU when running inference on CPU
+int output_size;
+float *output_d = output_t->data<float>(PaddlePlace::kGPU, &output_size);
+```
+## <a name="C++ inference sample"> C++ inference sample</a>
+1. Download or compile C++ Inference Library, refer to [Install and Compile C++ Inference Library](./build_and_install_lib_en.html).
+2. Download [C++ inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress it , then enter `sample/inference` directory.  
+    `inference` directory structure is as following:
+    ``` shell
+    inference
+    ├── CMakeLists.txt
+    ├── mobilenet_test.cc
+    ├── thread_mobilenet_test.cc
+    ├── mobilenetv1
+    │   ├── model
+    │   └── params
+    ├── run.sh
+    └── run_impl.sh
+    ```
+    - `mobilenet_test.cc` is the source code for single-thread inference.
+    - `thread_mobilenet_test.cc` is the source code for multi-thread inference.
+    - `mobilenetv1` is the model directory.
+    - `run.sh` is the script for running inference.
+3. Configure script:
+    Before running, we need to configure script `run.sh` as following:
+    ``` shell
+    # set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
+    WITH_MKL=ON
+    WITH_GPU=OFF
+    USE_TENSORRT=OFF
+    # set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
+    LIB_DIR=YOUR_LIB_DIR
+    CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+    CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+    TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
+    MODEL_DIR=YOUR_MODEL_DIR
+    ```
+    Please configure `run.sh` depending on your environment.
+4. Build and run the sample.  
+    ``` shell
+    sh run.sh
+    ```
+## <a name="Performance tuning"> Performance tuning</a>
+### Tuning on CPU
+1. If the CPU model allows, try to use the version with AVX and MKL.
+2. You can try to use Intel's MKLDNN acceleration.
+3. When the number of CPU cores available is enough, you can increase the num value in the setting `config->SetCpuMathLibraryNumThreads(num);`.
+### Tuning on GPU
+1. You can try to open the TensorRT subgraph acceleration engine. Through the graph analysis, Paddle can automatically fuse certain subgraphs, and call NVIDIA's TensorRT for acceleration. For details, please refer to [Use Paddle-TensorRT Library for inference](../../performance_improving/inference_improving/paddle_tensorrt_infer_en.html)。
+### Tuning with multi-thread
+Paddle Fluid supports optimizing prediction performance by running multiple AnalysisPredictors on different threads, and supports CPU and GPU environments.
+sample of using multi-threads is `thread_mobilenet_test.cc` downloaded from [sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz). You can change `mobilenet_test` in `run.sh` to `thread_mobilenet_test` to run inference with multi-thread.
+```
+sh run.sh
+```
--- a/doc/paddle/guides/05_inference_deployment/inference/paddle_gpu_benchmark_en.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/paddle_gpu_benchmark_en.md
+# Performance Profiling for TensorRT Library
+## Test Environment
+- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
+- TensorRT4.0, CUDA8.0, CUDNNV7
+- Test model ResNet50, MobileNet, ResNet101, Inception V3.
+## Test Targets
+**PaddlePaddle, Pytorch, Tensorflow**  
+- In test, PaddlePaddle adopts subgraph optimization to integrate TensorRT [model](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models) .
+- Native implementation is used in Pytorch. Model [address 1](https://github.com/pytorch/vision/tree/master/torchvision/models) , [address 2](https://github.com/marvis/pytorch-mobilenet) .
+- Test for TensorFlow contains test for native TF and TF—TRT. **Test for TF—TRT hasn't reached expectation wihch will be complemented later**. Model [address](https://github.com/tensorflow/models) .
+### ResNet50
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|4.64117 |16.3|10.878|
+|5|6.90622| 22.9 |20.62|
+|10|7.9758 |40.6|34.36|
+### MobileNet
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1| 1.7541 | 7.8 |2.72|
+|5| 3.04666 | 7.8 |3.19|
+|10|4.19478 | 14.47 |4.25|
+### ResNet101
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|8.95767| 22.48 |18.78|
+|5|12.9811 | 33.88 |34.84|
+|10|14.1463| 61.97 |57.94|
+### Inception v3
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|15.1613 | 24.2 |19.1|
+|5|18.5373 | 34.8 |27.2|
+|10|19.2781| 54.8 |36.7|
--- a/doc/paddle/guides/05_inference_deployment/inference/python_infer_cn.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/python_infer_cn.md
+# Python 预测 API介绍
+Paddle提供了高度优化的[C++预测库](./native_infer.html)，为了方便使用，我们也提供了C++预测库对应的Python接口，下面是详细的使用说明。
+如果您在使用2.0之前的Paddle，请参考[旧版API](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/advanced_guide/inference_deployment/inference/python_infer_cn.html)文档。
+## Python预测相关数据结构
+使用Python预测API与C++预测API相似，主要包括`Tensor`, `DataType`, `Config`和`Predictor`，分别对应于C++ API中同名的类型。
+### DataType
+class paddle.inference.DataType
+`DataType`定义了`Tensor`的数据类型，由传入`Tensor`的numpy数组类型确定，包括以下成员
+* `INT64`: 64位整型
+* `INT32`: 32位整型
+* `FLOAT32`: 32位浮点型
+### PrecisionType
+class paddle.3.inference.PrecisionType
+`PrecisionType`定义了`Predictor`运行的精度模式，包括一下成员
+* `Float32`: fp32模式运行
+* `Half`: fp16模式运行
+* `Int8`: int8模式运行
+### Tensor
+class paddle.inference.Tensor
+`Tensor`是`Predictor`的一种输入/输出数据结构，通过`predictor`获取输入/输出handle得到，主要提供以下方法
+* `copy_from_cpu`: 从cpu获取模型运行所需输入数据
+* `copy_to_cpu`: 获取模型运行输出结果
+* `lod`: 获取lod信息
+* `set_lod`: 设置lod信息
+* `shape`: 获取shape信息
+* `reshape`: 设置shape信息
+* `type`: 获取DataType信息
+``` python
+# 创建predictor
+predictor = create_predictor(config)
+# 获取输入的名称
+input_names = predictor.get_input_names()
+input_tensor = predictor.get_input_handle(input_names[0])
+# 设置输入
+fake_input = numpy.random.randn(1, 3, 318, 318).astype("float32")
+input_tensor.copy_from_cpu(fake_input)
+# 运行predictor
+predictor.run()
+# 获取输出
+output_names = predictor.get_output_names()
+output_tensor = predictor.get_output_handle(output_names[0])
+output_data = output_tensor.copy_to_cpu() # numpy.ndarray类型
+```
+### Config
+class paddle.inference.Config
+`Config`是创建预测引擎的配置，提供了模型路径设置、预测引擎运行设备选择以及多种优化预测流程的选项，主要包括以下方法
+* `set_model`: 设置模型的路径
+* `model_dir`: 返回模型文件夹路径
+* `prog_file`: 返回模型文件路径
+* `params_file`: 返回参数文件路径
+* `enable_use_gpu`: 设置GPU显存(单位M)和Device ID
+* `disable_gpu`: 禁用GPU
+* `gpu_device_id`: 返回使用的GPU ID
+* `switch_ir_optim`: IR优化(默认开启)
+* `enable_tensorrt_engine`: 开启TensorRT
+* `enable_mkldnn`: 开启MKLDNN
+* `disable_glog_info`: 禁用预测中的glog日志
+* `delete_pass`: 预测的时候删除指定的pass
+#### 代码示例
+设置模型和参数路径有两种形式：
+* 当模型文件夹下存在一个模型文件和多个参数文件时，传入模型文件夹路径，模型文件名默认为`__model__`
+``` python
+config = Config("./model")
+```
+* 当模型文件夹下只有一个模型文件和一个参数文件时，传入模型文件和参数文件路径
+``` python
+config = Config("./model/model", "./model/params")
+```
+使用`set_model`方法设置模型和参数路径方式同上
+其他预测引擎配置选项示例如下
+``` python
+config.enable_use_gpu(100, 0) # 初始化100M显存，使用gpu id为0
+config.gpu_device_id()        # 返回正在使用的gpu id
+config.disable_gpu()          # 禁用gpu
+config.switch_ir_optim(True)  # 开启IR优化
+config.enable_tensorrt_engine(precision_mode=PrecisionType.Float32,
+                              use_calib_mode=True) # 开启TensorRT预测，精度为fp32，开启int8离线量化
+config.enable_mkldnn()          # 开启MKLDNN
+```
+### Predictor
+class paddle.inference.Predictor
+`Predictor`是运行预测的引擎，由`paddle.inference.create_predictor(config)`创建，主要提供以下方法
+* `run()`: 运行预测引擎，返回预测结果
+* `get_input_names()`: 获取输入的名称
+* `get_input_handle(input_name: str)`: 根据输入的名称获取对应的`Tensor`
+* `get_output_names()`: 获取输出的名称
+* `get_output_handle(output_name: str)`: 根据输出的名称获取对应的`Tensor`
+#### 代码示例
+``` python
+# 设置完AnalysisConfig后创建预测引擎PaddlePredictor
+predictor = create_predictor(config)
+# 获取输入的名称
+input_names = predictor.get_input_names()
+input_handle = predictor.get_input_handle(input_names[0])
+# 设置输入
+fake_input = numpy.random.randn(1, 3, 318, 318).astype("float32")
+input_handle.reshape([1, 3, 318, 318])
+input_handle.copy_from_cpu(fake_input)
+# 运行predictor
+predictor.run()
+# 获取输出
+output_names = predictor.get_output_names()
+output_handle = predictor.get_output_handle(output_names[0])
+```
+## 完整使用示例
+下面是使用Paddle Inference Python API进行预测的一个完整示例，使用resnet50模型
+下载[resnet50模型](http://paddle-inference-dist.bj.bcebos.com/resnet50_model.tar.gz)并解压，运行如下命令将会调用预测引擎
+``` bash
+python resnet50_infer.py --model_file ./model/model --params_file ./model/params --batch_size 2
+```
+`resnet50_infer.py` 的内容是
+``` python
+import argparse
+import numpy as np
+from paddle.inference import Config
+from paddle.inference import create_predictor
+def main():
+    args = parse_args()
+    # 设置AnalysisConfig
+    config = set_config(args)
+    # 创建PaddlePredictor
+    predictor = create_predictor(config)
+    # 获取输入的名称
+    input_names = predictor.get_input_names()
+    input_handle = predictor.get_input_handle(input_names[0])
+    # 设置输入
+    fake_input = np.random.randn(1, 3, 318, 318).astype("float32")
+    input_handle.reshape([1, 3, 318, 318])
+    input_handle.copy_from_cpu(fake_input)
+    # 运行predictor
+    predictor.run()
+    # 获取输出
+    output_names = predictor.get_output_names()
+    output_handle = predictor.get_output_handle(output_names[0])
+    output_data = output_handle.copy_to_cpu() # numpy.ndarray类型
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_file", type=str, help="model filename")
+    parser.add_argument("--params_file", type=str, help="parameter filename")
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
+    return parser.parse_args()
+def set_config(args):
+    config = Config(args.model_file, args.params_file)
+    config.disable_gpu()
+    config.switch_use_feed_fetch_ops(False)
+    config.switch_specify_input_names(True)
+    return config
+if __name__ == "__main__":
+    main()
+```
+## 支持方法列表
+* Tensor
+    * `copy_from_cpu(input: numpy.ndarray) -> None`
+    * `copy_to_cpu() -> numpy.ndarray`
+    * `reshape(input: numpy.ndarray|List[int]) -> None`
+    * `shape() -> List[int]`
+    * `set_lod(input: numpy.ndarray|List[List[int]]) -> None`
+    * `lod() -> List[List[int]]`
+    * `type() -> PaddleDType`
+* Config
+    * `set_model(model_dir: str) -> None`
+    * `set_model(prog_file: str, params_file: str) -> None`
+    * `set_model_buffer(model: str, model_size: int, param: str, param_size: int) -> None`
+    * `model_dir() -> str`
+    * `prog_file() -> str`
+    * `params_file() -> str`
+    * `model_from_memory() -> bool`
+    * `set_cpu_math_library_num_threads(num: int) -> None`
+    * `enable_use_gpu(memory_pool_init_size_mb: int, device_id: int) -> None`
+    * `use_gpu() -> bool`
+    * `gpu_device_id() -> int`
+    * `switch_ir_optim(x: bool = True) -> None`
+    * `switch_ir_debug(x: int=True) -> None`
+    * `ir_optim() -> bool`
+    * `enable_tensorrt_engine(workspace_size: int = 1 << 20,
+                              max_batch_size: int,
+                              min_subgraph_size: int,
+                              precision_mode: AnalysisConfig.precision,
+                              use_static: bool,
+                              use_calib_mode: bool) -> None`
+    * `set_trt_dynamic_shape_info(min_input_shape: Dict[str, List[int]]={}, max_input_shape: Dict[str, List[int]]={}, optim_input_shape: Dict[str, List[int]]={}, disable_trt_plugin_fp16: bool=False) -> None`
+    * `tensorrt_engine_enabled() -> bool`
+    * `enable_mkldnn() -> None`
+    * `enable_mkldnn_bfloat16() -> None`
+    * `mkldnn_enabled() -> bool`
+    * `set_mkldnn_cache_capacity(capacity: int=0) -> None`
+    * `set_mkldnn_op(ops: Set[str]) -> None`
+    * `set_optim_cache_dir(dir: str) -> None`
+    * `disable_glog_info() -> None`
+    * `pass_builder() -> paddle::PassStrategy`
+    * `delete_pass(pass_name: str) -> None`
+    * `cpu_math_library_num_threads() -> int`
+    * `disable_gpu() -> None`
+    * `enable_lite_engine(precision: PrecisionType, zero_copy: bool, passes_filter: List[str]=[], ops_filter: List[str]=[]) -> None`
+    * `lite_engine_enabled() -> bool`
+    * `enable_memory_optim() -> None`
+    * `enable_profile() -> None`
+    * `enable_quantizer() -> None`
+    * `quantizer_config() -> paddle::MkldnnQuantizerConfig`
+    * `fraction_of_gpu_memory_for_pool() -> float`
+    * `memory_pool_init_size_mb() -> int`
+    * `glog_info_disabled() -> bool`
+    * `gpu_device_id() -> int`
+    * `specify_input_name() -> bool`
+    * `switch_specify_input_names(x: bool=True) -> None`
+    * `specify_input_name(q) -> bool`
+    * `switch_use_feed_fetch_ops(x: int=True) -> None`
+    * `use_feed_fetch_ops_enabled() -> bool`
+    * `to_native_config() -> paddle.fluid.core_avx.NativeConfig`
+* `create_predictor(config: Config) -> Predictor`
+* Predictor
+    * `run() -> None`
+    * `get_input_names() -> List[str]`
+    * `get_input_handle(input_name: str) -> Tensor`
+    * `get_output_names() -> List[str]`
+    * `get_output_handle(output_name: str) -> Tensor`
+    * `clear_intermediate_tensor() -> None`
+    * `clone() -> Predictor`
+* PredictorPool
+    * `retrive(idx: int) -> Predictor`
+可参考对应的[C++预测接口](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/pybind/inference_api.cc)，其中定义了每个接口的参数和返回值
--- a/doc/paddle/guides/05_inference_deployment/inference/windows_cpp_inference.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/windows_cpp_inference.md
+安装与编译 Windows 预测库
+===========================
+下载安装包与对应的测试环境
+-------------
+| 版本说明      |     预测库(1.8.3版本)     |       编译器        |    构建工具      |  cuDNN  |  CUDA  |
+|:---------|:-------------------|:-------------------|:----------------|:--------|:-------|
+|    cpu_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3|  CMake v3.16.0  |
+|    cpu_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3|  CMake v3.16.0  |
+|    cuda9.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post97/fluid_inference_install_dir.zip) |  MSVC 2015 update 3 |  CMake v3.16.0  |  7.3.1  |   9.0    |
+|    cuda9.0_cudnn7_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 |  CMake v3.16.0  |  7.3.1  |   9.0    |
+|    cuda10.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post107/fluid_inference_install_dir.zip) | MSVC 2015 update 3 |  CMake v3.16.0  |  7.4.1  |   10.0    |
+### 硬件环境
+测试环境硬件配置：
+| CPU      |      I7-8700K      |
+|:---------|:-------------------|
+| 内存 | 16G               |
+| 硬盘 | 1T hdd + 256G ssd |
+| 显卡 | GTX1080 8G        |
+测试环境操作系统使用 win10 家庭版本
+从源码编译预测库
+--------------
+用户也可以从 PaddlePaddle 核心代码编译C++预测库，只需在编译时配制下面这些编译选项：
+|选项       |说明               |   值     |
+|:-------------|:-------|:------------|
+|CMAKE_BUILD_TYPE |  配置生成器上的构建类型，windows预测库目前只支持Release          | Release    |
+|ON_INFER |    是否生成预测库，编译预测库时必须设置为ON                | ON         |
+|WITH_GPU |   是否支持GPU                  | ON/OFF     |
+|WITH_MKL |   是否使用Intel MKL(数学核心库)                 | ON/OFF     |
+|WITH_PYTHON | 是否内嵌PYTHON解释器                | OFF(推荐)        |
+|MSVC_STATIC_CRT|是否使用/MT 模式进行编译，Windows默认使用 /MT 模式进行编译 |ON/OFF|
+|CUDA_TOOKIT_ROOT_DIR|编译GPU预测库时，需设置CUDA的根目录|YOUR_CUDA_PATH|
+请按照推荐值设置，以避免链接不必要的库。其它可选编译选项按需进行设定。
+更多具体编译选项含义请参见[编译选项表](../../../beginners_guide/install/Tables.html/#Compile)
+Windows下安装与编译预测库步骤：(在Windows命令提示符下执行以下指令)
+1. 将PaddlePaddle的源码clone在当下目录的Paddle文件夹中，并进入Paddle目录：
+   ```bash
+   git clone https://github.com/PaddlePaddle/Paddle.git
+   cd Paddle
+   ```
+2. 执行cmake：
+   - 编译CPU预测
+   ```bash
+   # 创建并进入build目录
+   mkdir build
+   cd build
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF
+   # Windows默认使用 /MT 模式进行编译，如果想使用 /MD 模式，请使用以下命令。如不清楚两者的区别，请使用上面的命令
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF -DMSVC_STATIC_CRT=OFF
+   ```
+   - 编译GPU预测库:
+   ```bash
+   # -DCUDA_TOOKIT_ROOT_DIR 为cuda根目录，例如-DCUDA_TOOKIT_ROOT_DIR="D:\\cuda"
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_GPU=ON -DON_INFER=ON -DWITH_PYTHON=OFF -DCUDA_TOOKIT_ROOT_DIR=YOUR_CUDA_PATH
+   ```
+3. 使用Blend for Visual Studio 2015 打开 `paddle.sln` 文件，选择平台为`x64`，配置为`Release`，编译inference_lib_dist项目。
+   操作方法：在Visual Studio中选择相应模块，右键选择"生成"（或者"build"）
+编译成功后，使用C++预测库所需的依赖（包括：（1）编译出的PaddlePaddle预测库和头文件；（2）第三方链接库和头文件；（3）版本信息与编译选项信息）
+均会存放于`fluid_inference_install_dir`目录中。
+version.txt 中记录了该预测库的版本信息，包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号，如：
+     GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
+     WITH_MKL: ON
+     WITH_MKLDNN: ON
+     WITH_GPU: ON
+     CUDA version: 8.0
+     CUDNN version: v7
+编译预测demo
+-------------
+### 硬件环境
+测试环境硬件配置：
+| CPU      |      I7-8700K      |
+|:---------|:-------------------|
+| 内存 | 16G               |
+| 硬盘 | 1T hdd + 256G ssd |
+| 显卡 | GTX1080 8G        |
+测试环境操作系统使用 win10 家庭版本。
+### 软件要求
+**请您严格按照以下步骤进行安装，否则可能会导致安装失败！**
+**安装Visual Studio 2015 update3**
+安装Visual Studio 2015，安装选项中选择安装内容时勾选自定义，选择安装全部关于c，c++，vc++的功能。
+### 其他要求
+1. 你需要直接下载Windows预测库或者从Paddle源码编译预测库，确保windows预测库存在。
+2. 你需要下载Paddle源码，确保demo文件和脚本文件存在：
+```bash
+git clone https://github.com/PaddlePaddle/Paddle.git
+```
+### 编译demo
+Windows下编译预测demo步骤：(在Windows命令提示符下执行以下指令)
+#### 使用脚本编译运行
+进入到demo_ci目录，运行脚本`run_windows_demo.bat`，根据提示按需输入参数:
+```dos
+# path为下载Paddle的目录
+cd path\Paddle\paddle\fluid\inference\api\demo_ci
+run_windows_demo.bat
+```
+其中，run_windows_demo.bat 的部分选项如下：
+```dos
+gpu_inference=Y #是否使用GPU预测库，默认使用CPU预测库
+use_mkl=Y #该预测库是否使用MKL，默认为Y
+use_gpu=Y  #是否使用GPU进行预测，默认为N。使用GPU预测需要下载GPU版本预测库
+paddle_inference_lib=path\fluid_inference_install_dir #设置paddle预测库的路径
+cuda_lib_dir=path\lib\x64  #设置cuda库的路径
+vcvarsall_dir=path\vc\vcvarsall.bat  #设置visual studio #本机工具命令提示符路径
+```
+#### 手动编译运行
+1. 进入demo_ci目录，创建并进入build目录
+   ```dos
+   # path为下载Paddle的目录
+   cd path\Paddle\paddle\fluid\inference\api\demo_ci
+   mkdir build
+   cd build
+   ```
+2. 执行cmake（cmake可以在[官网进行下载](https://cmake.org/download/)，并添加到环境变量中):
+   - 使用CPU预测库编译demo
+   ```dos
+   # -DDEMO_NAME 是要编译的文件
+   # -DDPADDLE_LIB是预测库目录，例如-DPADDLE_LIB=D:\fluid_inference_install_dir
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=OFF -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
+   -DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON
+   ```
+   - 使用GPU预测库编译demo
+   ```dos
+   # -DCUDA_LIB CUDA的库目录，例如-DCUDA_LIB=D:\cuda\lib\x64
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=ON -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
+   -DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON -DCUDA_LIB=YOUR_CUDA_LIB
+   ```
+3. 使用Blend for Visual Studio 2015 打开 `cpp_inference_demo.sln` 文件，选择平台为`x64`，配置为`Release`，编译simple_on_word2vec项目。
+   操作方法: 在Visual Studio中选择相应模块，右键选择"生成"（或者"build"）
+4. [下载模型](http://paddle-inference-dist.bj.bcebos.com/word2vec.inference.model.tar.gz)并解压到当前目录，执行命令：
+   ```dos
+   # 开启GLOG
+   set GLOG_v=100
+   # 进行预测，path为模型解压后的目录
+   Release\simple_on_word2vec.exe --dirname=path\word2vec.inference.model
+   ```
+### 实现一个简单预测demo
+[完整的代码示例](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/demo_ci/windows_mobilenet.cc)
+本示例使用了AnalysisConfig管理AnalysisPredictor的预测配置，提供了模型路径设置、预测引擎运行设备选择以及使用ZeroCopyTensor管理输入/输出的设置。具体步骤如下：
+1. 创建AnalysisConfig
+   ```C++
+   AnalysisConfig config;
+   config->SwitchUseFeedFetchOps(false);  // 关闭feed和fetch OP使用，使用ZeroCopy接口必须设置此项
+   // config->EnableUseGpu(100 /*设定GPU初始显存池为MB*/,  0 /*设定GPU ID为0*/); //开启GPU预测
+   ```
+2. 在config中设置模型和参数路径
+   从磁盘加载模型时，根据模型和参数文件存储方式不同，设置AnalysisConfig加载模型和参数的路径有两种形式，此处使用combined形式：
+   - 非combined形式：模型文件夹`model_dir`下存在一个模型文件和多个参数文件时，传入模型文件夹路径，模型文件名默认为`__model__`。
+   ``` c++
+   config->SetModel("path\\model_dir\\__model__")
+   ```
+   - combined形式：模型文件夹`model_dir`下只有一个模型文件`__model__`和一个参数文件`__params__`时，传入模型文件和参数文件路径。
+   ```C++
+   config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
+   ```
+3. 创建predictor，准备输入数据
+   ```C++
+   std::unique_ptr<PaddlePredictor> predictor = CreatePaddlePredictor(config);
+   int batch_size = 1;
+   int channels = 3; // channels，height，width三个参数必须与模型中对应输入的shape一致
+   int height = 300;
+   int width = 300;
+   int nums = batch_size * channels * height * width;
+   float* input = new float[nums];
+   for (int i = 0; i < nums; ++i) input[i] = 0;
+   ```
+4. 使用ZeroCopyTensor管理输入
+   ```C++
+   // 通过创建的AnalysisPredictor获取输入Tensor，该Tensor为ZeroCopyTensor
+   auto input_names = predictor->GetInputNames();
+   auto input_t = predictor->GetInputTensor(input_names[0]);
+   // 对Tensor进行reshape，将准备好的输入数据从CPU拷贝到ZeroCopyTensor中
+   input_t->Reshape({batch_size, channels, height, width});
+   input_t->copy_from_cpu(input);
+   ```
+5. 运行预测引擎
+   ```C++
+   predictor->ZeroCopyRun();
+   ```
+6. 使用ZeroCopyTensor管理输出
+   ```C++
+   auto output_names = predictor->GetOutputNames();
+   auto output_t = predictor->GetOutputTensor(output_names[0]);
+   std::vector<int> output_shape = output_t->shape();
+   int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1,
+                                 std::multiplies<int>());
+   out_data.resize(out_num);
+   output_t->copy_to_cpu(out_data.data()); // 将ZeroCopyTensor中数据拷贝到cpu中，得到输出数据
+   delete[] input;
+   ```
+**Note:** 关于AnalysisPredictor的更多介绍，请参考[C++预测API介绍](./native_infer.html)
--- a/doc/paddle/guides/05_inference_deployment/inference/windows_cpp_inference_en.md
+++ b/doc/paddle/guides/05_inference_deployment/inference/windows_cpp_inference_en.md
+Install and Compile C++ Inference Library on Windows
+===========================
+Direct Download and Install
+-------------
+| Version      |     Inference Libraries(v1.8.3)   | Compiler | Build tools | cuDNN | CUDA |
+|:---------|:-------------------|:-------------------|:----------------|:--------|:-------|
+|    cpu_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3|  CMake v3.16.0  |
+|    cpu_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3|  CMake v3.16.0  |
+|    cuda9.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post97/fluid_inference_install_dir.zip) |  MSVC 2015 update 3 |  CMake v3.16.0  |  7.3.1  |   9.0    |
+|    cuda9.0_cudnn7_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 |  CMake v3.16.0  |  7.3.1  |   9.0    |
+|    cuda10.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post107/fluid_inference_install_dir.zip) | MSVC 2015 update 3 |  CMake v3.16.0  |  7.4.1  |   10.0    |
+### Hardware Environment
+Hardware Configuration of the experimental environment:
+| CPU           |      I7-8700K      |
+|:--------------|:-------------------|
+| Memory        | 16G               |
+| Hard Disk     | 1T hdd + 256G ssd |
+| Graphics Card | GTX1080 8G        |
+The operating system is win10 family version in the experimental environment.
+Build From Source Code
+--------------
+Users can also compile C++ inference libraries from the PaddlePaddle core code by specifying the following compile options at compile time:
+|Option    | Description    |   Value     |
+|:-------------|:-----|:--------------|
+|CMAKE_BUILD_TYPE|Specifies the build type on single-configuration generators, Windows inference library currently only supports Release| Release    |
+|ON_INFER|Whether to generate the inference library. Must be set to ON when compiling the inference library. | ON   |
+|WITH_GPU|Whether to support GPU   | ON/OFF     |
+|WITH_MKL|Whether to support MKL   | ON/OFF     |
+|WITH_PYTHON|Whether the PYTHON interpreter is embedded      | OFF        |
+|MSVC_STATIC_CRT|Whether to compile with / MT mode |   ON   |
+|CUDA_TOOKIT_ROOT_DIR | When compiling the GPU inference library, you need to set the CUDA root directory | YOUR_CUDA_PATH |
+For details on the compilation options, see [the compilation options list](../../../beginners_guide/install/Tables_en.html/#Compile)
+**Paddle Windows Inference Library Compilation Steps**
+1. Clone Paddle source code from GitHub:
+   ```bash
+   git clone https://github.com/PaddlePaddle/Paddle.git
+   cd Paddle
+   ```
+2. Run Cmake command
+   - compile CPU inference library
+   ```bash
+   # create build directory
+   mkdir build
+   # change to the build directory
+   cd build
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF
+   # use -DWITH_MKL to select math library: Intel MKL or OpenBLAS
+   # By default on Windows we use /MT for C Runtime Library, If you want to use /MD, please use the below command
+   # If you have no ideas the differences between the two, use the above one
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF -DMSVC_STATIC_CRT=OFF
+   ```
+   - compile GPU inference library
+   ```bash
+   # -DCUDA_TOOKIT_ROOT_DIR is cuda root directory, such as -DCUDA_TOOKIT_ROOT_DIR="D:\\cuda"
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_GPU=ON -DON_INFER=ON -DWITH_PYTHON=OFF -DCUDA_TOOKIT_ROOT_DIR=YOUR_CUDA_PATH
+   ```
+3. Open the `paddle.sln` using VisualStudio 2015, choose the`x64` for Slution Platforms, and `Release` for Solution Configurations, then build the `inference_lib_dist` project in the Solution Explorer(Rigth click the project and click Build).
+The inference library will be installed in `fluid_inference_install_dir`.
+version.txt constains the detailed configurations about the library, including git commit ID、math library, CUDA, CUDNN versions：
+     GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
+     WITH_MKL: ON
+     WITH_MKLDNN: ON
+     WITH_GPU: ON
+     CUDA version: 8.0
+     CUDNN version: v7
+Inference Demo Compilation
+-------------------
+### Hardware Environment
+Hardware Configuration of the experimental environment:
+| CPU           |      I7-8700K      |
+|:--------------|:-------------------|
+| Memory        | 16G               |
+| Hard Disk     | 1T hdd + 256G ssd |
+| Graphics Card | GTX1080 8G        |
+The operating system is win10 family version in the experimental environment.
+### Steps to Configure Environment
+**Please strictly follow the subsequent steps to install, otherwise the installation may fail**
+**Install Visual Studio 2015 update3**
+Install Visual Studio 2015. Please choose "customize" for the options of contents to be installed and choose to install all functions relevant to c, c++ and vc++.
+### Other requirements
+1. You need to download the Windows inference library or compile the inference library from Paddle source code.
+2. You need to run the command to get the Paddle source code.
+```bash
+git clone https://github.com/PaddlePaddle/Paddle.git
+```
+### Usage of Inference demo
+#### Compile with script
+Open the windows command line and run the `run_windows_demo.bat`, and input parameters as required according to the prompts.
+```dos
+# Path is the directory of Paddle you downloaded.
+cd path\Paddle\paddle\fluid\inference\api\demo_ci
+run_windows_demo.bat
+```
+Some options of the script are as follows:
+```dos
+gpu_inference=Y # Use gpu_inference_lib or not(Y/N), default: N.
+use_mkl=Y # Use MKL or not(Y/N), default: Y.
+use_gpu=Y  # Whether to use GPU for prediction, defalut: N.
+paddle_inference_lib=path\fluid_inference_install_dir # Set the path of paddle inference library.
+cuda_lib_dir=path\lib\x64  # Set the path of cuda library.
+vcvarsall_dir=path\vc\vcvarsall.bat  # Set the path of visual studio command prompt.
+```
+#### Compile manually
+1. Create and change to the build directory
+   ```dos
+   # path is the directory where Paddle is downloaded
+   cd path\Paddle\paddle\fluid\inference\api\demo_ci
+   mkdir build
+   cd build
+   ```
+2. Run Cmake command, cmake can be [downloaded at official site](https://cmake.org/download/) and added to environment variables.
+   - compile inference demo with CPU inference library
+   ```dos
+   # Path is the directory where you downloaded paddle.
+   # -DDEMO_NAME is the file to be built
+   # DPADDLE_LIB is the path of fluid_install_dir, for example: DPADDLE_LIB=D:\fluid_install_dir
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_STATIC_LIB=ON -DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON
+   ```
+   - compile inference demo with GPU inference library
+   ```dos
+   cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=ON -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
+   -DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON -DCUDA_LIB=YOUR_CUDA_LIB
+   ```
+3. Open the `cpp_inference_demo.sln` using VisualStudio 2015, choose the`x64` for Slution Platforms, and `Release` for Solution Configurations, then build the `simple_on_word2vec` project in the Solution Explorer(Rigth click the project and click Build).
+   In the dependent packages provided, please copy openblas and model files under Release directory to the directory of Release built and generated.
+   <p align="center">
+   <img src="https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/advanced_usage/deploy/inference/image/image8.png">
+   </p>
+4. [Download model](http://paddle-inference-dist.bj.bcebos.com/word2vec.inference.model.tar.gz) and decompress it to the current directory. Run the command:
+   ```dos
+   #  Open GLOG
+   set GLOG_v=100
+   # Start inference, path is the directory where you decompres model
+   Release\simple_on_word2vec.exe --dirname=path\word2vec.inference.model
+   ```
+### Implementing a simple inference demo
+[Complete code example](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/demo_ci/windows_mobilenet.cc)
+This example uses Analysisconfig to manage the Analysispredictor prediction configuration. The configuration method is as follows:
+1. Create AnalysisConfig
+   ``` c++
+   AnalysisConfig config;
+   config->SwitchUseFeedFetchOps(false);  // Turn off the use of feed and fetch OP, this must be set when using the ZeroCopy interface.
+   // config->EnableUseGpu(100 /*Set the GPU initial memory pool to 100MB*/,  0 /*Set GPU ID to 0*/); // Turn on GPU prediction
+   ```
+2. Set path of models and parameters
+   - When there is a model file and multiple parameter files under the model folder `model_dir`, the model folder path is passed in, and the model file name defaults to `__model__`.
+   ``` c++
+   config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
+   ```
+   - When there is only one model file `__model__` and one parameter file `__params__` in the model folder `model_dir`, the model file and parameter file path are passed in.
+   ```C++
+   config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
+   ```
+3. Create predictor and prepare input data
+   ``` C++
+   std::unique_ptr<PaddlePredictor> predictor = CreatePaddlePredictor(config);
+   int batch_size = 1;
+   int channels = 3; // The parameters of channels, height, and width must be the same as those required by the input in the model.
+   int height = 300;
+   int width = 300;
+   int nums = batch_size * channels * height * width;
+   float* input = new float[nums];
+   for (int i = 0; i < nums; ++i) input[i] = 0;
+   ```
+4. Manage input with ZeroCopyTensor
+   ```C++
+   auto input_names = predictor->GetInputNames();
+   auto input_t = predictor->GetInputTensor(input_names[0]);
+   // Reshape the input tensor, copy the prepared input data from the CPU to ZeroCopyTensor
+   input_t->Reshape({batch_size, channels, height, width});
+   input_t->copy_from_cpu(input);
+   ```
+5. Run prediction engine
+   ```C++
+   predictor->ZeroCopyRun();
+   ```
+6.  Manage input with ZeroCopyTensor
+   ```C++
+   auto output_names = predictor->GetOutputNames();
+   auto output_t = predictor->GetOutputTensor(output_names[0]);
+   std::vector<int> output_shape = output_t->shape();
+   int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1,
+                                 std::multiplies<int>());
+   out_data.resize(out_num);
+   output_t->copy_to_cpu(out_data.data()); // Copy data from ZeroCopyTensor to cpu
+   delete[] input;
+   ```  
+**Note:** For more introduction to AnalysisPredictor, please refer to the [introduction of C++ Prediction API](./native_infer_en.html).
--- a/doc/paddle/guides/05_inference_deployment/mobile/for_developer.md
+++ b/doc/paddle/guides/05_inference_deployment/mobile/for_developer.md
+# 开发者文档
+## 基本概念
+### Place
+`Place`类确定了kernel运行时的上下文信息，其中包含了kernel运行时所在的平台，执行运算数据的精度以及数据的布局等信息，使得MIR的分析更加清晰准确。它主要的成员变量如下：
+* `TargetType target`: kernel运行时所在的平台，如X86/CUDA/ARM等；
+* `PrecisionType precision`: kernel执行运算的数据的精度，如Float, Int8, Fp16等；
+* `DataLayoutType layout`: kernel执行运算的数据的布局，如NCHW, NHWC等；
+### OpLite
+`Oplite`类负责协助kernel计算，本身不具备计算功能，主要的接口功能包括：
+* `CheckShape`: 用于检查op的输入/输出参数维度、类型是否合法，以及属性信息是否符合设计；
+* `InferShape`: 用于设置输出Tensor的形状信息；
+* `CreateKernels`:  创建相关的kernel;
+* `Attach`: 用于从`Scope`和`OpDesc`中获取参数的指针，并传递给kernel;
+重要方法及声明如下：
+```c++
+class OpLite : public Registry {
+ public:
+  OpLite() = default;
+  explicit OpLite(const std::string &type) : op_type_(type) {}
+  explicit OpLite(const std::vector<Place> &valid_places)
+      : valid_places_(valid_places) {}
+  void SetValidPlaces(const std::vector<Place> &places) {
+    VLOG(3) << "valid places " << valid_places_.size();
+    valid_places_ = places;
+  }
+  // Set supported places
+  const std::vector<Place> &valid_places() const { return valid_places_; }
+  // Check the shape.
+  virtual bool CheckShape() const { return true; }
+  // Inference the outputs' shape.
+  virtual bool InferShape() const { return true; }
+  // Run this operator.
+  virtual bool Run();
+  // Link the external execution environ to internal context.
+  bool Attach(const cpp::OpDesc &opdesc, lite::Scope *scope);
+  // Create all the kernels for the valid targets.
+  std::vector<std::unique_ptr<KernelBase>> CreateKernels(
+      const std::vector<Place> &places, const std::string &kernel_type = "");
+  // Assign op param to kernel.
+  virtual void AttachKernel(KernelBase *kernel) = 0;
+};
+```
+### KernelLite
+为了提升kernel对`Target`, `Precision`, `DataLayout`等多种执行模式的支持，引入了`KernelLite`的概念，它主要有以下特点：
+* 可以通过模版特化不同`Place`和kernel的实现，加强对不同执行模式的支持；
+* 轻量级，`KernelLite`类似functor，只有执行的职能，执行效率更高；
+* 每个kernel有明确执行的模式，并且可以在analysis time参与分析；
+* 依赖简单，便于部署到mobile执行；
+* 硬件调度信息等`context`跟具体的kernel绑定，方便定制不同kernel的行为。
+重要的方法及声明如下：
+```c++
+template <TargetType Target, PrecisionType Precision,
+          DataLayoutType DataLayout = DataLayoutType::kNCHW>
+class KernelLite : public KernelBase {
+ public:
+  // Run the kernel.
+  virtual void Run() { CHECK(false) << "Not Implemented"; }
+  // Set target
+  TargetType target() const override { return Target; }
+  // Set precision
+  PrecisionType precision() const override { return Precision; }
+  // Set data layout
+  DataLayoutType layout() const override { return DataLayout; }
+  Place place() const override { return Place{Target, Precision, DataLayout}; }
+  void Touch() {}
+  KernelLite() = default;
+  virtual ~KernelLite() = default;
+};
+```
+## 架构简介
+Mobile 在这次升级为 lite 架构， 侧重多硬件、高性能的支持，其主要设计思想如下
+- 引入 Type system，强化多硬件、量化方法、data layout 的混合调度能力
+- 硬件细节隔离，通过不同编译开关，对支持的任何硬件可以自由插拔
+- 引入 MIR(Machine IR) 的概念，强化带执行环境下的优化支持
+- 优化期和执行期严格隔离，保证预测时轻量和高效率
+架构图如下
+![Paddle Inference Refactor1.0](https://github.com/Superjomn/_tmp_images/raw/master/images/lite.jpg)
+## 增加新 Kernel的方法
+下面主要介绍op新增kernel如何写，简单总结新增kernel的实现需要包含如下内容：
+- kernel实现：继承自`KernelLite`类的对应op的Compute类定义与实现，根据输入的数据类型，数据布局，数据所在的设备以及运行时所调用的第三方库的不同实现不同的kernel；server端CPU kernel实现在.h文件中。
+- kernel注册：server端CPU kernel注册实现在.cc文件。
+## 实现C++类
+以mul op的CPU Kernel实现为例，mul kernel执行运算的矩阵乘法的公式为*Out* = *X* * *Y*,  可见该计算由两个输入，一个输出组成; 输入输出参数分别从OP的param中获取，如mul op的param定义如下：
+```c++
+struct MulParam {
+  const lite::Tensor* x{};
+  const lite::Tensor* y{};
+  lite::Tensor* output{};
+  int x_num_col_dims{1};
+  int y_num_col_dims{1};
+};
+```
+下面开始定义`MulCompute`类的实现：
+```c++
+template <typename T>
+class MulCompute : public KernelLite<TARGET(kX86), PRECISION(kFloat)> {
+ public:
+  using param_t = operators::MulParam;
+  void Run() override {
+    auto& context = ctx_->As<X86Context>();
+    auto& param = *param_.get_mutable<operators::MulParam>();
+    CHECK(context.x86_device_context());
+    //1. 为output分配内存
+    param.output->template mutable_data<T>();
+    // 2. 获取计算用的输入输出
+    auto* x = &param.x->raw_tensor();
+    auto* y = &param.y->raw_tensor();
+    auto* z = &param.output->raw_tensor();
+    //3. 对输入输出数据进行需要的处理...
+    Tensor x_matrix, y_matrix;
+    if (x->dims().size() > 2) {
+      x_matrix = framework::ReshapeToMatrix(*x, param.x_num_col_dims);
+    } else {
+      x_matrix = *x;
+    }
+    //4. 调用数学库进行矩阵的运算...
+    auto blas = paddle::operators::math::GetBlas<platform::CPUDeviceContext, T>(
+        *context.x86_device_context());
+    blas.MatMul(x_matrix, y_matrix, z);
+  }
+  virtual ~MulCompute() = default;
+};
+```
+`MulCompute`类继承自`kernelLite`, 带有下面两个模版参数：
+- `TARGET(kX86)`: `Target`代表的是硬件信息，如CUDA/X86/ARM/…，表示该kernel运行的硬件平台，在该示例中我们写的是kX86，表示mul这个kernel运行在X86平台上；
+- `PRECISION(kFloat)`：`Precision`代表该kernel运算支持的数据精度信息，示例中写的是`kFloat`, 表示mul这个kernel支持Float数据的运算；
+  需要为`MulCompute`类重写`Run`接口， kernel 的输入和输出分别通过`MulParam`获得，输入/输出的变量类型是`lite::Tensor`。
+到此，前向mul kernel的实现完成，接下来需要在.cc文件中注册该kernel。
+## 注册kernel
+在.cc文件中注册实现的kernel：
+```c++
+REGISTER_LITE_KERNEL(mul, kX86, kFloat, kNCHW,
+                     paddle::lite::kernels::x86::MulCompute<float>, def)
+    .BindInput("X", {LiteType::GetTensorTy(TARGET(kX86))})
+    .BindInput("Y", {LiteType::GetTensorTy(TARGET(kX86))})
+    .BindOutput("Out", {LiteType::GetTensorTy(TARGET(kX86))})
+    .Finalize();
+```
+在上面的代码中；
+- `REGISTER_LITE_KERNEL`: 注册MulCompute类，并特化模版参数为float类型， 类型名为mul, 运行的平台为X86, 数据精度为float, 数据布局为NCHW；
+- 在运行时，框架系统根据输入数据所在的设备，输入数据的类型，数据布局等信息静态的选择合适的kernel执行运算。
+## 开发环境
+### Mobile端开发和测试
+我们提供了移动端开发所需的docker镜像环境，在`paddle/fluid/lite/tools/Dockerfile.mobile`，可以直接通过
+`docker build --file paddle/fluid/lite/tools/Dockerfile.mobile --tag paddle-lite-mobile:latest . `生成镜像文件。
+该镜像中提供了
+ - Android端的交叉编译环境
+ - ARM Linux端的交叉编译环境
+ - Android端的模拟器环境
+ - 开发所需的格式检查工具
+#### 相关的cmake选项
+目前支持如下的编译配置，以生成不同目标上的程序。
+- `ARM_TARGET_OS` 代表目标操作系统， 目前支持 "android" "armlinux"， 默认是Android
+- `ARM_TARGET_ARCH_ABI` 代表ARCH，支持输入"armv8"和"armv7"，针对OS不一样选择不一样。
+    - `-DARM_TARGET_OS="android"` 时
+        - "armv8", 等效于 "arm64-v8a"。 default值为这个。
+        - "armv7", 等效于 "armeabi-v7a"。
+    - `-DARM_TARGET_OS="armlinux"` 时
+        - "armv8", 等效于 "arm64"。 default值为这个。
+        - "armv7hf", 等效于使用`eabihf`且`-march=armv7-a -mfloat-abi=hard -mfpu=neon-vfpv4 `。
+        - "armv7", 等效于使用`eabi`且`-march=armv7-a -mfloat-abi=softfp -mfpu=neon-vfpv4`。
+- `ARM_TARGET_LANG` 代表目标编译的语言， 默认为gcc，支持 gcc和clang两种。
+注意: ARM Linux当前仅支持在armv8上编译并测试。
+#### 开发
+添加新的ARM端kernel，主要分为3部分：
+1. 添加具体的数学计算，在`paddle/fluid/lite/arm/math`中添加对应的数学函数，侧重点在于代码本身的优化，充分利用NEON指令发挥其优势。
+2. 添加kernel声明和调用实例，在`paddle/fluid/lite/kernels/arm`中添加对应kernel的框架声明和调用，侧重点在于每种kernel严格对应输入输出的类型。
+3. 添加单元测试，在`paddle/fluid/lite/kernels/arm`中添加相应的单元测试，并保持其在模拟器或者真机中可以通过。
+#### 测试
+我们在镜像开发环境中添加了`arm64-v8a`和`armeabi-v7a`的Android模拟环境，在没有真机环境下，可以很方便的用于测试对应平台上的单元测试。
+常用步骤如下
+```shell
+# 创建Android avd (armv8)
+$ echo n | avdmanager create avd -f -n paddle-armv8 -k "system-images;android-24;google_apis;arm64-v8a"
+# 启动Android armv8 emulator
+$ ${ANDROID_HOME}/emulator/emulator -avd paddle-armv8 -noaudio -no-window -gpu off -verbose &
+# 其他正常测试步骤
+# 关闭所有模拟器
+$ adb devices | grep emulator | cut -f1 | while read line; do adb -s $line emu kill; done
+```
--- a/doc/paddle/guides/05_inference_deployment/mobile/images/Paddle Inference Refactor1.0.jpg
+++ b/doc/paddle/guides/05_inference_deployment/mobile/images/Paddle Inference Refactor1.0.jpg
--- a/doc/paddle/guides/05_inference_deployment/mobile/images/lite-process.png
+++ b/doc/paddle/guides/05_inference_deployment/mobile/images/lite-process.png
--- a/doc/paddle/guides/05_inference_deployment/mobile/images/lite_train_process.png
+++ b/doc/paddle/guides/05_inference_deployment/mobile/images/lite_train_process.png
--- a/doc/paddle/guides/05_inference_deployment/mobile/images/op-kernel-relation.png
+++ b/doc/paddle/guides/05_inference_deployment/mobile/images/op-kernel-relation.png
--- a/doc/paddle/guides/05_inference_deployment/mobile/index_cn.rst
+++ b/doc/paddle/guides/05_inference_deployment/mobile/index_cn.rst
+##########
+移动端部署
+##########
+本模块介绍了飞桨的端侧推理引擎Paddle-Lite：
+* `Paddle Lite <mobile_index.html>`_：简要介绍了 Paddle-Lite 特点以及使用说明。
+.. toctree::
+   :hidden:
+   mobile_index.md
--- a/doc/paddle/guides/05_inference_deployment/mobile/index_en.rst
+++ b/doc/paddle/guides/05_inference_deployment/mobile/index_en.rst
+#################
+Mobile Deployment
+#################
--- a/doc/paddle/guides/05_inference_deployment/mobile/mobile_index.md
+++ b/doc/paddle/guides/05_inference_deployment/mobile/mobile_index.md
+# Paddle-Lite
+Paddle-Lite为Paddle-Mobile的升级版，定位支持包括手机移动端在内更多场景的轻量化高效预测，支持更广泛的硬件和平台，是一个高性能、轻量级的深度学习预测引擎。在保持和PaddlePaddle无缝对接外，也兼容支持其他训练框架产出的模型。
+完整使用文档位于 [Paddle-Lite 文档](https://paddle-lite.readthedocs.io/zh/latest/) 。
+## 特性
+### 轻量级
+执行阶段和计算优化阶段实现良好解耦拆分，移动端可以直接部署执行阶段，无任何第三方依赖。
+包含完整的80个 Op+85个 Kernel 的动态库，对于ARMV7只有800K，ARMV8下为1.3M，并可以裁剪到更低。
+在应用部署时，载入模型即可直接预测，无需额外分析优化。
+### 高性能
+极致的 ARM CPU 性能优化，针对不同微架构特点实现kernel的定制，最大发挥计算性能，在主流模型上展现出领先的速度优势。
+支持量化模型，结合[PaddleSlim 模型压缩工具](https://github.com/PaddlePaddle/models/tree/v1.5/PaddleSlim) 中量化功能，可以提供高精度高性能的预测能力。
+在Huawei NPU， FPGA上也具有有很好的性能表现。
+最新性能数据位于 [Benchmark 文档](https://paddle-lite.readthedocs.io/zh/latest/benchmark/benchmark.html)。
+### 通用性
+硬件方面，Paddle-Lite 的架构设计为多硬件兼容支持做了良好设计。除了支持ARM CPU、Mali GPU、Adreno GPU，还特别支持了华为 NPU，以及 FPGA 等边缘设备广泛使用的硬件。即将支持支持包括寒武纪、比特大陆等AI芯片，未来会增加对更多硬件的支持。
+模型支持方面，Paddle-Lite和PaddlePaddle训练框架的Op对齐，提供更广泛的模型支持能力。目前已严格验证18个模型85个OP的精度和性能，对视觉类模型做到了较为充分的支持，覆盖分类、检测和定位，包含了特色的OCR模型的支持。未来会持续增加更多模型的支持验证。
+框架兼容方面：除了PaddlePaddle外，对其他训练框架也提供兼容支持。当前，支持Caffe 和 TensorFlow 训练出来的模型，通过[X2Paddle] (https://github.com/PaddlePaddle/X2Paddle) 转换工具实现。接下来将会对ONNX等格式模型提供兼容支持。
+## 架构
+Paddle-Lite 的架构设计着重考虑了对多硬件和平台的支持，并且强化了多个硬件在一个模型中混合执行的能力，多个层面的性能优化处理，以及对端侧应用的轻量化设计。
+![](https://github.com/Superjomn/_tmp_images/raw/master/images/paddle-lite-architecture.png)
+其中，Analysis Phase 包括了 MIR(Machine IR) 相关模块，能够对原有的模型的计算图针对具体的硬件列表进行算子融合、计算裁剪 在内的多种优化。Execution Phase 只涉及到Kernel 的执行，且可以单独部署，以支持极致的轻量级部署。
+## Paddle-Mobile升级为Paddle-Lite的说明
+原Paddle-Mobile作为一个致力于嵌入式平台的PaddlePaddle预测引擎，已支持多种硬件平台，包括ARM CPU、 Mali GPU、Adreno GPU，以及支持苹果设备的GPU Metal实现、ZU5、ZU9等FPGA开发板、树莓派等arm-linux开发板。在百度内已经过广泛业务场景应用验证。对应设计文档可参考: [mobile/README](https://github.com/PaddlePaddle/Paddle-Lite/blob/develop/mobile/README.md)
+Paddle-Mobile 整体升级重构并更名为Paddle-Lite后，原paddle-mobile 的底层能力大部分已集成到[新架构 ](https://github.com/PaddlePaddle/Paddle-Lite/tree/develop/lite)下。作为过渡，暂时保留原Paddle-mobile代码。 主体代码位于 `mobile/` 目录中，后续一段时间会继续维护，并完成全部迁移。新功能会统一到[新架构 ](https://github.com/PaddlePaddle/Paddle-Lite/tree/develop/lite)下开发。
+metal, web的模块相对独立，会继续在 `./metal` 和 `./web` 目录下开发和维护。对苹果设备的GPU Metal实现的需求及web前端预测需求，可以直接进入这两个目录。
+## 致谢
+Paddle-Lite 借鉴了以下开源项目：
+- [ARM compute library](https://github.com/ARM-software/ComputeLibrary)
+- [Anakin](https://github.com/PaddlePaddle/Anakin) ，Anakin对应底层的一些优化实现已被集成到Paddle-Lite。Anakin作为PaddlePaddle组织下的一个高性能预测项目，极具前瞻性，对Paddle-Lite有重要贡献。Anakin已和本项目实现整合。之后，Anakin不再升级。
+##  交流与反馈
+* 欢迎您通过Github Issues来提交问题、报告与建议
+* 微信公众号：飞桨PaddlePaddle
+* QQ群: 696965088
+<p align="center"><img width="200" height="200"  src="https://user-images.githubusercontent.com/45189361/64117959-1969de80-cdc9-11e9-84f7-e1c2849a004c.jpeg"/>&#8194;&#8194;&#8194;&#8194;&#8194;<img width="200" height="200" margin="500" src="https://user-images.githubusercontent.com/45189361/64117844-cb54db00-cdc8-11e9-8c08-24bbe594608e.jpeg"/></p>
+<p align="center">  &#8194;&#8194;&#8194;微信公众号&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;官方技术交流QQ群</p>
+* 论坛: 欢迎大家在[PaddlePaddle论坛](https://ai.baidu.com/forum/topic/list/168)分享在使用PaddlePaddle中遇到的问题和经验, 营造良好的论坛氛围
--- a/doc/paddle/guides/05_inference_deployment/mobile/pics/anakin_fm_ch.png
+++ b/doc/paddle/guides/05_inference_deployment/mobile/pics/anakin_fm_ch.png
--- a/doc/paddle/guides/05_inference_deployment/mobile/pics/int8_design.png
+++ b/doc/paddle/guides/05_inference_deployment/mobile/pics/int8_design.png
--- a/doc/paddle/guides/05_inference_deployment/paddleslim/paddle_slim.md
+++ b/doc/paddle/guides/05_inference_deployment/paddleslim/paddle_slim.md
+# 模型压缩
+PaddleSlim是一个模型压缩工具库，包含模型剪裁、定点量化、知识蒸馏、超参搜索和模型结构搜索等一系列模型压缩策略。
+对于业务用户，PaddleSlim提供完整的模型压缩解决方案，可用于图像分类、检测、分割等各种类型的视觉场景。
+同时也在持续探索NLP领域模型的压缩方案。另外，PaddleSlim提供且在不断完善各种压缩策略在经典开源任务的benchmark,
+以便业务用户参考。
+对于模型压缩算法研究者或开发者，PaddleSlim提供各种压缩策略的底层辅助接口，方便用户复现、调研和使用最新论文方法。
+PaddleSlim会从底层能力、技术咨询合作和业务场景等角度支持开发者进行模型压缩策略相关的创新工作。
+## 功能
+- 模型剪裁
+  - 卷积通道均匀剪裁
+  - 基于敏感度的卷积通道剪裁
+  - 基于进化算法的自动剪裁
+- 定点量化
+  - 在线量化训练（training aware）
+  - 离线量化（post training）
+- 知识蒸馏
+  - 支持单进程知识蒸馏
+  - 支持多进程分布式知识蒸馏
+- 神经网络结构自动搜索（NAS）
+  - 支持基于进化算法的轻量神经网络结构自动搜索
+  - 支持One-Shot网络结构自动搜索
+  - 支持 FLOPS / 硬件延时约束
+  - 支持多平台模型延时评估
+  - 支持用户自定义搜索算法和搜索空间
+## 安装
+依赖：
+Paddle >= 1.7.0
+```bash
+pip install paddleslim -i https://pypi.org/simple
+```
+## 使用
+- [快速开始](https://paddlepaddle.github.io/PaddleSlim/quick_start/index.html)：通过简单示例介绍如何快速使用PaddleSlim。
+- [进阶教程](https://paddlepaddle.github.io/PaddleSlim/tutorials/index.html)：PaddleSlim高阶教程。
+- [模型库](https://paddlepaddle.github.io/PaddleSlim/model_zoo.html)：各个压缩策略在图像分类、目标检测和图像语义分割模型上的实验结论，包括模型精度、预测速度和可供下载的预训练模型。
+- [API文档](https://paddlepaddle.github.io/PaddleSlim/api_cn/index.html)
+- [Paddle检测库](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim)：介绍如何在检测库中使用PaddleSlim。
+- [Paddle分割库](https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim)：介绍如何在分割库中使用PaddleSlim。
+- [PaddleLite](https://paddlepaddle.github.io/Paddle-Lite/)：介绍如何使用预测库PaddleLite部署PaddleSlim产出的模型。
+## 部分压缩策略效果
+### 分类模型
+数据: ImageNet2012; 模型: MobileNetV1;
+|压缩策略 |精度收益(baseline: 70.91%) |模型大小(baseline: 17.0M)|
+|:---:|:---:|:---:|
+| 知识蒸馏(ResNet50)| **+1.06%** |-|
+| 知识蒸馏(ResNet50) + int8量化训练 |**+1.10%**| **-71.76%**|
+| 剪裁(FLOPs-50%) + int8量化训练|**-1.71%**|**-86.47%**|
+### 图像检测模型
+#### 数据：Pascal VOC；模型：MobileNet-V1-YOLOv3
+|        压缩方法           | mAP(baseline: 76.2%)         | 模型大小(baseline: 94MB)      |
+| :---------------------:   | :------------: | :------------:|
+| 知识蒸馏(ResNet34-YOLOv3) | **+2.8%**      |       -       |
+| 剪裁 FLOPs -52.88%        | **+1.4%**      | **-67.76%**   |
+|知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-69.57%)| **+2.6%**|**-67.00%**|
+#### 数据：COCO；模型：MobileNet-V1-YOLOv3
+|        压缩方法           | mAP(baseline: 29.3%) | 模型大小|
+| :---------------------:   | :------------: | :------:|
+| 知识蒸馏(ResNet34-YOLOv3) |  **+2.1%**     |-|
+| 知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-67.56%) | **-0.3%** | **-66.90%**|
+### 搜索
+数据：ImageNet2012; 模型：MobileNetV2
+|硬件环境           | 推理耗时 | Top1准确率(baseline:71.90%) |
+|:---------------:|:---------:|:--------------------:|
+| RK3288  | **-23%**    | +0.07%    |
+| Android cellphone  | **-20%**    | +0.16% |
+| iPhone 6s   | **-17%**    | +0.32%  |
--- a/doc/paddle/guides/05_inference_deployment/paddleslim/paddle_slim_en.rst
+++ b/doc/paddle/guides/05_inference_deployment/paddleslim/paddle_slim_en.rst
+Model Compression
+==================
+PaddleSlim is a toolkit for model compression. It contains a collection of compression strategies, such as pruning, fixed point quantization, knowledge distillation, hyperparameter searching and neural architecture search.
+PaddleSlim provides solutions of compression on computer vision models, such as image classification, object detection and semantic segmentation. Meanwhile, PaddleSlim Keeps exploring advanced compression strategies for language model. Furthermore, benckmark of compression strategies on some open tasks is available for your reference.
+PaddleSlim also provides auxiliary and primitive API for developer and researcher to survey, implement and apply the method in latest papers. PaddleSlim will support developer in ability of framework and technology consulting.
+Features
+----------
+Pruning
+++++++++
+  - Uniform pruning of convolution
+  - Sensitivity-based prunning
+  - Automated pruning based evolution search strategy
+  - Support pruning of various deep architectures such as VGG, ResNet, and MobileNet.
+  - Support self-defined range of pruning, i.e., layers to be pruned.
+Fixed Point Quantization
++++++++++++++++++++++++
+  - Training aware
+    - Dynamic strategy: During inference, we quantize models with hyperparameters dynamically estimated from small batches of samples.
+    - Static strategy: During inference, we quantize models with the same hyperparameters estimated from training data.
+    - Support layer-wise and channel-wise quantization.
+  - Post training
+Knowledge Distillation
+++++++++++++++++++++++
+  - Naive knowledge distillation: transfers dark knowledge by merging the teacher and student model into the same Program
+  - Paddle large-scale scalable knowledge distillation framework Pantheon: a universal solution for knowledge distillation, more flexible than the naive knowledge distillation, and easier to scale to the large-scale applications.
+    - Decouple the teacher and student models --- they run in different processes in the same or different nodes, and transfer knowledge via TCP/IP ports or local files;
+    - Friendly to assemble multiple teacher models and each of them can work in either online or offline mode independently;
+    - Merge knowledge from different teachers and make batch data for the student model automatically;
+    - Support the large-scale knowledge prediction of teacher models on multiple devices.
+Neural Architecture Search
+++++++++++++++++++++++++++
+  - Neural architecture search based on evolution strategy.
+  - Support distributed search.
+  - One-Shot neural architecture search.
+  - Support FLOPs and latency constrained search.
+  - Support the latency estimation on different hardware and platforms.
+Install
+--------
+Requires:
+Paddle >= 1.7.0
+.. code-block:: bash
+   pip install paddleslim -i https://pypi.org/simple
+Usage
+------
+- `QuickStart <https://paddlepaddle.github.io/PaddleSlim/quick_start/index_en.html>`_ : Introduce how to use PaddleSlim by simple examples.
+- `Advanced Tutorials <https://paddlepaddle.github.io/PaddleSlim/tutorials/index_en.html>`_ : Tutorials about advanced usage of PaddleSlim.
+- `Model Zoo <https://paddlepaddle.github.io/PaddleSlim/model_zoo_en.html>`_ : Benchmark and pretrained models.
+- `API Documents <https://paddlepaddle.github.io/PaddleSlim/api_en/index_en.html>`_
+- `PaddleDetection <https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim>`_ : Introduce how to use PaddleSlim in PaddleDetection library.
+- `PaddleSeg <https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim>`_ : Introduce how to use PaddleSlim in PaddleSeg library.
+- `PaddleLite <https://paddlepaddle.github.io/Paddle-Lite/>`_ : How to use PaddleLite to deploy models generated by PaddleSlim.
+Performance
+------------
+Image Classification
+++++++++++++++++++++
+Dataset: ImageNet2012; Model: MobileNetV1;
+=====================================================  ===========================  ============================
+Method                                                 Accuracy(baseline: 70.91%)   Model Size(baseline: 17.0M)
+=====================================================  ===========================  ============================
+Knowledge Distillation(ResNet50)                       +1.06%                       -
+Knowledge Distillation(ResNet50) + int8 quantization   +1.10%                       -71.76%
+Pruning(FLOPs-50%) + int8 quantization                 -1.71%                       -86.47%
+=====================================================  ===========================  ============================ 
+Object Detection
+++++++++++++++++
+Dataset: Pascal VOC; Model: MobileNet-V1-YOLOv3
+==============================================================  =====================  ===========================
+Method                                                          mAP(baseline: 76.2%)   Model Size(baseline: 94MB)  
+==============================================================  =====================  ===========================
+Knowledge Distillation(ResNet34-YOLOv3)                         +2.8%                  -
+Pruning(FLOPs -52.88%)                                          +1.4%                  -67.76%
+Knowledge DistillationResNet34-YOLOv3)+Pruning(FLOPs-69.57%)    +2.6%                  -67.00%
+==============================================================  =====================  ===========================
+Dataset: COCO; Model: MobileNet-V1-YOLOv3
+==============================================================  =====================  ===========================
+Method                                                          mAP(baseline: 29.3%)   Model Size|
+==============================================================  =====================  ===========================
+Knowledge Distillation(ResNet34-YOLOv3)                         +2.1%                  -
+Knowledge Distillation(ResNet34-YOLOv3)+Pruning(FLOPs-67.56%)   -0.3%                  -66.90%|
+==============================================================  =====================  ===========================
+NAS
++++++
+Dataset: ImageNet2012; Model: MobileNetV2
+===================  ================  ===============================
+Device               Infer time cost   Top1 accuracy(baseline:71.90%)
+===================  ================  ===============================
+RK3288               -23%              +0.07%
+Android cellphone    -20%              +0.16%
+iPhone 6s            -17%              +0.32%
+===================  ================  ===============================
--- a/doc/paddle/guides/06_distributed_training/cluster_howto.rst
+++ b/doc/paddle/guides/06_distributed_training/cluster_howto.rst
+.. _cluster_howto:
+分布式训练使用手册
+====================
+分布式训练基本思想
+---------------
+分布式深度学习训练通常分为两种并行化方法：数据并行，模型并行，参考下图：
+.. image:: src/parallelism.png
+在模型并行方式下，模型的层和参数将被分布在多个节点上，模型在一个mini-batch的前向和反向训练中，将经过多次跨\
+节点之间的通信。每个节点只保存整个模型的一部分；在数据并行方式下，每个节点保存有完整的模型的层和参数，每个节点\
+独自完成前向和反向计算，然后完成梯度的聚合并同步的更新所有节点上的参数。Fluid目前版本仅提供数据并行方式，另外\
+诸如模型并行的特例实现（超大稀疏模型训练）功能将在后续的文档中予以说明。
+在数据并行模式的训练中，Fluid使用了两种通信模式，用于应对不同训练任务对分布式训练的要求，分别为RPC通信和Collective
+通信。其中RPC通信方式使用 `gRPC <https://github.com/grpc/grpc/>`_ ，Collective通信方式使用
+`NCCL2 <https://developer.nvidia.com/nccl>`_ 。
+**RPC通信和Collective通信的横向对比如下：**
+.. csv-table:: 
+   :header: "Feature", "Collective", "RPC"
+   "Ring-Based通信", "Yes", "No"
+   "异步训练", "Yes", "Yes"
+   "分布式模型", "No", "Yes"
+   "容错训练", "No", "Yes"
+   "性能", "Faster", "Fast"
+- RPC通信方式的结构：
+  .. image:: src/dist_train_pserver.png
+  使用RPC通信方式的数据并行分布式训练，会启动多个pserver进程和多个trainer进程，每个pserver进程\
+  会保存一部分模型参数，并负责接收从trainer发送的梯度并更新这些模型参数；每个trainer进程会保存一份\
+  完整的模型，并使用一部分数据进行训练，然后向pserver发送梯度，最后从pserver拉取更新后的参数。
+  pserver进程可以在和trainer完全不同的计算节点上，也可以和trainer公用节点。一个分布式任务所需要的\
+  pserver进程个数通常需要根据实际情况调整，以达到最佳的性能，然而通常来说pserver的进程不会比trainer\
+  更多。
+  **注：** 在使用GPU训练时，pserver可以选择使用GPU或只使用CPU，如果pserver也使用GPU，则会增加一次从CPU拷贝\
+  接收到的梯度数据到GPU的开销，在某些情况下会导致整体训练性能降低。
+  **注：** 在使用GPU训练时，如果每个trainer节点有多个GPU卡，则会先在每个trainer节点的多个卡之间执行\
+  NCCL2通信方式的梯度聚合，然后再通过pserver聚合多个节点的梯度。
+- NCCL2通信方式的结构：
+  .. image:: src/dist_train_nccl2.png
+  使用NCCL2（Collective通信方式）进行分布式训练，是不需要启动pserver进程的，每个trainer进程都保存\
+  一份完整的模型参数，在完成计算梯度之后通过trainer之间的相互通信，Reduce梯度数据到所有节点的所有设备\
+  然后每个节点在各自完成参数更新。
+使用parameter server方式的训练
+------------------------------
+使用 :code:`transpiler` API可以把单机可以执行的程序快速转变成可以分布式执行的程序。在不同的服务器节点
+上，通过传给 :code:`transpiler` 对应的参数，以获取当前节点需要执行的 :code:`Program` 。
+需要配置参数包括
++++++++++++++++++
+.. csv-table:: 
+   :header: "参数", "说明"
+   "role", "\ **必选**\ 区分作为pserver启动还是trainer启动，不传给transpile，也可以用其他的变量名或环境变量"
+   "trainer_id", "\ **必选**\ 如果是trainer进程，用于指定当前trainer在任务中的唯一id，从0开始，在一个任务中需保证不重复"
+   "pservers", "\ **必选**\ 当前任务所有pserver的ip:port列表字符串，形式比如：127.0.0.1:6170,127.0.0.1:6171"
+   "trainers", "\ **必选**\ trainer节点的个数"
+   "sync_mode", "\ **可选**\ True为同步模式，False为异步模式"
+   "startup_program", "\ **可选**\ 如果startup_program不是默认的fluid.default_startup_program()，需要传入此参数"
+   "current_endpoint", "\ **可选**\ 只有NCCL2模式需要传这个参数"
+一个例子，假设有两个节点，分别是 :code:`192.168.1.1` 和 :code:`192.168.1.2` ，使用端口6170，启动4个trainer，
+则代码可以写成：
+.. code-block:: python
+   role = "PSERVER"
+   trainer_id = 0  # get actual trainer id from cluster
+   pserver_endpoints = "192.168.1.1:6170,192.168.1.2:6170"
+   current_endpoint = "192.168.1.1:6170" # get actual current endpoint
+   trainers = 4
+   t = fluid.DistributeTranspiler()
+   t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+   if role == "PSERVER":
+       pserver_prog = t.get_pserver_program(current_endpoint)
+       pserver_startup = t.get_startup_program(current_endpoint,
+                                               pserver_prog)
+       exe.run(pserver_startup)
+       exe.run(pserver_prog)
+   elif role == "TRAINER":
+       train_loop(t.get_trainer_program())
+选择同步或异步训练
++++++++++++++++++
+Fluid分布式任务可以支持同步训练或异步训练，在同步训练方式下，所有的trainer节点，会在每个mini-batch
+同步地合并所有节点的梯度数据并发送给parameter server完成更新，在异步训练方式下，每个trainer没有相互\
+同步等待的过程，可以独立地更新parameter server的参数。通常情况下，使用异步训练方式，可以在trainer节点\
+更多的时候比同步训练方式有更高的总体吞吐量。
+在调用 :code:`transpile` 函数时，默认会生成同步训练的分布式程序，通过指定 :code:`sync_mode=False`
+参数即可生成异步训练的程序：
+.. code-block:: python
+   t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=False)
+选择是否使用分布式embedding表进行训练
++++++++++++++++++++++++++++++++++++
+embedding被广泛应用在各种网络结构中，尤其是文本处理相关的模型。在某些场景，例如推荐系统或者搜索引擎中，
+embedding的feature id可能会非常多，当feature id达到一定数量时，embedding参数会变得很大，一方面可能
+单机内存无法存放导致无法训练，另一方面普通的训练模式每一轮迭代都需要同步完整的参数，参数太大会让通信变得
+非常慢，进而影响训练速度。
+Fluid支持千亿量级超大规模稀疏特征embedding的训练，embedding参数只会保存在parameter server上，通过
+参数prefetch和梯度稀疏更新的方法，大大减少通信量，提高通信速度。
+该功能只对分布式训练有效，单机无法使用。
+需要配合稀疏更新一起使用。
+使用方法，在配置embedding的时候，加上参数 :code:`is_distributed=True` 以及 :code:`is_sparse=True` 即可。
+参数 :code:`dict_size` 定义数据中总的id的数量，id可以是int64范围内的任意值，只要总id个数小于等于dict_size就可以支持。
+所以配置之前需要预估一下数据中总的feature id的数量。
+.. code-block:: python
+  emb = fluid.layers.embedding(
+      is_distributed=True,
+      input=input,
+      size=[dict_size, embedding_width],
+      is_sparse=True)
+选择参数分布方法
++++++++++++++++
+参数 :code:`split_method` 可以指定参数在parameter server上的分布方式。
+Fluid默认使用 `RoundRobin <https://en.wikipedia.org/wiki/Round-robin_scheduling>`_
+方式将参数分布在多个parameter server上。此方式在默认未关闭参数切分的情况下，参数会较平均的分布在所有的
+parameter server上。如果需要使用其他，可以传入其他的方法，目前可选的方法有： :code:`RoundRobin` 和
+:code:`HashName` 。也可以使用自定义的分布方式，只需要参考
+`这里 <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py#L44>`_
+编写自定义的分布函数。
+关闭切分参数
++++++++++++
+参数 :code:`slice_var_up` 指定是否将较大（大于8192个元素）的参数切分到多个parameter server以均衡计算负载，默认为开启。
+当模型中的可训练参数体积比较均匀或者使用自定义的参数分布方法是参数均匀分布在多个parameter server上，
+可以选择关闭切分参数，这样可以降低切分和重组带来的计算和拷贝开销：
+.. code-block:: python
+   t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, slice_var_up=False)
+开启内存优化
++++++++++++
+在parameter server分布式训练模式下，要开启内存优化 :code:`memory_optimize` 和单机相比，需要注意按照下面的规则配置：
+* 在pserver端，\ **不要**\ 执行 :code:`memory_optimize`
+* 在trainer端，先执行 :code:`fluid.memory_optimize` 再执行 :code:`t.transpile()`
+* 在trainer端，调用 :code:`memory_optimize` 需要增加 :code:`skip_grads=True` 确保发送的梯度不会被重命名： :code:`fluid.memory_optimize(input_program, skip_grads=True)`
+示例：
+.. code-block:: python
+  if role == "TRAINER":
+      fluid.memory_optimize(fluid.default_main_program(), skip_grads=True)
+  t = fluid.DistributeTranspiler()
+  t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+  if role == "PSERVER":
+      # start pserver here
+  elif role == "TRAINER":
+      # start trainer here
+使用NCCL2通信方式的训练
+--------------------
+NCCL2模式的分布式训练，由于没有parameter server角色，是trainer之间互相通信，使用时注意：
+* 配置 :code:`fluid.DistributeTranspilerConfig` 中 :code:`mode="nccl2"` 。
+* 调用 :code:`transpile` 时，:code:`trainers` 传入所有trainer节点的endpoint，并且传入参数 :code:`current_endpoint` 。
+  在此步骤中，会在 :code:`startup program` 中增加 :code:`gen_nccl_id_op` 用于在多机程序初始化时同步NCCLID信息。
+* 初始化 :code:`ParallelExecutor` 时传入 :code:`num_trainers` 和 :code:`trainer_id` 。
+  在此步骤中，:code:`ParallelExecutor` 会使用多机方式初始化NCCL2并可以开始在多个节点对每个参数对应的梯度执行跨节点的
+  :code:`allreduce` 操作，执行多机同步训练
+一个例子：
+.. code-block:: python
+  trainer_id = 0 # get actual trainer id here
+  trainers = "192.168.1.1:6170,192.168.1.2:6170"
+  current_endpoint = "192.168.1.1:6170"
+  config = fluid.DistributeTranspilerConfig()
+  config.mode = "nccl2"
+  t = fluid.DistributeTranspiler(config=config)
+  t.transpile(trainer_id, trainers=trainers, current_endpoint=current_endpoint)
+  exe = fluid.ParallelExecutor(use_cuda,
+    loss_name=loss_name, num_trainers=len(trainers.split(",")), trainer_id=trainer_id)
+  ...
+NCCL2模式必要参数说明
++++++++++++++++++++++++++++++++++++++
+.. csv-table:: 
+   :header: "参数", "说明"
+   "trainer_id", "(int) 任务中每个trainer节点的唯一ID，从0开始，不能有重复"
+   "trainers", "(int) 任务中所有trainer节点的endpoint，用于在NCCL2初始化时，广播NCCL ID"
+   "current_endpoint", "(string) 当前节点的endpoint"
+目前使用NCCL2进行分布式训练仅支持同步训练方式。使用NCCL2方式的分布式训练，更适合模型体积较大，并需要使用\
+同步训练和GPU训练，如果硬件设备支持RDMA和GPU Direct，可以达到很高的分布式训练性能。
+启动多进程模式 NCCL2 分布式训练作业
+++++++++++++++++++++++++++++++++
+通常情况下使用多进程模式启动 NCCL2 分布式训练作业可以获得更好多训练性能，Paddle 提供了
+:code:`paddle.distributed.launch` 模块可以方便地启动多进程作业，启动后每个训练进程将会使用一块独立的 GPU 设备。
+使用时需要注意：
+* 设置节点数：通过环境变量 :code:`PADDLE_NUM_TRAINERS` 设置作业的节点数，此环境变量也会被设置在每个训练进程中。
+* 设置每个节点的设备数：通过启动参数 :code:`--gpus` 可以设置每个节点的 GPU 设备数量，每个进程的序号将会被自动设置在环境变量
+  :code:`PADDLE_TRAINER_ID` 中。
+* 数据切分： 多进程模式是每个设备一个进程，一般来说需要每个进程处理一部分训练数据，并且保证所有进程能够处理完整的数据集。
+* 入口文件：入口文件为实际启动的训练脚本。
+* 日志：每个训练进程的日志默认会保存在 :code:`./mylog` 目录下，您也可以通过参数 :code:`--log_dir` 进行指定。
+启动样例:
+.. code-block:: bash
+    > PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
+NCCL2分布式训练注意事项
+++++++++++++++++++++
+**注意：** 使用NCCL2模式分布式训练时，需要确保每个节点训练等量的数据，防止在最后一轮训练中任务不退出。通常有两种方式：
+- 随机采样一些数据，补全分配到较少数据的节点上。（推荐使用这种方法，以训练完整的数据集）。
+- 在python代码中，每个节点每个pass只训练固定的batch数，如果这个节点数据较多，则不训练这些多出来的数据。
+**说明：** 使用NCCL2模式分布式训练时，如果只希望使用一个节点上的部分卡，可以通过配置环境变量：:code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` 指定。
+**注意：** 如果系统中有多个网络设备，需要手动指定NCCL2使用的设备，假设需要使用 :code:`eth2` 为通信设备，需要设定如下环境变量：
+.. code-block:: bash
+   export NCCL_SOCKET_IFNAME=eth2
+另外NCCL2提供了其他的开关环境变量，比如指定是否开启GPU Direct，是否使用RDMA等，详情可以参考
+`ncclknobs <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ncclknobs>`_ 。
--- a/doc/paddle/guides/06_distributed_training/cluster_howto_en.rst
+++ b/doc/paddle/guides/06_distributed_training/cluster_howto_en.rst
+.. _cluster_howto_en:
+Manual for Distributed Training with Fluid
+==========================================
+Basic Idea Of Distributed Training
+-------------------------------------
+Distributed deep learning training is usually divided into two parallelization methods: data parallelism, model parallelism. Refer to the following figure:
+.. image:: src/parallelism.png
+In the model parallelism mode, the layers and parameters of the model will be distributed on multiple nodes. The model will go through multiple communications across nodes in the feeding forward and back propagation training of a mini-batch. Each node only saves a part of the entire model; 
+In data parallelism mode, each node holds the complete layers and parameters of the model, each node performs feeding forward and back propagation calculations on its own, and then conducts the aggregation of the gradients and updates the parameters on all nodes synchronously. 
+Current version of Fluid only provides data parallelism mode. In addition, implementations of special cases in model parallelism mode (e.g. large sparse model training ) will be explained in subsequent documents.
+In the training of data parallelism mode, Fluid uses two communication modes to deal with the requirements of distributed training for different training tasks, namely RPC Communication and Collective Communication. The RPC communication method uses `gRPC <https://github.com/grpc/grpc/>`_ , Collective communication method uses `NCCL2 <https://developer.nvidia.com/nccl>`_ . 
+.. csv-table:: The table above is a horizontal comparison of RPC communication and Collective communication
+	:header: "Feature", "Collective", "RPC"
+	"Ring-Based Communication", "Yes", "No"
+	"Asynchronous Training", "Yes", "Yes"
+	"Distributed Model", "No", "Yes"
+	"Fault-tolerant Training", "No", "Yes"
+	"Performance", "Faster", "Fast"
+- Structure of RPC Communication Method:
+  .. image:: src/dist_train_pserver.png
+  Data-parallelised distributed training in RPC communication mode will start multiple pserver processes and multiple trainer processes, each pserver process will save a part of the model parameters and be responsible for receiving the gradients sent from the trainers and updating these model parameters; Each trainer process will save a copy of the complete model, and use a part of the data to train, then send the gradients to the pservers, finally pull the updated parameters from the pserver.
+  The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.
+  **Note:** When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+  **Note:** When using GPU training, if there are multiple GPU cards in each trainer node, the gradient polymerization will execute in NCCL2 way among the cards in one node, and then in multiple nodes through pserver.
+- Structure of NCCL2 communication method:
+  .. image:: src/dist_train_nccl2.png
+NCCL2 (Collective communication method) for distributed training avoids the need of pserver processes. Each trainer process holds a complete set of model parameters. After the calculation of the gradient, the trainer, through mutual communications, "Reduce" the gradient data to all devices of all nodes and then each node completes parameter updates of its own.
+Training in the Parameter Server Manner 
+----------------------------------------------
+Use the :code:`transpiler` API to quickly convert a program that can be executed on a single machine into a program that can be executed in a distributed manner. On different server nodes, pass values to corresponding arguments at :code:`transpiler` to get the :code:`Program` which current node is to execute:
+.. csv-table:: required configuration parameters
+   :header: "parameter", "description"
+   "role", "\ **required**\  distinguishes whether to start as pserver or trainer, this arugument is not passed into ``transpile`` , you can also use other variable names or environment variables"
+   "trainer_id", "\ **required**\  If it is a trainer process, it is used to specify the unique id of the current trainer in the task, starting from 0, and must be guaranteed not to be repeated in one task"
+   "pservers", "\ **required**\ ip:port list string of all pservers in current task, for example: 127.0.0.1:6170,127.0.0.1:6171"
+   "trainers", "\ **required**\  the number of trainer nodes"
+   "sync_mode", "\ **optional**\  True for synchronous mode, False for asynchronous mode"
+   "startup_program", "\ **optional**\  If startup_program is not the default fluid.default_startup_program(), this parameter needs to be passed in"
+   "current_endpoint", "\ **optional**\  This parameter is only required for NCCL2 mode"
+For example, suppose there are two nodes, namely :code:`192.168.1.1` and :code:`192.168.1.2`, use port 6170 to start 4 trainers.
+Then the code can be written as:
+.. code-block:: python
+	role = "PSERVER"
+	trainer_id = 0 # get actual trainer id from cluster
+	pserver_endpoints = "192.168.1.1:6170,192.168.1.2:6170"
+	current_endpoint = "192.168.1.1:6170" # get actual current endpoint
+	trainers = 4
+	t = fluid.DistributeTranspiler()
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+	if role == "PSERVER":
+		    pserver_prog = t.get_pserver_program(current_endpoint)
+		    pserver_startup = t.get_startup_program(current_endpoint,Pserver_prog)
+		    exe.run(pserver_startup)
+		    exe.run(pserver_prog)
+	elif role == "TRAINER":
+		train_loop(t.get_trainer_program())
+Choose Synchronous Or Asynchronous Training
+++++++++++++++++++++++++++++++++++++++++++++
+Fluid distributed tasks support synchronous training or asynchronous training. 
+In the synchronous training mode, all trainer nodes will merge the gradient data of all nodes synchronously per mini-batch and send them to the parameter server to complete the update. 
+In the asynchronous mode, each trainer does not wait for each other, and independently update the parameters on the parameter server. 
+In general, using the asynchronous training method can have a higher overall throughput than the synchronous training mode when there are more trainer nodes.
+When the :code:`transpile` function is called, the distributed training program is generated by default. The asynchronous training program can be generated by specifying the :code:`sync_mode=False` parameter:
+.. code-block:: python
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=False)
+Whether To Use The Distributed Embedding Table For Training
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+Embedding is widely used in various network structures, especially text processing related models.
+In some scenarios, such as recommendation systems or search engines, the number of feature ids of embedding may be very large. When it reaches a certain number, the embedding parameter will become very large.
+On the one hand, the memory of the single machine may not be competent, resulting in the inability to train.
+On the other hand, the normal training mode needs to synchronize the complete set of parameters for each iteration. If the parameter is too large, the communication will become very slow, which will affect the training speed.
+Fluid supports the training of very large scale sparse features embedding at hundred billion level. The embedding parameter is only saved on the parameter server. The parameter prefetch and gradient sparse update method greatly reduce the traffic and improve the communication speed.
+This feature is only valid for distributed training and cannot be used on a single machine. Need to be used with sparse updates.
+Usage: When configuring embedding, add the parameters :code:`is_distributed=True` and :code:`is_sparse=True`.
+Parameters :code:`dict_size` Defines the total number of ids in the data. The id can be any value in the int64 range. As long as the total number of ids is less than or equal to dict_size, it can be supported.
+So before you configure, you need to estimate the total number of feature ids in the data.
+.. code-block:: python
+	emb = fluid.layers.embedding(
+		is_distributed=True,
+		input=input,
+		size=[dict_size, embedding_width],
+		is_sparse=True)
+Select Parameter Distribution Method
++++++++++++++++++++++++++++++++++++++
+Parameter :code:`split_method` can specify how the parameters are distributed on the parameter servers.
+Fluid uses `RoundRobin <https://en.wikipedia.org/wiki/Round-robin_scheduling>`_ by default to scatter parameters to multiple parameter servers. 
+In this case, the parameters are evenly distributed on all parameter servers in the case where the parameter segmentation is not turned off by default. 
+If you need to use something else, you can pass in other methods. The currently available methods are: :code:`RoundRobin` and :code:`HashName` . You can also use a customized distribution method, just refer to `here <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py#L44>`_
+to write customized distribution function
+Turn Off the slice-up of Parameters 
++++++++++++++++++++++++++++++++++++++
+Parameter :code:`slice_var_up` specifies whether to split large (more than 8192 elements) parameters into multiple parameter servers to balance the computational load. The default is on.
+When the sizes of the trainable parameters in the model are relatively uniform or a customized parameter distribution method is used, which evenly distributes the parameters on multiple parameter servers, you can choose to turn off the slice-up function, which reduces the computational and copying overhead of slicing and reorganization:
+.. code-block:: python
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, slice_var_up=False)
+Turn On Memory Optimization
++++++++++++++++++++++++++++++
+In the parameter server distributed training mode, to enable memory optimization :code:`memory_optimize` , compared with a single machine, you need to pay attention to the following rules:
+- On the pserver side, **don't** execute :code:`memory_optimize`
+- On the trainer side, execute :code:`fluid.memory_optimize` and then execute :code:`t.transpile()`
+- On the trainer side, calling :code:`memory_optimize` needs to add :code:`skip_grads=True` to ensure the gradient sent is not renamed : :code:`fluid.memory_optimize(input_program, skip_grads=True)`
+Example:
+.. code-block:: python
+	if role == "TRAINER":
+		fluid.memory_optimize(fluid.default_main_program(), skip_grads=True)
+	t = fluid.DistributeTranspiler()
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+	if role == "PSERVER":
+		# start pserver here
+	elif role == "TRAINER":
+		# start trainer here
+Training Using NCCL2 Communication
+--------------------
+Distributed training in NCCL2 mode, because there is no parameter server role, the trainers directly communicate with each other. Pay attention to the following tips:
+* Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
+* When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
+  In this step, :code:`gen_nccl_id_op` will add in :code:`startup program` to synchronize NCCLID information during the multi-computer program initialization.
+* Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
+  In this step, :code:`ParallelExecutor` will initialize NCCL2 by the multi-computer way and do the operations :code:`allreduce` across the nodes for the gradient of every parameter to execute muti-computer training
+For example:
+.. code-block:: python
+	trainer_id = 0 # get actual trainer id here
+	trainers = "192.168.1.1:6170,192.168.1.2:6170"
+	current_endpoint = "192.168.1.1:6170"
+	config = fluid.DistributeTranspilerConfig()
+	config.mode = "nccl2"
+	t = fluid.DistributeTranspiler(config=config)
+	t.transpile(trainer_id, trainers=trainers, current_endpoint=current_endpoint)
+	txe = fluid.ParallelExecutor(use_cuda,
+		loss_name=loss_name, num_trainers=len(trainers.split(",")), trainer_id=trainer_id)
+	...
+.. csv-table:: Description of the necessary parameters for NCCL2 mode
+	:header: "parameter", "description"
+	"trainer_id", "(int)The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
+	"trainers", "(int)endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
+	"current_endpoint", "(string)endpoint of current node"
+Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
+synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.
+Start Up NCCL2 Distributed Training in Muti-Process Mode
++++++++++++++++++++++++++++++++++++++++++++++
+ Usually you can get better multi-training performance by using multi-process mode to start up NCCL2 distributed training assignment. Paddle provides :code:`paddle.distributed.launch` module to start up multi-process assignment, after which each training process will use an independent GPU device.
+Attention during usage:
+ * set the number of nodes: set the number of nodes of an assignment by the environment variable :code:`PADDLE_NUM_TRAINERS` , and this variable will also be set in every training process.
+ * set the number of devices of each node: by activating the parameter :code:`--gpus` , you can set the number of GPU devices of each node, and the sequence number of each process will be set in the environment variable :code:`PADDLE_TRAINER_ID` automatically.
+ * data segment: mult-process mode means one process in each device. Generally, each process manages a part of training data, in order to make sure that all processes can manage the whole data set.
+ * entrance file: entrance file is the training script for actual startup.
+ * journal: for each training process, the joural is saved in the default :code:`./mylog` directory, and you can assign by the parameter :code:`--log_dir` .
+  startup example:
+  .. code-block:: bash
+     > PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
+Important Notes on NCCL2 Distributed Training
++++++++++++++++++++++++++++++++++++++++++++++
+**Note:** When using distributed training in NCCL2 mode, if you only want to use a part of cards in one node, you can appoint by configuring the environment variable :code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` .
+**Note:** Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
+exit at the final iteration. There are two common ways:
+- Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)
+- Each node only trains fixed number of batches per pass, which is controlled by python codes. If a node has more data than this fixed amount, then these 
+  marginal data will not be trained.
+**Note** : If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.
+Assuming you need to use :code:`eth2` as the communication device, you need to set the following environment variables:
+.. code-block:: bash
+    export NCCL_SOCKET_IFNAME=eth2
+In addition, NCCL2 provides other switch environment variables, such as whether to enable GPU Direct, whether to use RDMA, etc. For details, please refer to
+`ncclknobs <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ncclknobs>`_ .
--- a/doc/paddle/guides/06_distributed_training/cluster_quick_start.rst
+++ b/doc/paddle/guides/06_distributed_training/cluster_quick_start.rst
+..  _cluster_quick_start:
+分布式训练快速开始
+==================
+使用Fleet API进行分布式训练
+---------------------------
+从Paddle Fluid `Release 1.5.1 <https://github.com/PaddlePaddle/Paddle/releases/tag/v1.5.1>`_ 开始，官方推荐使用Fleet API进行分布式训练，关于Fleet API的介绍可以参考 `Fleet Design Doc <https://github.com/PaddlePaddle/Fleet>`_
+准备条件
+^^^^^^^^
+* 
+  [x] 成功安装Paddle Fluid，如果尚未安装，请参考 `快速开始 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/beginners_guide/quick_start_cn.html>`_
+* 
+  [x] 学会最基本的单机训练方法，请参考 `单机训练 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/single_node.html>`_ 中描述的单卡训练，进行学习
+点击率预估任务
+^^^^^^^^^^^^^^
+本文使用一个简单的示例，点击率预估任务，来说明如何使用Fleet API进行分布式训练的配置方法，并利用单机环境模拟分布式环境给出运行示例。示例的源码来自 `CTR with Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`_
+为了方便学习，这里给出的示例是单机与多机混合的代码，用户可以通过不同的启动命令进行单机或多机任务的启动。获取数据的部分，以及对数据预处理的逻辑可以参考 `CTR with Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`_ 的源码和说明，这里不做过多描述。
+.. code-block:: python
+   from __future__ import print_function
+   from args import parse_args
+   import os
+   import paddle.fluid as fluid
+   import sys
+   from network_conf import ctr_dnn_model_dataset
+   import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+   from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+   from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
+   dense_feature_dim = 13
+   sparse_feature_dim = 10000001
+   batch_size = 100
+   thread_num = 10
+   embedding_size = 10
+   args = parse_args()
+   def main_function(is_local):
+     # common code for local training and distributed training
+     dense_input = fluid.layers.data(
+       name="dense_input", shape=[dense_feature_dim], dtype='float32')
+     sparse_input_ids = [
+           fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1,
+                             dtype="int64") for i in range(1, 27)]
+       label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+       dataset = fluid.DatasetFactory().create_dataset()
+       dataset.set_use_var([dense_input] + sparse_input_ids + [label])
+       pipe_command = "python criteo_reader.py %d" % sparse_feature_dim
+       dataset.set_pipe_command(pipe_command)
+       dataset.set_batch_size(batch_size)
+       dataset.set_thread(thread_num)
+       whole_filelist = ["raw_data/part-%d" % x 
+                          for x in range(len(os.listdir("raw_data")))]
+       dataset.set_filelist(whole_filelist)
+       loss, auc_var, batch_auc_var = ctr_dnn_model_dataset(
+           dense_input, sparse_input_ids, label, embedding_size,
+           sparse_feature_dim)
+       exe = fluid.Executor(fluid.CPUPlace())
+       def train_loop(epoch=20):
+           for i in range(epoch):
+               exe.train_from_dataset(program=fluid.default_main_program(),
+                                      dataset=dataset,
+                                      fetch_list=[auc_var],
+                                      fetch_info=["auc"],
+                                      debug=False)
+       # local training
+       def local_train():
+           optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
+           optimizer.minimize(loss)
+           exe.run(fluid.default_startup_program())
+           train_loop()
+     # distributed training
+       def dist_train():
+           role = role_maker.PaddleCloudRoleMaker()
+           fleet.init(role)
+           strategy = DistributeTranspilerConfig()
+           strategy.sync_mode = False
+           optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
+           optimizer = fleet.distributed_optimizer(optimizer, strategy)
+           optimizer.minimize(loss)
+           if fleet.is_server():
+               fleet.init_server()
+               fleet.run_server()
+           elif fleet.is_worker():
+               fleet.init_worker()
+               exe.run(fluid.default_startup_program())
+               train_loop()
+       if is_local:
+           local_train()
+       else:
+           dist_train()
+   if __name__ == '__main__':
+       main_function(args.is_local)
+* 说明：示例中使用的IO方法是dataset，想了解具体的文档和用法请参考 `Dataset API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/dataset_cn.html>`_ 。示例中使用的 ``train_from_dataset`` 接口，想了解具体的文档和使用方法请参考 `Executor API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/executor_cn.html>`_ 。示例中的 ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet`` 表示引入参数服务器架构进行分布式训练，如果想更进一步了解Fleet API的更多选项和示例，请参考 `Fleet API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`_
+单机训练启动命令
+~~~~~~~~~~~~~~~~
+.. code-block:: bash
+   python train.py --is_local 1
+单机模拟分布式训练的启动命令
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+在单机模拟多机训练的启动命令，这里我们用到了paddle内置的一个启动器launch_ps，用户可以指定worker和server的数量进行参数服务器任务的启动
+.. code-block:: bash
+   python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 train.py
+任务运行的日志在工作目录的logs目录下可以查看，当您能够使用单机模拟分布式训练，可以进行真正的多机分布式训练。我们建议用户直接参考 `百度云运行分布式任务的示例 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html>`_
--- a/doc/paddle/guides/06_distributed_training/cluster_quick_start_en.rst
+++ b/doc/paddle/guides/06_distributed_training/cluster_quick_start_en.rst
+Quick start for distributed training
+====================================
+Distributed training with Fleet API
+-----------------------------------
+Since Paddle Fluid `Release
+1.5.1 <https://github.com/PaddlePaddle/Paddle/releases/tag/v1.5.1>`__,
+it is officially recommended to use the Fleet API for distributed
+training. For the introduction of the Fleet API, please refer to `Fleet
+Design Doc <https://github.com/PaddlePaddle/Fleet>`__.
+Preparation
+~~~~~~~~~~~
+-  [x] Install Paddle Fluid. If not already installed, please refer to
+   `Beginner’s
+   Guide <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/beginners_guide/index_en.html>`__.
+-  [x] Master the most basic single node training method. Please refer
+   to the single card training described in `Single-node
+   training <https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/user_guides/howto/training/single_node_en.html>`__.
+Click-through rate prediction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Here, we will use a simple example, click-through rate prediction task,
+to illustrate how to configure Fleet API for distributed training, and
+gives an example by using a single node environment to simulate the
+distributed environment. The source code of the example comes from `CTR
+with
+Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
+In order to facilitate learning, the example given here is a mixed code
+of single node and multi node. You can start single node or multi node
+tasks through different startup commands. For the part of obtaining data
+and the logic of data preprocessing, please refer to the source code and
+description of `CTR with
+Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
+.. code:: python
+    from __future__ import print_function
+    from args import parse_args
+    import os
+    import paddle.fluid as fluid
+    import sys
+    from network_conf import ctr_dnn_model_dataset
+    import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+    from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
+    dense_feature_dim = 13
+    sparse_feature_dim = 10000001
+    batch_size = 100
+    thread_num = 10
+    embedding_size = 10
+    args = parse_args()
+    def main_function(is_local):
+      # common code for local training and distributed training
+      dense_input = fluid.layers.data(
+        name="dense_input", shape=[dense_feature_dim], dtype='float32')
+      sparse_input_ids = [
+            fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1,
+                              dtype="int64") for i in range(1, 27)]
+        label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+        dataset = fluid.DatasetFactory().create_dataset()
+        dataset.set_use_var([dense_input] + sparse_input_ids + [label])
+        pipe_command = "python criteo_reader.py %d" % sparse_feature_dim
+        dataset.set_pipe_command(pipe_command)
+        dataset.set_batch_size(batch_size)
+        dataset.set_thread(thread_num)
+        whole_filelist = ["raw_data/part-%d" % x 
+                           for x in range(len(os.listdir("raw_data")))]
+        dataset.set_filelist(whole_filelist)
+        loss, auc_var, batch_auc_var = ctr_dnn_model_dataset(
+            dense_input, sparse_input_ids, label, embedding_size,
+            sparse_feature_dim)
+        exe = fluid.Executor(fluid.CPUPlace())
+        def train_loop(epoch=20):
+            for i in range(epoch):
+                exe.train_from_dataset(program=fluid.default_main_program(),
+                                       dataset=dataset,
+                                       fetch_list=[auc_var],
+                                       fetch_info=["auc"],
+                                       debug=False)
+        # local training
+        def local_train():
+            optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
+            optimizer.minimize(loss)
+            exe.run(fluid.default_startup_program())
+            train_loop()
+      # distributed training
+        def dist_train():
+            role = role_maker.PaddleCloudRoleMaker()
+            fleet.init(role)
+            strategy = DistributeTranspilerConfig()
+            strategy.sync_mode = False
+            optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
+            optimizer = fleet.distributed_optimizer(optimizer, strategy)
+            optimizer.minimize(loss)
+            if fleet.is_server():
+                fleet.init_server()
+                fleet.run_server()
+            elif fleet.is_worker():
+                fleet.init_worker()
+                exe.run(fluid.default_startup_program())
+                train_loop()
+        if is_local:
+            local_train()
+        else:
+            dist_train()
+    if __name__ == '__main__':
+        main_function(args.is_local)
+-  Note: The IO method used in this example is dataset, please refer to
+   `Dataset
+   API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/dataset.html>`__
+   for specific documents and usage. For the ``train_from_dataset``
+   interface, please refer to `Executor
+   API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/executor.html>`__.
+   ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet``
+   in this example means to introduce parameter server architecture for
+   distributed training, which you can refer to `Fleet
+   API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`__
+   for getting more about the options and examples of Fleet API.
+Start command of single node training
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. code:: bash
+    python train.py --is_local 1
+Start command of single machine simulation distributed training
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Here we use launch\_ps, a built-in launcher of paddle, which users can
+specify the number of workers and servers to start the parameter server
+tasks.
+.. code:: bash
+    python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 train.py
+The task running log can be viewed in the logs directory of the working
+directory. When you can use a single machine to simulate distributed
+training, you can perform true multi node distributed training. We
+recommend that users refer directly to
+`百度云运行分布式任务的示例 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html>`__.
--- a/doc/paddle/guides/06_distributed_training/fleet_api_howto_cn.rst
+++ b/doc/paddle/guides/06_distributed_training/fleet_api_howto_cn.rst
+使用FleetAPI进行分布式训练
+==========================
+FleetAPI 设计说明
+-----------------
+Fleet是PaddlePaddle分布式训练的高级API。Fleet的命名出自于PaddlePaddle，象征一个舰队中的多只双桨船协同工作。Fleet的设计在易用性和算法可扩展性方面做出了权衡。用户可以很容易从单机版的训练程序，通过添加几行代码切换到分布式训练程序。此外，分布式训练的算法也可以通过Fleet
+API接口灵活定义。具体的设计原理可以参考\ `Fleet
+API设计文档 <https://github.com/PaddlePaddle/Fleet/blob/develop/README.md>`_\ 。当前FleetAPI还处于paddle.fluid.incubate目录下，未来功能完备后会放到paddle.fluid目录中，欢迎持续关注。
+Fleet API快速上手示例
+---------------------
+下面会针对Fleet
+API最常见的两种使用场景，用一个模型做示例，目的是让用户有快速上手体验的模板。快速上手的示例源代码可以在\ `Fleet Quick Start <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/quick-start>`_ 找到。
+* 
+  假设我们定义MLP网络如下：
+  .. code-block:: python
+     import paddle.fluid as fluid
+     def mlp(input_x, input_y, hid_dim=128, label_dim=2):
+       fc_1 = fluid.layers.fc(input=input_x, size=hid_dim, act='tanh')
+       fc_2 = fluid.layers.fc(input=fc_1, size=hid_dim, act='tanh')
+       prediction = fluid.layers.fc(input=[fc_2], size=label_dim, act='softmax')
+       cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
+       avg_cost = fluid.layers.mean(x=cost)
+       return avg_cost
+* 
+  定义一个在内存生成数据的Reader如下：
+  .. code-block:: python
+     import numpy as np
+     def gen_data():
+         return {"x": np.random.random(size=(128, 32)).astype('float32'),
+                 "y": np.random.randint(2, size=(128, 1)).astype('int64')}
+* 
+  单机Trainer定义
+  .. code-block:: python
+     import paddle.fluid as fluid
+     from nets import mlp
+     from utils import gen_data
+     input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
+     input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
+     cost = mlp(input_x, input_y)
+     optimizer = fluid.optimizer.SGD(learning_rate=0.01)
+     optimizer.minimize(cost)
+     place = fluid.CUDAPlace(0)
+     exe = fluid.Executor(place)
+     exe.run(fluid.default_startup_program())
+     step = 1001
+     for i in range(step):
+       cost_val = exe.run(feed=gen_data(), fetch_list=[cost.name])
+       print("step%d cost=%f" % (i, cost_val[0]))
+* 
+  Parameter Server训练方法
+  参数服务器方法对于大规模数据，简单模型的并行训练非常适用，我们基于单机模型的定义给出使用Parameter Server进行训练的示例如下：
+  .. code-block:: python
+     import paddle.fluid as fluid
+     from nets import mlp
+     from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+     from paddle.fluid.incubate.fleet.base import role_maker
+     from utils import gen_data
+     input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
+     input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
+     cost = mlp(input_x, input_y)
+     optimizer = fluid.optimizer.SGD(learning_rate=0.01)
+     role = role_maker.PaddleCloudRoleMaker()
+     fleet.init(role)
+     optimizer = fleet.distributed_optimizer(optimizer)
+     optimizer.minimize(cost)
+     if fleet.is_server():
+       fleet.init_server()
+       fleet.run_server()
+     elif fleet.is_worker():
+       place = fluid.CPUPlace()
+       exe = fluid.Executor(place)
+       exe.run(fluid.default_startup_program())
+       step = 1001
+       for i in range(step):
+         cost_val = exe.run(
+             program=fluid.default_main_program(),
+             feed=gen_data(),
+             fetch_list=[cost.name])
+         print("worker_index: %d, step%d cost = %f" %
+              (fleet.worker_index(), i, cost_val[0]))
+* 
+  Collective训练方法
+  Collective Training通常在GPU多机多卡训练中使用，一般在复杂模型的训练中比较常见，我们基于上面的单机模型定义给出使用Collective方法进行分布式训练的示例如下：
+  .. code-block:: python
+     import paddle.fluid as fluid
+     from nets import mlp
+     from paddle.fluid.incubate.fleet.collective import fleet
+     from paddle.fluid.incubate.fleet.base import role_maker
+     from utils import gen_data
+     input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
+     input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
+     cost = mlp(input_x, input_y)
+     optimizer = fluid.optimizer.SGD(learning_rate=0.01)
+     role = role_maker.PaddleCloudRoleMaker(is_collective=True)
+     fleet.init(role)
+     optimizer = fleet.distributed_optimizer(optimizer)
+     optimizer.minimize(cost)
+     place = fluid.CUDAPlace(0)
+     exe = fluid.Executor(place)
+     exe.run(fluid.default_startup_program())
+     step = 1001
+     for i in range(step):
+       cost_val = exe.run(
+           program=fluid.default_main_program(),
+           feed=gen_data(),
+           fetch_list=[cost.name])
+       print("worker_index: %d, step%d cost = %f" %
+            (fleet.worker_index(), i, cost_val[0]))
+更多使用示例
+------------
+`点击率预估 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr>`_
+`语义匹配 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/simnet_bow>`_
+`向量学习 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/word2vec>`_
+`基于Resnet50的图像分类 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet>`_
+`基于Transformer的机器翻译 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/transformer>`_
+`基于Bert的语义表示学习 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/bert>`_
+Fleet API相关的接口说明
+-----------------------
+Fleet API接口
+^^^^^^^^^^^^^
+* init(role_maker=None)
+  * fleet初始化，需要在使用fleet其他接口前先调用，用于定义多机的环境配置
+* is_worker()
+  * Parameter Server训练中使用，判断当前节点是否是Worker节点，是则返回True，否则返回False
+* is_server(model_dir=None)
+  * Parameter Server训练中使用，判断当前节点是否是Server节点，是则返回True，否则返回False
+* init_server()
+  * Parameter Server训练中，fleet加载model_dir中保存的模型相关参数进行parameter
+    server的初始化
+* run_server()
+  * Parameter Server训练中使用，用来启动server端服务
+* init_worker()
+  * Parameter Server训练中使用，用来启动worker端服务
+* stop_worker()
+  * 训练结束后，停止worker
+* distributed_optimizer(optimizer, strategy=None)
+  * 分布式优化算法装饰器，用户可带入单机optimizer，并配置分布式训练策略，返回一个分布式的optimizer
+RoleMaker
+^^^^^^^^^
+* 
+  MPISymetricRoleMaker
+  * 
+    描述：MPISymetricRoleMaker会假设每个节点启动两个进程，1worker+1pserver，这种RoleMaker要求用户的集群上有mpi环境。
+  * 
+    示例：
+    .. code-block:: python
+       from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+       from paddle.fluid.incubate.fleet.base import role_maker
+       role = role_maker.MPISymetricRoleMaker()
+       fleet.init(role)
+  * 
+    启动方法：
+    .. code-block:: python
+       mpirun -np 2 python trainer.py
+* 
+  PaddleCloudRoleMaker
+  * 
+    描述：PaddleCloudRoleMaker是一个高级封装，支持使用paddle.distributed.launch或者paddle.distributed.launch_ps启动脚本
+  * 
+    Parameter Server训练示例：
+    .. code-block:: python
+       from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+       from paddle.fluid.incubate.fleet.base import role_maker
+       role = role_maker.PaddleCloudRoleMaker()
+       fleet.init(role)
+  * 
+    启动方法：
+    .. code-block:: python
+       python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 trainer.py
+  * 
+    Collective训练示例：
+    .. code-block:: python
+       from paddle.fluid.incubate.fleet.collective import fleet
+       from paddle.fluid.incubate.fleet.base import role_maker
+       role = role_maker.PaddleCloudRoleMaker(is_collective=True)
+       fleet.init(role)
+  * 
+    启动方法：
+    .. code-block:: python
+        python -m paddle.distributed.launch trainer.py
+* 
+  UserDefinedRoleMaker
+  * 
+    描述：用户自定义节点的角色信息，IP和端口信息
+  * 
+    示例：
+    .. code-block:: python
+       from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+       from paddle.fluid.incubate.fleet.base import role_maker
+       role = role_maker.UserDefinedRoleMaker(
+                   current_id=int(os.getenv("CURRENT_ID")),
+                   role=role_maker.Role.WORKER if bool(int(os.getenv("IS_WORKER"))) 
+                                                                                   else role_maker.Role.SERVER,
+                   worker_num=int(os.getenv("WORKER_NUM")),
+                   server_endpoints=pserver_endpoints)
+       fleet.init(role)
+Strategy
+^^^^^^^^
+* Parameter Server Training
+  * Sync_mode
+* Collective Training
+  * LocalSGD
+  * ReduceGrad
+Fleet Mode
+^^^^^^^^^^
+* 
+  Parameter Server Training
+  .. code-block:: python
+     from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+* 
+  Collective Training
+  .. code-block:: python
+     from paddle.fluid.incubate.fleet.collective import fleet
--- a/doc/paddle/guides/06_distributed_training/index_cn.rst
+++ b/doc/paddle/guides/06_distributed_training/index_cn.rst
+##########
+分布式训练
+##########
+..  toctree::
+    :maxdepth: 1
+    cluster_quick_start.rst
+    fleet_api_howto_cn.rst
--- a/doc/paddle/guides/06_distributed_training/index_en.rst
+++ b/doc/paddle/guides/06_distributed_training/index_en.rst
+..  _user_guide_distribute_en:
+######################
+Distributed Training
+######################
+.. toctree::
+   :maxdepth: 1
+   cluster_quick_start_en.rst
+   cluster_howto_en.rst
--- a/doc/paddle/guides/06_distributed_training/multi_node.rst
+++ b/doc/paddle/guides/06_distributed_training/multi_node.rst
+########
+多机训练
+########
+..  toctree::
+    :maxdepth: 1
+    cluster_quick_start.rst
+    cluster_howto.rst
+    fleet_api_howto_cn.rst
--- a/doc/paddle/guides/06_distributed_training/multi_node_en.rst
+++ b/doc/paddle/guides/06_distributed_training/multi_node_en.rst
+####################
+Multi-node Training
+####################
+.. toctree::
+   :maxdepth: 1
+   cluster_quick_start_en.rst
+   cluster_howto_en.rst
+   train_on_baidu_cloud_en.rst
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/cluster-info.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/cluster-info.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/concole.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/concole.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/conf-download.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/conf-download.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-models.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-models.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-prediction-end-to-end-deployment.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-prediction-end-to-end-deployment.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-running.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/ctr-running.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/eip.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/eip.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/file_server.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/file_server.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/helm-version.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/helm-version.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/kubectl-version.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/kubectl-version.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/load_balancer.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/load_balancer.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/pserver-log.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/pserver-log.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/tiller.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/tiller.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/trainer-log.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/trainer-log.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/volcano.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/volcano.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/wget_example.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/wget_example.png
--- a/doc/paddle/guides/06_distributed_training/src/baidu_cloud/workload.png
+++ b/doc/paddle/guides/06_distributed_training/src/baidu_cloud/workload.png
--- a/doc/paddle/guides/06_distributed_training/src/create_gpu_machine.png
+++ b/doc/paddle/guides/06_distributed_training/src/create_gpu_machine.png
--- a/doc/paddle/guides/06_distributed_training/src/create_image.png
+++ b/doc/paddle/guides/06_distributed_training/src/create_image.png
--- a/doc/paddle/guides/06_distributed_training/src/create_more_nodes.png
+++ b/doc/paddle/guides/06_distributed_training/src/create_more_nodes.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_kubectl_download.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_kubectl_download.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_node.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_node.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_pods.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_pods.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_pserver_log.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_pserver_log.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_trainer_log.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_trainer_log.png
--- a/doc/paddle/guides/06_distributed_training/src/ctr_volcano_install.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctr_volcano_install.png
--- a/doc/paddle/guides/06_distributed_training/src/ctryaml1.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctryaml1.png
--- a/doc/paddle/guides/06_distributed_training/src/ctryaml2.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctryaml2.png
--- a/doc/paddle/guides/06_distributed_training/src/ctryaml3.png
+++ b/doc/paddle/guides/06_distributed_training/src/ctryaml3.png
--- a/doc/paddle/guides/06_distributed_training/src/cube.png
+++ b/doc/paddle/guides/06_distributed_training/src/cube.png
--- a/doc/paddle/guides/06_distributed_training/src/cube_config1.png
+++ b/doc/paddle/guides/06_distributed_training/src/cube_config1.png
--- a/doc/paddle/guides/06_distributed_training/src/cube_config2.png
+++ b/doc/paddle/guides/06_distributed_training/src/cube_config2.png
--- a/doc/paddle/guides/06_distributed_training/src/dist_train_demo.py
+++ b/doc/paddle/guides/06_distributed_training/src/dist_train_demo.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import paddle.fluid.core as core
+import math
+import os
+import sys
+import numpy
+import paddle
+import paddle.fluid as fluid
+BATCH_SIZE = 64
+PASS_NUM = 1
+def loss_net(hidden, label):
+    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+    loss = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=prediction, label=label)
+    return prediction, avg_loss, acc
+def conv_net(img, label):
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    return loss_net(conv_pool_2, label)
+def train(use_cuda, role, endpoints, current_endpoint, trainer_id, trainers):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    prediction, avg_loss, acc = conv_net(img, label)
+    test_program = fluid.default_main_program().clone(for_test=True)
+    optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+    optimizer.minimize(avg_loss)
+    t = fluid.DistributeTranspiler()
+    t.transpile(trainer_id, pservers=endpoints, trainers=trainers)
+    if role == "pserver":
+        prog = t.get_pserver_program(current_endpoint)
+        startup = t.get_startup_program(current_endpoint, pserver_program=prog)
+        exe = fluid.Executor(fluid.CPUPlace())
+        exe.run(startup)
+        exe.run(prog)
+    elif role == "trainer":
+        prog = t.get_trainer_program()
+        place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+        exe = fluid.Executor(place)
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=500),
+            batch_size=BATCH_SIZE)
+        test_reader = paddle.batch(
+            paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
+        feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
+        exe.run(fluid.default_startup_program())
+        for pass_id in range(PASS_NUM):
+            for batch_id, data in enumerate(train_reader()):
+                acc_np, avg_loss_np = exe.run(
+                    prog, feed=feeder.feed(data), fetch_list=[acc, avg_loss])
+                if (batch_id + 1) % 10 == 0:
+                    print(
+                        'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
+                        format(pass_id, batch_id + 1,
+                               float(avg_loss_np.mean()), float(
+                                   acc_np.mean())))
+if __name__ == '__main__':
+    if len(sys.argv) != 6:
+        print(
+            "Usage: python %s role endpoints current_endpoint trainer_id trainers"
+            % sys.argv[0])
+        exit(0)
+    role, endpoints, current_endpoint, trainer_id, trainers = \
+        sys.argv[1:]
+    train(True, role, endpoints, current_endpoint,
+          int(trainer_id), int(trainers))
--- a/doc/paddle/guides/06_distributed_training/src/dist_train_nccl2.graffle
+++ b/doc/paddle/guides/06_distributed_training/src/dist_train_nccl2.graffle
--- a/doc/paddle/guides/06_distributed_training/src/dist_train_nccl2.png
+++ b/doc/paddle/guides/06_distributed_training/src/dist_train_nccl2.png
--- a/doc/paddle/guides/06_distributed_training/src/dist_train_pserver.graffle
+++ b/doc/paddle/guides/06_distributed_training/src/dist_train_pserver.graffle
--- a/doc/paddle/guides/06_distributed_training/src/dist_train_pserver.png
+++ b/doc/paddle/guides/06_distributed_training/src/dist_train_pserver.png
--- a/doc/paddle/guides/06_distributed_training/src/file_server_pod.png
+++ b/doc/paddle/guides/06_distributed_training/src/file_server_pod.png
--- a/doc/paddle/guides/06_distributed_training/src/file_server_svc.png
+++ b/doc/paddle/guides/06_distributed_training/src/file_server_svc.png
--- a/doc/paddle/guides/06_distributed_training/src/overview.png
+++ b/doc/paddle/guides/06_distributed_training/src/overview.png
--- a/doc/paddle/guides/06_distributed_training/src/paddleclient.png
+++ b/doc/paddle/guides/06_distributed_training/src/paddleclient.png
--- a/doc/paddle/guides/06_distributed_training/src/paddleserving_pod.png
+++ b/doc/paddle/guides/06_distributed_training/src/paddleserving_pod.png
--- a/doc/paddle/guides/06_distributed_training/src/paddleserving_svc.png
+++ b/doc/paddle/guides/06_distributed_training/src/paddleserving_svc.png
--- a/doc/paddle/guides/06_distributed_training/src/parallelism.png
+++ b/doc/paddle/guides/06_distributed_training/src/parallelism.png
--- a/doc/paddle/guides/06_distributed_training/src/pyreader.png
+++ b/doc/paddle/guides/06_distributed_training/src/pyreader.png
--- a/doc/paddle/guides/06_distributed_training/src/release.png
+++ b/doc/paddle/guides/06_distributed_training/src/release.png
--- a/doc/paddle/guides/06_distributed_training/src/transfer.png
+++ b/doc/paddle/guides/06_distributed_training/src/transfer.png
--- a/doc/paddle/guides/07_performance_improving/amp/amp.md
+++ b/doc/paddle/guides/07_performance_improving/amp/amp.md
+# 混合精度训练最佳实践
+Automatic Mixed Precision (AMP) 是一种自动混合使用半精度（FP16）和单精度（FP32）来加速模型训练的技术。AMP技术可方便用户快速将使用 FP32 训练的模型修改为使用混合精度训练，并通过黑白名单和动态`loss scaling`来保证训练时的数值稳定性进而避免梯度Infinite或者NaN(Not a Number)。借力于新一代NVIDIA GPU中Tensor Cores的计算性能，PaddlePaddle AMP技术在ResNet50、Transformer等模型上训练速度相对于FP32训练加速比可达1.5～2.9。
+### 半精度浮点类型FP16
+如图 1 所示，半精度（Float Precision16，FP16）是一种相对较新的浮点类型，在计算机中使用2字节（16位）存储。在IEEE 754-2008标准中，它亦被称作binary16。与计算中常用的单精度（FP32）和双精度（FP64）类型相比，FP16更适于在精度要求不高的场景中使用。
+<figure align="center">
+    <img src="https://paddleweb-static.bj.bcebos.com/images/fp16.png" width="600" alt='missing'/>
+    <figcaption><center>图 1. 半精度和单精度数据示意图</center></figcaption>
+</figure>
+### 英伟达GPU的FP16算力
+在使用相同的超参数下，混合精度训练使用半精度浮点（FP16）和单精度（FP32）浮点即可达到与使用纯单精度训练相同的准确率，并可加速模型的训练速度。这主要得益于英伟达推出的Volta及Turing架构GPU在使用FP16计算时具有如下特点：
+* FP16可降低一半的内存带宽和存储需求，这使得在相同的硬件条件下研究人员可使用更大更复杂的模型以及更大的batch size大小。
+* FP16可以充分利用英伟达Volta及Turing架构GPU提供的Tensor Cores技术。在相同的GPU硬件上，Tensor Cores的FP16计算吞吐量是FP32的8倍。
+### PaddlePaddle AMP功能——牛刀小试
+如前文所述，使用FP16数据类型可能会造成计算精度上的损失，但对深度学习领域而言，并不是所有计算都要求很高的精度，一些局部的精度损失对最终训练效果影响很微弱，却能使吞吐和训练速度带来大幅提升。因此，混合精度计算的需求应运而生。具体而言，训练过程中将一些对精度损失不敏感且能利用Tensor Cores进行加速的运算使用半精度处理，而对精度损失敏感部分依然保持FP32计算精度，用以最大限度提升访存和计算效率。
+为了避免对每个具体模型人工地去设计和尝试精度混合的方法，PaddlePaadle框架提供自动混合精度训练（AMP）功能，解放"炼丹师"的双手。在PaddlePaddle中使用AMP训练是一件十分容易的事情，用户只需要增加一行代码即可将原有的FP32训练转变为AMP训练。下面以`MNIST`为例介绍PaddlePaddle AMP功能的使用示例。
+**MNIST网络定义**
+```python
+import paddle.fluid as fluid
+def MNIST(data, class_dim):
+    conv1 = fluid.layers.conv2d(data, 16, 5, 1, act=None, data_format='NHWC')
+    bn1 = fluid.layers.batch_norm(conv1, act='relu', data_layout='NHWC')
+    pool1 = fluid.layers.pool2d(bn1, 2, 'max', 2, data_format='NHWC')
+    conv2 = fluid.layers.conv2d(pool1, 64, 5, 1, act=None, data_format='NHWC')
+    bn2 = fluid.layers.batch_norm(conv2, act='relu', data_layout='NHWC')
+    pool2 = fluid.layers.pool2d(bn2, 2, 'max', 2, data_format='NHWC')
+    fc1 = fluid.layers.fc(pool2, size=64, act='relu')
+    fc2 = fluid.layers.fc(fc1, size=class_dim, act='softmax')
+    return fc2
+```
+针对CV(Computer Vision)类模型组网，为获得更高的训练性能需要注意如下三点：
+* `conv2d`、`batch_norm`以及`pool2d`等需要将数据布局设置为`NHWC`，这样有助于使用TensorCore技术加速计算过程<sup><a href="#fn1" id="ref1">1</a></sup>。
+* Tensor Cores要求在使用FP16加速卷积运算时conv2d的输入/输出通道数为8的倍数<sup><a href="#fn2" id="ref2">2</a></sup>，因此设计网络时推荐将conv2d层的输入/输出通道数设置为8的倍数。
+* Tensor Cores要求在使用FP16加速矩阵乘运算时矩阵行数和列数均为8的倍数<sup><a href="#fn3" id="ref3">3</a></sup>，因此设计网络时推荐将fc层的size参数设置为8的倍数。
+**FP32 训练**
+为了训练 MNIST 网络，还需要定义损失函数来更新权重参数，此处使用的优化器是SGDOptimizer。为了简化说明，这里省略了迭代训练的相关代码，仅体现损失函数及优化器定义相关的内容。
+```python
+import paddle
+import numpy as np
+data = fluid.layers.data(
+    name='image', shape=[None, 28, 28, 1], dtype='float32')
+label = fluid.layers.data(name='label', shape=[None, 1], dtype='int64')
+out = MNIST(data, class_dim=10)
+loss = fluid.layers.cross_entropy(input=out, label=label)
+avg_loss = fluid.layers.mean(loss)
+sgd = fluid.optimizer.SGDOptimizer(learning_rate=1e-3)
+sgd.minimize(avg_loss)
+```
+**AMP训练**
+与FP32训练相比，用户仅需使用PaddlePaddle提供的`fluid.contrib.mixed_precision.decorate` 函数将原来的优化器SGDOptimizer进行封装，然后使用封装后的优化器（mp_sgd）更新参数梯度即可完成向AMP训练的转换，代码如下所示：
+```python
+sgd = SGDOptimizer(learning_rate=1e-3)
+# 此处只需要使用fluid.contrib.mixed_precision.decorate将sgd封装成AMP训练所需的
+# 优化器mp_sgd，并使用mp_sgd.minimize(avg_loss)代替原来的sgd.minimize(avg_loss)语句即可。
+mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd)
+mp_sgd.minimize(avg_loss)
+```
+运行上述混合精度训练python脚本时为得到更好的执行性能可配置如下环境参数，并保证cudnn版本在7.4.1及以上。
+```shell
+export FLAGS_conv_workspace_size_limit=1024 # MB，根据所使用的GPU显存容量及模型特点设置数值，值越大越有可能选择到更快的卷积算法
+export FLAGS_cudnn_exhaustive_search=1 # 使用穷举搜索方法来选择快速卷积算法
+export FLAGS_cudnn_batchnorm_spatial_persistent=1 # 用于触发batch_norm和relu的融合
+```
+上述即为最简单的PaddlePaddle AMP功能使用方法。ResNet50模型的AMP训练示例可[点击此处](https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README.md#%E6%B7%B7%E5%90%88%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83)查看，其他模型使用PaddlePaddle AMP的方法也与此类似。若AMP训练过程中出现连续的loss nan等不收敛现象，可尝试使用[check nan inf工具](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/flags/check_nan_inf_cn.html#span-id-speed-span)进行调试。
+### PaddlePaddle AMP功能——进阶使用
+上一小节所述均为默认AMP训练行为，用户当然也可以改变一些默认的参数设置来满足特定的模型训练场景需求。接下来的章节将介绍PaddlePaddle AMP功能使用中用户可配置的参数行为，即进阶使用技巧。
+#### 自定义黑白名单
+PaddlePaddle AMP功能实现中根据FP16数据类型计算稳定性和加速效果在框架内部定义了算子（Op）的黑白名单。具体来说，将对FP16计算友好且能利用Tensor Cores的Op归类于白名单，将使用FP16计算会导致数值不稳定的Op归类于黑名单，将对FP16计算没有多少影响的Op归类于灰名单。然而，框架开发人员不可能考虑到所有的网络模型情况，尤其是那些特殊场景中使用到的模型。用户可以在使用`fluid.contrib.mixed_precision.decorate` 函数时通过指定自定义的黑白名单列表来改变默认的FP16计算行为。
+```python
+sgd = SGDOptimizer(learning_rate=1e-3)
+# list1是白名单op列表，list2是黑名单op列表，list3是黑名单var_name列表（凡是以这些黑名单var_name为输入或输出的op均会被视为黑名单op）
+amp_list = AutoMixedPrecisionLists(custom_white_list=list1, custom_black_list=list2, custom_black_varnames=list3)
+mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd, amp_list)
+mp_sgd.minimize(avg_loss)
+```
+#### 自动loss scaling
+为了避免梯度Infinite或者NAN，PaddlePaddle AMP功能支持根据训练过程中梯度的数值自动调整loss scale值。用户在使用`fluid.contrib.mixed_precision.decorate` 函数时也可以改变与loss scaling相关的参数设置，示例如下：
+```python
+sgd = SGDOptimizer(learning_rate=1e-3)
+mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd,
+            amp_lists=None,
+             init_loss_scaling=2**8,
+             incr_every_n_steps=500,
+             decr_every_n_nan_or_inf=4,
+            incr_ratio=2.0,
+            decr_ratio=0.5,
+             use_dynamic_loss_scaling=True)
+mp_sgd.minimize(avg_loss)
+```
+`init_loss_scaling `、`incr_every_n_steps` 以及`decr_every_n_nan_or_inf`等参数控制着自动loss scaling的行为。它们仅当 `use_dynamic_loss_scaling`设置为True时有效。下面详述这些参数的意义：
+* init_loss_scaling(float)：初始loss scaling值。
+* incr_every_n_steps(int)：每经过incr_every_n_steps个连续的正常梯度值才会增大loss scaling值。
+* decr_every_n_nan_or_inf(int)：每经过decr_every_n_nan_or_inf个连续的无效梯度值(nan或者inf)才会减小loss scaling值。
+* incr_ratio(float)：每次增大loss scaling值的扩增倍数，其为大于1的浮点数。
+* decr_ratio(float)：每次减小loss scaling值的比例系数，其为小于1的浮点数。
+### 多卡GPU训练的优化
+PaddlePaddle AMP功能对多卡GPU训练进行了深度优化。如图 2 所示，优化之前的参数梯度更新特点：梯度计算时虽然使用的是FP16数据类型，但是不同GPU卡之间的梯度传输数据类型仍为FP32。
+<figure align="center">
+    <img src="https://paddleweb-static.bj.bcebos.com/images/transfer_fp32_grad.png" width="500" alt='missing'/>
+    <figcaption><center>图 2. 不同GPU卡之间传输梯度使用FP32数据类型（优化前）</center></figcaption>
+</figure>
+为了降低GPU多卡之间的梯度传输带宽，我们将梯度传输提前至`Cast`操作之前，而每个GPU卡在得到对应的FP16梯度后再执行`Cast`操作将其转变为FP32类型，具体操作详见图2。这一优化在训练大模型时对减少带宽占用尤其有效，如多卡训练BERT-Large模型。
+<figure align="center">
+    <img src="https://paddleweb-static.bj.bcebos.com/images/transfer_fp16_grad.png" width="500" alt='missing'/>
+    <figcaption><center>图 3. 不同GPU卡之间传输梯度使用FP16数据类型（优化后）</center></figcaption>
+</figure>
+### 训练性能对比（AMP VS FP32）
+PaddlePaddle AMP技术在ResNet50、Transformer等模型上训练速度相对于FP32训练上均有可观的加速比，下面是ResNet50和ERNIE Large模型的AMP训练相对于FP32训练的加速效果。
+<table align="center">
+<caption align="bottom"><center>图 4. Paddle AMP训练加速效果（横坐标为卡数，如8*8代表8机8卡）</center></caption>
+   <tr>
+       <td> <img src="https://paddleweb-static.bj.bcebos.com/images/resnet50.png" alt='missing'/> </td>
+       <td> <img src="https://paddleweb-static.bj.bcebos.com/images/ernie.png" alt='missing'/> </td>
+   </tr>
+</table>
+从图4所示的图表可以看出，ResNet50的AMP训练相对与FP32训练加速比可达$2.8 \times$以上，而ERNIE Large的AMP训练相对与FP32训练加速比亦可达 $1.7 \times -- 2.1 \times$ 。
+### 参考文献
+* <p> <a href="https://arxiv.org/abs/1710.03740"> Mixed Precision Training </a> </p>
+* <p> <a href="https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=cn9312-%e4%bd%bf%e7%94%a8%e8%87%aa%e5%8a%a8%e6%b7%b7%e5%90%88%e7%b2%be%e5%ba%a6%e5%8a%a0%e9%80%9f+paddlepaddle+%e8%ae%ad%e7%bb%83"> 使用自动混合精度加速 PaddlePaddle 训练 </a> </p>
+* <p id="fn1"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout"> Tensor Layouts In Memory: NCHW vs NHWC </a> <sup> <a href="#ref1">↩</a> </sub> </p>
+* <p id="fn2"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels"> Channels In And Out Requirements </a> <sup> <a href="#ref2">↩</a> </sup> </p>
+* <p id="fn3"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc"> Matrix-Matrix Multiplication Requirements </a> <sup> <a href="#ref3">↩</a> </sup> </p>
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/benchmark_cn.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/benchmark_cn.md
+如何进行基准测试
+===============
+本文介绍如何给深度学习框架做基准测试。基准测试主要包含验证模型的精度和性能两方面，下文包含搭建测试环境，选择基准测试模型，验证测试结果等几方面内容。
+验证深度学习框架，可分为训练和测试两个阶段， 验证指标略有不同，本文只介绍训练阶段的指标验证。训练阶段关注的是模型训练集上的精度，训练集是完备的，因此关注大batch\_size下的训练速度,关注吞吐量，例如图像模型常用的batch\_size=128, 多卡情况下会加大；预测阶段关注的是在测试集上的精度，线上服务测试数据不能提前收集，因此关注小batch\_size下的预测速度，关注延迟，例如预测服务常用的batch\_size=1, 4等。
+[Fluid](https://github.com/PaddlePaddle/Paddle>)是PaddlePaddle从0.11.0版本开始引入的设计，本文的基准测试在该版本上完成。
+环境搭建
+========
+基准测试中模型精度和硬件、框架无关，由模型结构和数据共同决定；性能方面由测试硬件和框架性能决定。框架基准测试为了对比框架之间的差异，控制硬件环境，系统库等版本一致。下文中的对比实验都在相同的硬件条件和系统环境条件下进行.
+不同架构的GPU卡性能差异巨大，在验证模型在GPU上训练性能时，可使用NVIDIA提供的命令:```nvidia-smi``` 检验当前使用的GPU型号，如果测试多卡训练性能，需确认硬件连接是 [nvlink](https://zh.wikipedia.org/zh/NVLink)或 [PCIe](https://zh.wikipedia.org/zh-hans/PCI_Express)。 同样地，CPU型号会极大影响模型在CPU上的训练性能。可读取`/proc/cpuinfo`中的参数，确认当前正在使用的CPU型号。
+下载GPU对应的Cuda Tool Kit和 Cudnn，或者使用NVIDIA官方发布的nvidia-docker镜像 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), 镜像内包含了Cuda和Cudnn，本文采用这种方式。 Cuda Tool Kit包含了GPU代码使用到的基础库，影响在此基础上编译出的Fluid二进制运行性能。
+准备好Cuda环境后，从github上下载Paddle代码并编译，会生成对应的最适合当前GPU的sm\_arch二进制[sm\_arch](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html)。另外，cudnn对卷积类任务影响巨大，在基准测试中需要小版本一致，例如Cudnn7.0.2与Cudnn7.1.4在Resnet上有5%以上差异。
+选择基准模型
+============
+对框架做基准测试，需要覆盖不同训练任务和不同大小的模型，本文中选取了图像和NLP的最为常用的5个模型。
+任务种类|        模型名称|       网络结构|         数据集  
+:---:|:--:|:---:|:---:
+图像生成|      CycleGAN|         GAN|              horse2zebra
+图像分类|      SE-ResNeXt50|        Resnet-50|          image-net
+语义分割|      DeepLab_V3+|  ResNets|       cityscapes
+自然语言|      Bert|       Transformer|       Wikipedia
+机器翻译|      Transformer|           Attention|             Wikipedia
+CycleGAN, SE-ResNeXt50, DeepLab_V3+属于CNN模型, Bert, Transformer是一种比传统RNN模型更好的NLP模型。
+[benchmark](https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid)
+基准模型测试脚本中，均跳过了前几个batch的训练过程，原因是加载数据和分配显存受系统当前运行情况影响，会导致统计性能不准确。运行完若干个轮次后，统计对应指标。
+基准模型的数据的选择方面，数据量大且验证效果多的公开数据集为首选。图像模型CycleGAN选择了horse2zebra数据集，SE-ResNeXt50选择了[image-net](http://www.image-net.org/challenges/LSVRC/2012/nnoupb)数据集，图像大小预处理为和Imagenet相同大小，因此性能可直接对比。
+NLP模型的公开且影响力大数据集较少，Bert和Transformer模型都选择了[Wikipedia](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)数据集。
+注意，图像模型每条样本大小相同，图像经过变换后大小一致，因此经过的计算路径基本相同，计算速度和显存占用波动较小，可以从若干个batch的数据中采样得到当前的训练性能数据。而NLP模型由于样本长度不定，计算路径和显存占用也不相同，因此只能完整运行若干个轮次后，统计速度和显存消耗。
+显存分配是特别耗时的操作，因此Fluid默认会占用所有可用显存空间形成显存池，用以加速计算过程中的显存分配。如果需要统计模型真实显存消耗，可设置环境变量`FLAGS_fraction_of_gpu_memory_to_use=0.0`，观察最大显存开销。
+测试过程
+========
+-  GPU 单机单卡测试
+本教程使用了Cuda9, Cudnn7.0.1。来源为:```nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04```
+```
+    nvidia-docker run -it --name CASE_NAME --security-opt seccomp=unconfined -v $PWD/benchmark:/benchmark -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu paddlepaddle/paddle:latest-dev /bin/bash
+```
+在单卡上测试，设置CUDA的环境变量使用一块GPU，``CUDA_VISIBLE_DEVICES=0``
+然后代码中设置为使用CUDAPlace，如果使用Paddle代码库中的脚本，只需要命令行参数传入 use_gpu=True即可。
+```
+    >>> import paddle.fluid as fluid
+    >>> place = fluid.CUDAPlace(0) // 0 指第0块GPU
+```
+测试结果
+========
+本教程对比相同环境下的Fluid1.4, Pytorch1.1.0和TensorFlow1.12.0的性能表现。
+硬件环境为 CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, GPU: Tesla v100(volta) 21729MiB x 1, Nvidia-Driver 384.66。
+系统环境为Ubuntu 16.04.3 LTS, 本文中采用了docker环境，系统版本为nvidia-docker17.05.0-ce。
+测试的Fluid版本为[v.1.4.1](https://github.com/PaddlePaddle/Paddle/tree/v1.4.1) 。
+TensorFlow版本为[v.1.12.0-rc2](https://github.com/tensorflow/tensorflow/tree/v1.12.0-rc2)。
+Pytorch版本为[v.1.1.0](https://github.com/pytorch/pytorch/tree/v1.1.0)。
+使用的脚本和配置见[benchmark](https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid) 。
+SE-ResNeXt50对比的框架是Pytorch，因为tensorflow上没有对应的模型。
+图表中统计单位为samples/秒。
+- GPU 单机单卡测试结果
+  Model|Fluid GPU|  TensorFlow/Pytorch GPU
+  :---:|:--:|:---:
+  CycleGAN|              7.3 samples/s|               6.1 samples/s
+  SE-ResNeXt50|             169.4 samples/s  |              153.1 samples/s
+  DeepLab_V3+|          12.8 samples/s  |              6.4 samples/s
+  Bert|       4.0 samples/s   |              3.4 samples/s
+  Transformer|            4.9 samples/s   |              4.7 samples/s
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/cpu_profiling_cn.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/cpu_profiling_cn.md
+# CPU性能调优
+此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优（performance tuning）。
+Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
+PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
+* Python 代码的性能分析
+* Python 与 C++ 混合代码的性能分析
+## Python代码的性能分析
+### 生成性能分析文件
+Python标准库中提供了性能分析的工具包，[cProfile](https://docs.python.org/2/library/profile.html)。生成Python性能分析的命令如下:
+```bash
+python -m cProfile -o profile.out main.py
+```
+其中 `main.py` 是我们要分析的程序，`-o`标识了一个输出的文件名，用来存储本次性能分析的结果。如果不指定这个文件，`cProfile`会打印到标准输出。
+### 查看性能分析文件
+`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务，将性能分析结果以网页的形式展示出来：
+```bash
+cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
+```
+其中`-a`标识HTTP服务绑定的IP。使用`0.0.0.0`允许外网访问这个HTTP服务。`-p`标识HTTP服务的端口。`-f`标识性能分析的结果文件。`main.py`标识被性能分析的源文件。
+用Web浏览器访问对应网址，即可显示性能分析的结果：
+```
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
+     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
+```
+每一列的含义是:
+<table>
+<thead>
+<tr>
+<th>列名</th>
+<th>含义 </th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td> ncalls</td>
+<td> 函数的调用次数</td>
+</tr>
+<tr>
+<td>tottime</td>
+<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
+</tr>
+<tr>
+<td> percall </td>
+<td> tottime的每次调用平均时间</td>
+</tr>
+<tr>
+<td> cumtime</td>
+<td> 函数总时间。包含这个函数调用其他函数的时间</td>
+</tr>
+<tr>
+<td> percall</td>
+<td> cumtime的每次调用平均时间</td>
+</tr>
+<tr>
+<td> filename:lineno(function) </td>
+<td> 文件名, 行号，函数名 </td>
+</tr>
+</tbody>
+</table>
+### 寻找性能瓶颈
+通常`tottime`和`cumtime`是寻找瓶颈的关键指标。这两个指标代表了某一个函数真实的运行时间。
+将性能分析结果按照tottime排序，效果如下:
+```text
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
+   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
+     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
+        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
+```
+可以看到最耗时的函数是C++端的`run`函数。这需要联合我们第二节`Python`与`C++`混合代码的性能分析来进行调优。而`sync_with_cpp`函数的总共耗时很长，每次调用的耗时也很长。于是我们可以点击`sync_with_cpp`的详细信息，了解其调用关系。
+```text
+Called By:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+Function                                                                                                 was called by...
+                                                                                                             ncalls  tottime  cumtime
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
+                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
+Called:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+```
+通常观察热点函数间的调用关系，和对应行的代码，就可以了解到问题代码在哪里。当我们做出性能修正后，再次进行性能分析(profiling)即可检查我们调优后的修正是否能够改善程序的性能。
+## Python与C++混合代码的性能分析
+### 生成性能分析文件
+C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
+使用`yep`前需要安装`google-perftools`与`yep`包。ubuntu下安装命令为
+```bash
+apt update
+apt install libgoogle-perftools-dev
+pip install yep
+```
+安装完毕后，我们可以通过
+```bash
+python -m yep -v main.py
+```
+生成性能分析文件。生成的性能分析文件为`main.py.prof`。
+命令行中的`-v`指定在生成性能分析文件之后，在命令行显示分析结果。我们可以在命令行中简单的看一下生成效果。因为C++与Python不同，编译时可能会去掉调试信息，运行时也可能因为多线程产生混乱不可读的性能分析结果。为了生成更可读的性能分析结果，可以采取下面几点措施:
+1. 编译时指定`-g`生成调试信息。使用cmake的话，可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`。
+2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
+3. 运行性能分析的时候，先从单线程开始，再开启多线程，进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
+### 查看性能分析文件
+在运行完性能分析后，会生成性能分析结果文件。我们可以使用[`pprof`](https://github.com/google/pprof)来显示性能分析结果。注意，这里使用了用`Go`语言重构后的`pprof`，因为这个工具具有web服务界面，且展示效果更好。
+安装`pprof`的命令和一般的`Go`程序是一样的，其命令如下:
+```bash
+go get github.com/google/pprof
+```
+进而我们可以使用如下命令开启一个HTTP服务:
+```bash
+pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
+```
+这行命令中，`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径，进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
+访问对应的网址，我们可以查看性能分析的结果。结果如下图所示:
+![result](./pprof_1.png)
+### 寻找性能瓶颈
+与寻找Python代码的性能瓶颈类似，寻找Python与C++混合代码的性能瓶颈也是要看`tottime`和`cumtime`。而`pprof`展示的调用图也可以帮助我们发现性能中的问题。
+例如下图中，
+![kernel_perf](./pprof_2.png)
+在一次训练中，乘法和乘法梯度的计算占用2%-4%左右的计算时间。而`MomentumOp`占用了17%左右的计算时间。显然，`MomentumOp`的性能有问题。
+在`pprof`中，对于性能的关键路径都做出了红色标记。先检查关键路径的性能问题，再检查其他部分的性能问题，可以更有次序的完成性能的优化。
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/cpu_profiling_en.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/cpu_profiling_en.md
+# Tune CPU performance
+This tutorial introduces techniques we use to profile and tune the
+CPU performance of PaddlePaddle.  We will use Python packages
+`cProfile` and `yep`, and Google's `perftools`.
+Profiling is the process that reveals performance bottlenecks,
+which could be very different from what's in the developers' mind.
+Performance tuning is done to fix these bottlenecks. Performance optimization
+repeats the steps of profiling and tuning alternatively.
+PaddlePaddle users program AI applications by calling the Python API, which calls
+into `libpaddle.so.` written in C++.  In this tutorial, we focus on
+the profiling and tuning of
+1. the Python code and
+1. the mixture of Python and C++ code.
+## Profiling the Python Code
+### Generate the Performance Profiling File
+We can use Python standard
+package, [`cProfile`](https://docs.python.org/2/library/profile.html),
+to generate Python profiling file.  For example:
+```bash
+python -m cProfile -o profile.out main.py
+```
+where `main.py` is the program we are going to profile, `-o` specifies
+the output file.  Without `-o`, `cProfile` would outputs to standard
+output.
+### Look into the Profiling File
+`cProfile` generates `profile.out` after `main.py` completes. We can
+use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
+the details:
+```bash
+cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
+```
+where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
+specifies the profiling file, and `main.py` is the source file.
+Open the Web browser and points to the local IP and the specifies
+port, we will see the output like the following:
+```
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
+     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
+```
+where each line corresponds to Python function, and the meaning of
+each column is as follows:
+<table>
+<thead>
+<tr>
+<th>column</th>
+<th>meaning </th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td> ncalls</td>
+<td> the number of calls into a function</td>
+</tr>
+<tr>
+<td>tottime</td>
+<td> the total execution time of the function, not including the execution time of other functions called by the function</td>
+</tr>
+<tr>
+<td> percall </td>
+<td> tottime divided by ncalls</td>
+</tr>
+<tr>
+<td> cumtime</td>
+<td> the total execution time of the function, including the execution time of other functions being called</td>
+</tr>
+<tr>
+<td> percall</td>
+<td> cumtime divided by ncalls</td>
+</tr>
+<tr>
+<td> filename:lineno(function) </td>
+<td> where the function is define </td>
+</tr>
+</tbody>
+</table>
+### Identify Performance Bottlenecks
+Usually, `tottime` and the related `percall` time is what we want to
+focus on. We can sort above profiling file by tottime:
+```text
+     4696   12.040    0.003   12.040    0.003 {built-in method run}
+   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
+   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
+     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
+        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
+```
+We can see that the most time-consuming function is the `built-in
+method run`, which is a C++ function in `libpaddle.so`.  We will
+explain how to profile C++ code in the next section.  At this
+moment, let's look into the third function `sync_with_cpp`, which is a
+Python function.  We can click it to understand more about it:
+```
+Called By:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+Function                                                                                                 was called by...
+                                                                                                             ncalls  tottime  cumtime
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
+/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
+                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
+Called:
+   Ordered by: internal time
+   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
+```
+The lists of the callers of `sync_with_cpp` might help us understand
+how to improve the function definition.
+## Profiling Python and C++ Code
+### Generate the Profiling File
+To profile a mixture of Python and C++ code, we can use a Python
+package, `yep`, that can work with Google's `perftools`, which is a
+commonly-used profiler for C/C++ code.
+In Ubuntu systems, we can install `yep` and `perftools` by running the
+following commands:
+```bash
+apt update
+apt install libgoogle-perftools-dev
+pip install yep
+```
+Then we can run the following command
+```bash
+python -m yep -v main.py
+```
+to generate the profiling file.  The default filename is
+`main.py.prof`.
+Please be aware of the `-v` command line option, which prints the
+analysis results after generating the profiling file.  By examining the
+ the print result, we'd know that if we stripped debug
+information from `libpaddle.so` at build time.  The following hints
+help make sure that the analysis results are readable:
+1. Use GCC command line option `-g` when building `libpaddle.so` so to
+   include the debug information.  The standard building system of
+   PaddlePaddle is CMake, so you might want to set
+   `CMAKE_BUILD_TYPE=RelWithDebInfo`.
+1. Use GCC command line option `-O2` or `-O3` to generate optimized
+   binary code. It doesn't make sense to profile `libpaddle.so`
+   without optimization, because it would anyway run slowly.
+1. Profiling the single-threaded binary file before the
+   multi-threading version, because the latter often generates tangled
+   profiling analysis result.  You might want to set environment
+   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
+   starting multiple threads.
+### Examining the Profiling File
+The tool we used to examine the profiling file generated by
+`perftools` is [`pprof`](https://github.com/google/pprof), which
+provides a Web-based GUI like `cprofilev`.
+We can rely on the standard Go toolchain to retrieve the source code
+of `pprof` and build it:
+```bash
+go get github.com/google/pprof
+```
+Then we can use it to profile `main.py.prof` generated in the previous
+section:
+```bash
+pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
+```
+Where `-http` specifies the IP and port of the HTTP service.
+Directing our Web browser to the service, we would see something like
+the following:
+![result](./pprof_1.png)
+### Identifying the Performance Bottlenecks
+Similar to how we work with `cprofilev`, we'd focus on `tottime` and
+`cumtime`.
+![kernel_perf](./pprof_2.png)
+We can see that the execution time of multiplication and the computing
+of the gradient of multiplication takes 2% to 4% of the total running
+time, and `MomentumOp` takes about 17%. Obviously, we'd want to
+optimize `MomentumOp`.
+`pprof` would mark performance critical parts of the program in
+red. It's a good idea to follow the hints.
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/host_memory_profiling_en.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/host_memory_profiling_en.md
+# Heap Memory Profiling and Optimization
+Any computer program has the danger of memory leak. Generally, **Memory Leak** is caused by the unreleased heap memory allocated by the program. As the memory occupied by the program becomes larger and larger, it will affect the stability of the program, which may make the running speed slower or give rise to OoM(Out of Memory). It even compromises the stability of the machine in use, and leads to *downtime* .
+There are many memory leak analysis tools at present. Typical ones include, [valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/).
+Because Fluid runs in C++ core driven by Python, It is very difficult for valgrind to analyze directly. You need to compile the debug version and dedicated Python version with valgrind support, and most of the output information is Python's own symbols and call information. In addition, valgrind will make the program run very slowly, so it is not recommended.
+Here we mainly introduce the use of [gperftools](https://gperftools.github.io/gperftools/) .
+gperftool mainly supports four functions:
+- thread-caching malloc
+- heap-checking using tcmalloc
+- heap-profiling using tcmalloc
+- CPU profiler
+Paddle also provides a [tutorial on CPU performance analysis](./cpu_profiling_en.html) based on gperftool.
+For the analysis for heap, we mainly use thread-caching malloc and heap-profiling using tcmalloc.
+## Environment
+This tutorial is based on the Docker development environment paddlepaddle/paddle:latest-dev provided by paddle, based on the Ubuntu 16.04.4 LTS environment.
+## Manual
+- Install google-perftools
+```
+apt-get install libunwind-dev
+apt-get install google-perftools
+```
+- Install pprof
+```
+go get -u github.com/google/pprof
+```
+- Configure Running Environment
+```
+export PPROF_PATH=/root/gopath/bin/pprof
+export PPROF_BINARY_PATH=/root/gopath/bin/pprof
+export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
+```
+- Use heap profile to run python program. The essence of it is to get a snapshot of the heap allocation periodically.
+```
+# HEAPPROFILE sets the directory and file prefix of the generated heap analysis file
+# HEAP_PROFILE_ALLOCATION_INTERVAL Sets how many storage dumps are allocated for each dump, default 1GB
+env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py
+```
+As the program runs, a lot of files will be generated in the perf_log folder as follows:
+```
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0001.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0002.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0003.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0004.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0005.heap
+-rw-r--r-- 1 root root 1.0M Jun  1 15:00 test.log.0006.heap
+```
+- Analyze the heap files with pprof. There are two modes of analysis:
+    - Complete mode. An analysis of the current heap is performed, showing some of the call paths for the current allocation of memory.
+    ```
+    pprof --pdf python test.log.0012.heap
+    ```
+    The command above will generate a file of profile00x.pdf, which can be opened directly, for example, [memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf).  As demonstrated in the chart below, during the running of the CPU version fluid, the module CPUAllocator is allocated with most memory. Other modules are allocated with relatively less memory, so they are ignored. It is very inconvenient for inspecting memory leak for memory leak is a chronic process which cannot be inspected in this picture.
+    ![result](https://user-images.githubusercontent.com/3048612/40964027-a54033e4-68dc-11e8-836a-144910c4bb8c.png)
+    - Diff mode. You can do diff on the heap at two moments, which removes some modules whose memory allocation has not changed, and displays the incremental part.
+    ```
+    pprof --pdf --base test.log.0010.heap python test.log.1045.heap
+    ```
+    The generated result: [`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf)
+    As shown from the figure: The structure of ProgramDesc has increased by 200MB+ between the two versions, so there is a large possibility that memory leak happens here, and the final result does prove a leak here.
+    ![result](https://user-images.githubusercontent.com/3048612/40964057-b434d5e4-68dc-11e8-894b-8ab62bcf26c2.png)
+    ![result](https://user-images.githubusercontent.com/3048612/40964063-b7dbee44-68dc-11e8-9719-da279f86477f.png)
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/index_cn.rst
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/index_cn.rst
+.. _api_guide_analysis_tools:
+###############
+性能优化分析及工具
+###############
+..  toctree::
+	:hidden:
+	cpu_profiling_cn.md
+	host_memory_profiling_cn.md
+	timeline_cn.md
+本模块介绍 Fluid 使用过程中的调优方法，包括：
+- `CPU性能调优 <cpu_profiling_cn.html>`_：介绍如何使用 cProfile 包、yep库、Google perftools 进行性能分析与调优
+- `堆内存分析和优化 <host_memory_profiling_cn.html>`_：介绍如何使用 gperftool 进行堆内存分析和优化，以解决内存泄漏的问题
+- `Timeline工具简介 <timeline_cn.html>`_ ：介绍如何使用 Timeline 工具进行性能分析和调优
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/index_en.rst
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/index_en.rst
+#######################################
+Performance Profiling and Optimization
+#######################################
+..  toctree::
+	:hidden:
+	cpu_profiling_en.md
+	host_memory_profiling_en.md
+	timeline_en.md
+This section illustrates how to optimize performance of Fluid：
+- `CPU profiling <cpu_profiling_en.html>`_：How to use cProfile, yep, and Google perftools to profile and optimize model performance
+- `Heap Memory Profiling and Optimization <host_memory_profiling_en.html>`_：Use gperftool to perform Heap Memory Profiling and Optimization to solve memory leaks.
+- `How to use timeline tool to do profiling <timeline_en.html>`_ ：How to use timeline tool to do profile and optimization
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp1.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp1.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp2.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp2.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp3.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp3.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp4.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/nvvp4.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/pprof_1.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/pprof_1.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/pprof_2.png
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/pprof_2.png
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/timeline.jpeg
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/timeline.jpeg
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/timeline_cn.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/timeline_cn.md
+# timeline工具简介
+## <span id="local">本地使用</span>
+1. 在训练的主循环外加上`profiler.start_profiler(...)`和`profiler.stop_profiler(...)`。运行之后，代码会在`/tmp/profile`目录下生成一个profile的记录文件。
+    **提示：**
+    请不要在timeline记录信息时运行太多次迭代，因为timeline中的记录数量和迭代次数是成正比的。
+    ```python
+    import numpy as np
+    import paddle
+    import paddle.fluid as fluid
+    from paddle.fluid import profiler
+    place = fluid.CPUPlace()
+    def reader():
+        for i in range(100):
+            yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')],
+    main_program = fluid.Program()
+    startup_program = fluid.Program()
+    with fluid.program_guard(main_program, startup_program):
+        data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2])
+        data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3])
+        out = fluid.layers.fc(input=[data_1, data_2], size=2)
+        # ...
+        feeder = fluid.DataFeeder([data_1, data_2], place)
+        exe = fluid.Executor(place)
+        exe.run(startup_program)
+        pass_num = 10
+        for pass_id in range(pass_num):
+            for batch_id, data in enumerate(reader()):
+                if pass_id == 0 and batch_id == 5:
+                    profiler.start_profiler("All")
+                elif pass_id == 0 and batch_id == 10:
+                    profiler.stop_profiler("total", "/tmp/profile")
+                outs = exe.run(program=main_program,
+                               feed=feeder.feed(data),
+                               fetch_list=[out])
+    ```
+1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`，这个程序默认会生成一个`/tmp/timeline`文件，你也可以用命令行参数来修改这个路径，请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)。
+```python
+python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
+```
+1. 打开chrome浏览器，访问<chrome://tracing/>，用`load`按钮来加载生成的`timeline`文件。
+1. 结果如下图所示，可以放大来查看timeline的细节信息。
+    ![chrome timeline](./timeline.jpeg)
+## 分布式使用
+一般来说，分布式的训练程序都会有两种程序：pserver和trainer。我们提供了把pserver和trainer的profile日志用timeline来显示的方式。
+1. trainer打开方式与[本地使用](#local)部分的第1步相同
+1. pserver可以通过加两个环境变量打开profile，例如：
+```
+FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
+```
+3. 把pserver和trainer的profile文件生成一个timeline文件，例如：
+```
+python /paddle/tools/timeline.py
+    --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
+    --timeline_path ./dist.timeline
+```
+4. 在chrome中加载dist.timeline文件，方法和[本地使用](#local)第4步相同。
--- a/doc/paddle/guides/07_performance_improving/analysis_tools/timeline_en.md
+++ b/doc/paddle/guides/07_performance_improving/analysis_tools/timeline_en.md
+# How to use timeline tool to do profile
+## <span id="local">Local</span>
+1. Add `profiler.start_profiler(...)` and `profiler.stop_profiler(...)` to the main training loop. After run, the code will generate a profile record file `/tmp/profile`. **Warning**: Please do not run too many batches when use profiler to record timeline information, for the profile record will grow with the batch number.
+    ```python
+    import numpy as np
+    import paddle
+    import paddle.fluid as fluid
+    from paddle.fluid import profiler
+    place = fluid.CPUPlace()
+    def reader():
+        for i in range(100):
+            yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')],
+    main_program = fluid.Program()
+    startup_program = fluid.Program()
+    with fluid.program_guard(main_program, startup_program):
+         data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2])
+         data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3])
+     out = fluid.layers.fc(input=[data_1, data_2], size=2)
+     # ...
+     feeder = fluid.DataFeeder([data_1, data_2], place)
+         exe = fluid.Executor(place)
+     exe.run(startup_program)
+     pass_num = 10
+     for pass_id in range(pass_num):
+            for batch_id, data in enumerate(reader()):
+            if pass_id == 0 and batch_id == 5:
+             profiler.start_profiler("All")
+        elif pass_id == 0 and batch_id == 10:
+             profiler.stop_profiler("total", "/tmp/profile")
+        outs = exe.run(program=main_program,
+                       feed=feeder.feed(data),
+                       fetch_list=[out])
+    ```
+2. Run `python paddle/tools/timeline.py` to process `/tmp/profile`, it will generate another
+file `/tmp/timeline` by default. You can change the path by cmd parameter, please take a look at
+[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py) for details.
+```python
+python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
+```
+3. Open chrome and visit <chrome://tracing/>, use `load` button to load the generated `timeline` file.
+4. The result timeline should be like:<a name="local_step_4"></a>
+    ![chrome timeline](./timeline.jpeg)
+## Distributed
+This tool can support distributed train programs(pserver and trainer) too.
+1. Open traniner profiler just like how to use in [local](#local).
+2. Open pserver profiler: add two environment variables, e.g.:
+```
+FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
+```
+3. Merge pservers' and trainers' profiler file, e.g.:
+```
+python /paddle/tools/timeline.py
+    --profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
+    --timeline_path ./dist.timeline
+```
+4. Load `dist.timeline` in chrome just like the [fourth step in Local](#local_step_4)
--- a/doc/paddle/guides/07_performance_improving/device_switching/device_switching.md
+++ b/doc/paddle/guides/07_performance_improving/device_switching/device_switching.md
+# 运行时设备切换
+Paddle提供了[fluid.CUDAPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CUDAPlace_cn.html)以及[fluid.CPUPlace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/CPUPlace_cn.html)用于指定运行时的设备。这两个接口用于指定全局的设备，从1.8版本开始，Paddle提供了[device_guard](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/device_guard_cn.html)接口，用于指定部分OP的运行设备，此教程会介绍device_guard的使用场景，以及如何使用该接口对模型进行优化。
+如果使用了`fluid.CUDAPlace`设置了全局的执行设备，框架将尽可能地将OP设置在GPU上执行，因此有可能会遇到显存不够的情况。`device_guard`可以用于设置OP的执行设备，如果将部分层设置在CPU上运行，就能够充分利用CPU大内存的优势，避免显存超出。
+有时尽管指定了全局的执行设备为GPU，但框架在自动分配OP执行设备时，可能会将部分OP设置在CPU上执行。另外，个别OP会将输出存储在CPU上。在以上的场景中，常常会发生不同设备间的数据传输，可能会影响模型的性能。使用`device_guard`可以避免模型运行中不必要的数据传输。在下面的内容中，将会详细介绍如何通过[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/profiler_cn.html)工具分析数据传输开销，以及如何使用`device_guard`避免不必要的数据传输，从而提升模型性能。
+## 如何避免显存超出
+下面示例代码中的`embedding`层，其参数`size`包含两个元素，第一个元素为`vocab_size` (词表大小), 第二个为`emb_size`（`embedding`层维度）。实际场景中，词表可能会非常大。示例代码中，词表大小被设置为10000000。如果在GPU模式下运行，该层创建的权重矩阵的大小为(10000000, 150)，仅这一层就需要5.59G的显存，如果词表大小继续增加，极有可能会导致显存超出。
+```python
+import paddle.fluid as fluid
+data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64')
+label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32')
+emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32')
+out = fluid.layers.l2_normalize(x=emb, axis=-1)
+cost = fluid.layers.square_error_cost(input=out, label=label)
+avg_cost = fluid.layers.mean(cost)
+sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+sgd_optimizer.minimize(avg_cost)
+place = fluid.CUDAPlace(0)
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost])
+```
+`embedding`是根据`input`中的`id`信息从`embedding`矩阵中查询对应`embedding`信息，在CPU上进行计算，其速度也是可接受的。因此，可以参考如下代码，使用`device_guard`将`embedding`层设置在CPU上，以利用CPU内存资源。那么，除了`embedding`层，其他各层都会在GPU上运行。
+```python
+import paddle.fluid as fluid
+data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64')
+label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32')
+with fluid.device_guard("cpu"):
+    emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32')
+out = fluid.layers.l2_normalize(x=emb, axis=-1)
+cost = fluid.layers.square_error_cost(input=out, label=label)
+avg_cost = fluid.layers.mean(cost)
+sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+sgd_optimizer.minimize(avg_cost)
+place = fluid.CUDAPlace(0)
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost])
+```
+在显存足够的情况下，可不必进行这样的设置。
+## 如何减少数据传输
+### 使用profile工具确认是否发生了数据传输
+首先对模型的性能数据进行分析，找到发生数据传输的原因。如下列代码所示，可以利用[profile](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/profiler_cn.html)工具进行分析。
+```python
+import paddle.fluid as fluid
+import paddle.fluid.compiler as compiler
+import paddle.fluid.profiler as profiler
+data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
+data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
+shape = fluid.layers.shape(data2)
+shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
+out = fluid.layers.crop_tensor(data1, shape=shape)
+place = fluid.CUDAPlace(0)
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
+with profiler.profiler('All', 'total') as prof:
+    for i in range(10):
+        result = exe.run(program=compiled_prog, fetch_list=[out])
+```
+在程序运行结束后，将会自动地打印出profile report。在下面的profile report中，可以看到    `GpuMemCpy Summary`中给出了2项数据传输的调用耗时。在OP执行过程中，如果输入Tensor所在的设备与OP执行的设备不同，就会发生`GpuMemcpySync`，通常我们可以直接优化的就是这一项。进一步分析，可以看到`slice`和`crop_tensor`执行中都发生了`GpuMemcpySync`。尽管我们在程序中设置了GPU模式运行，但是框架中有些OP，例如shape，会将输出结果放在CPU上。
+```text
+------------------------->     Profiling Report     <-------------------------
+Note! This Report merge all thread info into one.
+Place: All
+Time unit: ms
+Sorted by total time in descending order in the same thread
+Total time: 26.6328
+  Computation time       Total: 13.3133     Ratio: 49.9884%
+  Framework overhead     Total: 13.3195     Ratio: 50.0116%
+-------------------------     GpuMemCpy Summary     -------------------------
+GpuMemcpy                Calls: 30          Total: 1.47508     Ratio: 5.5386%
+  GpuMemcpyAsync         Calls: 10          Total: 0.443514    Ratio: 1.66529%
+  GpuMemcpySync          Calls: 20          Total: 1.03157     Ratio: 3.87331%
+-------------------------       Event Summary       -------------------------
+Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
+FastThreadedSSAGraphExecutorPrepare                         10          9.16493     9.152509 (0.998645)     0.012417 (0.001355)     0.025192    8.85968     0.916493    0.344122
+shape                                                       10          8.33057     8.330568 (1.000000)     0.000000 (0.000000)     0.030711    7.99849     0.833057    0.312793
+fill_constant                                               20          4.06097     4.024522 (0.991025)     0.036449 (0.008975)     0.075087    0.888959    0.203049    0.15248
+slice                                                       10          1.78033     1.750439 (0.983212)     0.029888 (0.016788)     0.148503    0.290851    0.178033    0.0668471
+  GpuMemcpySync:CPU->GPU                                    10          0.45524     0.446312 (0.980388)     0.008928 (0.019612)     0.039089    0.060694    0.045524    0.0170932
+crop_tensor                                                 10          1.67658     1.620542 (0.966578)     0.056034 (0.033422)     0.143906    0.258776    0.167658    0.0629515
+  GpuMemcpySync:GPU->CPU                                    10          0.57633     0.552906 (0.959357)     0.023424 (0.040643)     0.050657    0.076322    0.057633    0.0216398
+Fetch                                                       10          0.919361    0.895201 (0.973721)     0.024160 (0.026279)     0.082935    0.138122    0.0919361   0.0345199
+  GpuMemcpyAsync:GPU->CPU                                   10          0.443514    0.419354 (0.945526)     0.024160 (0.054474)     0.040639    0.059673    0.0443514   0.0166529
+ScopeBufferedMonitor::post_local_exec_scopes_process        10          0.341999    0.341999 (1.000000)     0.000000 (0.000000)     0.028436    0.057134    0.0341999   0.0128413
+eager_deletion                                              30          0.287236    0.287236 (1.000000)     0.000000 (0.000000)     0.005452    0.022696    0.00957453  0.010785
+ScopeBufferedMonitor::pre_local_exec_scopes_process         10          0.047864    0.047864 (1.000000)     0.000000 (0.000000)     0.003668    0.011592    0.0047864   0.00179718
+InitLocalVars                                               1           0.022981    0.022981 (1.000000)     0.000000 (0.000000)     0.022981    0.022981    0.022981    0.000862883
+```
+### 通过log查看发生数据传输的具体位置
+以上的示例程序比较简单，我们只用看profile report就能知道具体是哪些算子发生了数据传输。但是当模型比较复杂时，可能需要去查看更加详细的调试信息，可以打印出运行时的log去确定发生数据传输的具体位置。依然以上述程序为例，执行`GLOG_vmodule=operator=3 python test_case.py`，会得到如下log信息，会发现发生了2次数据传输：
+- `shape`输出的结果在CPU上，在`slice`运行时，`shape`的输出被拷贝到GPU上
+- `slice`执行完的结果在GPU上，当`crop_tensor`执行时，它会被拷贝到CPU上。
+```text
+I0406 14:56:23.286592 17516 operator.cc:180] CUDAPlace(0) Op(shape), inputs:{Input[fill_constant_1.tmp_0:float[1, 3, 5, 5]({})]}, outputs:{Out[shape_0.tmp_0:int[4]({})]}.
+I0406 14:56:23.286628 17516 eager_deletion_op_handle.cc:107] Erase variable fill_constant_1.tmp_0 on CUDAPlace(0)
+I0406 14:56:23.286725 17516 operator.cc:1210] Transform Variable shape_0.tmp_0 from data_type[int]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[int]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
+I0406 14:56:23.286763 17516 scope.cc:169] Create variable shape_0.tmp_0
+I0406 14:56:23.286784 17516 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
+I0406 14:56:23.286867 17516 tensor_util.cu:129] TensorCopySync 4 from CPUPlace to CUDAPlace(0)
+I0406 14:56:23.287099 17516 operator.cc:180] CUDAPlace(0) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_0.tmp_0:int[4]({})], StartsTensor[], StartsTensorList[]}, outputs:{Out[slice_0.tmp_0:int[4]({})]}.
+I0406 14:56:23.287140 17516 eager_deletion_op_handle.cc:107] Erase variable shape_0.tmp_0 on CUDAPlace(0)
+I0406 14:56:23.287220 17516 tensor_util.cu:129] TensorCopySync 4 from CUDAPlace(0) to CPUPlace
+I0406 14:56:23.287473 17516 operator.cc:180] CUDAPlace(0) Op(crop_tensor), inputs:{Offsets[], OffsetsTensor[], Shape[slice_0.tmp_0:int[4]({})], ShapeTensor[], X[fill_constant_0.tmp_0:float[1, 3, 8, 8]({})]}, outputs:{Out[crop_tensor_0.tmp_0:float[1, 3, 5, 5]({})]}.
+```
+### 使用device_guard避免不必要的数据传输
+在上面的例子中，`shape`输出的是一个1-D的Tensor，因此对于`slice`而言计算量很小。这种情况下如果将`slice`设置在CPU上运行，就可以避免2次数据传输。修改后的程序如下：
+```python
+import paddle.fluid as fluid
+import paddle.fluid.compiler as compiler
+import paddle.fluid.profiler as profiler
+data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
+data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
+shape = fluid.layers.shape(data2)
+with fluid.device_guard("cpu"):
+    shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
+out = fluid.layers.crop_tensor(data1, shape=shape)
+place = fluid.CUDAPlace(0)
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
+with profiler.profiler('All', 'total') as prof:
+    for i in range(10):
+        result = exe.run(program=compiled_prog, fetch_list=[out])
+```
+再次观察profile report中`GpuMemCpy Summary`的内容，可以看到`GpuMemCpySync`已经被消除。在实际的模型中，若`GpuMemCpySync` 调用耗时占比较大，并且可以通过设置`device_guard`避免，那么就能够带来一定的性能提升。
+```text
+------------------------->     Profiling Report     <-------------------------
+Note! This Report merge all thread info into one.
+Place: All
+Time unit: ms
+Sorted by total time in descending order in the same thread
+Total time: 14.5345
+  Computation time       Total: 4.47587     Ratio: 30.7948%
+  Framework overhead     Total: 10.0586     Ratio: 69.2052%
+-------------------------     GpuMemCpy Summary     -------------------------
+GpuMemcpy                Calls: 10          Total: 0.457033    Ratio: 3.14447%
+  GpuMemcpyAsync         Calls: 10          Total: 0.457033    Ratio: 3.14447%
+-------------------------       Event Summary       -------------------------
+Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
+FastThreadedSSAGraphExecutorPrepare                         10          7.70113     7.689066 (0.998433)     0.012064 (0.001567)     0.032657    7.39363     0.770113    0.529852
+fill_constant                                               20          2.62299     2.587022 (0.986287)     0.035968 (0.013713)     0.071097    0.342082    0.13115     0.180466
+shape                                                       10          1.93504     1.935040 (1.000000)     0.000000 (0.000000)     0.026774    1.6016      0.193504    0.133134
+Fetch                                                       10          0.880496    0.858512 (0.975032)     0.021984 (0.024968)     0.07392     0.140896    0.0880496   0.0605797
+  GpuMemcpyAsync:GPU->CPU                                   10          0.457033    0.435049 (0.951898)     0.021984 (0.048102)     0.037836    0.071424    0.0457033   0.0314447
+crop_tensor                                                 10          0.705426    0.671506 (0.951916)     0.033920 (0.048084)     0.05841     0.123901    0.0705426   0.0485346
+slice                                                       10          0.324241    0.324241 (1.000000)     0.000000 (0.000000)     0.024299    0.07213     0.0324241   0.0223084
+eager_deletion                                              30          0.250524    0.250524 (1.000000)     0.000000 (0.000000)     0.004171    0.016235    0.0083508   0.0172365
+ScopeBufferedMonitor::post_local_exec_scopes_process        10          0.047794    0.047794 (1.000000)     0.000000 (0.000000)     0.003344    0.014131    0.0047794   0.00328831
+InitLocalVars                                               1           0.034629    0.034629 (1.000000)     0.000000 (0.000000)     0.034629    0.034629    0.034629    0.00238254
+ScopeBufferedMonitor::pre_local_exec_scopes_process         10          0.032231    0.032231 (1.000000)     0.000000 (0.000000)     0.002952    0.004076    0.0032231   0.00221755
+```
+### 总结
+- 使用profile工具对模型进行分析，看是否存在GpuMemcpySync的调用耗时。若存在，则进一步分析发生数据传输的原因。
+- 可以通过profile report找到发生GpuMemcpySync的OP。如果需要，可以通过打印log，找到GpuMemcpySync发生的具体位置。
+- 尝试使用`device_guard`设置部分OP的运行设备，来减少GpuMemcpySync的调用。
+- 最后可以通过比较修改前后模型的profile report，或者其他用来衡量性能的指标，确认修改后是否带来了性能提升。
--- a/doc/paddle/guides/07_performance_improving/index_cn.rst
+++ b/doc/paddle/guides/07_performance_improving/index_cn.rst
+########
+性能调优
+########
+..  toctree::
+    :maxdepth: 1
+    singlenode_training_improving/training_best_practice.rst
+    singlenode_training_improving/memory_optimize.rst
+    device_switching/device_switching.md
+    amp/amp.md
+    multinode_training_improving/cpu_train_best_practice.rst
+    multinode_training_improving/dist_training_gpu.rst
+    multinode_training_improving/gpu_training_with_recompute.rst
+    inference_improving/paddle_tensorrt_infer.md
+    analysis_tools/index_cn.rst
--- a/doc/paddle/guides/07_performance_improving/index_en.rst
+++ b/doc/paddle/guides/07_performance_improving/index_en.rst
+###############
+Practice Improving
+###############
+..  toctree::
+    :maxdepth: 1
+    singlenode_training_improving/memory_optimize_en.rst
+    multinode_training_improving/cpu_train_best_practice_en.rst
+    multinode_training_improving/gpu_training_with_recompute_en.rst
+    inference_improving/paddle_tensorrt_infer_en.md
+    analysis_tools/index_en.rst
--- a/doc/paddle/guides/07_performance_improving/inference_improving/paddle_tensorrt_infer.md
+++ b/doc/paddle/guides/07_performance_improving/inference_improving/paddle_tensorrt_infer.md
+# 使用Paddle-TensorRT库预测
+NVIDIA TensorRT 是一个高性能的深度学习预测库，可为深度学习推理应用程序提供低延迟和高吞吐量。PaddlePaddle 采用子图的形式对TensorRT进行了集成，即我们可以使用该模块来提升Paddle模型的预测性能。该模块依旧在持续开发中，目前支持的模型如下表所示：
+|分类模型|检测模型|分割模型|
+|---|---|---|
+|mobilenetv1|yolov3|ICNET|
+|resnet50|SSD||
+|vgg16|mask-rcnn||
+|resnext|faster-rcnn||
+|AlexNet|cascade-rcnn||
+|Se-ResNext|retinanet||
+|GoogLeNet|mobilenet-SSD||
+|DPN|||
+在这篇文档中，我们将会对Paddle-TensorRT库的获取、使用和原理进行介绍。
+**Note:**
+1. 从源码编译时，TensorRT预测库目前仅支持使用GPU编译，且需要设置编译选项TENSORRT_ROOT为TensorRT所在的路径。
+2. Windows支持需要TensorRT 版本5.0以上。
+3. Paddle-TRT目前仅支持固定输入shape。
+4. 下载安装TensorRT后，需要手动在`NvInfer.h`文件中为`class IPluginFactory`和`class IGpuAllocator`分别添加虚析构函数：
+    ``` c++
+    virtual ~IPluginFactory() {};
+    virtual ~IGpuAllocator() {};
+    ```
+## 内容
+- [Paddle-TRT使用介绍](#Paddle-TRT使用介绍)
+- [Paddle-TRT样例编译测试](#Paddle-TRT样例编译测试)
+- [Paddle-TRT INT8使用](#Paddle-TRT_INT8使用)
+- [Paddle-TRT子图运行原理](#Paddle-TRT子图运行原理)
+- [Paddle-TRT性能测试](#Paddle-TRT性能测试)
+## <a name="Paddle-TRT使用介绍">Paddle-TRT使用介绍</a>
+在使用AnalysisPredictor时，我们通过配置AnalysisConfig中的接口
+``` c++
+config->EnableTensorRtEngine(1 << 20      /* workspace_size*/,  
+                        batch_size        /* max_batch_size*/,  
+                        3                 /* min_subgraph_size*/,
+                        AnalysisConfig::Precision::kFloat32 /* precision*/,
+                        false             /* use_static*/,
+                        false             /* use_calib_mode*/);
+```  
+的方式来指定使用Paddle-TRT子图方式来运行。
+该接口中的参数的详细介绍如下：
+- **`workspace_size`**，类型：int，默认值为1 << 20。指定TensorRT使用的工作空间大小，TensorRT会在该大小限制下筛选合适的kernel执行预测运算。
+- **`max_batch_size`**，类型：int，默认值为1。需要提前设置最大的batch大小，运行时batch大小不得超过此限定值。
+- **`min_subgraph_size`**，类型：int，默认值为3。Paddle-TRT是以子图的形式运行，为了避免性能损失，当子图内部节点个数大于`min_subgraph_size`的时候，才会使用Paddle-TRT运行。
+- **`precision`**，类型：`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, 默认值为`AnalysisConfig::Precision::kFloat32`。指定使用TRT的精度，支持FP32（kFloat32），FP16（kHalf），Int8（kInt8）。若需要使用Paddle-TRT int8离线量化校准，需设定`precision`为 `AnalysisConfig::Precision::kInt8`, 且设置`use_calib_mode` 为true。
+- **`use_static`**，类型：bool, 默认值为false。如果指定为true，在初次运行程序的时候会将TRT的优化信息进行序列化到磁盘上，下次运行时直接加载优化的序列化信息而不需要重新生成。
+- **`use_calib_mode`**，类型：bool, 默认值为false。若要运行Paddle-TRT int8离线量化校准，需要将此选项设置为true。
+**Note：** Paddle-TRT目前只支持固定shape的输入，不支持变化shape的输入。
+## <a name="Paddle-TRT样例编译测试">Paddle-TRT样例编译测试</a>
+1. 下载或编译带有 TensorRT 的paddle预测库，参考[安装与编译C++预测库](../../inference_deployment/inference/build_and_install_lib_cn.html)。
+2. 从[NVIDIA官网](https://developer.nvidia.com/nvidia-tensorrt-download)下载对应本地环境中cuda和cudnn版本的TensorRT，需要登陆NVIDIA开发者账号。
+3. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压，进入`sample/paddle-TRT`目录下。
+    `paddle-TRT` 文件夹目录结构如下：
+    ```
+    paddle-TRT
+    ├── CMakeLists.txt
+    ├── mobilenet_test.cc
+    ├── fluid_generate_calib_test.cc
+    ├── fluid_int8_test.cc
+    ├── mobilenetv1
+    │   ├── model
+    │   └── params
+    ├── run.sh
+    └── run_impl.sh
+    ```
+    - `mobilenet_test.cc` 为使用paddle-TRT预测的C++源文件
+    - `fluid_generate_calib_test.cc` 为使用TRT int8离线量化校准的C++源文件
+    - `fluid_int8_test.cc` 为使用TRT执行int8预测的C++源文件
+    - `mobilenetv1` 为模型文件夹
+    - `run.sh` 为预测运行脚本文件
+    在这里假设样例所在的目录为 `SAMPLE_BASE_DIR/sample/paddle-TRT`
+4. 配置编译与运行脚本
+    编译运行预测样例之前，需要根据运行环境配置编译与运行脚本`run.sh`。`run.sh`的选项与路径配置的部分如下：
+    ```shell
+    # 设置是否开启MKL、GPU、TensorRT，如果要使用TensorRT，必须打开GPU
+    WITH_MKL=ON
+    WITH_GPU=ON
+    USE_TENSORRT=ON
+    # 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径
+    LIB_DIR=YOUR_LIB_DIR
+    CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+    CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+    TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
+    MODEL_DIR=YOUR_MODEL_DIR
+    ```
+    按照实际运行环境配置`run.sh`中的选项开关和所需lib路径。
+5. 编译与运行样例  
+## <a name="Paddle-TRT_INT8使用">Paddle-TRT INT8使用</a>
+1. Paddle-TRT INT8 简介  
+    神经网络的参数在一定程度上是冗余的，在很多任务上，我们可以在保证模型精度的前提下，将Float32的模型转换成Int8的模型。目前，Paddle-TRT支持离线将预训练好的Float32模型转换成Int8的模型，具体的流程如下：
+    1) **生成校准表**（Calibration table）：我们准备500张左右的真实输入数据，并将数据输入到模型中去，Paddle-TRT会统计模型中每个op输入和输出值的范围信息，并将其记录到校准表中，这些信息有效减少了模型转换时的信息损失。
+    2) 生成校准表后，再次运行模型，**Paddle-TRT会自动加载校准表**，并进行INT8模式下的预测。
+2. 编译测试INT8样例
+    将`run.sh`文件中的`mobilenet_test`改为`fluid_generate_calib_test`，运行
+    ``` shell  
+    sh run.sh  
+    ```
+    即可执行生成校准表样例，在该样例中，我们随机生成了500个输入来模拟这一过程，在实际业务中，建议大家使用真实样例。运行结束后，在 `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` 模型目录下会多出一个名字为trt_calib_*的文件，即校准表。
+    生成校准表后，将带校准表的模型文件拷贝到特定地址
+    ``` shell  
+    cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib  
+    ```
+    将`run.sh`文件中的`fluid_generate_calib_test`改为`fluid_int8_test`，将模型路径改为`SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib`，运行
+    ``` shell  
+    sh run.sh  
+    ```
+    即可执行int8预测样例。
+## <a name="Paddle-TRT子图运行原理">Paddle-TRT子图运行原理</a>
+   PaddlePaddle采用子图的形式对TensorRT进行集成，当模型加载后，神经网络可以表示为由变量和运算节点组成的计算图。Paddle TensorRT实现的功能是对整个图进行扫描，发现图中可以使用TensorRT优化的子图，并使用TensorRT节点替换它们。在模型的推断期间，如果遇到TensorRT节点，Paddle会调用TensorRT库对该节点进行优化，其他的节点调用Paddle的原生实现。TensorRT在推断期间能够进行Op的横向和纵向融合，过滤掉冗余的Op，并对特定平台下的特定的Op选择合适的kernel等进行优化，能够加快模型的预测速度。  
+下图使用一个简单的模型展示了这个过程：  
+**原始网络**
+<p align="center">
+ <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_original.png" width="600">
+</p>
+**转换的网络**
+<p align="center">
+ <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
+</p>
+   我们可以在原始模型网络中看到，绿色节点表示可以被TensorRT支持的节点，红色节点表示网络中的变量，黄色表示Paddle只能被Paddle原生实现执行的节点。那些在原始网络中的绿色节点被提取出来汇集成子图，并由一个TensorRT节点代替，成为转换后网络中的`block-25` 节点。在网络运行过程中，如果遇到该节点，Paddle将调用TensorRT库来对其执行。
+## <a name="Paddle-TRT性能测试">Paddle-TRT性能测试</a>
+### 测试环境
+- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
+- TensorRT4.0, CUDA8.0, CUDNNV7
+- 测试模型 ResNet50，MobileNet，ResNet101, Inception V3.
+### 测试对象
+**PaddlePaddle, Pytorch, Tensorflow**
+- 在测试中，PaddlePaddle使用子图优化的方式集成了TensorRT, 模型[地址](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)。
+- Pytorch使用了原生的实现, 模型[地址1](https://github.com/pytorch/vision/tree/master/torchvision/models)、[地址2](https://github.com/marvis/pytorch-mobilenet)。
+- 对TensorFlow测试包括了对TF的原生的测试，和对TF—TRT的测试，**对TF—TRT的测试并没有达到预期的效果，后期会对其进行补充**， 模型[地址](https://github.com/tensorflow/models)。
+#### ResNet50
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|4.64117 |16.3|10.878|
+|5|6.90622| 22.9 |20.62|
+|10|7.9758 |40.6|34.36|
+#### MobileNet
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1| 1.7541 | 7.8 |2.72|
+|5| 3.04666 | 7.8 |3.19|
+|10|4.19478 | 14.47 |4.25|
+#### ResNet101
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|8.95767| 22.48 |18.78|
+|5|12.9811 | 33.88 |34.84|
+|10|14.1463| 61.97 |57.94|
+#### Inception v3
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|15.1613 | 24.2 |19.1|
+|5|18.5373 | 34.8 |27.2|
+|10|19.2781| 54.8 |36.7|
--- a/doc/paddle/guides/07_performance_improving/inference_improving/paddle_tensorrt_infer_en.md
+++ b/doc/paddle/guides/07_performance_improving/inference_improving/paddle_tensorrt_infer_en.md
+# Use Paddle-TensorRT Library for inference
+NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
+Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are as following:
+|classification|detection|segmentation|
+|---|---|---|
+|mobilenetv1|yolov3|ICNET|
+|resnet50|SSD||
+|vgg16|mask-rcnn||
+|resnext|faster-rcnn||
+|AlexNet|cascade-rcnn||
+|Se-ResNext|retinanet||
+|GoogLeNet|mobilenet-SSD||
+|DPN|||
+We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
+**Note:**
+1. When compiling from source, TensorRT library currently only supports GPU compilation, and you need to set the compilation option TensorRT_ROOT to the path where tensorrt is located.
+2. Windows support requires TensorRT version 5.0 or higher.
+3. Paddle-TRT currently only supports fixed input shape.
+4. After downloading and installing tensorrt, you need to manually add virtual destructors for `class IPluginFactory` and `class IGpuAllocator` in the `NvInfer.h` file:
+    ``` c++
+    virtual ~IPluginFactory() {};
+    virtual ~IGpuAllocator() {};
+    ```
+## <a name="Paddle-TRT interface usage">Paddle-TRT interface usage</a>
+When using AnalysisPredictor, we enable Paddle-TRT by setting
+``` c++
+config->EnableTensorRtEngine(1 << 20      /* workspace_size*/,  
+                        batch_size        /* max_batch_size*/,  
+                        3                 /* min_subgraph_size*/,
+                        AnalysisConfig::Precision::kFloat32 /* precision*/,
+                        false             /* use_static*/,
+                        false             /* use_calib_mode*/);
+```  
+The details of this interface is as following:
+- **`workspace_size`**: type:int, default is 1 << 20. Sets the max workspace size of TRT. TensorRT will choose kernels under this constraint.
+- **`max_batch_size`**: type:int, default is 1. Sets the max batch size. Batch sizes during runtime cannot exceed this value.
+- **`min_subgraph_size`**: type:int, default is 3. Subgraph is used to integrate TensorRT in PaddlePaddle. To avoid low performance, Paddle-TRT is only enabled when th number of nodes in th subgraph is more than `min_subgraph_size`.
+- **`precision`**: type:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, default is `AnalysisConfig::Precision::kFloat32`. Sets the precision of TRT, supporting FP32(kFloat32), FP16(kHalf), Int8(kInt8). Using Paddle-TRT int8 calibration requires setting `precision` to  `AnalysisConfig::Precision::kInt8`, and `use_calib_mode` to true.
+- **`use_static`**: type:bool, default is false. If set to true, Paddle-TRT will serialize optimization information to disk, to deserialize next time without optimizing again.
+- **`use_calib_mode`**: type:bool, default is false. Using Paddle-TRT int8 calibration requires setting this option to true.
+**Note：** Paddle-TRT currently only supports fixed input shape.
+## <a name="Paddle-TRT example compiling test">Paddle-TRT example compiling test</a>
+1. Download or compile Paddle Inference with TensorRT support, refer to [Install and Compile C++ Inference Library](../../inference_deployment/inference/build_and_install_lib_en.html).
+2. Download NVIDIA TensorRT(with consistent version of cuda and cudnn in local environment) from [NVIDIA TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) with an NVIDIA developer account.
+3. Download [Paddle Inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress, and enter `sample/paddle-TRT` directory.
+    `paddle-TRT` directory structure is as following:
+    ```
+    paddle-TRT
+    ├── CMakeLists.txt
+    ├── mobilenet_test.cc
+    ├── fluid_generate_calib_test.cc
+    ├── fluid_int8_test.cc
+    ├── mobilenetv1
+    │   ├── model
+    │   └── params
+    ├── run.sh
+    └── run_impl.sh
+    ```
+    - `mobilenet_test.cc` is the c++ source code of inference using Paddle-TRT
+    - `fluid_generate_calib_test.cc` is the c++ source code of inference using Paddle-TRT int8 calibration to generate calibration table
+    - `fluid_int8_test.cc` is the c++ source code of inference using Paddle-TRT int8
+    - `mobilenetv1` is the model dir
+    - `run.sh` is the script for running inference
+    Here we assume that the current directory is `SAMPLE_BASE_DIR/sample/paddle-TRT`.
+    ``` shell
+    # set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
+    WITH_MKL=ON
+    WITH_GPU=OFF
+    USE_TENSORRT=OFF
+    # set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
+    LIB_DIR=YOUR_LIB_DIR
+    CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+    CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+    TENSORRT_ROOT_DIR=YOUR_TENSORRT_ROOT_DIR
+    MODEL_DIR=YOUR_MODEL_DIR
+    ```
+    Please configure `run.sh` depending on your environment.
+4. Build and run the sample.  
+    ``` shell
+    sh run.sh
+    ```
+## <a name="Paddle-TRT INT8 usage">Paddle-TRT INT8 usage</a>
+1. Paddle-TRT INT8 introduction  
+    The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows:
+    1）**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation.
+    2）After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
+2. compile and test the INT8 example
+    change the `mobilenet_test` in `run.sh` to `fluid_generate_calib_test` and run
+    ``` shell  
+    sh run.sh  
+    ```
+    We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` model directory, which is the calibration table.
+    Then copy the model dir with calibration infomation to path
+    ``` shell  
+    cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib  
+    ```
+    change `fluid_generate_calib_test` in `run.sh` to `fluid_int8_test`, and change model dir path to `SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib` and run
+    ``` shell  
+    sh run.sh  
+    ```
+## <a name="Paddle-TRT subgraph operation principle">Paddle-TRT subgraph operation principle</a>
+   Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
+A simple model expresses the process :
+**Original Network**
+<p align="center">
+ <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_original.png" width="600">
+</p>
+**Transformed Network**
+<p align="center">
+ <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
+</p>
+  We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
+## <a name="Paddle-TRT benchmark">Paddle-TRT benchmark</a>
+### Test Environment
+- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
+- TensorRT 4.0, CUDA 8.0, CUDNN V7
+- models: ResNet50，MobileNet，ResNet101, Inception V3.
+### Test set
+**PaddlePaddle, Pytorch, Tensorflow**
+- PaddlePaddle integrates TensorRT with subgraph, model[link](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)。
+- Pytorch uses original kernels, model[link1](https://github.com/pytorch/vision/tree/master/torchvision/models), [link2](https://github.com/marvis/pytorch-mobilenet)。
+- We tested TF original and TF-TRT**对TF—TRT的测试并没有达到预期的效果，后期会对其进行补充**, model[link](https://github.com/tensorflow/models)。
+#### ResNet50
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|4.64117 |16.3|10.878|
+|5|6.90622| 22.9 |20.62|
+|10|7.9758 |40.6|34.36|
+#### MobileNet
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1| 1.7541 | 7.8 |2.72|
+|5| 3.04666 | 7.8 |3.19|
+|10|4.19478 | 14.47 |4.25|
+#### ResNet101
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|8.95767| 22.48 |18.78|
+|5|12.9811 | 33.88 |34.84|
+|10|14.1463| 61.97 |57.94|
+#### Inception v3
+|batch_size|PaddlePaddle(ms)|Pytorch(ms)|TensorFlow(ms)|
+|---|---|---|---|
+|1|15.1613 | 24.2 |19.1|
+|5|18.5373 | 34.8 |27.2|
+|10|19.2781| 54.8 |36.7|
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/cpu_train_best_practice.rst
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/cpu_train_best_practice.rst
+.. _api_guide_cpu_training_best_practice:
+####################
+分布式CPU训练优秀实践
+####################
+提高CPU分布式训练的训练速度，主要要从四个方面来考虑：
+1）提高训练速度，主要是提高CPU的使用率；2）提高通信速度，主要是减少通信传输的数据量；3）提高数据IO速度；4）更换分布式训练策略，提高分布式训练速度。
+提高CPU的使用率
+=============
+提高CPU使用率主要依赖 :code:`ParallelExecutor`，可以充分利用多个CPU的计算能力来加速计算。
+API详细使用方法参考 :ref:`cn_api_fluid_ParallelExecutor` ，简单实例用法：
+.. code-block:: python
+    # 配置执行策略，主要是设置线程数
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_threads = 8
+    # 配置构图策略，对于CPU训练而言，应该使用Reduce模式进行训练
+    build_strategy = fluid.BuildStrategy()
+    if int(os.getenv("CPU_NUM")) > 1:
+        build_strategy.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce
+    pe = fluid.ParallelExecutor(
+        use_cuda=False,
+        loss_name=avg_cost.name,
+        main_program=main_program,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy)
+以上参数中：
+- :code:`num_threads` ： 模型训练使用的线程数，最好和训练所在机器的物理CPU核数接近
+- :code:`reduce_strategy` ： 对于CPU训练而言，应该选择 fluid.BuildStrategy.ReduceStrategy.Reduce
+通用环境变量配置：
+- :code:`CPU_NUM` ：模型副本replica的个数，最好和num_threads一致
+提高通信速度
+==========
+要减少通信数据量，提高通信速度，主要是使用稀疏更新 ，目前支持  :ref:`api_guide_sparse_update` 的主要是  :ref:`cn_api_fluid_layers_embedding` 。
+.. code-block:: python
+    data = fluid.layers.data(name='ids', shape=[1], dtype='int64')
+    fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)
+以上参数中：
+- :code:`is_sparse` ： 配置embedding使用稀疏更新，如果embedding的dict_size很大，而每次数据data很少，建议使用sparse更新方式。
+提高数据IO速度
+==========
+要提高CPU分布式的数据IO速度，可以首先考虑使用dataset API进行数据读取。 dataset是一种多生产者多消费者模式的数据读取方法，默认情况下耦合数据读取线程与训练线程，在多线程的训练中，dataset表现出极高的性能优势。
+API接口介绍可以参考：https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dataset_cn/QueueDataset_cn.html
+结合实际的网络，比如CTR-DNN模型，引入的方法可以参考：https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
+最后使用 :code:`train_from_dataset` 接口来进行网络的训练：
+.. code-block:: python
+    dataset = fluid.DatasetFactory().create_dataset()
+    exe = fluid.Executor(fluid.CPUPlace())
+    exe.run(fluid.default_startup_program())
+    exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset)
+更换分布式训练策略
+==========
+CPU分布式训练速度进一步提高的核心在于选择合适的分布式训练策略，比如定义通信策略、编译策略、执行策略等等。paddlepaddle于v1.7版本发布了 :code:`DistributedStrategy` 功能，可以十分灵活且方便的指定分布式运行策略。
+首先需要在代码中引入相关库：
+.. code-block:: python
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+    import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
+然后指定CPU分布式运行的训练策略，目前可选配置有四种：同步训练（Sync）、异步训练（Async）、半异步训练（Half-Async）以及GEO训练。不同策略的细节，可以查看设计文档：
+https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
+通过如下代码引入上述策略的默认配置，并进行CPU分布式训练：
+.. code-block:: python
+    # step1: 引入CPU分布式训练策略
+    # 同步训练策略
+    strategy = DistributedStrategyFactory.create_sync_strategy()
+    # 半异步训练策略
+    strategy = DistributedStrategyFactory.create_half_async_strategy()
+    # 异步训练策略
+    strategy = DistributedStrategyFactory.create_async_strategy()
+    # GEO训练策略
+    strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
+    # step2: 定义节点角色
+    role = role_maker.PaddleCloudRoleMaker()
+    fleet.init(role)
+    # step3: 分布式训练program构建
+    optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
+    optimizer = fleet.distributed_optimizer(optimizer, strategy)
+    optimizer.minimize(loss)
+    # step4.1: 启动参数服务器节点（Server）
+    if fleet.is_server():
+        fleet.init_server()
+        fleet.run_server()
+    # step4.2: 启动训练节点（Trainer）
+    elif fleet.is_worker():
+        fleet.init_worker()
+        exe.run(fleet.startup_program)
+        # Do training 
+        exe.run(fleet.main_program)
+        fleet.stop_worker()
+paddlepaddle支持对训练策略中的细节进行调整：
+- 创建compiled_program所需的build_strategy及exec_strategy可以直接基于strategy获得
+.. code-block:: python
+    compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
+                                                                            loss_name=loss.name, 
+                                                                            build_strategy=strategy.get_build_strategy(), 
+                                                                            exec_strategy=strategy.get_execute_strategy())
+- 自定义训练策略细节，支持对DistributeTranspilerConfig、TrainerRuntimeConfig、ServerRuntimeConfig、fluid.ExecutionStrategy、fluid.BuildStrategy进行自定义配置。以DistributeTranspilerConfig为例，修改方式如下所示：
+.. code-block:: python
+    strategy = DistributedStrategyFactory.create_sync_strategy()
+    # 方式一（推荐）：
+    config = strategy.get_program_config()
+    config.min_block_size = 81920
+    # 方式二：调用set_program_config修改组网相关配置，支持DistributeTranspilerConfig和dict两种数据类型
+    config = DistributeTranspilerConfig()
+    config.min_block_size = 81920
+    # config = dict()
+    # config['min_block_size'] = 81920
+    strategy.set_program_config(config)
\ No newline at end of file
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst
+.. _api_guide_cpu_training_best_practice_en:
+######################################################
+Best practices of distributed training on CPU
+######################################################
+To improve the training speed of CPU distributed training, we must consider two aspects:
+1. Improve the training speed mainly by improving utilization rate of CPU; 
+2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication;
+3. Improve the data IO speed by dataset API;
+4. Improve the distributed training speed by changing distributed training strategy.
+Improve CPU utilization 
+=============================
+The CPU utilization mainly depends on :code:`ParallelExecutor`, which can make full use of the computing power of multiple CPUs to speed up the calculation.
+For detailed API usage, please refer to :ref:`api_fluid_ParallelExecutor` . A simple example:
+.. code-block:: python
+	# Configure the execution strategy, mainly to set the number of threads
+	exec_strategy = fluid.ExecutionStrategy()
+	exec_strategy.num_threads = 8
+	# Configure the composition strategy, for CPU training, you should use the Reduce mode for training.
+	build_strategy = fluid.BuildStrategy()
+	if int(os.getenv("CPU_NUM")) > 1:
+		build_strategy.reduce_strategy=fluid.BuildStrategy.ReduceStrategy.Reduce
+	pe = fluid.ParallelExecutor(
+		use_cuda=False,
+		loss_name=avg_cost.name,
+		main_program=main_program,
+		build_strategy=build_strategy,
+		exec_strategy=exec_strategy)
+Among the parameters above:
+- :code:`num_threads` : the number of threads used by the model training. It is preferably close to the number of the physical CPU cores of the machine where the training is performed.
+- :code:`reduce_strategy` : For CPU training, you should choose fluid.BuildStrategy.ReduceStrategy.Reduce
+Configuration of general environment variables:
+- :code:`CPU_NUM`: The number of replicas of the model, preferably the same as num_threads
+Improve communication speed
+==============================
+To reduce the amount of communication data and improve communication speed is achieved mainly by using sparse updates, the current support for `sparse update <../layers/sparse_update_en.html>`_ is mainly :ref:`api_fluid_layers_embedding`.
+.. code-block:: python
+	data = fluid.layers.data(name='ids', shape=[1], dtype='int64')
+	fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)
+Among the parameters above:
+- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
+Improve data IO speed
+==============================
+To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage.
+Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html
+Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
+Using :code:`train_from_dataset` for network training.
+.. code-block:: python
+    dataset = fluid.DatasetFactory().create_dataset()
+    exe = fluid.Executor(fluid.CPUPlace())
+    exe.run(fluid.default_startup_program())
+    exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset)
+Change distributed training strategy
+==============================
+The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy.
+First, we need to introduce relevant libraries into the code:
+.. code-block:: python
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+    import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+    from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
+At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents:
+https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
+The default configuration of the above policy is introduced by the following code:
+.. code-block:: python
+    # step1: get distributed strategy
+    # Sync
+    strategy = DistributedStrategyFactory.create_sync_strategy()
+    # Half-Async
+    strategy = DistributedStrategyFactory.create_half_async_strategy()
+    # Async
+    strategy = DistributedStrategyFactory.create_async_strategy()
+    # GEO
+    strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
+    # step2: define role of node
+    role = role_maker.PaddleCloudRoleMaker()
+    fleet.init(role)
+    # step3: get distributed training program
+    optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
+    optimizer = fleet.distributed_optimizer(optimizer, strategy)
+    optimizer.minimize(loss)
+    # step4.1: run parameter server node
+    if fleet.is_server():
+        fleet.init_server()
+        fleet.run_server()
+    # step4.2: run worker node
+    elif fleet.is_worker():
+        fleet.init_worker()
+        exe.run(fleet.startup_program)
+        # Do training 
+        exe.run(fleet.main_program)
+        fleet.stop_worker()
+PaddlePaddle supports adjusting the details of the training strategy:
+- The build_strategy and exec_strategy which used to create compiled_program can generate from strategy:
+.. code-block:: python
+    compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
+                                                                            loss_name=loss.name, 
+                                                                            build_strategy=strategy.get_build_strategy(), 
+                                                                            exec_strategy=strategy.get_execute_strategy())
+- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows:
+.. code-block:: python
+    strategy = DistributedStrategyFactory.create_sync_strategy()
+    # Mode 1 (recommended)：
+    config = strategy.get_program_config()
+    config.min_block_size = 81920
+    # Mode 2 
+    config = DistributeTranspilerConfig()
+    config.min_block_size = 81920
+    # config = dict()
+    # config['min_block_size'] = 81920
+    strategy.set_program_config(config)
\ No newline at end of file
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/dist_training_gpu.rst
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/dist_training_gpu.rst
+.. _best_practice_dist_training_gpu:
+#####################
+分布式GPU训练优秀实践
+#####################
+开始优化您的GPU分布式训练任务
+---------------------------
+PaddlePaddle Fluid支持在现代GPU [#]_ 服务器集群上完成高性能分布式训练。通常可以通过以下方法优化在多机多卡环境训练性能，建议在进行性能优化时，检查每项优化点并验证对应提升，从而提升最终的性能。
+一个简单的验证当前的训练程序是否需要进一步优化性能的方法，是查看GPU的计算利用率 [#]_ ，通常用 :code:`nvidia-smi` 命令查看。如果GPU利用率较低，则可能存在较大的优化空间。下面主要从数据准备、训练策略设置和训练方式三个方面介绍GPU分布式训练中常用的优化方法。
+1、数据准备
+===========
+数据读取的优化在GPU训练中至关重要，尤其在不断增加batch_size提升吞吐时，计算对reader性能会有更高对要求，优化reader性能需要考虑的点包括：
+ - 使用 :code:`DataLoader` 。参考 `这里 <https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/io_cn/DataLoader_cn.html#dataloader>`_ 使用DataLoader，并建议开启 :code:`use_double_buffer` 。
+ - reader返回uint8类型数据。图片在解码后一般会以uint8类型存储，如果在reader中转换成float类型数据，会将数据体积扩大4倍。直接返回uint8数据，然后在GPU上转化成float类型进行训练可以提升数据读取效率。
+ - 减少reader初始化时间 (infinite read)。在训练任务开始执行第一轮训练时，reader开始不断异步地从磁盘或其他存储中读取数据并执行预处理，然后将处理好的数据填充到队列中供计算使用。从0开始填充这个队列直到数据可以源源不断供给计算，需要一定时间的预热。所以，如果每轮训练都重新填充队列，会产生一些时间的开销。所以，在使用DataLoader时，可以让reader函数不断地产生数据，直到训练循环结束：
+   .. code-block:: python
+      :linenos:
+      def infinite_reader(file_path):
+          while True:
+              with open(file_path) as fn:
+                  for line in fn:
+                      yield process(line)
+      def train():
+          ...
+          for pass_id in xrange(NUM_PASSES):
+              if pass_id == 0:
+                  data_loader.start()
+              for batch_id in (iters_per_pass):
+                  exe.run()
+          data_loader.reset()
+另外，可以使用DALI库提升数据处理性能。DALI是NVIDIA开发的数据加载库，更多内容请参考 `官网文档 <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>`_ 。飞桨中如何结合使用DALI库请参考 `使用示例 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet>`_ 。
+2、训练策略设置
+===========
+训练参数设置表
+..  csv-table:: 
+    :header: "选项", "类型", "默认值", "说明"
+    :widths: 3, 3, 3, 5
+    ":code:`num_threads`", "int", "1", "CPU线程数"
+    ":code:`nccl_comm_num`", "int", "1", "nccl通信器数量"
+    ":code:`fuse_all_reduce_ops`", "bool", "False", "多卡训练时，将AllReduce操纵进行融合"
+    ":code:`use_hierarchical_allreduce` ", "bool", "False", "分级式reduce"
+    ":code:`num_iteration_per_drop_scope`", "int", "1", "scope drop频率，设置每隔几个batch的迭代之后执行一次清理scope"
+    ":code:`fetch_frequency`", "int", "1", "fetch的刷新频率"
+    ":code:`fuse_bn_act_ops`", "bool", "False", "是否开启batch normalization和激活函数的融合"
+    ":code:`fuse_elewise_add_act_ops`", "bool", "False", "是否开启elementwise add函数和激活函数的融合"
+说明：
+- 关于设置合适的CPU线程数 :code:`num_threads` 和nccl通信器数量 :code:`nccl_comm_num` 。PaddlePaddle Fluid使用“线程池” [#]_ 模型调度并执行Op，Op在启动GPU计算之前，通常需要CPU的协助，然而如果Op本身占用时间很小，“线程池”模型下又会带来额外的调度开销。使用多进程模式时，如果神经网络的计算图 [#]_ 节点间有较高的并发度，即使每个进程只在一个GPU上运行，使用多个线程可以更大限度的提升GPU利用率。nccl通信器数量 :code:`nccl_comm_num` 可以加快GPU之间的通信效率，建议单机设置为1，多机设置为2。针对CPU线程数 :code:`num_threads` ，建议单机设置为1，多机设置为 :code:`nccl_comm_num` +1。
+- 关于AllReduce融合 :code:`fuse_all_reduce_ops` ，默认情况下会将同一layer中参数的梯度的AllReduce操作合并成一个，比如对于 :code:`fluid.layers.fc` 中有Weight和Bias两个参数，打开该选项之后，原本需要两次AllReduce操作，现在只用一次AllReduce 操作。此外，为支持更大粒度的参数梯度融合，Paddle提供了 :code:`FLAGS_fuse_parameter_memory_size` 和 :code:`FLAGS_fuse_parameter_groups_size` 两个环境变量选项。用户可以指定融合AllReduce操作之后，每个AllReduce操作的梯度字节数，比如希望每次AllReduce调用传输16MB的梯度，:code:`export FLAGS_fuse_parameter_memory_size=16` ，经验值为总通信量的十分之一。可以指定每次AllReduce操作的最大层数，即到达该层数就进行AllReduce，如指定50层 :code:`export FLAGS_fuse_parameter_groups_size=50` 。注意：目前不支持sparse参数梯度。
+- 关于使用分级式reduce :code:`use_hierarchical_allreduce` 。对于多机模式，针对小数据量的通信，Ring AllReduce通信效率低，采用Hierarchical AllReduce可以解决该问题。
+- 关于降低scope drop频率 :code:`num_iteration_per_drop_scope` 和fetch频率 :code:`fetch_frequency` 。减少scope drop和fetch频率，可以减少频繁的变量内存申请、释放和拷贝，从而提升性能。
+- 关于操作融合：通过参数融合可以提升训练性能。
+设置这些参数可以参考：
+.. code-block:: python
+   :linenos:
+   dist_strategy = DistributedStrategy()
+   dist_strategy.nccl_comm_num = 2                    #建议多机设置为2，单机设置为1
+   exec_strategy = fluid.ExecutionStrategy()
+   exe_st.num_threads = 3                             #建议多机设置为nccl_comm_num+1，单机设置为1
+   exec_strategy.num_iteration_per_drop_scope = 30    #scope drop频率
+   dist_strategy.exec_strategy = exec_strategy
+   dist_strategy.fuse_all_reduce_ops = True           #AllReduce是否融合
+                ...
+   with fluid.program_guard(main_prog, startup_prog): #组网
+       params = model.params
+       optimizer = optimizer_setting(params)
+       dist_optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+       dist_optimizer.minimize(avg_cost)
+                ...
+   for pass_id in range(PASS_NUM):
+       batch_id = 0
+       while True:
+           if batch_id % fetch_frequency == 0:        #fetch频率
+               fetched = exe.run(main_prog, fetch_list)
+           else:
+               exe.run([])
+3、训练方式
+===========
+1、Local SGD
+GPU多机多卡同步训练过程中存在慢trainer现象，即每步中训练快的trainer的同步通信需要等待训练慢的trainer。由于每步中慢trainer的rank具有随机性，因此我们使用局部异步训练的方式——LocalSGD，通过多步异步训练（无通信阻塞）实现慢trainer时间均摊，从而提升同步训练性能。Local SGD训练方式主要有三个参数，分别是：
+..  csv-table:: 
+    :header: "选项", "类型", "可选值", "说明"
+    :widths: 3, 3, 3, 5
+    ":code:`use_local_sgd`", "bool", "False/True", "是否开启Local SGD，默认不开启"
+    ":code:`local_sgd_is_warm_steps`", "int", "大于0", "训练多少轮之后才使用Local SGD方式训练"
+    ":code:`local_sgd_steps`", "int", "大于0", "Local SGD的步长"
+说明：
+- Local SGD的warmup步长 :code:`local_sgd_is_warm_steps` 影响最终模型的泛化能力，一般需要等到模型参数稳定之后在进行Local SGD训练，经验值可以将学习率第一次下降时的epoch作为warmup步长，之后再进行Local SGD训练。
+- Local SGD步长 :code:`local_sgd_steps` ，一般该值越大，通信次数越少，训练速度越快，但随之而来的时模型精度下降。经验值设置为2或者4。
+具体的Local SGD的训练代码可以参考：https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet
+2、使用混合精度训练
+V100 GPU提供了 `Tensor Core <https://www.nvidia.com/en-us/data-center/tensorcore/>`_ 可以在混合精度计算场景极大的提升性能。使用混合精度计算的例子可以参考：https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#using-mixed-precision-training
+目前Paddle只提供在两个模型（ResNet, BERT）的混合精度计算实现并支持static loss scaling，其他模型使用混合精度也可以参考以上的实现完成验证。
+附录
+----
+.. [#] 现代GPU：指至少支持运行 `CUDA <https://developer.nvidia.com/cuda-downloads>`_ 版本7.5以上的GPU
+.. [#] GPU利用率：这里指GPU计算能力被使用部分所占的百分比
+.. [#] https://en.wikipedia.org/wiki/Thread_pool
+.. [#] https://en.wikipedia.org/wiki/Data-flow_diagram
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_low_bandwidth_dgc.md
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_low_bandwidth_dgc.md
+# 低配网络的分布式GPU训练
+## 1. 背景
+  大规模分布式训练需要较高的网络带宽以便进行梯度的聚合更新，这限制了多节点训练时的可扩展性同时也需要昂贵的高带宽设备。在低带宽云网络等环境下进行分布式训练会变得更加糟糕。现有[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)研究表明，分布式SGD中有99.9%的梯度交换都是冗余的，可以使用深度梯度压缩选择重要梯度进行通信来减少通信量，降低对通信带宽的依赖。Paddle目前实现了DGC的稀疏通信方式，可有效在低配网络下进行GPU分布式训练。下面将介绍DGC稀疏通信方式的使用方法、适用场景及基本原理。
+## 2. 使用方法
+`注意：使用DGC请使用1.6.2及其之后版本，之前版本存在有若干bug。`
+DGC稀疏通信算法以DGCMomentumOptimizer接口的形式提供，目前只支持GPU多卡及GPU多机分布式，由于现有fuse策略会造成DGC失效，所以使用DGC时需设置`strategy.fuse_all_reduce_ops=False`关闭fuse。DGC只支持Momentum优化器，使用时把当前代码中的Momentum优化器替换为DGCMomentumOptimizer，并添加DGC所需参数即可。如下代码所示，其中rampup_begin_step表示从第几步开始使用DGC，更详细参数可见[api文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/DGCMomentumOptimizer_cn.html#dgcmomentumoptimizer)。
+``` python
+import paddle.fluid as fluid
+# optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
+# 替换Momentum优化器，添加DGC所需参数
+optimizer = fluid.optimizer.DGCMomentumOptimizer(
+    learning_rate=0.001, momentum=0.9, rampup_begin_step=0)
+optimizer.minimize(cost)
+```
+在fleet中我们提供了[DGC的示例](https://github.com/PaddlePaddle/Fleet/tree/develop/examples/dgc_example)。示例中以数字手写体识别为例，将程序移植为分布式版本（注：DGC亦支持单机多卡），再加上DGC优化器。可参照此示例将单机单卡程序迁移到DGC。在单机单卡迁移到DGC过程中，一般需要先对齐多机Momentum的精度，再对齐DGC的精度。
+## 3. 调参&适用场景
+### 3.1 预热调参
+对于正常的训练，使用DGC一般需进行预热训练，否则可能会有精度损失。如下图是ResNet50模型Imagenet数据集的训练结果，未进行预热训练的DGC最终损失了约0.3%的精度。
+<div align=center>
+![DGC Resnet50 acc1](images/dgc_resnet50_acc1.png)
+</div>
+预热训练调参可参照论文的设置。对图像分类，论文在Cifar10和ImageNet数据集上共164和90个epochs的训练中都采用了4个epochs的预热训练。在语言模型PTB数据集上，在共40个epochs的训练中选择了1个epoch进行预热训练。在语音识别AN4数据集上，80个epochs中选择1个epoch进行预热训练。
+论文中使用了75%, 93.75%, 98.4375%, 99.6%, 99.9%稀疏度逐渐提升的策略。由于paddle稀疏梯度聚合通信使用了AllGather，通信量会随卡数增加而增长，所以在卡数较多时不推荐较低稀疏度的预热训练。如75%稀疏度时每张卡会选择25%的梯度进行通信，卡数为32时通信量是正常dense通信的32\*(1-0.75)=8倍，所以前几个epoch使用正常的dense通信为佳。可参照如下写法
+``` python
+# 1. 以1252个step为一个epoch，前2个epochs使用正常dense通信，后3个epochs逐步提升稀疏度为99.9%
+optimizer = fluid.optimizer.DGCMomentumOptimizer(
+    learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*2,
+    rampup_step=1252*3, sparsity=[0.984375, 0.996, 0.999])
+# 2. 前面4个epochs都使用dense通信，之后默认0.999稀疏度运行
+optimizer = fluid.optimizer.DGCMomentumOptimizer(
+    learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*4)
+```
+对于Fine-tuning训练，现测试可无需预热训练，从第0个epoch直接使用DGC即可。
+``` python
+# 从第0步开始DGC稀疏通信
+optimizer = fluid.optimizer.DGCMomentumOptimizer(
+    learning_rate=0.001, momentum=0.9, rampup_begin_step=0)
+```
+### 3.2 适用场景
+DGC稀疏通信在低带宽通信瓶颈时会有较大的性能提升，但在单机多卡及RDMA网络通信并非瓶颈情况下，并不会带来性能上的提升。同时由于AllGather的通信量会随卡数的增多而增大，所以DGC的多机训练规模也不宜过大。故DGC适用于低配网络，同时节点规模不宜过大，如>128张卡。在云网络或高带宽网络设备昂贵时，DGC可有效降低训练成本。
+## 4. 原理
+本节原理部分基本来自[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)论文，本文进行了部分理解翻译，英文较好者建议直接阅读论文。
+### 4.1 梯度稀疏
+DGC的基本思路是通过只传送重要梯度，即只发送大于给定阈值的梯度来减少通信带宽的使用。为避免信息的丢失，DGC会将剩余梯度在局部累加起来，最终这些梯度会累加大到足以传输。
+换个角度，从理论依据上来看，局部梯度累加等同于随时间推移增加batch size，（DGC相当于每一个梯度有自己的batch size）。设定 $F(w)$ 为需要优化的loss函数，则有着N个训练节点的同步分布式SGD更新公式如下
+$$
+F(w)=\\frac{1}{\|\\chi\|}\\sum\_{x\\in\\chi}f(x, w), \\qquad w\_{t+1}=w\_{t}-\\eta\\frac{1}{N b}\\sum\_{k=0}^{N}\\sum\_{x\\in\\mathcal{B}\_{k,t}}\\nabla f\\left(x, w\_{t}\\right) \\tag{1}
+$$
+其中$\chi$是训练集，$w$是网络权值，$f(x, w)$是每个样本$x \in \chi$的loss，$\eta$是学习率，N是训练节点个数，$\mathcal{B}_{k, t}$代表第$k$个节点在第$t$个迭代时的minibatch，大小为b。
+考虑权重的第i个值，在T次迭代后，可获得
+$$
+w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta T \\cdot \\frac{1}{N b T} \\sum\_{k=1}^{N}\\left(\\sum\_{\\tau=0}^{T-1} \\sum\_{x \\in \\mathcal{B}\_{k, t+\\tau}} \\nabla^{(i)} f\\left(x, w\_{t+\\tau}\\right)\\right)  \\tag{2}
+$$
+等式2表明局部梯度累加可以被认为batch size从$Nb$增大为$NbT$，其中T是$w^{(i)}$两次更新的稀疏通信间隔。
+### 4.2 局部梯度累加改进
+正常情况，稀疏更新会严重影响收敛性。DGC中采用动量修正(Momentum Correction)和局部梯度裁减(local gradient clipping)来解决这个问题。
+#### 4.2.1 动量修正
+有着N个节点分布式训练中vanilla momentum SGD公式，
+$$
+u\_{t}=m u\_{t-1}+\\sum\_{k=1}^{N}\\left(\\nabla\_{k, t}\\right), \\quad w\_{t+1}=w\_{t}-\\eta u\_{t}  \\tag{3}
+$$
+其中$m$是动量因子，$N$是节点数，$\nabla_{k, t}=\frac{1}{N b} \sum_{x \in \mathcal{B}_{k, t}} \nabla f\left(x, w_{t}\right)$。
+考虑第i个权重$w^{(i)}$，在T次迭代后，权重更新公式如下，
+$$
+w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta\\left[\\cdots+\\left(\\sum\_{\\tau=0}^{T-2} m^{\\tau}\\right) \\nabla\_{k, t+1}^{(i)}+\\left(\\sum\_{\\tau=0}^{T-1} m^{\\tau}\\right) \\nabla\_{k, t}^{(i)}\\right]  \\tag{4}
+$$
+如果直接应用动量SGD到稀疏梯度更新中，则有公式，
+$$
+v_{k, t}=v_{k, t-1}+\\nabla_{k, t}, \\quad u_{t}=m u_{t-1}+\\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right), \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{5}
+$$
+其中$v_k$是训练节点k上的局部梯度累加项，一旦$v_k$大于某一阈值，则会在第二项中压缩梯度进行动量更新，并使用sparse()函数获得mask清空大于阈值的梯度。
+$w^{(i)}$在T次稀疏更新后的权重为,
+$$
+w_{t+T}^{(i)}=w_{t}^{(i)}-\\eta\\left(\\cdots+\\nabla_{k, t+1}^{(i)}+\\nabla_{k, t}^{(i)}\\right) \\tag{6}
+$$
+相比传统动量SGD，方程6缺失了累积衰减因子$\sum_{\tau=0}^{T-1} m^{\tau}$，会导致收敛精度的损失。如下图A，正常梯度更新从A点到B点，但是方程6则从A点到C点。当稀疏度很高时，会显著降低模型性能，所以需要在方程5基础上对梯度进行修正。
+<div align=center>
+<img src=./images/dgc_without_momentum_correction.png width=400>
+<img src=./images/dgc_with_momentum_correction.png width=400>
+</div>
+若将方程3中速度项$u_t$当作“梯度”，则方程3第二项可认为是在”梯度“$u_t$上应用传统SGD，前面已经证明了局部梯度累加在传统SGD上是有效的。因此，可以使用方程3局部累加速度项$u_t$而非累加真实的梯度$\nabla_{k, t}$来修正方程5，
+$$
+u_{k, t}=m u_{k, t-1}+\\nabla_{k, t}, \\quad v_{k, t}=v_{k, t-1}+u_{k, t}, \\quad w_{t+1}=w_{t}-\\eta \\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right)  \\tag{7}
+$$
+修正后，如上图(b)，方程可正常从A点到B点。除了传统动量方程修正，论文还给出了Nesterov动量SGD的修正方程。
+#### 4.2.2 局部梯度修剪
+梯度修剪是防止梯度爆炸的常用方法。这方法由Pascanu等人在2013年提出，当梯度的l2-norms和大于给定阈值时，就对梯度rescale。正常梯度修剪在梯度聚合后使用，而DGC因为每个节点独立的进行局部梯度累加，所以DGC在使用$G_t$累加前对其进行局部梯度修剪。阈值缩放为原来的$N^{-1/2}$
+$$
+thr_{G^{k}}=N^{-1 / 2} \\cdot thr_{G}  \\tag{8}
+$$
+### 4.3 克服迟滞效应
+因为推迟了较小梯度更新权重的时间，所以会有权重陈旧性问题。稀疏度为99.9%时大部分参数需600到1000步更新一次。迟滞效应会减缓收敛并降低模型精度。DGC中采用动量因子掩藏和预热训练来解决这问题。
+#### 4.3.1 动量因子掩藏
+DGC中使用下面方程来掩藏动量因子减缓陈旧性问题。
+$$
+Mask \\leftarrow\\left|v_{k, t}\\right|>t h r, \\quad v_{k, t} \\leftarrow v_{k, t} \\odot \\neg Mask, \\quad u_{k, t} \\leftarrow u_{k, t} \\odot \\neg Mask \\tag{9}
+$$
+此掩码可以停止延迟梯度产生的动量，防止陈旧梯度把权重引入错误的方向。
+#### 4.3.2 预热训练
+在训练初期，梯度变动剧烈，需要及时更新权重，此时迟滞效应影响会很大。为此DGC采用预热训练的方法，在预热期间使用更小的学习率来减缓网络的变化速度，并使用较小的稀疏度来减少需推迟更新的梯度数量。预热期间会线性增大学习率，指数型增加稀疏度到最终值。
+### 4.4 正则化(Weight Decay)项修正
+Paddle框架以Weight Decay的形式实现正则化。以L2Decay为例，公式(3)中传统momentum添加weight decay后公式为
+$$
+G_{t}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t}, \\quad  u_{t}=m u_{t-1}+G_{t}, \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{10}
+$$
+其中$\lambda$为Weight Decay系数，$G_{t}$为添加L2Decay项之后的聚合梯度。由于在公式7中进行了局部动量修正，所以按照相同思路在局部梯度上运用修正的Weight Decay项。如下公式在局部梯度上添加局部Weight Decay项即可。
+$$
+\\nabla_{k, t}=\\nabla_{k, t}+\\frac{\\lambda}{N} w_{t} \\tag{11}
+$$
+在模型实际训练中，通常会设置weight decay的系数$\lambda=10^{-4}$，在卡数较多如4机32卡的情况下局部weight decay系数为$\frac{\lambda}{N}=\frac{10^{-4}}{32}=3.125*10^{-6}$，在数值精度上偏低，测试训练时会损失一定精度。为此还需对局部weight decay项进行数值修正。如下公式，
+$$
+\\nabla_{k, t}^{'}=N \\nabla_{k, t}+\\lambda w_{t}, \\quad
+G_{t}^{'}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}^{'}\\right)=N\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+N\\lambda w_{t}, \\quad
+G_{t}=\\frac{G_{t}^{'}}{N}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t} \\tag{12}
+$$
+具体做法为对局部梯度乘以卡数求得$\nabla_{k, t}^{'}$，此时$\lambda$项则无需除以卡数，聚合梯度求得$G_{t}^{'}$再对聚合梯度除以卡数得到$G_{t}$即可。
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_recompute.rst
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_recompute.rst
+重计算：大Batch训练特性
+=============
+背景
+---------
+随着训练数据规模的逐渐增加，训练更大、更深的深度学习模型成为一个主流趋势。目前的深度学习模型训练，通常要求保留前向计算的隐层结果，并且需要保存结果的数量会随着模型层数的增加线性增加，这对于目前能够使用的AI芯片的内存大小是个挑战。Forward Recomputation Backpropagation（FRB）可以在额外增加少量计算的情况下，显著增加模型的层数和宽度，同时也可以显著提升模型训练的batch大小。
+原理
+---------
+我们知道，深度学习网络的一次训练迭代包含三个步骤：
+- **前向计算**：运行前向算子(Operator) 来计算中间隐层(Variable)的值
+- **反向计算**：运行反向算子来计算参数(Parameter)的梯度
+- **优化**：应用优化算法以更新参数值
+在前向计算过程中，前向算子会输出大量的中间计算结果，在Paddle中，使用
+Variable来存储这些隐层的中间结果。当模型层数加深时，其数量可达成千上万个，
+占据大量的内存。Paddle的 `显存回收机制 <https://paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/best_practice/memory_optimize.html>`_ 
+会及时清除无用的中间结果，以节省存储。
+然而，有些中间结果是反向算子的输入，这些Variable必须存储在内存中，直到相应的反向算子计算完毕。
+举个简单的例子, 我们定义一个由mul算子构成的网络，其前向计算为：
+.. math::
+    y = W_1 * x
+    z = W_2 * y
+其中 :math:`x, y, z` 为向量， :math:`W_1, W_2` 为矩阵。容易知道，求 :math:`W_2` 梯度的反向计算为：
+.. math::
+    W_{2}^{'} = z^{'} / y 
+可以看到反向计算中用到了前向计算生成的变量 :math:`y` ，因此变量 :math:`y` 必须存储在内存中，直到这个反向算子计算完毕。当模型加深时，我们会有大量的“ :math:`y` ”，占据了大量的内存。
+Forward Recomputation Backpropagation（FRB）的思想是将深度学习网络切分为k个部分（segments）。对每个segment而言：前向计算时，除了小部分必须存储在内存中的Variable外(我们后续会讨论这些特殊Variable)，其他中间结果都将被删除；在反向计算中，首先重新计算一遍前向算子，以获得中间结果，再运行反向算子。简而言之，FRB和普通的网络迭代相比，多计算了一遍前向算子。
+我们把切分网络的变量叫做checkpoints。
+那么问题来了，如何选择checkpoints呢？自从FRB方法提出以来 \ :sup:`[1], [2]`，大量学者在研究这一关键问题。
+我们知道深度学习网络通常是由一个个模块串联得到的，比如ResNet-50由16个block串联而成，
+Bert-Large由24个transformer串联而成，以两个子模块中间的变量作为切分点就是一个很好的选择。
+对于非串联的网络（比如含有大量shortcut结构的网络），FRB也支持对其做切分，
+只是可能多耗费一点内存（用于存储shortcut的Variable）。
+Mitsuru Kusumoto  \ :sup:`[3]` 等提出了一种基于动态规划的算法，
+可以根据指定的内存自动搜索合适的checkpoints，支持各种各样的网络结构。
+下图是由4个fc Layer、3个relu Layer、1个sigmoid Layer和1个log-loss Layer串联而成的一个网络：最左侧为其前向计算流程、中间是普通的前向计算和反向计算流程、最右侧为添加FRB后的前向计算和反向计算流程。其中方框代表算子(Operator)，红点代表前向计算的中间结果、蓝点代表checkpoints。
+.. image:: images/recompute.png
+注：该例子完整代码位于 `source <https://github.com/PaddlePaddle/examples/blob/master/community_examples/recompute/demo.py>`_
+添加FRB后，前向计算中需要存储的中间Variable从4个(红点)变为2个(蓝点)，
+从而节省了这部分内存。当然了，重计算的部分也产生了新的中间变量，
+这就需要根据实际情况来做权衡了。这个例子里的网络比较浅，通常来讲，
+对层数较深的网络，FRB节省的内存要远多于新增加的内存。
+使用方法
+---------
+我们实现了基于Paddle的FRB算法，叫做RecomputeOptimizer，
+您可以根据其 `源码 <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/optimizer.py>`_
+与
+`文档 <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/RecomputeOptimizer_cn.html>`_
+更深入地了解这一算法。我们为用户提供了两个使用RecomputeOptimizer的方法:
+直接调用与Fleet API中使用。在单机单卡或者CPU训练中建议您直接调用RecomputeOptimizer，
+在多卡训练或者多机训练任务上建议您在Fleet API中使用Recompute。
+**1. 直接调用**
+直接调用RecomputeOptimizer非常简单，首先要定义一个经典的Optimizer，比如Adam；
+然后在外面包一层RecomputeOptimizer；最后设置checkpoints即可。
+.. code-block:: python
+            import paddle.fluid as fluid
+            # 定义网络
+            def mlp(input_x, input_y, hid_dim=128, label_dim=2):
+                print(input_x)
+                fc_1 = fluid.layers.fc(input=input_x, size=hid_dim)
+                prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax')
+                cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
+                sum_cost = fluid.layers.reduce_mean(cost)
+                return sum_cost, fc_1, prediction
+            input_x = fluid.layers.data(name="x", shape=[32], dtype='float32')
+            input_y = fluid.layers.data(name="y", shape=[1], dtype='int64')
+            cost, fc_1, pred = mlp(input_x, input_y)
+            # 定义RecomputeOptimizer
+            sgd = fluid.optimizer.Adam(learning_rate=0.01)
+            sgd = fluid.optimizer.RecomputeOptimizer(sgd)
+            # 设置checkpoints
+            sgd._set_checkpoints([fc_1, pred])
+            # 运行优化算法
+            sgd.minimize(cost)
+Recompute原则上适用于所有Optimizer。
+**2. 在Fleet API中使用Recompute**
+`Fleet API <https://github.com/PaddlePaddle/Fleet>`_ 
+是基于Fluid的分布式计算高层API。在Fleet API中添加RecomputeOptimizer
+仅需要2步：
+- 设置dist_strategy.forward_recompute为True；
+- 设置dist_strategy.recompute_checkpoints。
+.. code-block:: python
+    from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
+    dist_strategy = DistributedStrategy()
+    dist_strategy.forward_recompute = True
+    dist_strategy.recompute_checkpoints=checkpoints
+    optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+    optimizer.minimize(loss)
+为了帮助您快速地用Fleet API使用Recompute任务，我们提供了一些例子，
+并且给出了这些例子的计算速度、效果和显存节省情况：
+- 用Recompute做Bert Fine-tuning:  `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
+- 用Recompute做目标检测：开发中.
+Q&A
+-------
+- **是否支持带有随机性的Op？**
+  目前Paddle中带随机性的Op有：dropout，Recompute支持
+  dropout Operator，可以保证重计算与初次计算结果保持一致。
+- **有没有更多Recompute的官方例子？**
+  更多Recompute的例子将更新在 `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_ 
+  和 `Fleet <https://github.com/PaddlePaddle/Fleet>`_ 库下，欢迎关注。
+- **有没有添加checkpoints的建议？**
+  我们建议将子网络连接部分的变量添加为checkpoints，即：
+  如果一个变量能将网络完全分为前后两部分，那么建议将其加入checkpoints。
+  checkpoints的数目会影响内存的消耗：如果checkpoints很少，
+  那么Recompute起的作用有限；如果checkpoints数量过多，
+  那么checkpoints本身占用的内存量就较大，内存消耗可能不降反升。
+  我们后续会添加一个估算内存用量的工具，
+  可以对每个Operator运算前后的显存用量做可视化，
+  帮助用户定位问题。
+[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost.
+arXiv preprint, arXiv:1604.06174, 2016. 
+[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient
+backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133,
+2016.
+[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019). 
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_recompute_en.rst
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/gpu_training_with_recompute_en.rst
+Recompute: Training with bigger batch size
+=============
+Context
+---------
+As the amount of training data increases, training deeper neural network models becomes more and more popular. Current deep-learning training usually keeps the hidden layer outputs in memory during the forward propagation,
+and the number of outputs increases linearly with
+the increase of the number of model layers,
+which becomes a challenge of the memory size
+for common devices.
+Theory
+---------
+As we know, a training process of a deep-learning network contains 3 steps:
+- **Forward Propagation**：Running forward operators and generate temporary variables as output
+- **Backward Propagation**：Running backward operators to compute gradients of parameters
+- **Optimization**：Applying optimization algorithm to update parameters 
+When the model becomes deeper, the number of temporary variables
+generated in the forward propagation process can reach tens
+of thousands, occupying a large amount of memory. 
+The `Garbage Collection mechanism <https://paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/best_practice/memory_optimize.html>`_
+in Paddle can delete useless variables for the sake of saving memory.
+However, some variables serve as inputs of backward operators,
+they must be kept in memory until particular operator finish.
+Take a simple example, define a network contains two `mul` operators,
+the forward propagation works as follows:
+.. math::
+    y = W_1 * x
+    z = W_2 * y
+where :math:`x, y, z` are vectors， :math:`W_1, W_2` are matrix。It is easy to conduct that the gradient of :math:`W_2` is:
+.. math::
+    W_{2}^{'} = z^{'} / y 
+We can see that :math:`y` is used in the backward propagation process, 
+thus it must be kept in the memory during the whole forward propagation.
+When network grows deeper, more 'y's need to be stored,
+adding more requirements to the memory.
+Forward Recomputation Backpropagation(FRB) splits a deep network to k segments.
+For each segment, in forward propagation, 
+most of the temporary variables are erased in time, 
+except for some special variables (we will talk about that later); 
+in backward propagation, the forward operators will be recomputed
+to get these temporary variables before running backward operators.
+In short, FBR runs forward operators twice.
+But how to split the network? A deep learning network usually consists
+of connecting modules in series:
+ResNet-50 contains 16 blocks and Bert-Large contains 24 transformers.
+It is a good choice to treat such modules as segments. 
+The variables among segments are
+called as checkpoints.
+The following picture is a network with 4 fc layers, 3 relu layers, 
+1 sigmoid layer and 1 log-loss layer in series.
+The left column is the forward propagation, 
+the middle column is the normal backward propagation,
+and the right column is the FRB.
+Rectangular boxes represent the operators, red dots represent
+the intermediate variables in forward computation, blue dots
+represent checkpoints and arrows represent the dependencies between operators.
+.. image:: images/recompute.png
+Note: the complete source code of this example: `source <https://github.com/PaddlePaddle/examples/blob/master/community_examples/recompute/demo.py>`_
+After applying FBR, the forward computation only needs to store
+2 variables (the blue dots) instead of 4 variables (the red
+dots), saving the corresponding memories. It is notable that
+recomputing operators generate new intermediate variables at the same time,
+a trade-off needs to be considered in this situation.
+While according to our experiments,
+FBR usually saves rather than increase the memory load.
+Usage
+---------
+We have implemented the FRB algorithm named "RecomputeOptimizer"
+based on Paddle. More information about this algorithm can
+be learned by the `source code <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/optimizer.py>`_
+and the
+`document <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/RecomputeOptimizer_cn.html>`_
+of RecomputeOptimizer.
+There are 2 methods to apply RecomputeOptimizer in your Paddle
+program: call RecomputeOptimizer directly or use it with Fleet
+API. For single-GPU card training or CPU training, we recommend
+directly calling; For multi-GPU training, we
+recommend using with Fleet API.
+**1. Directly calling**
+Calling RecomputeOptimizer is very easy: first, define a classic
+optimizer, such as Adam; second, wrap it with RecomputeOptimizer;
+third, set the checkpoints.
+.. code-block:: python
+            import paddle.fluid as fluid
+            # Define the network
+            def mlp(input_x, input_y, hid_dim=128, label_dim=2):
+                print(input_x)
+                fc_1 = fluid.layers.fc(input=input_x, size=hid_dim)
+                prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax')
+                cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
+                sum_cost = fluid.layers.reduce_mean(cost)
+                return sum_cost, fc_1, prediction
+            input_x = fluid.layers.data(name="x", shape=[32], dtype='float32')
+            input_y = fluid.layers.data(name="y", shape=[1], dtype='int64')
+            cost, fc_1, pred = mlp(input_x, input_y)
+            # define RecomputeOptimizer
+            sgd = fluid.optimizer.Adam(learning_rate=0.01)
+            sgd = fluid.optimizer.RecomputeOptimizer(sgd)
+            # set checkpoints
+            sgd._set_checkpoints([fc_1, pred])
+            # apply optimization
+            sgd.minimize(cost)
+In principle, recompute is for all kinds of optimizers in Paddle.
+**2. Using Recompute in Fleet API**
+`Fleet API <https://github.com/PaddlePaddle/Fleet>`_ 
+is a high-level API for distributed training in Fluid. Adding
+RecomputeOptimizer to Fluid takes two steps:
+- set dist_strategy.forward_recompute to True
+- set dist_strategy.recompute_checkpoints
+.. code-block:: python
+    from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
+    dist_strategy = DistributedStrategy()
+    dist_strategy.forward_recompute = True
+    dist_strategy.recompute_checkpoints=checkpoints
+    optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+    optimizer.minimize(loss)
+We supply some examples of using recompute in Fleet API for users.
+We also post corresponding training speed,
+test results and memory usages of these examples for reference.
+- Fine-tuning Bert Large model with recomputing:  `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
+- Training object detection models with recomputing：developing.
+Q&A
+-------
+- **Does RecomputeOptimizer support operators with random outputs?**
+We currently found that the dropout operator has random results
+and RecomputeOptimizer is able to keep the outputs of
+first-computation and recomputation consistent.
+- **Are there more official examples of Recompute?**
+  More examples will be updated at `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_
+and `Fleet <https://github.com/PaddlePaddle/Fleet>`_ . Feel free to
+raise issues if you get any problem with these examples.
+- **How should I set checkpoints?**
+The position of checkpoints is important: 
+we suggest setting the variable between the sub-model as checkpoints,
+that is, set a variable as a checkpoint if it
+can separate the network into two parts without short-cut connections.
+The number of checkpoints is also important:
+too few checkpoints will reduce the memory saved by recomputing while
+too many checkpoints will occupy a lot of memory themselves.
+We will add a tool to estimate the memory usage with specific checkpoints,
+helping users to choose checkpointing variables.
+[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost.
+arXiv preprint, arXiv:1604.06174, 2016.
+[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient
+backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133,
+2016.
+[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019).
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_resnet50_acc1.png
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_resnet50_acc1.png
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_with_momentum_correction.png
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_with_momentum_correction.png
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_without_momentum_correction.png
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/dgc_without_momentum_correction.png
--- a/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/recompute.png
+++ b/doc/paddle/guides/07_performance_improving/multinode_training_improving/images/recompute.png
--- a/doc/paddle/guides/07_performance_improving/singlenode_training_improving/memory_optimize.rst
+++ b/doc/paddle/guides/07_performance_improving/singlenode_training_improving/memory_optimize.rst
+.. _api_guide_memory_optimize:
+###########
+存储分配与优化
+###########
+1. PaddlePaddle的显存分配策略
+===========================
+1.1. 显存自增长AutoGrowth策略
+--------------------------
+自1.6+的版本起，PaddlePaddle支持显存自增长AutoGrowth策略，按需分配显存，且已于1.7+版本中默认开启，方便用户在同一张GPU卡上同时运行多个任务。
+由于原生的CUDA系统调用 :code:`cudaMalloc` 和 :code:`cudaFree` 均是同步操作，非常耗时。
+因此显存自增长AutoGrowth策略会缓存已分配到的显存，供后续分配使用，具体方式为：
+- 在前几次显存分配时，框架会调用 :code:`cudaMalloc` 按需分配，但释放时不会调用 :code:`cudaFree` 返回给GPU，而是在框架内部缓存起来。
+- 在随后的显存分配时，框架会首先检查缓存的显存中是否有合适的块，若有则从中分割出所需的显存空间返回，否则才调用 :code:`cudaMalloc` 直接从GPU中分配。随后的显存释放亦会缓存起来供后续分配使用。
+因此，显存自增长AutoGrowth策略会在前几个batch训练时分配较慢（因为频繁调用 :code:`cudaMalloc` ），在随后训练过程中基本不会影响模型训练速度。
+1.2. 显存预分配策略
+----------------
+除了显存自增长AutoGrowth策略以外，PaddlePaddle还提供了显存预分配策略。显存预分配策略是PaddlePaddle 1.7版本前的默认显存分配策略。
+显存预分配策略会在第一次分配时分配很大chunk_size的显存块，随后的显存分配大多从预分配的显存块中切分获得。
+其中，chunk_size由环境变量 :code:`FLAGS_fraction_of_gpu_memory_to_use` 确定，chunk_size的计算公式为：
+.. code-block:: python
+  chunk_size = FLAGS_fraction_of_gpu_memory_to_use * 单张GPU卡的当前可用显存值
+:code:`FLAGS_fraction_of_gpu_memory_to_use` 的默认值为0.92，即框架预先分配显卡92%的当前可用显存值。
+显存预分配策略分配显存的具体方式为：
+- 在分配requested_size大小的显存时，
+    - 若requested_size <= chunk_size，则框架会预先分配chunk_size大小的显存池chunk，并从chunk中分出requested_size大小的块返回。之后每次申请显存都会从chunk中分配。
+    - 若requested_size > chunk_size，则框架会直接调用 :code:`cudaMalloc` 分配requested_size大小的显存返回。
+- 在释放free_size大小的显存时，
+    - 若free_size <= chunk_size，则框架会将该显存放回预分配的chunk中，而不是直接返回给CUDA。
+    - 若free_size > chunk_size，则框架会直接调用 :code:`cudaFree` 将显存返回给CUDA。
+若你的GPU卡上有其他任务占用显存，你可以适当将 :code:`FLAGS_fraction_of_gpu_memory_to_use` 减少，保证框架能预分配到合适的显存块，例如：
+.. code-block:: shell
+  export FLAGS_fraction_of_gpu_memory_to_use=0.4 # 预先40%的GPU显存
+若 :code:`FLAGS_fraction_of_gpu_memory_to_use` 设为0，则每次显存分配和释放均会调用 :code:`cudaMalloc` 和 :code:`cudaFree` ，会严重影响性能，不建议你使用。
+只有当你想测量网络的实际显存占用量时，你可以设置 :code:`FLAGS_fraction_of_gpu_memory_to_use` 为0，观察nvidia-smi显示的显存占用情况。
+1.3. 显存分配策略的选择方式
+-----------------------
+自1.6+版本起，PaddlePaddle同时支持显存自增长AutoGrowth策略和显存预分配策略，并通过环境变量 :code:`FLAGS_allocator_strategy` 控制。
+选择显存自增长AutoGrowth的方式为：
+.. code-block:: shell
+  export FLAGS_allocator_strategy=auto_growth # 选择显存自增长AutoGrowth策略
+选择显存预分配策略的方式为：
+.. code-block:: shell
+  export FLAGS_allocator_strategy=naive_best_fit # 选择显存预分配策略
+此外，自1.7.2+版本起，PaddlePaddle提供了环境变量 :code:`FLAGS_gpu_memory_limit_mb` ，用于控制单个任务进程可分配的最大显存，单位是MB。默认值是0，表示没有限制，可分配全部显存。如果设置为大于0的值，则会在分配的显存超过限制时报错，即使此时系统还存在空闲的显存空间。
+2. PaddlePaddle的存储优化策略
+===========================
+PaddlePaddle提供了多种通用存储优化方法，优化你的网络的存储占用（包括显存和内存)。
+2.1. GC策略: 存储垃圾及时回收
+-------------------------
+GC（Garbage Collection）的原理是在网络运行阶段及时释放无用变量的存储空间，达到节省存储空间的目的。GC适用于使用Executor，ParallelExecutor做模型训练/预测的场合，但不适用于C++预测库接口。
+**GC策略已于1.6+版本中默认开启。**
+GC策略由三个环境变量控制：
+- :code:`FLAGS_eager_delete_tensor_gb`
+GC策略的使能开关，double类型，在<1.6的版本中默认值为-1，在1.6+版本中默认值为0。GC策略会积攒一定大小的存储垃圾后再统一释放，:code:`FLAGS_eager_delete_tensor_gb` 控制的是存储垃圾的阈值，单位是GB。**建议用户设置** :code:`FLAGS_eager_delete_tensor_gb=0` 。
+若 :code:`FLAGS_eager_delete_tensor_gb=0` ，则一旦有存储垃圾则马上回收，最为节省存储空间。
+若 :code:`FLAGS_eager_delete_tensor_gb=1` ，则存储垃圾积攒到1G后才触发回收。
+若 :code:`FLAGS_eager_delete_tensor_gb<0` ，则GC策略关闭。
+- :code:`FLAGS_memory_fraction_of_eager_deletion`
+GC策略的调节flag，double类型，默认值为1，范围为[0,1]，仅适用于使用ParallelExecutor或CompiledProgram+with_data_parallel的场合。
+GC内部会根据变量占用的存储空间大小，对变量进行降序排列，且仅回收前 :code:`FLAGS_memory_fraction_of_eager_deletion` 大的变量的存储空间。**建议用户维持默认值**，即 :code:`FLAGS_memory_fraction_of_eager_deletion=1` 。
+若 :code:`FLAGS_memory_fraction_of_eager_deletion=0.6` ，则表示仅回收存储占用60%大的变量的存储空间。
+若 :code:`FLAGS_memory_fraction_of_eager_deletion=0` ，则表示不回收任何变量的存储空间，GC策略关闭。
+若 :code:`FLAGS_memory_fraction_of_eager_deletion=1` ，则表示回收所有变量的存储空间。
+- :code:`FLAGS_fast_eager_deletion_mode`
+快速GC策略的开关，bool类型，默认值为True，表示使用快速GC策略。快速GC策略会不等待CUDA Kernel结束直接释放显存。**建议用户维持默认值**，即 :code:`FLAGS_fast_eager_deletion_mode=True` 。
+2.2. Inplace策略: Op内部的输出复用输入
+----------------------------------
+Inplace策略的原理是Op的输出复用Op输入的存储空间。例如，reshape操作的输出和输入可复用同一片存储空间。
+Inplace策略适用于使用ParallelExecutor或CompiledProgram+with_data_parallel的场合，通过 :code:`BuildStrategy` 设置。此策略不支持使用Executor+Program做单卡训练、使用C++预测库接口等场合。
+**Inplace策略已于1.6+版本中默认开启。**
+具体方式为:
+.. code-block:: python
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.enable_inplace = True # 开启Inplace策略
+    compiled_program = fluid.CompiledProgram(train_program)
+                              .with_data_parallel(loss_name=loss.name, build_strategy=build_strategy)
+在<1.6的版本中，由于设计上的一些问题，在开启Inplace策略后，必须保证后续exe.run中fetch_list的变量是persistable的，即假如你后续需要fetch的变量为loss和acc，则必须设置：
+.. code-block:: python
+    loss.persistable = True
+    acc.persistable = True
+**在1.6+的版本中，无需设置fetch变量为persistable。**
+3. 存储优化Best Practice
+=======================
+我们推荐你的最佳存储优化策略为：
+- 开启GC策略：设置 :code:`FLAGS_eager_delete_tensor_gb=0` 。
+- 开启Inplace策略：设置 :code:`build_strategy.enable_inplace = True` ，并在<1.6版本中设置fetch_list中的 :code:`var.persistable = True` 。
+**在1.6+的版本中，上述最佳策略均已默认打开，无需手动配置，亦无需设置fetch_list变量为persistable。**
--- a/doc/paddle/guides/07_performance_improving/singlenode_training_improving/memory_optimize_en.rst
+++ b/doc/paddle/guides/07_performance_improving/singlenode_training_improving/memory_optimize_en.rst
+.. _api_guide_memory_optimize_en:
+###########
+Memory Allocation and Optimization
+###########
+1. Memory Allocation Strategy
+===========================
+1.1. AutoGrowth Strategy
+--------------------------
+Since version 1.6+, PaddlePaddle supports the AutoGrowth strategy, which allocates memory on demand.
+AutoGrowth strategy has been enabled by default in version 1.7+, making it convenient for users to 
+run multiple tasks on the same GPU card at the same time.
+Because the native CUDA system calls :code:`cudaMalloc` and :code:`cudaFree` are synchronous operations, 
+which are very time-consuming, the AutoGrowth strategy will cache the allocated memory for subsequent allocation. 
+The specific methods are as follows:
+- In the first few memory allocations, PaddlePaddle framework will call :code:`cudaMalloc` and allocate memory on demand. When releasing the allocated memory, it will not call :code:`cudaFree` to return the memory to GPU, but cache the memory inside the framework.
+- In the subsequent allocations, PaddlePaddle framework will first check if there is a fit block (block size larger than the required memory size) in the cached memory. If there is, it will split the required memory from the fit block and return. Otherwise, it will call :code:`cudaMalloc` to allocate memory from GPU. The allocated memory are also cached when being released for subsequent allocation.
+Therefore, the AutoGrowth strategy may slow the speed in the first few batches of model training, 
+but will not affect the speed in the subsequent training process.
+1.2. Pre-Allocation Strategy
+----------------
+In addition to the AutoGrowth strategy, paddlepaddle also provides a Pre-Allocation strategy, 
+which is the default memory allocation strategy before paddlepaddle 1.7.
+The Pre-Allocation strategy allocates a large size chunk at the first allocation, and the subsequent memory allocation is mostly obtained from the pre allocated memory chunk.
+Among them, the chunk size is determined by the environment variable :code:`FLAGS_fraction_of_gpu_memory_to_use`, and the calculation formula of chunk size is:
+.. code-block:: python
+  chunk_size = FLAGS_fraction_of_gpu_memory_to_use * number of current available memory of a single GPU card
+The default value of :code:`FLAGS_fraction_of_gpu_memory_to_use` is 0.92, that is, the framework will pre allocates 
+92% of the currently available memory of the GPU card.
+The specific way of Pre-Allocation strategy to allocate GPU memory is:
+- When allocating memory of requested_size, 
+    - If requested_size <= chunk_size, the framework will first allocate a memory chunk of chunk_size, then split a block of requested_size and return the block. Every subsequent memory allocation will be performed on the chunk.
+    - If requested_size > chunk_size, the framework will call :code:`cudaMalloc` to allocate memory block of requested_size and return.
+- When freeing memory of requested_size, 
+    - If free_size <= chunk_size, the framework will put the memory block back into the pre-allocated chunk, instead of returning back to GPU.
+    - If free_size > chunk_size, the framework will call :code:`cudaFree` and return the memory back to GPU.
+If there are other tasks on your GPU card that occupy the memory, you can appropriately decrease :code:`FLAGS_fraction_of_gpu_memory_to_use` 
+to ensure that the framework can pre-allocate the memory block of appropriate size, for example
+.. code-block:: shell
+  export FLAGS_fraction_of_gpu_memory_to_use=0.4 # Pre-allocate 40% memory of a single GPU card
+If :code:`FLAGS_fraction_of_gpu_memory_to_use` is set to 0, the framework will call :code:`cudaMalloc` and :code:`cudaFree` every time the memory is allocated and released, which will seriously affect the performance and is not recommended. Only when you want to measure the actual memory usage of the network, you could set :code:`FLAGS_fraction_of_gpu_memory_to_use` to 0, and observe the memory usage of command nvidia-smi display.
+1.3. Configuration of memory allocation strategy
+-----------------------
+Since version 1.6+, PaddlePaddle supports both the AutoGrowth strategy and the Pre-Allocation Strategy, and control the strategy used in framework by 
+the environment variable :code:`FLAGS_allocator_strategy`.
+Use AutoGrowth strategy:
+.. code-block:: shell
+  export FLAGS_allocator_strategy=auto_growth # Use AutoGrowth strategy
+Use Pre-Allocation strategy:
+.. code-block:: shell
+  export FLAGS_allocator_strategy=naive_best_fit # Use Pre-Allocation strategy
+Plus, since version 1.7.2+, PaddlePaddle provides an environment variable :code:`FLAGS_gpu_memory_limit_mb`, which controls the maximum gpu memory limit that the process can allocate.
+If it is equal to 0, there would be no limit and all gpu memory would be available to the process. If it is larger than 0, the process would raise out of memory error if the allocated 
+memory exceeds the limit even though there is available memory on the gpu card. The unit is MB and default value is 0.
+2. Memory Optimization Strategy
+===========================
+Paddlepaddle provides several general memory optimization methods to optimize the memory usage of your network (including general memory and GPU memory).
+2.1. GC Strategy: memory garbage eager collection
+-------------------------
+The principle of GC（Garbage Collection）is to release the memory space of useless variables eagerly during network running, 
+in order to save memory space. GC is suitable for training and inference using Executor or ParallelExecutor, but it is not suitable for C++ inference library.
+**Since version 1.6+, GC Strategy is enabled by default.**
+GC Strategy is controlled by 3 environment variable:
+- :code:`FLAGS_eager_delete_tensor_gb`
+Variable to enable GC, its data type is double. The default value is -1 in PaddlePaddle with version < 1.6, 
+and is 0 in PaddlePaddle with version >= 1.6. GC Strategy will cache a certain amount of memory garbage and release it uniformly. 
+:code:`FLAGS_eager_delete_tensor_gb` means the threshold of cached memory garbage, the unit of which is GB. **It is recommended to set** :code:`FLAGS_eager_delete_tensor_gb=0`.
+If :code:`FLAGS_eager_delete_tensor_gb=0`, once there is memory garbage, it will be collected immediately to save memory.
+If :code:`FLAGS_eager_delete_tensor_gb=1`, the memory garbage is collected when the cached amount of garbage reaches 1GB.
+If :code:`FLAGS_eager_delete_tensor_gb<0`, GC Strategy is disabled.
+- :code:`FLAGS_memory_fraction_of_eager_deletion`
+Variable to control GC Strategy, its data type is double. The default value is 1, range [0,1]. It is only suitable for ParallelExecutor or CompiledProgram+with_data_parallel.
+GC will sort the variables in descending order according to the memory space occupied by the variables, 
+and only collect the memory space of top :code:`FLAGS_memory_fraction_of_eager_deletion` variables. 
+**It is recommended to remain default value**, that is  :code:`FLAGS_memory_fraction_of_eager_deletion=1`.
+If :code:`FLAGS_memory_fraction_of_eager_deletion=0.6`, top 60% variables will be collected.
+If :code:`FLAGS_memory_fraction_of_eager_deletion=0`, no variable will be collected, GC Strategy is disabled.
+If :code:`FLAGS_memory_fraction_of_eager_deletion=1`, all variables will be collected.
+- :code:`FLAGS_fast_eager_deletion_mode`
+Variable to enable fast GC Strategy, its type is bool. The default value is True, which means use fast GC Strategy. 
+Fast GC Strategy will collect the memory garbage immediately instead of waiting for CUDA Kernel finish. **It is recommended to remain default value**, that is  :code:`FLAGS_fast_eager_deletion_mode=True`.
+2.2. Inplace Strategy: output reuses input inside operator
+----------------------------------
+The principle of Inplace strategy is that the output of some operators can reuses the memory space of input. 
+For example, the output and input of operator :code:`reshape` can reuse the same memory space.
+Inplace Strategy is suitable for ParallelExecutor or CompiledProgram+with_data_parallel, which can be set through :code:`BuildStrategy`. 
+The Strategy is not suitable for Executor+Program or C++ inference library.
+**Since version 1.6+, Inplace Strategy is enabled by default.**
+The specific way of Inplace strategy is:
+.. code-block:: python
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.enable_inplace = True # Enable Inplace Strategy
+    compiled_program = fluid.CompiledProgram(train_program)
+                              .with_data_parallel(loss_name=loss.name, build_strategy=build_strategy)
+In PaddlePaddle with version < 1.6, due to of some design problems, when the Inplace Strategy is enabled, 
+the variable in fetch_list in the subsequent :code:`exe.run` must be persistent. 
+That is, if you the variables you want to fetch are loss and acc, you must set:
+.. code-block:: python
+    loss.persistable = True
+    acc.persistable = True
+**Since version 1.6+, setting variables in fetch_list to persistable is not needed.**
+3. Memory Optimization Best Practice
+=======================
+We recommend the best memory optimization strategy as:
+- Enable GC strategy:set :code:`FLAGS_eager_delete_tensor_gb=0`.
+- Enable Inplace strategy:set :code:`build_strategy.enable_inplace = True`, and set variables in fetch_list to persistable using :code:`var.persistable = True` when the version of PaddlePaddle < 1.6.
+**Since version 1.6+, the above optimal strategy have been enabled by default and setting variables in fetch_list to persistable is not needed.**
--- a/doc/paddle/guides/07_performance_improving/singlenode_training_improving/training_best_practice.rst
+++ b/doc/paddle/guides/07_performance_improving/singlenode_training_improving/training_best_practice.rst
+.. _api_guide_singlenode_training_best_practice:
+#####################
+单机训练优秀实践
+#####################
+开始优化您的单机训练任务
+-------------------------
+PaddlePaddle Fluid可以支持在现代CPU、GPU平台上进行训练。如果您发现Fluid进行单机训练的速度较慢，您可以根据这篇文档的建议对您的Fluid程序进行优化。
+神经网络训练代码通常由三个步骤组成：网络构建、数据准备、模型训练。这篇文档将分别从这三个方向介绍Fluid训练中常用的优化方法。
+1. 网络构建过程中的配置优化
+==================
+这部分优化与具体的模型有关，在这里，我们列举出一些优化过程中遇到过的一些示例。
+1.1 cuDNN操作的选择
+^^^^^^^^^^^^^^^^
+cuDNN是NVIDIA提供的深度神经网络计算库，其中包含了很多神经网络中常用算子，Paddle中的部分Op底层调用的是cuDNN库，例如 :code:`conv2d` ：
+.. code-block:: python
+    paddle.fluid.layers.conv2d(input,
+                               num_filters,
+                               filter_size,
+                               stride=1,
+                               padding=0,
+                               dilation=1,
+                               groups=None,
+                               param_attr=None,
+                               bias_attr=None,
+                               use_cudnn=True,
+                               act=None,
+                               name=None,
+                               data_format="NCHW")
+在 :code:`use_cudnn=True` 时，框架底层调用的是cuDNN中的卷积操作。
+通常cuDNN库提供的操作具有很好的性能表现，其性能明显优于Paddle原生的CUDA实现，比如 :code:`conv2d` 。但是cuDNN中有些操作的性能较差，比如： :code:`conv2d_transpose` 在 :code:`batch_size=1` 时、:code:`pool2d` 在 :code:`global_pooling=True` 时等，这些情况下，cuDNN实现的性能差于Paddle的CUDA实现，建议手动设置 :code:`use_cudnn=False` 。
+1.2 减少模型中Layer的个数
+^^^^^^^^^^^^^^^^^^
+为方便用户使用，飞桨提供一些不同粒度的Layer，其中有些Layer的组合可以通过单个Layer完成。比如：
+(1) :code:`fluid.layers.softmax_with_cross_entropy` ，该操作其实是 :code:`fluid.layers.softmax` 和 :code:`fluid.layers.cross_entropy` 的组合，因此如果模型中有出现
+.. code-block:: python
+    logits = fluid.layers.softmax(logits)
+    loss = fluid.layers.cross_entropy(logits, label, ignore_index=255)
+可以直接替换成
+.. code-block:: python
+    loss = fluid.layers.softmax_with_cross_entropy(logits, label, ignore_index=255, numeric_stable_mode=True)
+(2) 如果模型中需要对数据进行标准化，可以直接使用 :code:`fluid.layers.data_norm` ，而不用通过一系列layer组合出数据的标准化操作。
+因此，建议在构建模型时优先使用飞桨提供的单个Layer完成所需操作，这样减少模型中Layer的个数，并因此加速模型训练。
+2. 数据准备优化
+=============
+数据准备通常分为两部分：第一部分是数据加载，即程序从磁盘中加载训练/预测数据；第二部分是数据预处理，程序对加载的数据进行预处理，比如图像任务通常需要进行数据增强、Shuffle等。
+这两部分需要用户根据自己的模型需要进行设置，只需要最后得到Data Reader接口即可。Data Reader返回iterable对象，可以每次返回一条样本或者一组样本。代码示例如下：
+.. code-block:: python
+    def data_reader(width, height):
+        def reader():
+            while True:
+                yield np.random.uniform(-1, 1,size=width*height), np.random.randint(0,10)
+        return reader
+    train_data_reader = data_reader(32, 32)
+Paddle提供了两种方式从Data Reader中读取数据： :ref:`user_guide_use_numpy_array_as_train_data` 和 :ref:`user_guides_use_py_reader` ，详情请参考文档 :ref:`user_guide_prepare_data` 。
+2.1 同步数据读取
+^^^^^^^^^^^^^^^^
+同步数据读取是一种简单并且直观的数据准备方式，代码示例如下：
+.. code-block:: python
+    image = fluid.data(name="image", shape=[None, 1, 28, 28], dtype="float32")
+    label = fluid.data(name="label", shape=[None, 1], dtype="int64")
+    # 模型定义
+    # ……
+    prediction = fluid.layers.fc(input=image, size=10)
+    loss = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_loss = fluid.layers.mean(loss)
+    # ……
+    # 读取数据
+    # paddle.dataset.mnist.train()返回数据读取的Reader,每次可以从Reader中读取一条样本，batch_size为128
+    train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
+    # 读取数据
+    end = time.time()
+    for batch_id, batch in enumerate(train_reader):
+        data_time = time.time() - end
+        # 训练网络
+        executor.run(feed={...}, fetch_list=[...])
+        batch_time = time.time() - end
+        end = time.time()
+用户首先需要通过 :code:`fluid.data` 定义模型的输入，然后根据输入构建模型，最后从事先自定义的Reader函数中获取一个batch的数据，并将数据传递给执行器。
+采用同步数据读取方式时，用户可通过加入Python计时函数 :code:`time.time()` 来统计数据准备部分和执行部分所占用的时间。
+由于数据准备和执行是顺序进行的，所以程序的执行速度可能较慢。如果用户想进行模型调试的话，同步数据读取是一个不错的选择。
+2.2 异步数据读取
+^^^^^^^^^^^^^^^^
+Paddle里面使用 paddle.fluid.io. :ref:`cn_api_fluid_io_DataLoader` 接口来实现异步数据读取，代码示例如下：
+.. code-block:: python
+    image = fluid.data(name="image", shape=[None, 1, 28, 28], dtype="float32")
+    label = fluid.data(name="label", shape=[None, 1], dtype="int64")
+    dataloader = fluid.io.DataLoader.from_generator(
+            feed_list=[image, label],
+            capacity=64,
+            iterable=False,
+            use_double_buffer=True)
+    # 模型定义
+    # ……
+    prediction = fluid.layers.fc(input=image, size=10)
+    loss = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_loss = fluid.layers.mean(loss)
+    # ……
+    # 读取数据
+    train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
+    data_loader.set_batch_generator(train_reader, places=places)
+    # 启动data_loader
+    data_loader.start()
+    batch_id = 0
+    try:
+        end = time.time()
+        while True:
+            print("queue size: ", data_loader.queue.size())
+            loss, = executor.run(fetch_list=[...])
+            # ...
+            batch_time = time.time() - end
+            end = time.time()
+            batch_id += 1
+    except fluid.core.EOFException:
+        data_loader.reset()
+用户首先需要通过 :code:`fluid.io.DataLoader.from_generator` 定义DataLoader对象，并使用 :code:`set_batch_generator` 方法将自定义的Reader与DataLoader绑定。
+若DataLoader被定义成不可迭代的（ :code:`iterable=False` ），在训练开始之前，通过调用 :code:`start()` 方法来启动数据读取。
+在数据读取结束之后， :code:`executor.run` 会抛出 :code:`fluid.core.EOFException` ，表示训练已经遍历完Reader中的所有数据。
+采用异步数据读取时，Python端和C++端共同维护一个数据队列，Python端启动一个线程，负责向队列中插入数据，C++端在训练/预测过程中，从数据队列中获取数据，并将该数据从对队列中移除。
+用户可以在程序运行过程中，监测数据队列是否为空，如果队列始终不为空，表明数据准备的速度比模型执行的速度快，这种情况下数据读取可能不是瓶颈。
+另外，Paddle提供的一些FLAGS也能很好的帮助分析性能。如果用户希望评估一下在完全没有数据读取开销情况下模型的性能，可以设置一下环境变量：:code:`FLAGS_reader_queue_speed_test_mode` ，在该变量为True情况下，C++端从数据队列中获取数据之后，不会从数据队列中移除，这样能够保证数据队列始终不为空，从而避免了C++端读取数据时的等待开销。
+**需要特别注意的是，** :code:`FLAGS_reader_queue_speed_test_mode` **只能在性能分析的时候打开，正常训练模型时需要关闭。**
+为降低训练的整体时间，建议用户使用异步数据读取的方式，并开启 :code:`use_double_buffer=True` 。用户可根据模型的实际情况设置数据队列的大小。
+如果数据准备的时间大于模型执行的时间，或者出现了数据队列为空的情况，就需要考虑对数据读取Reader进行加速。
+常用的方法是 **使用Python多进程准备数据** ，一个简单的使用多进程准备数据的示例，可以参考 `YOLOv3 <https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/yolov3/reader.py>`_ 。
+Python端的数据预处理，都是使用CPU完成。如果Paddle提供了相应功能的API，可将这部分预处理功能写到模型配置中，如此Paddle就可以使用GPU来完成该预处理功能，这样也可以减轻CPU预处理数据的负担，提升总体训练速度。
+3. 模型训练相关优化
+=============
+3.1 执行器介绍
+^^^^^^^^^^^^^^^^
+目前Paddle的Python API中提供了 :code:`fluid.compiler.CompiledProgram` 的概念，用户可以通过 :code:`CompiledProgram` 将传入的program进行编译。
+如果希望采用数据并行模式训练，只需要将 :code:`CompiledProgram` 返回的对象调用一下 :code:`with_data_parallel` 即可，最后统一通过 :code:`executor.run(…)` 执行compiled_program。
+虽然统一通过 :code:`executor.run(…)` 接口来执行，实际底层的执行策略有两种，对应C++部分的两个执行器，即 :code:`Executor` 和 :code:`ParallelExecutor` ，如果用户采用数据并行模式，C++部分使用的是 :code:`ParallelExecutor` ，除此之外都是使用 :code:`Executor` 。
+这两个执行器的差别：
+..  csv-table::
+    :header: "执行器 ", "执行对象", "执行策略"
+    :widths: 3, 3, 5
+    ":code:`Executor`",         ":code:`Program`",   "根据 :code:`Program` 中Operator定义的先后顺序依次运行。"
+    ":code:`ParallelExecutor`", "SSA Graph", "根据Graph中各个节点之间的依赖关系，通过多线程运行。"
+可以看出， :code:`Executor` 的内部逻辑非常简单，但性能可能会弱一些，因为 :code:`Executor` 对于program中的操作是串行执行的。
+而 :code:`ParallelExecutor` 首先会将program转变为计算图，并分析计算图中节点间的连接关系，对图中没有相互依赖的节点（OP），通过多线程并行执行。
+因此， :code:`Executor` 是一个轻量级的执行器，目前主要用于参数初始化、模型保存、模型加载。
+:code:`ParallelExecutor` 是 :code:`Executor` 的升级版本，目前 :code:`ParallelExecutor` 主要用于模型训练，包括单机单卡、单机多卡以及多机多卡训练。
+:code:`ParallelExecutor` 执行计算图之前，可以对计算图进行一些优化，比如使计算图中的一些操作是In-place的、将计算图中的参数更新操作进行融合等。
+用户还可以调整 :code:`ParallelExecutor` 执行过程中的一些配置，比如执行计算图的线程数等。这些配置分别是构建策略（BuildStrategy）和执行策略（ExecutionStrategy）参数来设置的。
+一个简单的使用示例如下：
+.. code-block:: python
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.enable_inplace = True
+    build_strategy.fuse_all_optimizer_ops=True
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_threads = 4
+    train_program = fluid.compiler.CompiledProgram(main_program).with_data_parallel(
+                loss_name=loss.name,
+                build_strategy=build_strategy,
+                exec_strategy=exec_strategy)
+    place = fluid.CUDAPlace(0)
+    exe = Executor(place)
+    # 使用DataLoader读取数据，因此执行时不需要设置feed
+    fetch_outs = exe.run(train_program, fetch_list=[loss.name])
+3.2 构建策略（BuildStrategy）配置参数介绍
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+BuildStrategy中提供了一些关于计算图优化的策略，这些策略可以在不同程度上提升模型的训练速度，但是其中一些策略与模型的结构有关，比如 :code:`fuse_all_optimizer_ops` 不支持sparse梯度，我们正在积极的完善这些策略，并在下一个版本将这些策略默认打开。
+构建策略的详细介绍如下：
+..  csv-table::
+    :header: "选项", "类型", "默认值", "说明"
+    :widths: 3, 3, 3, 5
+    ":code:`reduce_strategy`",                   ":code:`fluid.BuildStrategy.ReduceStrategy`", ":code:`fluid.BuildStrategy.ReduceStrategy.AllReduce`", "使用数据并行训练模型时选用 :code:`AllReduce` 模式训练还是 :code:`Reduce` 模式训练。"
+    ":code:`enable_backward_optimizer_op_deps`", "bool", "True", "在反向操作和参数更新操作之间添加依赖，保证在所有的反向操作都运行结束之后才开始运行参数更新操作。"
+    ":code:`fuse_all_optimizer_ops`",            "bool", "False", "对模型中的参数更新算法进行融合。"
+    ":code:`fuse_all_reduce_ops`",               "bool", "False", "多卡训练时，将all_reduce操作进行融合。"
+    ":code:`fuse_relu_depthwise_conv`",          "bool", "False", "如果模型中存在relu和depthwise_conv，并且是连接的，即relu->depthwise_conv，该选项可以将这两个操作合并为一个。"
+    ":code:`fuse_broadcast_ops`",                "bool", "False", "在 :code:`Reduce` 模式下，将最后的多个Broadcast操作融合为一个。"
+    ":code:`mkldnn_enabled_op_types`",           "list", "{}",    "如果是CPU训练，可以用 :code:`mkldnn_enabled_op_types` 指明模型中的那些操作可以使用MKLDNN库。默认情况下，模型中用到的操作如果在Paddle目前支持的可以使用mkldnn库计算的列表中，这些操作都会调用mkldnn库的接口进行计算。"
+    ":code:`debug_graphviz_path`",               "str",  "{}",    "将Graph以graphviz格式输出到debug_graphviz_path所指定的文件中。"
+参数说明：
+(1) 关于 :code:`reduce_strategy` ，在 :code:`ParallelExecutor` 对于数据并行支持两种参数更新模式： :code:`AllReduce` 和 :code:`Reduce` 。在 :code:`AllReduce` 模式下，各个节点上计算得到梯度之后，调用 :code:`AllReduce` 操作，梯度在各个节点上聚合，然后各个节点分别进行参数更新。在 :code:`Reduce` 模式下，参数的更新操作被均匀的分配到各个节点上，即各个节点计算得到梯度之后，将梯度在指定的节点上进行 :code:`Reduce` ，然后在该节点上，最后将更新之后的参数Broadcast到其他节点。即：如果模型中有100个参数需要更新，训练时使用的是4个节点，在 :code:`AllReduce` 模式下，各个节点需要分别对这100个参数进行更新；在 :code:`Reduce` 模式下，各个节点需要分别对这25个参数进行更新，最后将更新的参数Broadcast到其他节点上。注意：如果是使用CPU进行数据并行训练，在Reduce模式下，不同CPUPlace上的参数是共享的，所以在各个CPUPlace上完成参数更新之后不用将更新后的参数Broadcast到其他CPUPlace。
+(2) 关于 :code:`enable_backward_optimizer_op_deps` ，在多卡训练时，打开该选项可能会提升训练速度。
+(3) 关于 :code:`fuse_all_optimizer_ops` ，目前只支持SGD、Adam和Momentum算法。 **注意：目前不支持sparse参数梯度** 。
+(4) 关于 :code:`fuse_all_reduce_ops` ，多GPU训练时，可以对 :code:`AllReduce` 操作进行融合，以减少 :code:`AllReduce` 的调用次数。默认情况下会将同一layer中参数的梯度的 :code:`AllReduce` 操作合并成一个，比如对于 :code:`fluid.layers.fc` 中有Weight和Bias两个参数，打开该选项之后，原本需要两次 :code:`AllReduce` 操作，现在只用一次 :code:`AllReduce` 操作。此外，为支持更大粒度的参数梯度融合，Paddle提供了 :code:`FLAGS_fuse_parameter_memory_size` 选项，用户可以指定融合AllReduce操作之后，每个 :code:`AllReduce` 操作的梯度字节数，比如希望每次 :code:`AllReduce` 调用传输64MB的梯度，:code:`export FLAGS_fuse_parameter_memory_size=64` 。 **注意：目前不支持sparse参数梯度** 。
+(5) 关于 :code:`mkldnn_enabled_op_types` ，目前Paddle的Op中可以使用mkldnn库计算的操作包括：transpose、sum、softmax、requantize、quantize、pool2d、lrn、gaussian_random、fc、dequantize、conv2d_transpose、conv2d、conv3d、concat、batch_norm、relu、tanh、sqrt、abs。
+3.3 执行策略（ExecutionStrategy）配置参数介绍
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ExecutionStrategy中提供了关于计算图执行时的一些配置，这些配置可能会影响模型的训练速度。同时，这些配置与模型的结构有关，如果用户希望模型训练速度更快，可以调整一下这些配置。在后续的优化中，我们会对这部分进行优化，根据输入模型结构动态调整这些设置。
+ExecutionStrategy配置选项说明：
+..  csv-table::
+    :header: "选项", "类型", "默认值", "说明"
+    :widths: 3, 3, 5, 5
+    ":code:`num_iteration_per_drop_scope`", "INT", "100", "经过多少次迭代之后清理一次local execution scope"
+    ":code:`num_threads`",                  "INT", "对于CPU：2*dev_count；对于GPU：4*dev_count. （这是一个经验值）", ":code:`ParallelExecutor` 中执行所有Op使用的线程池大小"
+说明：
+(1) 关于 :code:`num_iteration_per_drop_scope` ，框架在运行过程中会产生一些临时变量，默认每经过一个batch就要清理一下临时变量。由于GPU是异步设备，在清理之前需要对所有的GPU调用一次同步操作，因此耗费的时间较长。为此我们在execution_strategy中添加了 :code:`num_iteration_per_drop_scope` 选项。用户可以指定经过多少次迭代之后清理一次。
+(2) 关于 :code:`num_threads` ，:code:`ParallelExecutor` 根据Op之间的依赖关系确定Op的执行顺序，即：当Op的输入都已经变为ready状态之后，该Op会被放到一个队列中，等待被执行。 :code:`ParallelExecutor` 内部有一个任务调度线程和一个线程池，任务调度线程从队列中取出所有Ready的Op，并将其放到线程队列中。 :code:`num_threads` 表示线程池的大小。根据以往的经验，对于CPU任务，:code:`num_threads=2*dev_count` 时性能较好，对于GPU任务，:code:`num_threads=4*dev_count` 时性能较好。 **注意：线程池不是越大越好** 。
+4. 运行时FLAGS设置优化
+=================
+Paddle中有一些FLAGS可以有助于性能优化：
+(1) :code:`FLAGS_cudnn_exhaustive_search` 表示在调用cuDNN中的卷积操作时，根据输入数据的shape等信息，采取穷举搜索的策略从算法库中选取到更快的卷积算法，进而实现对模型中卷积操作的加速。需要注意的是：
+    - 在搜索算法过程中需要使用较多的显存，如果用户的模型中卷积操作较多，或者GPU卡显存较小，可能会出现显存不足问题。
+    - 通过穷举搜索选择好算法之后，该算法会进入Cache，以便下次运行时，如果输入数据的shape等信息不变，直接使用Cache中算法。
+(2) :code:`FLAGS_enable_cublas_tensor_op_math` 表示是否使用TensorCore加速cuBLAS等NV提供的库中的操作。需要注意的是，这个环境变量只在Tesla V100以及更新的GPU上适用，且可能会带来一定的精度损失，通常该损失不会影响模型的收敛性。
+5. 优秀实践
+=================
+(1) 尽可能的使用飞桨提供的单个layer实现所需操作。
+(2) 采用异步数据读取。
+(3) 模型训练相关优化：
+    - 使用ParallelExecutor作为底层执行器。单卡训练，也可以调用with_data_parallel方法。代码示例：
+    .. code-block:: python
+        compiled_prog = compiler.CompiledProgram(
+                  fluid.default_main_program()).with_data_parallel(
+                  loss_name=loss.name)
+    - 如果模型中参数的梯度都是非sparse的，可以打开fuse_all_optimizer_ops选项，将多个参数更新操作融合为一个。
+    - 如果是多卡训练，可以打开enable_backward_optimizer_op_deps、fuse_all_reduce_ops选项。如果想指定每次每次AllReduce操作的数据大小，可以设置 :code:`FLAGS_fuse_parameter_memory_size`，比如 :code:`export FLAGS_fuse_parameter_memory_size=1` ，表示每次AllReduce调用传输1MB的梯度。
+    - 使用CPU做数据并行训练时，推荐使用Reduce模型，因为在使用CPU进行数据并行训练时，在Reduce模式下，不同CPUPlace 上的参数是共享的，所以在各个CPUPlace 上完成参数更新之后不用将更新后的参数Broadcast到其他CPUPlace上，这对提升速度也有很大帮助。
+    - 如果是Reduce模式，可打开fuse_broadcast_ops选项。
+    - 如果用户的模型较小，比如mnist、language_model等，可以将num_threads设为1。
+    - 在显存足够的前提下，建议将 :code:`exec_strategy.num_iteration_per_drop_scope` 设置成一个较大的值，比如设置为100，这样可以避免反复地申请和释放内存。
+目前我们正在推进这些配置自动化的工作：即根据输入的模型结构自动配置这些选项，争取在下一个版本中实现，敬请期待。
+(4) FLAGS设置
+.. code-block:: bash
+    FLAGS_cudnn_exhaustive_search = True
+    FLAGS_enable_cublas_tensor_op_math = True
+6. 使用Profile工具进行性能分析
+======================
+为方便用户更好的发现程序中的性能瓶颈，Paddle提供了多种Profile工具，这些工具的详细介绍和使用说明请参考 :ref:`api_guide_analysis_tools` 。
--- a/doc/paddle/guides/08_new_op/custom_op.md
+++ b/doc/paddle/guides/08_new_op/custom_op.md
+# 如何在框架外部自定义C++ OP
+通常，如果PaddlePaddle的Operator(OP)库中没有您所需要的操作，建议先尝试使用已有的OP组合，如果无法组合出您需要的操作，可以尝试使用`fluid.layers.py_func`，也可以按照这篇教程自定义C++ OP。当然，如果用若干OP组合出来的OP性能无法满足您的要求，也可以自定义C++ OP。
+自定义OP需要以下几个步骤:
+1. 实现OP和注册OP，和在框架内部写OP完全相同，遵守"如何写新的C++ OP"的规范和步骤。当然，实现Gradient OP是可选的。
+2. 编译出动态库。
+3. 封装该OP的Python接口。
+4. 写OP的单测。
+下面通过一个具体的例子来详细的介绍，一步一步教会您如何实现。下面通过实现relu op来介绍。
+##  自定义OP的实现
+OP的实现与"如何写新的C++ OP"的教程相同，简答的说需要: 1). 定义OP的ProtoMaker，即描述OP的输入、输出、属性信息；2). 实现OP的定义和InferShape，以及OP的kernel函数，反向OP类似。3). 注册OP，以及OP的计算函数。
+ReLU OP的CPU实现， ``relu_op.cc`` 文件:
+```
+// relu_op.cc
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+// 前向OP的输入X、输出Y、属性
+class Relu2OpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("X", "The input tensor.");
+    AddOutput("Y", "Output of relu_op");
+    AddComment(R"DOC(
+Relu Operator.
+Y = max(X, 0)
+)DOC");
+  }
+};
+// 前向OP的定义和InferShape实现，设置输出Y的shape
+class Relu2Op : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    auto in_dims = ctx->GetInputDim("X");
+    ctx->SetOutputDim("Y", in_dims);
+  }
+};
+// 实现前向OP的Kernel计算函数: Y = max(0, X)
+using Tensor = framework::Tensor;
+template <typename DeviceContext, typename T>
+class Relu2Kernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* in_t = ctx.Input<Tensor>("X");
+    auto* out_t = ctx.Output<Tensor>("Y");
+    auto x = in_t->data<T>();
+    // mutable_data分配内存、获取指针
+    auto y = out_t->mutable_data<T>(ctx.GetPlace());
+    for (int i = 0; i < in_t->numel(); ++i) {
+      y[i] = std::max(static_cast<T>(0.), x[i]);
+    }
+  }
+};
+// 定义反向OP的输入Y和dY、输出dX、属性:
+template <typename T>
+class Relu2GradMaker : public framework::SingleGradOpMaker<T> {
+ public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+  void Apply(GradOpPtr<T> op) const override {
+    op->SetType("relu2_grad");
+    op->SetInput("Y", this->Output("Y"));
+    op->SetInput(framework::GradVarName("Y"), this->OutputGrad("Y"));
+    op->SetAttrMap(this->Attrs());
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+  }
+};
+// 定义反向OP和InferShape实现,设置dX的shape
+class Relu2GradOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    auto in_dims = ctx->GetInputDim(framework::GradVarName("Y"));
+    ctx->SetOutputDim(framework::GradVarName("X"), in_dims);
+  }
+};
+// 实现反向OP的kernel函数 dx = dy * ( y > 0. ? 1. : 0)
+template <typename DeviceContext, typename T>
+class Relu2GradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* dy_t = ctx.Input<Tensor>(framework::GradVarName("Y"));
+    auto* y_t = ctx.Input<Tensor>("Y");
+    auto* dx_t = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto dy = dy_t->data<T>();
+    auto y = y_t->data<T>();
+    auto dx = dx_t->mutable_data<T>(ctx.GetPlace());
+    for (int i = 0; i < y_t->numel(); ++i) {
+      dx[i] = dy[i] * (y[i] > static_cast<T>(0) ? 1. : 0.);
+    }
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+using CPU = paddle::platform::CPUDeviceContext;
+// 注册前向和反向op
+// 为了和框架内部的relu区分，这里注册的OP type为relu2
+REGISTER_OPERATOR(relu2,
+                  ops::Relu2Op,
+                  ops::Relu2OpMaker,
+                  ops::Relu2GradMaker<paddle::framework::OpDesc>,
+                  ops::Relu2GradMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(relu2_grad, ops::Relu2GradOp);
+// 注册CPU的Kernel
+REGISTER_OP_CPU_KERNEL(relu2,
+                       ops::Relu2Kernel<CPU, float>,
+                       ops::Relu2Kernel<CPU, double>);
+REGISTER_OP_CPU_KERNEL(relu2_grad,
+                       ops::Relu2GradKernel<CPU, float>,
+                       ops::Relu2GradKernel<CPU, double>);
+```
+ReLU OP的GPU实现， ``relu_op.cu`` 文件:
+```
+// relu_op.cu
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+template <typename T>
+__global__ void KeRelu2(const T* x, const int num, T* y) {
+  int gid = blockIdx.x * blockDim.x + threadIdx.x;
+  for (int i = gid; i < num; i += blockDim.x * gridDim.x) {
+    y[i] = max(x[i], static_cast<T>(0.));
+  }
+}
+// 前向OP的kernel的GPU实现
+template <typename DeviceContext, typename T>
+class Relu2CUDAKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* in_t = ctx.Input<Tensor>("X");
+    auto* out_t = ctx.Output<Tensor>("Y");
+    auto x = in_t->data<T>();
+    auto y = out_t->mutable_data<T>(ctx.GetPlace());
+    auto& dev_ctx = ctx.template device_context<DeviceContext>();
+    int num = in_t->numel();
+    int block = 512;
+    int grid = (num + block - 1) / block;
+    KeRelu2<T><<<grid, block, 0, dev_ctx.stream()>>>(x, num, y);
+  }
+};
+template <typename T>
+__global__ void KeRelu2Grad(const T* y, const T* dy, const int num, T* dx) {
+  int gid = blockIdx.x * blockDim.x + threadIdx.x;
+  for (int i = gid; i < num; i += blockDim.x * gridDim.x) {
+    dx[i] = dy[i] * (y[i] > 0 ? 1. : 0.);
+  }
+}
+// 反向OP的kernel的GPU实现
+template <typename DeviceContext, typename T>
+class Relu2GradCUDAKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* dy_t = ctx.Input<Tensor>(framework::GradVarName("Y"));
+    auto* y_t = ctx.Input<Tensor>("Y");
+    auto* dx_t = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto dy = dy_t->data<T>();
+    auto y = y_t->data<T>();
+    auto dx = dx_t->mutable_data<T>(ctx.GetPlace());
+    auto& dev_ctx = ctx.template device_context<DeviceContext>();
+    int num = dy_t->numel();
+    int block = 512;
+    int grid = (num + block - 1) / block;
+    KeRelu2Grad<T><<<grid, block, 0, dev_ctx.stream()>>>(y, dy, num, dx);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+using CUDA = paddle::platform::CUDADeviceContext;
+// 注册前向的GPU Kernel
+REGISTER_OP_CUDA_KERNEL(relu2,
+                        paddle::operators::Relu2CUDAKernel<CUDA, float>,
+                        paddle::operators::Relu2CUDAKernel<CUDA, double>);
+// 注册反向的GPU Kernel
+REGISTER_OP_CUDA_KERNEL(relu2_grad,
+                        paddle::operators::Relu2GradCUDAKernel<CUDA, float>,
+                        paddle::operators::Relu2GradCUDAKernel<CUDA, double>);
+```
+注意点:
+1. OP的type不能和PaddlePaddle已有的OP type相同，否则在Python中使用时会报错。
+## 自定义OP的编译
+需要将实现的C++、CUDA代码编译成动态库，下面通过g++/nvcc编译，当然您也可以写Makefile或者CMake。
+编译需要include PaddlePaddle的相关头文件，如上面代码  `paddle/fluid/framework/op_registry.h` ，需要链接PaddlePaddle的lib库。 可通过下面命令获取到:
+```
+# python
+>>> import paddle
+>>> print(paddle.sysconfig.get_include())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/include
+>>> print(paddle.sysconfig.get_lib())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/libs
+```
+下面命令可编译出动态库:
+```
+include_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_include())' )
+lib_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_lib())' )
+echo $include_dir
+echo $lib_dir
+# PaddlePaddel >=1.6.1, 仅需要include ${include_dir} 和 ${include_dir}/third_party
+nvcc relu_op.cu -c -o relu_op.cu.o -ccbin cc -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -DPADDLE_WITH_MKLDNN -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+g++ relu_op.cc relu_op.cu.o -o relu2_op.so -shared -fPIC -std=c++11 -O3 -DPADDLE_WITH_MKLDNN \
+  -I ${include_dir} \
+  -I ${include_dir}/third_party \
+  -L /usr/local/cuda/lib64 \
+  -L ${lib_dir} -lpaddle_framework -lcudart
+```
+注意点:
+1. 通过NVCC编译CUDA源文件时，需要加编译选项 `-DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO`，在框架源码中会使用这些宏定义进行条件编译。用户自定义的C++ OP实现编译时，选项的开启状态需要和核心框架编译行为一致。如`EIGEN_USE_GPU`是使用Eigen数学库的GPU实现时需要增加的编译选项。
+2. 如果飞桨安装包中不包含MKLDNN库，则需要去掉编译选项`-DPADDLE_WITH_MKLDNN`。核心框架源码中(比如tensor.h)有使用此宏定义进行条件编译，该选项是否打开同样需要和核心框架编译行为保持一致。默认的飞桨安装包中含有MKLDNN库。
+3. 可多个OP编译到同一个动态库中。
+4. 通过pip方式安装的PaddlePaddle由GCC 4.8编译得到，由于GCC 4.8和GCC 5以上**C++11 ABI不兼容**，您编写的自定义OP，需要通过GCC 4.8编译。若是GCC 5及以上的环境上使用自定义OP，推荐使用[Docker安装PaddlePaddle](https://www.paddlepaddle.org.cn/install/doc/docker)，使得编Paddle和编译自定义OP的GCC版本相同。
+## 封装Python Layer接口
+需要使用  `fluid.load_op_library`  接口调用加载动态库，使得PaddlePaddle的主进程中可以使用用户自定义的OP。
+```
+# custom_op.py
+import paddle.fluid as fluid
+# 调用load_op_library加载动态库
+fluid.load_op_library('relu2_op.so')
+from paddle.fluid.layer_helper import LayerHelper
+def relu2(x, name=None):
+    # relu2的type和在OP中定义的type相同
+    helper = LayerHelper("relu2", **locals())
+    # 创建输出Variable
+    out = helper.create_variable_for_type_inference(dtype=x.dtype)
+    helper.append_op(type="relu2", inputs={"X": x}, outputs={"Y": out})
+    return out
+```
+注意点:
+1. 一个动态库只需使用`fluid.load_op_library`在`paddle.fluid` import之后加载一次即可。
+2. Python接口的封装和PaddlePaddle框架内部的封装相同，更多的示例也可以阅读源码中 `python/paddle/fluid/layers/nn.py`的代码示例。
+## 单测测试
+ 可以写个简单的Python程序测试计算的正确性:
+```
+import numpy as np
+import paddle.fluid as fluid
+from custom_op import relu2
+data = fluid.layers.data(name='data', shape=[32], dtype='float32')
+relu = relu2(data)
+use_gpu = True # or False
+place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+exe = fluid.Executor(place)
+x = np.random.uniform(-1, 1, [4, 32]).astype('float32')
+out, = exe.run(feed={'data': x}, fetch_list=[relu])
+np.allclose(out, np.maximum(x,0.))
+```
+接下来可以在模型中使用您自定义的OP了!
+## 如何在C++预测库中使用
+暂时不支持在C++预测库中使用，后续会补充在C++预测库中的使用示例。
+## FAQ
+1. Q: 如果出现类似错误: `relu2_op.so: cannot open shared object file: No such file or directory` 以及 `libpaddle_framework.so: cannot open shared object file: No such file or directory`。
+   A: 需要将`relu2_op.so`所在路径以及`libpaddle_framework.so`路径(即`paddle.sysconfig.get_lib()`得到路径)设置到环境变量LD_LIBRARY_PATH中:
+     ```
+      # 假如relu2_op.so路径是：`paddle/test`，对于Linux环境设置:
+      export LD_LIBRARY_PATH=paddle/test:$( python -c 'import paddle; print(paddle.sysconfig.get_lib())'):$LD_LIBRARY_PATH
+     ```
--- a/doc/paddle/guides/08_new_op/index_cn.rst
+++ b/doc/paddle/guides/08_new_op/index_cn.rst
+#############
+新增OP
+#############
+本部分将指导您如何新增Operator，也包括一些必要的注意事项
+- `如何写新的C++ op <./new_op.html>`_
+- `C++ op相关注意事项 <./op_notes.html>`_
+- `如何写新的Python op <./new_python_op.html>`_
+- `如何在框架外部自定义C++ op <./custom_op.html>`_
+.. toctree::
+   :hidden:
+   new_op.md
+   op_notes.md
+   new_python_op.md
+   custom_op.md
--- a/doc/paddle/guides/08_new_op/index_en.rst
+++ b/doc/paddle/guides/08_new_op/index_en.rst
+###################
+Write New Operators
+###################
+This section will guide you how to add an operator, and it also includes some necessary notes.
+- `How to write new operator <new_op_en.html>`_ ：guides to write new operators
+- `op notes <op_notes_en.html>`_ ：notes on developing new operators
+.. toctree::
+   :hidden:
+   new_op_en.md
+   op_notes_en.md
--- a/doc/paddle/guides/08_new_op/new_op.md
+++ b/doc/paddle/guides/08_new_op/new_op.md
+# 如何写新的C++ OP
+## 概念简介
+简单介绍需要用到基类，详细介绍请参考[设计文档](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/motivation/refactorization.md#operatoropwithkernelopkernel)。
+- `framework::OperatorBase`: Operator(简写，Op)基类。
+- `framework::OpKernel`: Op计算函数的基类，称作Kernel。
+- `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
+- `framework::OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释，主要用于Python API接口生成。
+根据是否包含Kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op：
+- 包含Kernel的Op继承自`OperatorWithKernel`，这类Op的功能实现与输入的数据类型、数据布局、数据所在的设备以及Op实现所调用第三方库等有关。比如ConvOp，如果使用CPU计算，一般通过调用mkl库中的矩阵乘操作实现，如果使用GPU计算，一般通过调用cublas库中的矩阵乘操作实现，或者直接调用cudnn库中的卷积操作。
+- 不包含Kernel的Op继承自`OperatorBase`，因为这类Op的功能实现与设备以及输入的数据不相关。比如WhileOp、IfElseOp等。
+本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
+<table>
+<thead>
+<tr>
+<th>内容</th>
+<th>定义位置</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>OpProtoMake定义 </td>
+<td>.cc 文件 </td>
+</tr>
+<tr>
+<td>Op定义 </td>
+<td> .cc 文件</td>
+</tr>
+<tr>
+<td>Kernel实现 </td>
+<td> CPU、CUDA共享Kernel实现在.h 文件中，否则，CPU 实现在.cc 文件中，CUDA 实现在.cu 文件中。</td>
+</tr>
+<tr>
+<td>注册Op </td>
+<td> Op注册实现在.cc 文件；Kernel注册CPU实现在.cc 文件中，CUDA实现在.cu 文件中</td>
+</tr>
+</tbody>
+</table>
+实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有）、`*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
+下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
+## 实现C++类
+### 定义ProtoMaker类
+矩阵乘法的公式：$Out = X * Y$, 可见该计算由两个输入，一个输出组成。
+首先定义`ProtoMaker`来描述该Op的输入、输出，并添加注释：
+```cpp
+class MulOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("X", "(Tensor), The first input tensor of mul op.");
+    AddInput("Y", "(Tensor), The second input tensor of mul op.");
+    AddOutput("Out", "(Tensor), The output tensor of mul op.");
+    AddAttr<bool>("use_mkldnn",
+                  "(bool, default false) Only used in mkldnn kernel")
+        .SetDefault(false);
+    AddAttr<int>(
+        "x_num_col_dims",
+        R"DOC((int, default 1), The mul_op can take tensors with more than two
+              dimensions as its inputs. If the input $X$ is a tensor with more
+              than two dimensions, $X$ will be flattened into a two-dimensional
+              matrix first. The flattening rule is: the first `num_col_dims`
+              will be flattened to form the first dimension of the final matrix
+              (the height of the matrix), and the rest `rank(X) - num_col_dims`
+              dimensions are flattened to form the second dimension of the final
+              matrix (the width of the matrix). As a result, height of the
+              flattened matrix is equal to the product of $X$'s first
+              `x_num_col_dims` dimensions' sizes, and width of the flattened
+              matrix is equal to the product of $X$'s last `rank(x) - num_col_dims`
+              dimensions' size. For example, suppose $X$ is a 6-dimensional
+              tensor with the shape [2, 3, 4, 5, 6], and `x_num_col_dims` = 3.
+              Thus, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] =
+              [24, 30].
+        )DOC")
+        .SetDefault(1)
+        .EqualGreaterThan(1);
+    AddAttr<int>(
+        "y_num_col_dims",
+        R"DOC((int, default 1), The mul_op can take tensors with more than two,
+              dimensions as its inputs. If the input $Y$ is a tensor with more
+              than two dimensions, $Y$ will be flattened into a two-dimensional
+              matrix first. The attribute `y_num_col_dims` determines how $Y$ is
+              flattened. See comments of `x_num_col_dims` for more details.
+        )DOC")
+        .SetDefault(1)
+        .EqualGreaterThan(1);
+    AddAttr<float>(
+        "scale_x",
+        "scale_x to be used for int8 mul input data x. scale_x has the"
+        "same purpose as scale_in in OPs that support quantization."
+        "Only to be used with MKL-DNN INT8")
+        .SetDefault(1.0f);
+    AddAttr<std::vector<float>>(
+        "scale_y",
+        "scale_y to be used for int8 mul input data y. scale_y has the"
+        "same purpose as scale_weights in OPs that support quantization."
+        "Only to be used with MKL-DNN INT8")
+        .SetDefault({1.0f});
+    AddAttr<float>("scale_out",
+                   "scale_out to be used for int8 output data."
+                   "Only used with MKL-DNN INT8")
+        .SetDefault(1.0f);
+    AddAttr<bool>(
+        "force_fp32_output",
+        "(bool, default false) Force quantize kernel output FP32, only "
+        "used in quantized MKL-DNN.")
+        .SetDefault(false);
+    AddComment(R"DOC(
+Mul Operator.
+This operator is used to perform matrix multiplication for input $X$ and $Y$.
+The equation is:
+$$Out = X * Y$$
+Both the input $X$ and $Y$ can carry the LoD (Level of Details) information,
+or not. But the output only shares the LoD information with input $X$.
+)DOC");
+  }
+};
+```
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)继承自`framework::OpProtoAndCheckerMaker`。
+开发者通过覆盖`framework::OpProtoAndCheckerMaker`中的`Make`函数来定义Op所对应的Proto，通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddAttr`添加属性参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
+上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，以及`use_mkldnn`等属性，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
+### 定义GradOpMaker类
+通常情况下，大部分Op只有一个对应的反向Op，每个Op的会有一个对应的`GradOpMaker`。为方便代码编写，fluid为只有提供了一个模板类[`SingleGradOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/grad_op_desc_maker.h#L188)。`MulOp`的`GradOpMaker`需要继承这个模板类，并在`Apply()`方法中设置反向Op的输入、输出和属性。此外，fluid还提供了一个默认的`GradOpMaker`，
+[`DefaultGradOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/grad_op_desc_maker.h#L227)，该模板类会使用前向Op的全部输入(`Input`)输出(`Output`)以及输出变量所对应的梯度（`Output@Grad`）作为反向Op的输入，将前向Op的输入变量所对应的的梯度（`Input@Grad`）作为输出。
+**注意:**
+不要将反向Op不会用到的变量放到反向Op的输入列表中，这样会导致这些不会被反向Op用到的变量的空间不能够及时回收，进而有可能导致用到该Op的模型可以设置的batch_size较低。
+比如`relu`操作的前向操作为：`out.device(d) = x.cwiseMax(static_cast<T>(0));`反向操作为：`dx.device(d) = dout * (out > static_cast<T>(0)).template cast<T>();`。显然，反向操作中只是用到了`out`、`dout`、`dx`，没有用到`x`。因此，通常不建议使用默认的`DefaultGradOpMaker`。
+下面示例定义了`MulOp`的`GradOpMaker`。
+```cpp
+template <typename T>
+class MulOpGradMaker : public framework::SingleGradOpMaker<T> {
+ public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+ protected:
+  void Apply(GradOpPtr<T> retv) const override {
+    retv->SetType("mul_grad");
+    retv->SetInput("X", this->Input("X"));
+    retv->SetInput("Y", this->Input("Y"));
+    retv->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    retv->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    retv->SetOutput(framework::GradVarName("Y"), this->InputGrad("Y"));
+    retv->SetAttrMap(this->Attrs());
+  }
+};
+```
+**注意：**
+- 有些Op的前向逻辑和反向逻辑是一样的，比如[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc).这种情况下，前向Op和反向Op的Kernel可以为同一个。
+- 有些前向Op所对应的反向Op可能有多个，比如[`SumOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/sum_op.cc)，这种情况下，`GradMaker`需要继承`framework::GradOpDescMakerBase`。
+- 有些Op的反向对应另一个Op的前向，比如[`SplitOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h)，这种情况下，[`SplitGradMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h#L157)中定义的`SplitOp`反向Op的Type就是`concat`，
+- 为高效地同时支持命令式编程模式(动态图)和声明式编程模式(静态图)，`SingleGradOpMaker`是一个模板类，在注册Operator时需要同时注册`MulOpGradMaker<OpDesc>`（声明式编程模式使用）和`MulOpGradMaker<OpBase>`（命令式编程模式使用）。
+### 定义Operator类
+下面实现了MulOp的定义：
+```cpp
+class MulOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE_EQ(
+        ctx->HasInput("X"), true,
+        platform::errors::NotFound("Input(X) of MulOp should not be null."));
+    PADDLE_ENFORCE_EQ(
+        ctx->HasInput("Y"), true,
+        platform::errors::NotFound("Input(Y) of MulOp should not be null."));
+    PADDLE_ENFORCE_EQ(
+        ctx->HasOutput("Out"), true,
+        platform::errors::NotFound("Output(Out) of MulOp should not be null."));
+    auto x_dims = ctx->GetInputDim("X");
+    auto y_dims = ctx->GetInputDim("Y");
+    int x_num_col_dims = ctx->Attrs().Get<int>("x_num_col_dims");
+    int y_num_col_dims = ctx->Attrs().Get<int>("y_num_col_dims");
+    VLOG(3) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims
+            << " x_num_col_dims=" << x_num_col_dims
+            << " y_num_col_dims=" << y_num_col_dims;
+    PADDLE_ENFORCE_NE(framework::product(y_dims), 0,
+                      platform::errors::PreconditionNotMet(
+                          "The Input variable Y(%s) has not "
+                          "been initialized. You may need to confirm "
+                          "if you put exe.run(startup_program) "
+                          "after optimizer.minimize function.",
+                          ctx->Inputs("Y").front()));
+    PADDLE_ENFORCE_GT(
+        x_dims.size(), x_num_col_dims,
+        platform::errors::InvalidArgument(
+            "The input tensor X's dimensions of MulOp "
+            "should be larger than x_num_col_dims. But received X's "
+            "dimensions = %d, X's shape = [%s], x_num_col_dims = %d.",
+            x_dims.size(), x_dims, x_num_col_dims));
+    PADDLE_ENFORCE_GT(
+        y_dims.size(), y_num_col_dims,
+        platform::errors::InvalidArgument(
+            "The input tensor Y's dimensions of MulOp "
+            "should be larger than y_num_col_dims. But received Y's "
+            "dimensions = %d, Y's shape = [%s], y_num_col_dims = %d.",
+            y_dims.size(), y_dims, y_num_col_dims));
+    auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
+    auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
+    PADDLE_ENFORCE_EQ(
+        x_mat_dims[1], y_mat_dims[0],
+        platform::errors::InvalidArgument(
+            "After flatten the input tensor X and Y to 2-D dimensions "
+            "matrix X1 and Y1, the matrix X1's width must be equal with matrix "
+            "Y1's height. But received X's shape = [%s], X1's shape = [%s], "
+            "X1's "
+            "width = %s; Y's shape = [%s], Y1's shape = [%s], Y1's height = "
+            "%s.",
+            x_dims, x_mat_dims, x_mat_dims[1], y_dims, y_mat_dims,
+            y_mat_dims[0]));
+    std::vector<int64_t> output_dims;
+    output_dims.reserve(
+        static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims));
+    for (int i = 0; i < x_num_col_dims; ++i) {
+      output_dims.push_back(x_dims[i]);
+    }
+    for (int i = y_num_col_dims; i < y_dims.size(); ++i) {
+      output_dims.push_back(y_dims[i]);
+    }
+    ctx->SetOutputDim("Out", framework::make_ddim(output_dims));
+    ctx->ShareLoD("X", /*->*/ "Out");
+  }
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const {
+    framework::LibraryType library = framework::LibraryType::kPlain;
+    framework::DataLayout layout = framework::DataLayout::kAnyLayout;
+    int customized_type_value =
+        framework::OpKernelType::kDefaultCustomizedTypeValue;
+    auto input_data_type = OperatorWithKernel::IndicateVarDataType(ctx, "X");
+#ifdef PADDLE_WITH_MKLDNN
+    if (library == framework::LibraryType::kPlain &&
+        platform::CanMKLDNNBeUsed(ctx)) {
+      library = framework::LibraryType::kMKLDNN;
+      layout = framework::DataLayout::kMKLDNN;
+      if (input_data_type == framework::DataTypeTrait<int8_t>::DataType() ||
+          input_data_type == framework::DataTypeTrait<uint8_t>::DataType()) {
+        customized_type_value = kMULMKLDNNINT8;
+      }
+    }
+#endif
+    return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout,
+                                   library, customized_type_value);
+  }
+};
+```
+[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L30)继承自`OperatorWithKernel`。`public`成员：
+```cpp
+using framework::OperatorWithKernel::OperatorWithKernel;
+```
+这句表示使用基类`OperatorWithKernel`的构造函数，也可写成：
+```cpp
+MulOp(const std::string &type, const framework::VariableNameMap &inputs,
+      const framework::VariableNameMap &outputs,
+      const framework::AttributeMap &attrs)
+  : OperatorWithKernel(type, inputs, outputs, attrs) {}
+```
+此外，Operator类通常需要重写`InferShape`接口，并在有必要时重写`GetExpectedKernelType`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`framework::InferShapeContext* ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
+  - 做检查， 尽早报错：检查输入数据维度、类型等是否合法。
+  - 设置输出Tensor的形状以及LoD信息。
+`GetExpectedKernelType`接口OperatorWithKernel类中用于获取指定设备（例如CPU，GPU）上指定数据类型（例如double，float）的OpKernel的方法。该方法的重写可见请参考[写C++ OP相关注意事项](op_notes.html#getexpectedkerneltype)。
+通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和下面将要介绍的注册函数一起放在`.cc`中
+### InferShape区分 compile time 和 run time
+在我们的声明式编程模式网络中，`InferShape`操作在[编译时(compile time)和运行时(run time)](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md#%E8%AE%A9%E6%88%91%E4%BB%AC%E5%9C%A8fluid%E7%A8%8B%E5%BA%8F%E5%AE%9E%E4%BE%8B%E4%B8%AD%E5%8C%BA%E5%88%86%E7%BC%96%E8%AF%91%E6%97%B6%E5%92%8C%E8%BF%90%E8%A1%8C%E6%97%B6)都会被调用，在compile time时，由于真实的维度未知，框架内部用-1来表示，在run time时，用实际的维度表示，因此维度的值在compile time和 run time时可能不一致，如果存在维度的判断和运算操作，InferShape就需要区分compile time 和 run time。
+以下两种情况需要区分compile time和 run time。
+**1.检查**
+如以下代码：
+```cpp
+auto x_dim = ctx->GetInputDim("X");
+int i = xxx;
+PADDLE_ENFORCE_GT( x_dim[i] , 10)
+```
+在compile time的时候，x_dim[i]可能等于-1，导致这个PADDLE_ENFORCE_GT报错退出。
+如果用了以下paddle中定义的宏进行判断：
+```cpp
+PADDLE_ENFORCE_EQ ( x_dim[i] , 10)
+PADDLE_ENFORCE_NE ( x_dim[i] , 10)
+PADDLE_ENFORCE_GT ( x_dim[i] , 10)
+PADDLE_ENFORCE_GE ( x_dim[i] , 10)
+PADDLE_ENFORCE_LT ( x_dim[i] , 10)
+PADDLE_ENFORCE_LE ( x_dim[i] , 10)
+```
+都需要区分compile time和run time
+**2. 运算**
+如以下代码:
+```cpp
+auto x_dim = ctx->GetInputDim("X");
+int i = xxx;
+y_dim[0] = x_dim[i] + 10
+```
+在compile time的时候，x_dim[i]可能等于-1，得到的 y_dim[0] 等于 9，是不符合逻辑的
+如果用到了类似以下的运算操作
+```cpp
+y_dim[i] = x_dim[i] + 10
+y_dim[i] = x_dim[i] - 10
+y_dim[i] = x_dim[i] * 10
+y_dim[i] = x_dim[i] / 10
+y_dim[i] = x_dim[i] + z_dim[i]
+```
+都需要区分compile time和run time
+**处理的标准**：
+- 检查： compile time的时候不判断维度等于-1的情况，但在runtime的时候检查
+- 运算： -1和其他数做任何运算都要等于-1
+**参考代码**
+1. 判断的实现方法可以参考[cross_entropy_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc#L39)，cross_entropy_op 要求X和labels的两个输入，除了最后一维以外，其他的维度完全一致
+```cpp
+    bool contain_unknown_dim = framework::contain_unknown_dim(x_dims) ||
+                               framework::contain_unknown_dim(label_dims);
+    bool check = ctx->IsRuntime() || !contain_unknown_dim;
+    if (check) {
+      PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1),
+                        framework::slice_ddim(label_dims, 0, rank - 1),
+                        "Input(X) and Input(Label) shall have the same shape "
+                        "except the last dimension.");
+    }
+```
+2. 运算的实现可以参考[concat_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/concat_op.cc#L59)，concat在InferShape判断时，调用`ComputeAndCheckShape`，除了进行concat轴之外，其他的维度完全一致；在生成output的维度时，把concat轴的维度求和，其他的维度和输入保持一致。
+```cpp
+    const size_t n = inputs_dims.size();
+    auto out_dims = inputs_dims[0];
+    size_t in_zero_dims_size = out_dims.size();
+    for (size_t i = 1; i < n; i++) {
+      for (size_t j = 0; j < in_zero_dims_size; j++) {
+        if (j == axis) {
+          if (is_runtime) {
+            out_dims[axis] += inputs_dims[i][j];
+          } else {
+            if (inputs_dims[i][j] == -1) {
+              out_dims[axis] = -1;
+            } else {
+              out_dims[axis] += inputs_dims[i][j];
+            }
+          }
+        } else {
+          bool check_shape =
+              is_runtime || (out_dims[j] > 0 && inputs_dims[i][j] > 0);
+          if (check_shape) {
+            // check all shape in run time
+            PADDLE_ENFORCE_EQ(
+                inputs_dims[0][j], inputs_dims[i][j],
+                "ShapeError: Dimension %d in inputs' shapes must be equal. "
+                "But recevied input[0]'s shape = "
+                "[%s], input[%d]'s shape = [%s].",
+                j, inputs_dims[0], i, inputs_dims[i]);
+          }
+        }
+      }
+    }
+```
+### 定义OpKernel类
+`MulKernel`继承自`framework::OpKernel`，带有下面两个模板参数:
+- `typename DeviceContext`: 表示设备类型。不同设备(CPU、CUDA)共享同一个Kernel时，需加该模板参数；不共享则不加，一个不共享的例子是[`SGDOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/optimizers/sgd_op.h)。
+- `typename T` : 表示数据类型，如`float`, `double`, `int16`等。
+需要为`MulKernel`类重写`Compute`接口。
+- `Compute`接受一个输入参数：`const framework::ExecutionContext& context`。
+- 与`InferShapeContext`相比，`ExecutionContext`增加了设备类型，同样可获取到输入输出和属性参数。
+- `Compute`函数里实现`OpKernel`的具体计算逻辑。
+Op的输入和输出可分别通过`ExecutionContext::Input<T>()`和`ExecutionContext::Output<T>()`获得。
+**注意：** 若op的输入/输出的变量类型是`LoDTensor`（fluid默认所有的`Tensor`默认都是`LoDTensor`类型），请写成`ExecutionContext::Input<LoDTensor>()`和`ExecutionContext::Output<LoDTensor>()`，不要写`ExecutionContext::Input<Tensor>()`和`ExecutionContext::Output<Tensor>()`。因为若实际的变量类型为`SelectedRows`，`Input<Tensor>()`和`Output<Tensor>()`方法会将`SelectedRows`类型特化为`Tensor`，导致潜在的错误。
+下面是 `MulKernel` `Compute`的实现：
+```cpp
+template <typename DeviceContext, typename T>
+class MulKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor* x = context.Input<Tensor>("X");
+    const Tensor* y = context.Input<Tensor>("Y");
+    Tensor* z = context.Output<Tensor>("Out");
+    const Tensor x_matrix =
+        x->dims().size() > 2
+            ? framework::ReshapeToMatrix(
+                  *x, context.template Attr<int>("x_num_col_dims"))
+            : *x;
+    const Tensor y_matrix =
+        y->dims().size() > 2
+            ? framework::ReshapeToMatrix(
+                  *y, context.template Attr<int>("y_num_col_dims"))
+            : *y;
+    z->mutable_data<T>(context.GetPlace());
+    auto z_dim = z->dims();
+    if (z_dim.size() != 2) {
+      z->Resize({x_matrix.dims()[0], y_matrix.dims()[1]});
+    }
+    auto blas = math::GetBlas<DeviceContext, T>(context);
+    blas.MatMul(x_matrix, y_matrix, z);
+    if (z_dim.size() != 2) {
+      z->Resize(z_dim);
+    }
+  }
+};
+```
+需要注意：**不同设备(CPU、CUDA)共享一个Op定义，是否则共享同一个`OpKernel`，取决于`Compute`调用的函数是否支持不同设备。**
+`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考：[`SGDOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/optimizers/sgd_op.h)。
+为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用，我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md)。
+到此，前向Op实现完成。接下来，需要在`.cc`文件中注册该op和kernel。
+反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。
+### 注册Operator
+- 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
+    ```cpp
+    namespace ops = paddle::operators;
+    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker, ops::MulOpInferVarType,
+                      ops::MulOpGradMaker<paddle::framework::OpDesc>,
+                      ops::MulOpGradMaker<paddle::imperative::OpBase>);
+    REGISTER_OPERATOR(mul_grad, ops::MulGradOp);
+    REGISTER_OP_CPU_KERNEL(mul,
+                  ops::MulKernel<paddle::platform::CPUDeviceContext, float>,
+                  ops::MulKernel<paddle::platform::CPUDeviceContext, double>);
+    REGISTER_OP_CPU_KERNEL(mul_grad,
+                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>,
+                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, double>);
+    ```
+    在上面的代码中，使用`REGISTER_OPERATOR`注册了`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，其`GradOpMaker`分别是`ops::MulOpGradMaker<paddle::framework::OpDesc>`（声明式编程模式使用）和`ops::MulOpGradMaker<paddle::imperative::OpBase>`(命令式编程模式使用)，并使用`REGISTER_OPERATOR`注册`ops::MulGradOp`，类型名为`mul_grad`。然后，使用`REGISTER_OP_CPU_KERNEL`注册了`ops::MulKernel`类，并特化模板参数为设备为`paddle::platform::CPUPlace`、数据类型为`float`类型和`double`类型；同理，注册`ops::MulGradKernel`类。
+- 在 `.cu`文件中注册CUDA Kernel。
+    - 请注意，如果CUDA Kernel的实现基于Eigen unsupported模块，那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`，代码示例如下：
+    ```cpp
+    // if use Eigen unsupported module before include head files
+    #define EIGEN_USE_GPU
+    namespace ops = paddle::operators;
+    REGISTER_OP_CUDA_KERNEL(mul,
+                            ops::MulKernel<paddle::platform::CUDADeviceContext, float>,
+                            ops::MulKernel<paddle::platform::CUDADeviceContext, double>);
+    REGISTER_OP_CUDA_KERNEL(mul_grad,
+                            ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>,
+                            ops::MulGradKernel<paddle::platform::CUDADeviceContext, double>);
+    ```
+**注意：**
+在运行Op时，框架系统会根据输入数据所在的设备、输入数据的类型等信息自动的选择合适的OpKernel，比如输入的数据是在GPU上，并且为`float`类型，框架系统会选择由`REGISTER_OP_CUDA_KERNEL`注册的`ops::MulKernel<paddle::platform::CUDADeviceContext, float>`。如果用户希望指定运行时可被调用的OpKernel，用户需要覆盖`framework::OperatorWithKernel`中的`GetExpectedKernelType`函数，比如`MulOp`会根据属性`use_mkldnn`为`false`还是为`true`决定是否调用mkldnn库来完成计算。
+### 编译
+在`build/paddle/fluid/operators`目录下，运行下面命令可以进行编译：
+```
+make mul_op
+```
+## 绑定Python
+系统会对新增的op自动绑定Python，并链接到生成的lib库中。
+### 使用mul操作在Python端构建Layer
+在Python端，`mul`操作用于构建FC层，即：
+$$Out = Act({X*W + b})$$
+具体实现方式可参考[FC层的实现代码](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/layers/nn.py#L205)。
+## 实现单元测试
+单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py)。
+**注意：**
+单测中的测试用例需要尽可能的覆盖Op中的所有分支。
+### 前向Operator单测
+Op单元测试继承自`OpTest`。各项具体的单元测试在`TestMulOp`里完成。测试Operator，需要：
+1. 在`setUp`函数定义输入、输出，以及相关的属性参数。
+    > 注意：输入输出请以`ndarray`的类型配置输入/输出，如果需要配置一个带`LOD`的输入/输出，请以`tuple`的形式传入，`tuple`中应该有两个类型为`ndarray`的元素，第一个是实际的数据，第二个是`LOD`
+2. 生成随机的输入数据。
+3. 在Python脚本中实现与前向operator相同的计算逻辑，得到输出值，与operator前向计算的输出进行对比。
+4. 反向计算已经自动集成进测试框架，直接调用相应接口即可。
+      ```python
+      import unittest
+      import numpy as np
+      from op_test import OpTest
+      class TestMulOp(OpTest):
+          def setUp(self):
+              self.op_type = "mul"
+              self.inputs = {
+                  'X': np.random.random((32, 84)).astype("float32"),
+                  'Y': np.random.random((84, 100)).astype("float32")
+              }
+              self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
+          def test_check_output(self):
+              self.check_output()
+          def test_check_grad_normal(self):
+              self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+          def test_check_grad_ingore_x(self):
+              self.check_grad(
+                  ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+          def test_check_grad_ingore_y(self):
+              self.check_grad(
+                  ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+      ```
+    上面的代码首先导入依赖的包，下面是对`setUp`函数中操作的重要变量的详细解释：
+    - `self.op_type = "mul" ` : 定义类型，与operator注册时注册的类型一致。
+    - `self.inputs` : 定义输入，类型为`numpy.array`，并初始化。
+    - `self.outputs` : 定义输出，并在Python脚本中完成与operator同样的计算逻辑，返回Python端的计算结果。
+### 反向operator单测
+而反向测试中：
+- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
+  - 第一个参数`["X", "Y"]` : 指定对输入变量`X`、`Y`做梯度检测。
+  - 第二个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`。
+  - 第三个参数`max_relative_error`：指定检测梯度时能容忍的最大错误值。
+- `test_check_grad_ingore_x`和`test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
+### 编译和执行
+`python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
+请注意，**运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake -DWITH_TESTING=ON ..`。编译成功后，执行下面的命令来运行单元测试：
+```bash
+make test ARGS="-R test_mul_op -V"
+```
+或者:
+```bash
+ctest -R test_mul_op
+```
+## 注意事项
+- 注册Op时的类型名，需要和该Op的名字一样。即不允许在`A_op.cc`里面，注册`REGISTER_OPERATOR(B, ...)`等，这将会导致单元测试出错。
+- 如果Op没有实现CUDA Kernel，请不要创建空的`*_op.cu`，这将会导致单元测试出错。
+- 如果多个Op依赖一些共用的函数，可以创建非`*_op.*`格式的文件来存放，如`gather.h`文件。
+### PADDLE_ENFORCE使用注意
+实现Op时检查数据的合法性需要使用PADDLE_ENFORCE以及PADDLE_ENFORCE_EQ等宏定义，基本格式如下：
+```
+PADDLE_ENFORCE(表达式, 错误提示信息)
+PADDLE_ENFORCE_EQ(比较对象A, 比较对象B, 错误提示信息)
+```
+如果表达式为真，或者比较对象A=B，则检查通过，否则会终止程序运行，向用户反馈相应的错误提示信息。
+为了确保提示友好易懂，开发者需要注意其使用方法。
+#### 总体原则
+任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_XX检查的地方，必须有详略得当的备注解释！<font color="#FF0000">**错误提示信息不能为空！**</font>
+#### 提示信息书写标准
+1. [required] 哪里错了？为什么错了？
+    - 例如：`ValueError: Mismatched label shape`
+2. [optional] 期望的输入是什么样的？实际的输入是怎样的？
+    - 例如：`Expected labels dimension=1. Received 4.`
+3. [optional] 能否给出修改意见？
+    - 例如：`Suggested Fix:If your classifier expects one-hot encoding label,check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
+如果并非必要或者简洁的描述即可表达清楚以上要点，根据情况书写亦可。
+#### FAQ 典型问题
+1. 无报错信息或报错信息过于简单，不能给用户提供有效的提示！
+    问题示例1 ：未写提示信息
+    ```
+    PADDLE_ENFORCE(ctx->HasInput("X"), "");
+    ```
+    问题示例2 ：提示信息过于简单
+    ```
+    PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么？
+    ```
+2. 在报错信息中使用开发人员定义的变量缩写，不易理解！
+    问题示例：
+    ```
+    PADDLE_ENFORCE(forward_pd != nullptr,
+                        "Fail to find eltwise_fwd_pd in device context");  //eltwise_fwd_pd用户可能看不懂
+    ```
+3. OP内部调用非法接口：Op内部如果出现Output = ShareDataWith(Input)
+    问题示例：
+    ```cpp
+    auto *out = ctx.Output<framework::LoDTensor>("Out");
+    auto *in = ctx.Input<framework::LoDTensor>("X");
+    out->ShareDataWith(*in);
+    ```
+    Op内部如果出现Output = ShareDataWith(Input)，相当于operator图的中有一条隐藏边，连接了Input和Output，这条边无法在图分析中表达，引发基于图优化的错误。
+4. OP实现的性能实践
+    调用了eigen的broadcast, chop等操作，性能会比手写cuda kernel差几倍以上。此时cpu的实现可以复用eigen，gpu实现可以实现cuda kernel.
+#### OP InferShape检查提示信息特别说明
+- 检查输入输出变量，请统一遵循以下格式
+`Input(变量名) of OP名 operator should not be null.`
+    正确示例：
+    ```
+    PADDLE_ENFORCE(ctx->HasInput("Input"),
+                            "Input(Input) of LSTMP operator should not be null.");
+    ```
+- 反向Op的输入输出检查，要写明反向Op的名字
+    正确示例：
+    ```
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                            "Input(X) of LoDResetGrad opreator should not be null.");
+    ```
--- a/doc/paddle/guides/08_new_op/new_op_en.md
+++ b/doc/paddle/guides/08_new_op/new_op_en.md
+# How to write a new operator
+<a name="Background"></a>
+## Background
+Here are the base types needed. For details, please refer to the design docs.
+- `class OpProtoAndCheckerMaker`: Describes an Operator's input, output, attributes and description, mainly used to interface with Python API.
+- `framework::OperatorBase`: Operator (Op)base class.
+- `framework::OpKernel`: Base class for Op computation kernel.
+- `framework::OperatorWithKernel`: Inherited from OperatorBase, describing an operator with computation kernels.
+Operators can be categorized into two groups: operator with kernel(s) and operator without kernel(s). An operator with kernel(s) inherits from `OperatorWithKernel` while the one without kernel(s) inherits from `OperatorBase`. This tutorial focuses on implementing operators with kernels. In short, an operator includes the following information:
+<table>
+<thead>
+<tr>
+<th>Information</th>
+<th> Where is it defined</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>OpProtoMake definition </td>
+<td> `.cc`files, Backward Op does not need an OpProtoMake interface. </td>
+</tr>
+<tr>
+<td>Op definition  </td>
+<td> `.cc` files</td>
+</tr>
+<tr>
+<td>Kernel implementation  </td>
+<td> The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.</td>
+</tr>
+<tr>
+<td>Registering the Op  </td>
+<td> Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.</td>
+</tr>
+</tbody>
+</table>
+New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
+Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc), as an example to introduce the writing of an Operator with Kernel.
+<a name="Implementing C++ Types"></a>
+## Implementing C++ Types
+<a name="Defining ProtoMaker"></a>
+### Defining ProtoMaker
+Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and one output.
+First, define `ProtoMaker` to describe the Operator's input, output, and additional comments:
+```cpp
+class MulOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MulOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "(Tensor), 2D tensor of size (M x K)");
+    AddInput("Y", "(Tensor), 2D tensor of size (K x N)");
+    AddOutput("Out", "(Tensor), 2D tensor of size (M x N)");
+    AddComment(R"DOC(
+Two Element Mul Operator.
+The equation is: Out = X * Y
+)DOC");
+  }
+};
+```
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)is inherited from`framework::OpProtoAndCheckerMaker`, consisting of 2 variables in the constructor：
+   - `framework::OpProto` stores Operator input and variable attribute, used for generating Python API interfaces.
+   - `framework::OpAttrChecker` is used to validate variable attributes.
+The constructor utilizes `AddInput` to add input parameter, `AddOutput` to add output parameter, and `AddComment` to add comments for the Op, so that the corresponding information will be added to `OpProto`.
+The code above adds two inputs `X` and `Y` to `MulOp`, an output `Out`, and their corresponding descriptions. Names are given in accordance to Paddle's [naming convention](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md).
+An additional example [`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55) is implemented as follows:
+```cpp
+template <typename AttrType>
+class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "(Tensor) Input tensor of scale operator.");
+    AddOutput("Out", "(Tensor) Output tensor of scale operator.");
+    AddComment(R"DOC(
+Scale operator
+$$Out = scale*X$$
+)DOC");
+    AddAttr<AttrType>("scale",
+                      "(float, default 1.0)"
+                      "The scaling factor of the scale operator.")
+        .SetDefault(1.0);
+  }
+};
+```
+Note `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0.
+<a name="Defining the GradProtoMaker class"></a>
+### Defining the GradProtoMaker class
+Each Op must have a corresponding GradProtoMaker. If GradProtoMaker corresponding to the forward Op is not customized, Fluid provides DefaultGradProtoMaker. The default registration will use all input and output, including Input, Output, Output@Grad and so on. Using unnecessary variables will cause waste of memory.
+The following example defines ScaleOp's GradProtoMaker.
+```cpp
+class ScaleGradMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto *grad_op = new framework::OpDesc();
+    grad_op->SetType("scale");
+    grad_op->SetInput("X", OutputGrad("Out"));
+    grad_op->SetOutput("Out", InputGrad("X"));
+    grad_op->SetAttr("scale", GetAttr("scale"));
+    return std::unique_ptr<framework::OpDesc>(grad_op);
+  }
+};
+```
+<a name="Defining Operator"></a>
+### Defining Operator
+The following code defines the interface for MulOp:
+```cpp
+class MulOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+ protected:
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    //never use Input<Tensor> or Output<Tensor> if you want a to get a LoDTensor.
+    auto dim0 = ctx.Input<LoDTensor>("X")->dims();
+    auto dim1 = ctx.Input<LoDTensor>("Y")->dims();
+    PADDLE_ENFORCE_EQ(dim0.size(), 2,
+                      "input X(%s) should be a tensor with 2 dims, a matrix",
+                      ctx.op_.Input("X"));
+    PADDLE_ENFORCE_EQ(dim1.size(), 2,
+                      "input Y(%s) should be a tensor with 2 dims, a matrix",
+                      ctx.op_.Input("Y"));
+    PADDLE_ENFORCE_EQ(
+        dim0[1], dim1[0],
+        "First matrix's width must be equal with second matrix's height.");
+    ctx.Output<LoDTensor>("Out")->Resize({dim0[0], dim1[1]});
+  }
+};
+```
+[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L24) is inherited from `OperatorWithKernel`. Its `public` member
+```cpp
+using framework::OperatorWithKernel::OperatorWithKernel;
+```
+expresses an operator constructor using base class `OperatorWithKernel`, alternatively written as
+```cpp
+MulOp(const std::string &type, const framework::VariableNameMap &inputs,
+      const framework::VariableNameMap &outputs,
+      const framework::AttributeMap &attrs)
+  : OperatorWithKernel(type, inputs, outputs, attrs) {}
+```
+`InferShape` interface needs to be re-written.`InferShape` is a const method and cannot modify Op's member variables. Its constant member `const framework::InferShapeContext &ctx` can be used to extract input, output, and attributes. Its functions are
+  - 1). validate and error out early: it checks input data dimensions and types.
+  - 2). configures the tensor shape in the output.
+Usually `OpProtoMaker` and `Op` definitions are written in `.cc` files, which also include the registration methods introduced later.
+<a name="Defining OpKernel"></a>
+### Defining OpKernel
+`MulKernel` is derived from `framework::OpKernel`, which includes the following templates:
+- `typename  DeviceContext` denotes device context type. When different devices, namely the CPU and the CUDA, share the same kernel, this template needs to be added. If they don't share kernels, this must not be added. An example of a non-sharing kernel is [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43).
+- `typename T` denotes data type, such as `float` or `double`.
+`MulKernel` types need to rewrite the interface for `Compute`.
+- `Compute` takes one input parameter: `const framework::ExecutionContext& context`.
+- Compared with `InferShapeContext`, `ExecutionContext` includes device types, and can similarly extract input, output, and attribute variables.
+- `Compute` function implements the computation logics of an `OpKernel`.
+The input and output of Op can be obtained by `ExecutionContext::Input<T>()` and `ExecutionContext::Output<T>()` respectively.
+**Note:** If the input/output variable type of op is `LoDTensor` (In Fluid, all Tensors are LoDTensor type by default), please write `ExecutionContext::Input<LoDTensor>()` and `ExecutionContext:: Output<LoDTensor>()`, do not write `ExecutionContext::Input<Tensor>()` and `ExecutionContext::Output<Tensor>()`. Because if the actual variable type is `SelectedRows`, the `Input<Tensor>()` and `Output<Tensor>()` methods will specialize the `SelectedRows` type to `Tensor`, causing a potential error.
+`MulKernel`'s implementation of `Compute` is as follows:
+```cpp
+template <typename DeviceContext, typename T>
+class MulKernel : public framework::OpKernel {
+public:
+void Compute(const framework::ExecutionContext& context) const override {
+  auto* X = context.Input<LoDTensor>("X");
+  auto* Y = context.Input<LoDTensor>("Y");
+  auto* Z = context.Output<LoDTensor>("Out");
+  Z->mutable_data<T>(context.GetPlace());
+  auto& device_context = context.template device_context<DeviceContext>();
+  math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
+}
+};
+```
+Note that **different devices (CPU, CUDA)share one Op definition; whether or not they share the same `OpKernel` depends on whether functions called by `Compute`can support both devices.**
+`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing  `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc).
+To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md).
+This concludes the forward implementation of an operator. Next its operation and kernel need to be registered in a `.cc` file.
+The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**.
+<a name="Registering Operator and OpKernel"></a>
+### Registering Operator and OpKernel
+- In `.cc` files, register forward and backward operator classes and the CPU kernel.
+    ```cpp
+    namespace ops = paddle::operators;
+    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
+                  paddle::framework::DefaultGradOpDescMaker<true>)
+    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
+    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
+    REGISTER_OP_CPU_KERNEL(mul_grad,
+                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
+    ```
+    In that code block,
+    - `REGISTER_OPERATOR` registers the `ops::MulOp` class, with the type named `mul`. Its `ProtoMaker` is `ops::MulOpMaker`. Register `ops::MulOpGrad` as type named `mul_grad`.
+    - `REGISTER_OP_CPU_KERNEL` registers `ops::MulKernel` class and specializes template parameters as type `paddle::platform::CPUPlace` and `float`, and also registers `ops::MulGradKernel`.
+- Registering CUDA Kernel in `.cu` files
+    - Note that if CUDA Kernel is implemented using the `Eigen unsupported` module, then on top of `.cu`, a macro definition `#define EIGEN_USE_GPU` is needed, such as
+    ```cpp
+    // if use Eigen unsupported module before include head files
+    #define EIGEN_USE_GPU
+    namespace ops = paddle::operators;
+    REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
+    REGISTER_OP_CUDA_KERNEL(mul_grad,
+                           ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
+    ```
+<a name="Compilation"></a>
+### Compilation
+In folder `build/paddle/fluid/operators`, run the following commands to compile.
+```
+make mul_op
+```
+<a name="Python Binding"></a>
+## Python Binding
+The system will automatically bind the new op to Python and link it to a generated library.
+<a name="Unit Tests"></a>
+## Unit Tests
+Unit tests for an operator include
+1. comparing a forward operator's implementations on different devices (CPU, CUDA)
+2. comparing a backward operator's implementation on different devices (CPU, CUDA)
+3. a gradient test for the backward operator.
+Here, we introduce the [unit tests for `MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py).
+<a name="Unit Test for Forward Operators"></a>
+### Unit Test for Forward Operators
+The Op unit test is inherited from `OpTest`. More specific unit tests are done in `TestMulOp`. To test the Operator, you need to:
+1. Define input, output, and related property parameters in the `setUp` function.
+2. Generate random input data.
+3. Implement the same calculation logic as the forward operator in the Python script to get the output, which is to be compared with the output of the forward operator calculation.
+4. The backward calculation has been automatically integrated into the test framework and the corresponding interface can be called directly.
+  ```python
+  import unittest
+  import numpy as np
+  from op_test import OpTest
+  class TestMulOp(OpTest):
+      def setUp(self):
+          self.op_type = "mul"
+          self.inputs = {
+              'X': np.random.random((32, 84)).astype("float32"),
+              'Y': np.random.random((84, 100)).astype("float32")
+          }
+          self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
+      def test_check_output(self):
+          self.check_output()
+      def test_check_grad_normal(self):
+          self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+      def test_check_grad_ingore_x(self):
+          self.check_grad(
+              ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+      def test_check_grad_ingore_y(self):
+          self.check_grad(
+              ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+  ```
+The code above first loads required packages. In addition, we have
+- `self.op_type = "mul" ` defines the type that is identical to what the operator's registered type.
+- `self.inputs` defines input, with type `numpy.array` and initializes it.
+- `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script.
+<a name="Unit test for backward operators"></a>
+### Unit Test for Backward Operators
+In the backward operator test:
+- `check_grad` is called in `test_check_grad_normal` to use numerical methods to detect gradient correctness and stability.
+- The first parameter `["X", "Y"]` : specifies gradient check for the input variables `X`, `Y`.
+- The second parameter `"Out"` : specifies the final output target variable `Out` of the forward network.
+- The third parameter `max_relative_error`: specifies the maximum error value that can be tolerated when checking gradients.
+- The `test_check_grad_ingore_x` and `test_check_grad_ingore_y` branches are used to test cases where only one input gradient needs to be calculated.
+<a name="Compiling and Running"></a>
+### Compiling and Running
+Any new unit testing file of the format `test_*.py`  added to the directory `python/paddle/fluid/tests/unittests/` is automatically added to the project to compile.
+Note that **running unit tests requires compiling the entire project** and requires compiling with flag `WITH_TESTING` on i.e. `cmake paddle_dir -DWITH_TESTING=ON`.
+After successfully compiling the project, run the following command to run unit tests:
+```bash
+make test ARGS="-R test_mul_op -V"
+```
+Or,
+```bash
+ctest -R test_mul_op
+```
+<a name="Remarks"></a>
+## Remarks
+- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OPERATOR(B, ...)` in `A_op.cc` will cause unit testing failures.
+- If the operator does not implement a CUDA kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail.
+- If multiple operators rely on some shared methods, a file NOT named `*_op.*` can be created to store them, such as `gather.h`.
+<a name="PADDLE_ENFORCE Usage Note"></a>
+### PADDLE_ENFORCE Usage Note
+To check the validity of data when implementing Op, you need to use macro definitions such as PADDLE_ENFORCE and PADDLE_ENFORCE_EQ. The basic format is as follows:
+```
+PADDLE_ENFORCE (expression, error message)
+PADDLE_ENFORCE_EQ (comparison object A, comparison object B, error message)
+```
+If the expression is true, or the comparison object A=B, the check will be passed, otherwise the program will be terminated and the corresponding error message will be fed back to the user.
+In order to ensure that the feedbacks are user-friendly and easy to understand, developers need to pay attention to how to use them.
+<a name="General Principles"></a>
+#### General Principles
+Any place where PADDLE_ENFORCE and PADDLE_ENFORCE_EQ are used must have a properly detailed explanation of the comments! **Error message** can't be empty!
+<a name="Error Message Standard"></a>
+#### Error Message Standard
+1. [required] Where does it go wrong? Why is it wrong?
+    - For example: `ValueError: Mismatched label shape`
+2. [optional] What is the expected input? What is the actual input?
+    - For example: `Expected labels dimension=1. Received 4.`
+3. [optional] Can you come up with a suggestion?
+    - For example: `Suggested Fix: If your classifier expects one-hot encoding label, check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
+If it is not necessary or concise description is enough to clearly express the above points, just write based on actual needs.
+<a name="Typical Problems"></a>
+#### Typical Problems
+1.No error message exists or error message is too short to provide effective notification to the user.
+  Problem example 1: Absent message
+  ```
+  PADDLE_ENFORCE(ctx->HasInput("X"), "");
+  ```
+  Problem example 2: The prompt message is too short
+  ```
+  PADDLE_ENFORCE(i != nullptr, "i must be set"); // What is i?
+  ```
+2.Using developer-defined variable abbreviations in error messages is not easy to understand.
+  Example of the problem:
+  ```
+  PADDLE_ENFORCE(forward_pd != nullptr,
+  "Fail to find eltwise_fwd_pd in device context"); //eltwise_fwd_pduser may not be understood
+  ```
+3.The OP internally calls the illegal interface: If Op appears inside Output = ShareDataWith(Input)
+   Example of the problem:
+   ```cpp
+   auto *out = ctx.Output<framework::LoDTensor>("Out");
+   auto *in = ctx.Input<framework::LoDTensor>("X");
+   out->ShareDataWith(*in);
+   ```
+   If there is Output = ShareDataWith(Input) inside Op, it will equivalently indicate a hidden edge in the operator graph, which connects Input and Output. This edge cannot be expressed in graph analysis, causing error based on graph optimization.
+4.Performance of OP implementation. It called eigen's broadcast, chop and other operations, the performance will be over several times worse than the handwritten cuda kernel. At this point, the implementation of cpu can reuse eigen, and the gpu implementation can implement cuda kernel.
+<a name="Special instructions for OP InferShape check message"></a>
+#### Special Instructions for OP InferShape Check Message
+- Check input and output variables, please follow the following format
+`Input(variable name) of OP name operator should not be null.`
+  The correct example:
+  ```
+  PADDLE_ENFORCE(ctx->HasInput("Input"),
+            "Input(Input) of LSTMP operator should not be null.");
+  ```
+- Backward Op input and output check, to write the name of the backward Op
+  The correct example:
+  ```
+  PADDLE_ENFORCE(ctx->HasInput("X"),
+              "Input(X) of LoDResetGrad opreator should not be null.");
+  ```
--- a/doc/paddle/guides/08_new_op/new_python_op.md
+++ b/doc/paddle/guides/08_new_op/new_python_op.md
+# 如何写新的Python OP
+PaddlePaddle Fluid通过 `py_func` 接口支持在Python端自定义OP。 py_func的设计原理在于Paddle中的LodTensor可以与numpy数组可以方便的互相转换，从而可以使用Python中的numpy API来自定义一个Python OP。
+## py_func接口概述
+`py_func` 具体接口为：
+```Python
+def py_func(func, x, out, backward_func=None, skip_vars_in_backward_input=None):
+    pass
+```
+其中，
+- `x` 是Python Op的输入变量，可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]` 。多个Variable以tuple[Variable]或list[Variale]的形式传入，其中Variable为LoDTensor或Tenosr。
+- `out` 是Python Op的输出变量，可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]` 。其中Variable既可以为LoDTensor或Tensor，也可以为numpy数组。
+- `func` 是Python Op的前向函数。在运行网络前向时，框架会调用 `out = func(*x)` ，根据前向输入 `x` 和前向函数 `func` 计算前向输出 `out`。在 ``func`` 建议先主动将LoDTensor转换为numpy数组，方便灵活的使用numpy相关的操作，如果未转换成numpy，则可能某些操作无法兼容。
+- `backward_func` 是Python Op的反向函数。若 `backward_func` 为 `None` ，则该Python Op没有反向计算逻辑；
+  若 `backward_func` 不为 `None`，则框架会在运行网路反向时调用 `backward_func` 计算前向输入 `x` 的梯度。
+- `skip_vars_in_backward_input` 为反向函数 `backward_func` 中不需要的输入，可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]` 。
+## 如何使用py_func编写Python Op
+以下以tanh为例，介绍如何利用 `py_func` 编写Python Op。
+- 第一步：定义前向函数和反向函数
+前向函数和反向函数均由Python编写，可以方便地使用Python与numpy中的相关API来实现一个自定义的OP。
+若前向函数的输入为 `x_1`, `x_2`, ..., `x_n` ，输出为`y_1`, `y_2`, ..., `y_m`，则前向函数的定义格式为：
+```Python
+def foward_func(x_1, x_2, ..., x_n):
+    ...
+    return y_1, y_2, ..., y_m
+```
+默认情况下，反向函数的输入参数顺序为：所有前向输入变量 + 所有前向输出变量 + 所有前向输出变量的梯度，因此对应的反向函数的定义格式为：
+```Python
+def backward_func(x_1, x_2, ..., x_n, y_1, y_2, ..., y_m, dy_1, dy_2, ..., dy_m):
+    ...
+    return dx_1, dx_2, ..., dx_n
+```
+若反向函数不需要某些前向输入变量或前向输出变量，可设置 `skip_vars_in_backward_input` 进行排除（步骤三中会叙述具体的排除方法）。
+注：，x_1, ..., x_n为输入的多个LodTensor，请以tuple(Variable)或list[Variable]的形式在py_func中传入。建议先主动将LodTensor通过numpy.array转换为数组，否则Python与numpy中的某些操作可能无法兼容使用在LodTensor上。
+此处我们利用numpy的相关API完成tanh的前向函数和反向函数编写。下面给出多个前向与反向函数定义的示例：
+```Python
+import numpy as np
+# 前向函数1：模拟tanh激活函数
+def tanh(x):
+    # 可以直接将LodTensor作为np.tanh的输入参数
+    return np.tanh(x)
+# 前向函数2：将两个2-D LodTenosr相加，输入多个LodTensor以list[Variable]或tuple(Variable)形式
+def element_wise_add(x, y):
+    # 必须先手动将LodTensor转换为numpy数组，否则无法支持numpy的shape操作
+    x = np.array(x)  
+    y = np.array(y)
+    if x.shape != y.shape:
+        raise AssertionError("the shape of inputs must be the same!")
+    result = np.zeros(x.shape, dtype='int32')
+    for i in range(len(x)):
+        for j in range(len(x[0])):
+            result[i][j] = x[i][j] + y[i][j]
+    return result
+# 前向函数3：可用于调试正在运行的网络（打印值）
+def debug_func(x):
+    # 可以直接将LodTensor作为print的输入参数
+    print(x)
+# 前向函数1对应的反向函数，默认的输入顺序为：x、out、out的梯度
+def tanh_grad(x, y, dy):
+    # 必须先手动将LodTensor转换为numpy数组，否则"+/-"等操作无法使用
+    return np.array(dy) * (1 - np.square(np.array(y)))
+```
+注意，前向函数和反向函数的输入均是 `LoDTensor` 类型，输出可以是Numpy Array或 `LoDTensor`。
+由于 `LoDTensor` 实现了Python的buffer protocol协议，因此即可通过 `numpy.array` 直接将 `LoDTensor` 转换为numpy Array来进行操作，也可直接将 `LoDTensor` 作为numpy函数的输入参数。但建议先主动转换为numpy Array，则可以任意的使用python与numpy中的所有操作（例如"numpy array的+/-/shape"）。
+tanh的反向函数不需要前向输入x，因此我们可定义一个不需要前向输入x的反向函数，并在后续通过 `skip_vars_in_backward_input` 进行排除 :
+```Python
+def tanh_grad_without_x(y, dy):
+    return np.array(dy) * (1 - np.square(np.array(y)))
+```
+- 第二步：创建前向输出变量
+我们需调用 `Program.current_block().create_var` 创建前向输出变量。在创建前向输出变量时，必须指明变量的名称name、数据类型dtype和维度shape。
+```Python
+import paddle.fluid as fluid
+def create_tmp_var(program, name, dtype, shape):
+    return program.current_block().create_var(name=name, dtype=dtype, shape=shape)
+in_var = fluid.layers.data(name='input', dtype='float32', shape=[-1, 28, 28])
+# 手动创建前向输出变量
+out_var = create_tmp_var(fluid.default_main_program(), name='output', dtype='float32', shape=[-1, 28, 28])
+```
+- 第三步：调用 `py_func` 组建网络
+`py_func` 的调用方式为：
+```Python
+fluid.layers.py_func(func=tanh, x=in_var, out=out_var, backward_func=tanh_grad)
+```
+若我们不希望在反向函数输入参数中出现前向输入，则可使用 `skip_vars_in_backward_input` 进行排查，简化反向函数的参数列表。
+```Python
+fluid.layers.py_func(func=tanh, x=in_var, out=out_var, backward_func=tanh_grad_without_x,
+    skip_vars_in_backward_input=in_var)
+```
+至此，使用 `py_func` 编写Python Op的步骤结束。我们可以与使用其他Op一样进行网路训练/预测。
+## 注意事项
+- `py_func` 的前向函数和反向函数内部不应调用 `fluid.layers.xxx` ，因为前向函数和反向函数是在网络运行时调用的，且输入参数均为C++端的 `LoDTensor` ；
+  而 `fluid.layers.xxx` 是在组建网络的阶段调用的，且输入参数为Python端的 `Variable` 。
+- `skip_vars_in_backward_input` 只能跳过前向输入变量和前向输出变量，不能跳过前向输出的梯度。
+- 若某个前向输出变量没有梯度，则 `backward_func` 将接收到 `None` 的输入。若某个前向输入变量没有梯度，则我们应在 `backward_func` 中主动返回
+  `None`。
--- a/doc/paddle/guides/08_new_op/op_inheritance_relation_diagram.png
+++ b/doc/paddle/guides/08_new_op/op_inheritance_relation_diagram.png
--- a/doc/paddle/guides/08_new_op/op_notes.md
+++ b/doc/paddle/guides/08_new_op/op_notes.md
+# C++ OP相关注意事项
+## Fluid中Op的构建逻辑
+### 1.Fluid中Op的构建逻辑
+Fluid中所有的Op都继承自`OperatorBase`，且所有的Op都是无状态的，每个Op包含的成员变量只有四个：type、inputs、outputs、attribute。
+Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，**Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行**。
+Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv_op，conv_op的OpKernel有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。
+Operator继承关系图：
+![op_inheritance_relation_diagram](./op_inheritance_relation_diagram.png)
+进一步了解可参考：[multi_devices](https://github.com/PaddlePaddle/FluidDoc/tree/develop/doc/fluid/design/multi_devices)，[scope](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/scope.md)，[Developer's_Guide_to_Paddle_Fluid](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md)
+### 2.Op的注册逻辑
+每个Operator的注册项包括：
+    ```C++
+    OpCreator creator_;
+    GradOpMakerFN grad_op_maker_;
+    proto::OpProto* proto_{nullptr};
+    OpAttrChecker* checker_{nullptr};
+    InferVarTypeFN infer_var_type_;
+    InferShapeFN infer_shape_;
+    ```
+<table>
+<thead>
+<tr>
+<th>注册项</th>
+<th>类型</th>
+<th>说明</th>
+<th>调用</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>proto::OpProto </td>
+<td>Class </td>
+<td>存放Op的输入/输出/属性/Op类型 </td>
+<td>编译时调用 </td>
+</tr>
+<tr>
+<td>GradOpMakerFN </td>
+<td>Functor </td>
+<td>返回当前Op对应的反向Op的一组OpDesc，因为正向Op的反向可能有多个Op构成 </td>
+<td>编译时调用 </td>
+</tr>
+<tr>
+<td>OpAttrChecker </td>
+<td>Class </td>
+<td>对Op的attr进行check </td>
+<td>编译时调用</td>
+</tr>
+<tr>
+<td>InferVarTypeFN </td>
+<td>Functor </td>
+<td>用于推断输出Var的Type，比如是LoDTensor还是SelectedRows，或者其他 </td>
+<td>编译时调用 </td>
+</tr>
+<tr>
+<td>InferShapeFN </td>
+<td>Functor </td>
+<td>用于推断Output的Shape </td>
+<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run中调用 </td>
+</tr>
+<tr>
+<td>OpCreator </td>
+<td>Functor </td>
+<td>每次调用都会创建一个新的OperatorBase </td>
+<td>运行时调用 </td>
+</tr>
+</tbody>
+</table>
+通常Op注释时需要调用REGISTER_OPERATOR，即：
+    ```
+    REGISTER_OPERATOR(op_type,
+                      OperatorBase
+                      op_maker_and_checker_maker,
+                      op_grad_opmaker,
+                      op_infer_var_shape,
+                      op_infer_var_type)
+    ```
+**注意：**
+1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker以及Op中attr的checker。
+2. 如果该Op有反向，则必须要有op_grad_opmaker，因为在backward会根据正向的Op中获取反向Op的Maker。
+3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化**。
+4. 框架没有提供默认的op_infer_var_shape方法。如果该Op是无OpKernel的，通常需要用户添加对应的op_infer_var_shape方法；如果该Op是有OpKernel的，需要实现`OperatorWithKernel`中的`InferShape`方法，此时不需要提供op_infer_var_shape方法。具体实现可参考[while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc)，[conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc)。
+5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_type。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype**。
+更多内容请参考: [如何写新的Op](new_op.html)
+## 写Op注意事项
+### 1.Op可以支持输入输出类型
+Fluid的Op的输入输出都是`Variable`，从设计上讲，`Variable`中可以存放任意类型，Op的输入输出`Variable`可能是是任意类型，通常情况下`Variable`中存放的是`LoDTensor`、`SelectedRows`。
+**注意：**
+- 代码中经常出现`context.Input<Tensor>("Input")`，并不表示"Input"的`Variable`是`Tensor`，而是从"Input"的`Variable`的`LoDTensor`中获取`Tensor`。如果"Input"的`Variable`是`SelectedRows`，则会报错。
+- 如果”Input”是`SelectedRows`，`context->GetInputDim("Input")`返回的是`var->Get<SelectedRows>().GetCompleteDims()`，而不是`SelectedRows`中`Tensor`的Dim。
+### 2.在Op内部不能对输入的数据做任何的改写
+在Op内部绝不允许对输入数据做任何改写，因为可能存在其他Op需要读这个数据。
+### 3.OpKernel需要注册的数据类型
+目前要求所有OpKernel都要注册double和float数据类型。
+### 4.GetExpectedKernelType方法重写
+GetExpectedKernelType方法是OperatorWithKernel类中用于获取指定设备（例如CPU，GPU）上指定数据类型（例如double，float）的OpKernel的方法。该方法通过获取输入变量内部的Tensor数据类型得知需要的Kernel数据类型，但是由于Tensor在此处可能尚未被初始化，所以在该方法内使用输入变量时需要进行必要的初始化检查。在新增含Kernel的Op的时候，关于该方法的重写需要注意以下两点。
+#### 4.1 仅在必要时重写此方法
+基类OperatorWithKernel中的GetExpectedKernelType方法对于派生类Op的所有输入变量进行了完备的初始化检查，建议在新增的Op中直接使用基类的此方法，例如：
+- [MeanOp](https://github.com/PaddlePaddle/Paddle/blob/3556514e971bdbb98fdf0f556371c527f4dfa98c/paddle/fluid/operators/mean_op.cc#L39)：该Op的所有输入变量在Run之前应该全部被初始化，初始化检查是必要且合理的
+但是在一些情况下，直接使用基类的GetExpectedKernelType方法无法满足需求，则需要对该方法进行重写，具体情况及示例如下：
+1. OP的输入有多个，且数据类型不同，例如 [AccuracyOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/metrics/accuracy_op.cc#L80)，需要重写GetExpectedKernelType方法，指定用某一输入变量获取kernel类型
+2. Op包含Dispensable的输入变量，该类输入变量是可选的，当用户未输入时，该类变量未被初始化属于合理情况，例如 [ConvOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/conv_op.cc#L206)，存在Bias等可选的输入变量，需要重写GetExpectedKernelType方法，指定用必须提供的输入变量获取kernel类型
+3. Op的部分输入变量即使未被初始化也属于合理情况，例如 [ConcatOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/concat_op.cc#L90)，输入变量X中有个Tensor需要连接，其中可能包含未被初始化的Tensor，需要重写GetExpectedKernelType方法，使用输入变量X获取kernel的过程中，合理忽略掉部分Tensor为空的情况
+4. OP的Kernel类型与输入变量无关（可能由其他参数指定），例如 [FillOp](https://github.com/PaddlePaddle/Paddle/blob/efbdad059634bef022d4a3f5b00aef6ef8e88ed6/paddle/fluid/operators/one_hot_op.cc#L72)，该Op没有输入，Kernel类型通过Op的dtype参数指定，因此需要重写GetExpectedKernelType方法，用参数指定的数据类型获取kernel类型
+5. Op Kernel的部分参数在使用某些库时，需要指定为相应的值，因此需要重写GetExpectedKernelType方法，覆盖默认参数
+    - 使用CUDNN库：需要指定OpKernel的LibraryType为kCUDNN，例如 [AffineGridOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/affine_grid_op.cc#L78)
+    - 使用MKLDNN库：需要指定OpKernel的LibraryType和DataLayout为kMKLDNN [MulOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/mul_op.cc#L89)
+#### 4.2 重写此方法时需要对输入变量进行初始化检查
+在需要重写GetExpectedKernelType方法时，一般会根据某一输入变量获取Kernel的数据类型，此时请使用`OperatorWithKernel::IndicateVarDataType`接口获取变量的dtype，该方法对指定的输入变量进行了必要的初始化检查，详见[Paddle PR #20044](https://github.com/PaddlePaddle/Paddle/pull/20044)，实现示例如下，：
+```
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "X"), ctx.GetPlace());
+  }
+```
+如果未使用带有初始化检查的方法，直接使用了`Tensor->type()`，可能会导致报出`holder_ should not be null. Tensor not initialized yet when Tensor::type()`的错误，例如[Paddle issue #19522](https://github.com/PaddlePaddle/Paddle/issues/19522) ，用户仅凭该错误信息将无法得知具体出错的Op，不利于调试。
+### 5.Op兼容性问题
+对Op的修改需要考虑兼容性问题，要保证Op修改之后，之前的模型都能够正常加载及运行，即新版本的Paddle预测库能成功加载运行旧版本训练的模型。<font color="#FF0000">**所以，需要保证Op的Input、Output和Attribute不能被修改（文档除外）或删除，可以新增Input、Output和Attribute，但是新增的Input，Output必须设置AsDispensable，新增的Attribute必须设置默认值。更多详细内容请参考[OP修改规范：Input/Output/Attribute只能做兼容修改](https://github.com/PaddlePaddle/Paddle/wiki/OP-Input-Output-Attribute-Compatibility-Modification)**</font> 。
+### 6.ShareDataWith的调用
+ShareDataWith的功能是使两个Tensor共享底层buffer，在调用这个操作的时候需要特别注意，在Op内部不能将ShareDataWith作用在Op的输出上，即Op输出的Tensor必须是Malloc出来的。
+### 7.稀疏梯度参数更新方法
+目前稀疏梯度在做更新的时候会先对梯度做merge，即对相同参数的梯度做累加，然后做参数以及附加参数（如velocity）的更新。
+### 8.显存优化
+#### 8.1 为可原位计算的Op注册Inplace
+有些Op的计算逻辑中，输出可以复用输入的显存空间，也可称为原位计算。例如[`reshape_op`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/reshape_op.cc)中，输出`Out`可以复用输入`X`的显存空间，因为该Op的计算逻辑不会改变`X`的实际数据，只是修改它的shape，输出和输入复用同一块显存空间不影响结果。对于这类OP，可以注册`Inlace`，从而让框架在运行时自动地进行显存优化。
+fluid提供了`DECLARE_INPLACE_OP_INFERER`宏用于注册`Inplace`，该宏第一个参数是一个类名，如`ReshapeOpInplaceInToOut`；第二个参数是一对复用的输入输出，以`{"X", "Out"}`的形式给出。在`REGISTER_OPERATOR`时，
+可以将类名传传入，从而为该Op注册`Inplace`。
+```
+DECLARE_INPLACE_OP_INFERER(ReshapeOpInplaceInToOut, {"X", "Out"});
+REGISTER_OPERATOR(
+    reshape, ops::ReshapeOp, ops::ReshapeOpMaker,
+    paddle::framework::DefaultGradOpMaker<paddle::framework::OpDesc, true>,
+    paddle::framework::DefaultGradOpMaker<paddle::imperative::OpBase, true>,
+    ops::ReshapeOpInplaceInToOut);
+```
+#### 8.2 减少OP中的无关变量
+通常反向Op会依赖于前向Op的某些输入(Input)、输出(Output)，以供反向Op计算使用。但有些情况下，反向Op不需要前向Op的所有输入和输出；有些情况下，反向Op只需要前向Op的部分输入和输出；有些情况下，反向Op只需要使用前向Op中输入和输出变量的Shape和LoD信息。若Op开发者在注册反向Op时，将不必要的前向Op输入和输出作为反向Op的输入，会导致这部分显存无法被框架现有的显存优化策略优化，从而导致模型显存占用过高。
+所以在写注册反向Op时需要注意以下几点：
+- Fluid提供的`DefaultGradOpMaker`，默认会将前向op的所有输入(`Input`）、输出(`Output`)以及输出变量所对应的梯度(`Output@Grad`)作为反向Op的输入，将前向Op输入所对应的梯度(`Input@Grad`)作为反向Op的输出。所以在使用`DefaultGradOpMaker`时需要考虑是否有些变量在计算中不被用到。
+- 如果`DefaultGradOpMaker`不能够满足需求，需要用户自己手动构建`GradOpMaker`，具体实现请参考[相关文档](new_op.html#gradopmaker);
+- 如果有些反向Op需要依赖前向Op的输入或输出变量的的Shape或LoD，但不依赖于变量中Tensor的Buffer，且不能根据其他变量推断出该Shape和LoD，则可以通过`DECLARE_NO_NEED_BUFFER_VARS_INFERER`接口对该变量（以下称该变量为`X`）在反向Op中进行注册`NoNeedBufferVars`。**一旦注册了`NoNeedBufferVars`，反向op中就不能读写该变量对应的Tensor中的buffer，只能调用Tensor的dims()和lod()方法，同时，反向Op中的`GetExpectedKernelType()`必须要重写，并且`GetExpectedKernelType()`中不能访问`X`变量中Tensor的type()方法**。比如在`SliceOpGrad`中只会用到`Input`中变量的Shape信息，所以需要为对`Input`在`SliceOpGrad`上进行注册：
+```
+namespace paddle {
+namespace operators {
+// ...
+class SliceOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    // ...
+  }
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    // Note: don't get data type from ctx.Input<framework::Tensor>("Input");  
+    auto dtype = ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type();  
+    return framework::OpKernelType( dtype, ctx.GetPlace());
+  }
+};
+template <typename T>
+class SliceOpGradMaker : public framework::SingleGradOpMaker<T> {
+ public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+ protected:
+  void Apply(GradOpPtr<T> bind) const override {
+    bind->SetInput("Input", this->Input("Input"));
+    if (this->HasInput("StartsTensor")) {
+      bind->SetInput("StartsTensor", this->Input("StartsTensor"));
+    }
+    if (this->HasInput("EndsTensor")) {
+      bind->SetInput("EndsTensor", this->Input("EndsTensor"));
+    }
+    if (this->HasInput("StartsTensorList")) {
+      bind->SetInput("StartsTensorList", this->Input("StartsTensorList"));
+    }
+    if (this->HasInput("EndsTensorList")) {
+      bind->SetInput("EndsTensorList", this->Input("EndsTensorList"));
+    }
+    bind->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    bind->SetOutput(framework::GradVarName("Input"), this->InputGrad("Input"));
+    bind->SetAttrMap(this->Attrs());
+    bind->SetType("slice_grad");
+  }
+};
+DECLARE_NO_NEED_BUFFER_VARS_INFERER(SliceOpGradNoNeedBufferVarsInference,
+                                    "Input");
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(slice, ops::SliceOp, ops::SliceOpMaker,
+                  ops::SliceOpGradMaker<paddle::framework::OpDesc>,
+                  ops::SliceOpGradMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
+                  ops::SliceDoubleOpGradMaker<paddle::framework::OpDesc>,
+                  ops::SliceDoubleOpGradMaker<paddle::imperative::OpBase>,
+                  ops::SliceOpGradNoNeedBufferVarsInference);
+```
+### 9.混合设备调用
+由于GPU是异步执行的，当CPU调用返回之后，GPU端可能还没有真正的执行，所以如果在Op中创建了GPU运行时需要用到的临时变量，当GPU开始运行的时候，该临时变量可能在CPU端已经被释放，这样可能会导致GPU计算出错。
+关于GPU中的一些同步和异步操作：
+```
+The following device operations are asynchronous with respect to the host:
+    Kernel launches;
+    Memory copies within a single device's memory;
+    Memory copies from host to device of a memory block of 64 KB or less;
+    Memory copies performed by functions that are suffixed with Async;
+    Memory set function calls.
+```
+关于cudaMemCpy和cudaMemCpyAsync注意事项：
+- 如果数据传输是从GPU端到非页锁定的CPU端，数据传输将是同步，即使调用的是异步拷贝操作。
+- 如果数据传输是从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。
+更多内容可参考：[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)，[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)
+### 10. LoD 在 Op 内部的传导规范
+[LoD](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/lod_tensor.md) 是 Paddle Fluid 框架用来表示变长序列数据的属性，除了仅支持输入是 padding  data 的 Op 外，所有 Op 的实现都要考虑 LoD 的传导问题。
+根据 OP 的计算过程中是否用到 LoD，我们可以将涉及到 LoD 传导问题的 OP 分为两类: LoD-Transparent 与 LoD-Based。
+<table>
+<thead>
+<tr>
+<th>类型</th>
+<th>特点</th>
+<th>示例</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>LoD-Transparent </td>
+<td>计算过程不依赖 LoD，输入是否有 LoD 不会影响计算的结果，通常是 position-wise 的计算 </td>
+<td>conv2d_op、batch_norm_op、dropout_op 等 </td>
+</tr>
+<tr>
+<td>LoD-Based </td>
+<td>计算以序列为单位， 计算过程依赖 LoD </td>
+<td> lstm_op、gru_op、sequence_ops 等 </td>
+</tr>
+</tbody>
+</table>
+这两类 OP 的 LoD 传导需要考虑前向和反向两个过程。
+#### 前向传导
+在前向传导过程，与输入的 LoD 相比较，Op 输出的 LoD 可能出现不变、改变和消失这三种情况：
+  - 不变：适用于所有的 LoD-Transparent OP 与部分的 LoD-Based OP。可以在`InferShape` 中调用 `ShareLoD()` 直接将输入 Var 的 LoD 共享给输出 Var, 可参考 [lstm_op](https://github.com/PaddlePaddle/Paddle/blob/a88a1faa48a42a8c3737deb0f05da968d200a7d3/paddle/fluid/operators/lstm_op.cc#L92); 如果有多个输入且都可能存在 LoD 的情况，通常默认共享第一个输入, 例如 [elementwise_ops forward](https://github.com/PaddlePaddle/Paddle/blob/5d6a1fcf16bcb48d2e66306b27d9994d9b07433c/paddle/fluid/operators/elementwise/elementwise_op.h#L69)；
+  - 改变：适用于部分 LoD-Based OP。在实现 OpKernel 时需考虑输出 LoD 的正确计算，真实的 LoD 在前向计算结束后才能确定，此时仍需要在`InferShape` 中调用 `ShareLoD()`，以确保CompileTime 时对 LoD Level 做了正确的传导，可参考 [sequence_expand_op](https://github.com/PaddlePaddle/Paddle/blob/565d30950138b9f831caa33904d9016cf53c6c2e/paddle/fluid/operators/sequence_ops/sequence_expand_op.cc)；
+  - 消失：适用于输出不再是序列数据的 LoD-Based OP。此时不用再考虑前向的 LoD 传导问题，可参考 [sequence_pool_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/sequence_ops/sequence_pool_op.cc)；
+其它重要的注意事项：
+  - 实现 LoD-Based OP 时，需要处理好 LoD 传导的边界情况，例如对长度为零的输入的支持，并完善相应的单测，单测 case 覆盖空序列出现在 batch 开头、中间和末尾等位置的情况，可参考 [test_lstm_op.py](https://github.com/PaddlePaddle/Paddle/blob/4292bd8687ababc7737cffbddc0d38ead2138c00/python/paddle/fluid/tests/unittests/test_lstm_op.py#L203-L216)
+  - 对 LoD Level 有明确要求的 OP，推荐的做法是在 `InferShape` 中即完成 LoD Level的检查，例如 [sequence_pad_op](https://github.com/PaddlePaddle/Paddle/blob/4292bd8687ababc7737cffbddc0d38ead2138c00/paddle/fluid/operators/sequence_ops/sequence_pad_op.cc#L79)。
+#### 反向传导
+通常来讲，OP 的某个输入 Var 所对应的梯度 GradVar 的 LoD 应该与 Var 自身相同，所以应直接将 Var 的 LoD 共享给 GradVar，可以参考 [elementwise ops 的 backward](https://github.com/PaddlePaddle/Paddle/blob/a88a1faa48a42a8c3737deb0f05da968d200a7d3/paddle/fluid/operators/elementwise/elementwise_op.h#L189-L196)
+## Op性能优化
+### 1.第三方库的选择
+在写Op过程中优先使用高性能（如cudnn、mkldnn、mklml、eigen等）中提供的操作，但是一定要做benchmark，有些库中的操作在深度学习任务中可能会比较慢。因为高性能库（如eigen等）中提供的操作为了更为通用，在性能方面可能并不是很好，通常深度学习模型中数据量较小，所以有些情况下可能高性能库中提供的某些操作速度较慢。比如Elementwise系列的所有Op（前向和反向），Elementwise操作在模型中调用的次数比较多，尤其是Elementwise_add，在很多操作之后都需要添加偏置项。在之前的实现中Elementwise_op直接调用Eigen库，由于Elementwise操作在很多情况下需要对数据做Broadcast，而实验发现Eigen库做Broadcast的速度比较慢，慢的原因在这个PR[#6229](https://github.com/PaddlePaddle/Paddle/pull/6229)中有描述。
+### 2.Op性能优化
+Op的计算速度与输入的数据量有关，对于某些Op可以根据输入数据的Shape和Op的属性参数来选择不同的计算方式。比如concat_op，当axis>=1时，在对多个tensor做拼接过程中需要对每个tensor做很多次拷贝，如果是在GPU上，需要调用cudaMemCopy。相对CPU而言，GPU属于外部设备，所以每次调用GPU的操作都会有一定的额外开销，并且当需要拷贝的次数较多时，这种开销就更为凸现。目前concat_op的实现会根据输入数据的Shape以及axis值来选择不同的调用方式，如果输入的tensor较多，且axis不等于0，则将多次拷贝操作转换成一个CUDA Kernel来完成；如果输入tensor较少，且axis等于0，使用直接进行拷贝。相关实验过程在该PR（[#8669](https://github.com/PaddlePaddle/Paddle/pull/8669)）中有介绍。
+由于CUDA Kernel的调用有一定的额外开销，所以如果Op中出现多次调用CUDA Kernel，可能会影响Op的执行速度。比如之前的sequence_expand_op中包含很多CUDA Kernel，通常这些CUDA Kernel处理的数据量较小，所以频繁调用这样的Kernel会影响Op的计算速度，这种情况下最好将这些小的CUDA Kernel合并成一个。在优化sequence_expand_op过程（相关PR[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)）中就是采用这种思路，优化后的sequence_expand_op比之前的实现平均快出约1倍左右，相关实验细节在该PR（[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)）中有介绍。
+减少CPU与GPU之间的拷贝和同步操作的次数。比如fetch操作，在每个迭代之后都会对模型参数进行更新并得到一个loss，并且数据从GPU端到没有页锁定的CPU端的拷贝是同步的，所以频繁的fetch多个参数会导致模型训练速度变慢。
+## Op数值稳定性问题
+### 1.有些Op存在数值稳定性问题
+出现数值稳定性的主要原因程序在多次运行时，对浮点型数据施加操作的顺序可能不同，进而导致最终计算结果不同。而GPU是通过多线程并行计算的方式来加速计算的，所以很容易出现对浮点数施加操作的顺序不固定现象。
+目前发现cudnn中的卷积操作、cudnn中的MaxPooling、CUDA中CudaAtomicXX、ParallelExecutor的Reduce模式下参数梯度的聚合等操作运行结果是非确定的。
+为此Fluid中添加了一些FLAGS，比如使用FLAGS_cudnn_deterministic来强制cudnn使用确定性算法、FLAGS_cpu_deterministic强制CPU端的计算使用确定性方法。
+### 2.WITH_FAST_MATH的开与关
+如果WITH_FAST_MATH是ON，NVCC在编译Paddle和Egien的时候会使用--use_fast_math，这样可能会使CUDA中的一些操作在损失一定精度的情况下变快，比如log、exp、tanh等，但也会使一些操作的计算结果是错的，比如pow操作，具体原因请查看[torch/DEPRECEATED-torch7-distro#132](https://github.com/torch/DEPRECEATED-torch7-distro/issues/132)。
+## 其他
+### 1.报错信息
+Enforce提示信息不能为空，并且需要写明，因为报错信息可以更快更方便地分析出错误的原因。
+### 2.Op的数学公式
+如果Op有数学公式，一定要在代码中将数学公式写明，并在Python API的Doc中显示，因为用户在对比不同框架的计算结果时可能需要了解Paddle对Op是怎么实现的。
+**注意：**在merge到develop分支之前一定进行公式预览。可参考[dynamic_lstmp](../../../api_cn/layers_cn/nn_cn.html#dynamic-lstmp)。
+### 3.Op变量名的命名要规范
+在定义Op时，Op的输入输出以及属性的命名需要符合规范，具体命名规则请参考：[`name_convention`](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
+### 4.Python端Op接口中参数的顺序
+Python API中参数的顺序一般按照重要性来排，以fc为例：
+```
+def fc(input,
+       size,
+       num_flatten_dims=1,
+       param_attr=None,
+       bias_attr=None,
+       act=None,
+       is_test=False,
+       name=None)
+```
--- a/doc/paddle/guides/08_new_op/op_notes_en.md
+++ b/doc/paddle/guides/08_new_op/op_notes_en.md
+# Notes on operator development
+## Building logic of Fluid's op
+### 1.Building logic of Fluid's op
+All Ops in Fluid are derived from `OperatorBase` , and all Ops are stateless. Each Op contains only four variable members: type, inputs, outputs, and attribute.
+The core method of Op is Run. The Run method requires two resources: data resources and computing resources. These two resources are obtained respectively from `Scope` and `Place`. Inside the framework, there is a global `DeviceContextPool`, which is used to record the mapping relationship between `Place` and `DeviceContext`, which means each `Place` has only one `DeviceContext` corresponding to it, and `DeviceContext` stores the computing resources of the current device. For example, for GPU, these resources include `cudnn_handle`, `cublas_handle`, `stream`, and so on. All the internal calculations (data copy and CUDA Kernel, etc.) of Op must be done in `DeviceContext`.
+The Fluid framework is designed to run on a variety of devices and third-party libraries, and some Op implementations may vary on different the devices or third-party libraries. Therefore, Fluid introduced the OpKernel's approach, which means an Op can have multiple OpKernels. Such Ops are derived from `OperatorWithKernel`, and the representative of such Ops is conv, the OpKernels of conv_op are: `GemmConvKernel`, `CUDNNConvOpKernel`, `ConvMKLDNNOpKernel`, and each OpKernel has two data types, double and float. Ops that do not need OpKernel inclue `WhileOp` and so on.
+Operator inheritance diagram:
+![op_inheritance_relation_diagram](./op_inheritance_relation_diagram.png)
+For further information, please refer to: [multi_devices](https://github.com/PaddlePaddle/FluidDoc/tree/develop/doc/fluid/design/multi_devices) , [scope](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/scope.md) , [Developer's_Guide_to_Paddle_Fluid](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md)
+### 2.Op's registration logic
+The registration entries for each Operator include:
+    ```C++
+    OpCreator creator_;
+    GradOpMakerFN grad_op_maker_;
+    proto::OpProto* proto_{nullptr};
+    OpAttrChecker* checker_{nullptr};
+    InferVarTypeFN infer_var_type_;
+    InferShapeFN infer_shape_;
+    ```
+<table>
+<thead>
+<tr>
+<th>Registration Entry</th>
+<th>Type</th>
+<th>Description</th>
+<th>Usage</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>proto::OpProto </td>
+<td>Class </td>
+<td>Store the input/output/properties/Op type of Op </td>
+<td>Call at compile time </td>
+</tr>
+<tr>
+<td>GradOpMakerFN </td>
+<td>Functor </td>
+<td> Return a set of OpDescs of the reverse Op corresponding to the current Op, because the reverse ones of the forward Op may consist of multiple Ops </td>
+<td>Call at compile time </td>
+</tr>
+<tr>
+<td>OpAttrChecker </td>
+<td>Class </td>
+<td>Check the Op's attr </td>
+<td>Call at compile time </td>
+</tr>
+<tr>
+<td>InferVarTypeFN </td>
+<td>Functor </td>
+<td> Used to infer the type of the output Var, such as LoDTensor, SelectedRows, or others </td>
+<td>Call at compile time </td>
+</tr>
+<tr>
+<td>InferShapeFN </td>
+<td>Functor </td>
+<td> Used to infer the Shape of the Output </td>
+<td> The usage is different at compile time and runtime. At compile time, it is called in Python side; If the Op is derived from OperatorWithKernel, at the runtime it will be called at op.run </td>
+</tr>
+<tr>
+<td>OpCreator </td>
+<td>Functor </td>
+<td>Create a new OperatorBase for each call </td>
+<td>Call at runtime </td>
+</tr>
+</tbody>
+</table>
+Usually you need to call REGISTER_OPERATOR when you make comments on Op, which is:
+    ```
+    REGISTER_OPERATOR(op_type,
+                      OperatorBase
+                      Op_maker_and_checker_maker,
+                      Op_grad_opmaker,
+                      Op_infer_var_shape,
+                      Op_infer_var_type)
+    ```
+**Note:**
+1. For all Op, the first three parameters are required, op_type specifies the name of op, OperatorBase is the object instance of this Op, op_maker_and_checker_maker is the maker of op and the checker of attr in op.
+2. If the Op has a reverse, it must have op_grad_opmaker, because in backward, the reverse Op's Maker will be obtained from the forward Op.
+3. The framework provides a default op_grad_opmaker:`DefaultGradOpDescMaker`, which will use the input and output of the forward Op as the input of the reverse Op, and the gradients of the input to forward Op's as the output of the reverse Op, and copy the attributes of the forward Op to it. **Note:** DefaultGradOpDescMaker will take all the input and output of the forward Op as the reverse Op input. Even if this input is not necessary, the absence of this will prevent us from doing memory optimization for the unused variables.
+4. The framework does not provide a default op_infer_var_shape method. If the Op has no OpKernel, you usually need to add the corresponding op_infer_var_shape method. If the Op has OpKernel, you need to implement the `InferShape` method of `OperatorWithKernel`. You don't need to provide the op_infer_var_shape method. For details, refer to [while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc), [conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc).
+5. The framework does not provide a default op_infer_var_type method, the user needs to add op_infer_var_type according to the actual situation. Strictly speaking, every Op should register an InferVarType, and op_infer_var_type infers the type and dtype of the output Var according to the type and dtype of the input Var. **Note:** In the Python-side LayerHelper, the create_variable_for_type_inference operation returns a Variable which is a LoDTensor. The C++-side InferVarType can modify the type and dtype of the `Variable`.
+For more details, please refer to: [How to write a new Op](new_op_en.html)
+## Notes on Writing an Op
+### 1. input and output types supported by Op
+The input and output of Fluid's Ops are `Variable`. In design, `Variable` can store any type. Op's input and output `Variable` may be of any type, and usually the `Variable` stores `LoDTensor` and `SelectedRows` .
+**Note:**
+- `context.Input<Tensor>("Input")` often appears in the code. It does not mean that the `Variable` of "Input" is `Tensor`, but indicates that the `Tensor` is obtained from `LoDTensor` in the `Variable` of the "Input". If the `Variable` of "Input" is `SelectedRows`, an error will be reported.
+- If "Input" is `SelectedRows`, `context->GetInputDim("Input")` will return `var->Get<SelectedRows>().GetCompleteDims()` instead of Dim of `Tensor` in `SelectedRows` .
+### 2. Do not modify the input data inside Op.
+Never make any modification of the input data inside Op, as there may be other Ops that need to read this input.
+### 3. The data type needs to be registered for OpKernel
+Currently all OpKernel are required to register double and float data types.
+### 4.Op compatibility issue
+The modification of Op needs to consider the compatibility problem. Please ensure that the previous model can be loaded and run normally after the modification of Op which means that the model trained by the old version can be loaded and run with Paddle inference library of new version. <font color="#FF0000">**So developers should ensure that the Input, Output and Attribute of OPs cannot be modified (except for documents) or deleted. And developers can add Input, Output and Attribute, but the added Input and Output must be set to be dispensable, and the default value of added Attribute must be set. For more details, please refer to [OP Input/Output/Attribute Compatibility Modification](https://github.com/PaddlePaddle/Paddle/wiki/OP-Input-Output-Attribute-Compatibility-Modification(English-Version))**</font>.
+### 5.Call ShareDataWith
+The function of ShareDataWith is to make the two Tensors share the underlying buffer. When calling this operation, special attention should be paid. In the Op, the ShareDataWith cannot be applied to the output of Op. In other words, the Tensor of the Op output must be from Malloc.
+### 6. Sparse gradient parameter's update method
+At present, the sparse gradient will first merge the gradient when updating, which is to add up the gradients of the same parameter, and then update the parameters and additional parameters (such as velocity).
+### 7. (Video) Memory optimization
+If the reverse of Op does not require all of the input and output of the forward op as its input, please do not use `DefaultGradOpDescMaker`, which will prevent Memory/Video Memory optimization for unused variables.
+### 8. Calls made on Hybrid device
+Since the GPU is executed asynchronously, the GPU side may not be actually executed after the CPU call returns. Therefore, if you create a temporary variable in Op that you need to use at the GPU runtime, when the GPU starts running, the temporary variable may have been released on the CPU side, which may cause GPU calculation errors.
+Some of the synchronous and asynchronous operations in the GPU:
+```
+The following device operations are asynchronous with respect to the host:
+    Kernel launches;
+    Memory copies within a single device's memory;
+    Memory copies from host to device of a memory block of 64 KB or less;
+    Memory copies performed by functions that are suffixed with Async;
+    Memory set function calls.
+```
+Note on cudaMemCpy and cudaMemCpyAsync:
+- If the data transfer is from the GPU side to the CPU side with non-pinned memory , the data transfer will be synchronous, even if an asynchronous copy operation is called.
+- If the data is transferred from the CPU side to the CPU side, the data transfer will be synchronous, even if an asynchronous copy operation is called.
+For more information, please refer to: [Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution) , [API synchronization behavior](https://Docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)
+## Op Performance Optimization
+### 1. Selection of third-party libraries
+In the process of writing Op, the operations provided by high-performance libraries (such as cudnn, mkldnn, mklml, eigen, etc.) are preferred, but the benchmark must be done. Some operations in the library may be slower in deep learning tasks. Because the operations provided in high-performance libraries (such as eigen, etc.) are more generalized and in terms of performance, they may not be sufficient. Usually the amount of data in the deep learning model is small, so in some cases some of the high-performance libraries may be compromised to a slower speed. For example, all Op (forward and reverse) of the Elementwise set. The Elementwise operation is called relatively frequently in the model. Especially Elementwise_add, which is used to add offset to many operations. In the previous implementation, Elementwise_op directly calls the Eigen library. Since the Elementwise operation needs to broadcast the data in many cases, and the experiment finds that the Eigen library is slower to broadcast, whose reason is in this PR[#6229](https://github.com/PaddlePaddle/Paddle/pull/6229).
+### 2.Op performance optimization
+The calculation speed of Op is related to the amount of data input. For some Op, different calculation methods can be selected according to the attribute parameters in Op and Shape of the input data. For example, concat_op, when axis>=1, in the process of concatenating multiple tensors, you need to make many copies for each tensor. If it is on GPU, you need to call cudaMemCopy. Relative to the CPU, the GPU is an external device. So each time the GPU is called, there will a certain overhead. And when more times of copying are required, the overhead is more prominent. At present, the implementation of concat_op will select different calling methods according to the Shape and axis values of the input data. If there are a relatively large number of input tensors, and the axis is not equal to 0, the multiple copy operations will be converted into a CUDA Kernel to complete the process; if input tensor are less, and the axis is equal to 0, direct copy will be used. The relevant experiment is described in this PR ([#8669](https://github.com/PaddlePaddle/Paddle/pull/8669)) .
+Since the call of CUDA Kernel has a certain overhead, multiple calls of the CUDA Kernel in Op may affect the execution speed of Op. For example, the previous sequence_expand_op contains many CUDA Kernels. Usually, these CUDA Kernels process a small amount of data, so frequent calls to such Kernels will affect the calculation speed of Op. In this case, it is better to combine these small CUDA Kernels into one. This idea is used in the optimization of the sequence_expand_op procedure (related PR[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)). The optimized sequence_expand_op is about twice as fast as the previous implementation, the relevant experiments are introduced in the PR ([#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)).
+Reduce the number of copy and sync operations between the CPU and the GPU. For example, the fetch operation will update the model parameters and get a loss after each iteration, and the copy of the data from the GPU to the Non-Pinned-Memory CPU is synchronous, so frequent fetching for multiple parameters will reduce the model training speed.
+## Op numerical stability
+### 1. Some Ops have numerical stability problems
+The main reason for numerical stability is that when the program is run multiple times, the order in which the floating-point data is processed may be different, resulting in different final calculation results. The GPU is accelerated by multi-threaded parallel computing, so it is commonplace that the order of operations on floating-point numbers is not fixed.
+At present, it is found that the result of the convolution operation in cudnn, MaxPooling in cudnn, CudaAtomicXX in CUDA, and aggregation of parameter gradients in Reduce mode of ParallelExecutor are not certain.
+For this purpose, some FLAGS is added to the Fluid. For example, FLAGS_cudnn_deterministic is used to force cudnn to use the deterministic algorithm, and FLAGS_cpu_deterministic to force the CPU-side calculation to use the deterministic method.
+### 2.On/Off of WITH_FAST_MATH
+If WITH_FAST_MATH is ON, NVCC will use --use_fast_math when compiling Paddle and Egien. This may cause some operations in CUDA to get faster on the condition that they lose some precision, such as log, exp, tanh. But it may lead to wrong results of some operations, such as pow operation, please read [torch/DEPRECEATED-torch7-distro#132](https://github.com/torch/DEPRECEATED-torch7-distro/issues/132) for specific reasons.
+## Other
+### 1. Error message
+The Enforce prompt message cannot be empty and needs to be written, because the error message can analyze the cause of the error more quickly and conveniently.
+### 2.Op's mathematical formula
+If Op has a mathematical formula, be sure to write the mathematical formula in the code and display it in the Doc of the Python API, because the user may need to understand how Paddle implements Op when comparing the calculation results among different frameworks.
+**Note:** The formula preview must be done before the merge to the develop branch. Example: [dynamic_lstmp](../../../api/layers/nn.html#dynamic-lstmp).
+### 3. The order of parameters in the Python-side Op interface
+The order of the parameters in the Python API is generally ranked by importance, taking fc as an example:
+```
+def fc(input,
+       size,
+       num_flatten_dims=1,
+       param_attr=None,
+       bias_attr=None,
+       act=None,
+       is_test=False,
+       name=None)
+```
--- a/doc/paddle/guides/09_contribution/faq.rst
+++ b/doc/paddle/guides/09_contribution/faq.rst
+.. _contribute_to_paddle_faq:
+###################
+FAQ
+###################
+..  contents::
+1. CLA签署不成功，怎么办？
+---------------------------
+由于 `CLA <https://github.com/cla-assistant/cla-assistant>`_ 是第三方开源库，有时候会不稳定。如果确定自己已签署CLA，但CLA没触发成功，可尝试：
+* 关闭并重新开启本PR，来重新触发CLA。点击 :code:`Close pull request` ，再点击 :code:`Reopen pull request` ，并等待几分钟。
+* 如果上述操作重复2次仍未生效，请重新提一个PR或评论区留言。
+2. CI没有触发，怎么办？
+------------------------
+* 请在commit信息中添加正确的CI触发规则：
+  * develop分支请添加 :code:`test=develop`
+  * release分支请添加如 :code:`test=release/1.4` 来触发release/1.4分支
+  * 文档预览请添加 :code:`test=document_preview`
+* 该CI触发规则以commit为单位，即对同一个PR来说，不管前面的commit是否已经添加，如果新commit想继续触发CI，那么仍然需要添加。
+* 添加CI触发规则后，仍有部分CI没有触发：请关闭并重新开启本PR，来重新触发CI。
+3. CI随机挂，即错误信息与本PR无关，怎么办？
+--------------------------------------
+由于develop分支代码的不稳定性，CI可能会随机挂。
+如果确定CI错误和本PR无关，请在评论区贴上错误截图和错误链接。
+4. 如何修改API.spec？
+-----------------------
+为了保证API接口/文档的稳定性，我们对API进行了监控，即API.spec文件。
+修改方法请参考 `diff_api.py <https://github.com/PaddlePaddle/Paddle/blob/ddfc823c73934d483df36fa9a8b96e67b19b67b4/tools/diff_api.py#L29-L34>`_ 。
+**注意**：提交PR后请查看下diff，不要改到非本PR修改的API上。
--- a/doc/paddle/guides/09_contribution/img/cla_unsigned.png
+++ b/doc/paddle/guides/09_contribution/img/cla_unsigned.png
--- a/doc/paddle/guides/09_contribution/img/sign_cla.png
+++ b/doc/paddle/guides/09_contribution/img/sign_cla.png
--- a/doc/paddle/guides/09_contribution/index_cn.rst
+++ b/doc/paddle/guides/09_contribution/index_cn.rst
+############
+如何贡献代码
+############
+..  toctree::
+    :maxdepth: 1
+    local_dev_guide.md
+    submit_pr_guide.md
+    faq.rst
--- a/doc/paddle/guides/09_contribution/index_en.rst
+++ b/doc/paddle/guides/09_contribution/index_en.rst
+#################################
+How to contribute codes to Paddle
+#################################
+..  toctree::
+    :maxdepth: 1
+    local_dev_guide_en.md
+    submit_pr_guide_en.md
--- a/doc/paddle/guides/09_contribution/local_dev_guide.md
+++ b/doc/paddle/guides/09_contribution/local_dev_guide.md
+# 本地开发指南
+本文将指导您如何在本地进行代码开发
+## 代码要求
+- 代码注释请遵守 [Doxygen](http://www.doxygen.nl/) 的样式。
+- 确保编译器选项 `WITH_STYLE_CHECK` 已打开，并且编译能通过代码样式检查。
+- 所有代码必须具有单元测试。
+- 通过所有单元测试。
+- 请遵守[提交代码的一些约定](#提交代码的一些约定)。
+以下教程将指导您提交代码。
+## [Fork](https://help.github.com/articles/fork-a-repo/)
+跳转到[PaddlePaddle](https://github.com/PaddlePaddle/Paddle) GitHub首页，然后单击 `Fork` 按钮，生成自己目录下的仓库，比如 <https://github.com/USERNAME/Paddle>。
+## 克隆（Clone）
+将远程仓库 clone 到本地：
+```bash
+➜  git clone https://github.com/USERNAME/Paddle
+➜  cd Paddle
+```
+## 创建本地分支
+Paddle 目前使用[Git流分支模型](http://nvie.com/posts/a-successful-git-branching-model/)进行开发，测试，发行和维护，具体请参考 [Paddle 分支规范](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/others/releasing_process.md)。
+所有的 feature 和 bug fix 的开发工作都应该在一个新的分支上完成，一般从 `develop` 分支上创建新分支。
+使用 `git checkout -b` 创建并切换到新分支。
+```bash
+➜  git checkout -b my-cool-stuff
+```
+值得注意的是，在 checkout 之前，需要保持当前分支目录 clean，否则会把 untracked 的文件也带到新分支上，这可以通过 `git status` 查看。
+## 使用 `pre-commit` 钩子
+Paddle 开发人员使用 [pre-commit](http://pre-commit.com/) 工具来管理 Git 预提交钩子。 它可以帮助我们格式化源代码（C++，Python），在提交（commit）前自动检查一些基本事宜（如每个文件只有一个 EOL，Git 中不要添加大文件等）。
+`pre-commit`测试是 Travis-CI 中单元测试的一部分，不满足钩子的 PR 不能被提交到 Paddle，首先安装并在当前目录运行它：
+```bash
+➜  pip install pre-commit
+➜  pre-commit install
+```
+Paddle 使用 `clang-format` 来调整 C/C++ 源代码格式，请确保 `clang-format` 版本在 3.8 以上。
+注：通过`pip install pre-commit`和`conda install -c conda-forge pre-commit`安装的`yapf`稍有不同的，Paddle 开发人员使用的是`pip install pre-commit`。
+## 开始开发
+在本例中，我删除了 README.md 中的一行，并创建了一个新文件。
+通过 `git status` 查看当前状态，这会提示当前目录的一些变化，同时也可以通过 `git diff` 查看文件具体被修改的内容。
+```bash
+➜  git status
+On branch test
+Changes not staged for commit:
+  (use "git add <file>..." to update what will be committed)
+  (use "git checkout -- <file>..." to discard changes in working directory)
+	modified:   README.md
+Untracked files:
+  (use "git add <file>..." to include in what will be committed)
+	test
+no changes added to commit (use "git add" and/or "git commit -a")
+```
+## 编译和单元测试
+关于编译 PaddlePaddle 的源码，请参见[从源码编译](../../../install/compile/fromsource.html) 选择对应的操作系统。
+关于单元测试，可参考[Op单元测试](../new_op/new_op.html#id7) 的运行方法。
+## 提交（commit）
+接下来我们取消对 README.md 文件的改变，然后提交新添加的 test 文件。
+```bash
+➜  git checkout -- README.md
+➜  git status
+On branch test
+Untracked files:
+  (use "git add <file>..." to include in what will be committed)
+	test
+nothing added to commit but untracked files present (use "git add" to track)
+➜  git add test
+```
+Git 每次提交代码，都需要写提交说明，这可以让其他人知道这次提交做了哪些改变，这可以通过`git commit` 完成。
+```bash
+➜  git commit
+CRLF end-lines remover...............................(no files to check)Skipped
+yapf.................................................(no files to check)Skipped
+Check for added large files..............................................Passed
+Check for merge conflicts................................................Passed
+Check for broken symlinks................................................Passed
+Detect Private Key...................................(no files to check)Skipped
+Fix End of Files.....................................(no files to check)Skipped
+clang-formater.......................................(no files to check)Skipped
+[my-cool-stuff c703c041] add test file
+ 1 file changed, 0 insertions(+), 0 deletions(-)
+ create mode 100644 233
+```
+## 保持本地仓库最新
+在准备发起 Pull Request 之前，需要同步原仓库（<https://github.com/PaddlePaddle/Paddle>）最新的代码。
+首先通过 `git remote` 查看当前远程仓库的名字。
+```bash
+➜  git remote
+origin
+➜  git remote -v
+origin	https://github.com/USERNAME/Paddle (fetch)
+origin	https://github.com/USERNAME/Paddle (push)
+```
+这里 origin 是我们 clone 的远程仓库的名字，也就是自己用户名下的 Paddle，接下来我们创建一个原始 Paddle 仓库的远程主机，命名为 upstream。
+```bash
+➜  git remote add upstream https://github.com/PaddlePaddle/Paddle
+➜  git remote
+origin
+upstream
+```
+获取 upstream 的最新代码并更新当前分支。
+```bash
+➜  git fetch upstream
+➜  git pull upstream develop
+```
+## Push 到远程仓库
+将本地的修改推送到 GitHub 上，也就是 https://github.com/USERNAME/Paddle。
+```bash
+# 推送到远程仓库 origin 的 my-cool-stuff 分支上
+➜  git push origin my-cool-stuff
+```
--- a/doc/paddle/guides/09_contribution/local_dev_guide_en.md
+++ b/doc/paddle/guides/09_contribution/local_dev_guide_en.md
+# Guide of local development
+You will learn how to develop programs in local environment under the guidelines of this document.
+## Requirements of coding
+- Please refer to the coding comment format of [Doxygen](http://www.doxygen.nl/)
+- Make sure that option of builder `WITH_STYLE_CHECK` is on and the build could pass through the code style check.
+- Unit test is needed for all codes.
+- Pass through all unit tests.
+- Please follow [regulations of submitting codes](#regulations of submitting codes).
+The following guidiance tells you how to submit code.
+## [Fork](https://help.github.com/articles/fork-a-repo/)
+Transfer to the home page of Github [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) ,and then click button `Fork`  to generate the git under your own file directory,such as <https://github.com/USERNAME/Paddle>。
+## Clone
+Clone remote git to local:
+```bash
+➜  git clone https://github.com/USERNAME/Paddle
+➜  cd Paddle
+```
+## Create local branch
+At present [Git stream branch model](http://nvie.com/posts/a-successful-git-branching-model/)  is applied to Paddle to undergo task of development,test,release and maintenance.Please refer to [branch regulation of Paddle](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/others/releasing_process.md) about details。
+All development tasks of feature and bug fix should be finished in a new branch which is extended from `develop` branch.
+Create and switch to a new branch with command `git checkout -b`.
+```bash
+➜  git checkout -b my-cool-stuff
+```
+It is worth noting that before the checkout, you need to keep the current branch directory clean, otherwise the untracked file will be brought to the new branch, which can be viewed by  `git status` .
+## Use `pre-commit` hook
+Paddle developers use the [pre-commit](http://pre-commit.com/) tool to manage Git pre-commit hooks. It helps us format the source code (C++, Python) and automatically check some basic things before committing (such as having only one EOL per file, not adding large files in Git, etc.).
+The `pre-commit` test is part of the unit test in Travis-CI. A PR that does not satisfy the hook cannot be submitted to Paddle. Install `pre-commit` first and then run it in current directory：
+```bash
+➜  pip install pre-commit
+➜  pre-commit install
+```
+Paddle modify the format of C/C++ source code with `clang-format` .Make sure the version of `clang-format` is above 3.8.
+Note：There are differences between the installation of `yapf` with `pip install pre-commit` and that with `conda install -c conda-forge pre-commit` . Paddle developers use `pip install pre-commit` 。
+## Start development
+I delete a line of README.md and create a new file in the case.
+View the current state via `git status` , which will prompt some changes to the current directory, and you can also view the file's specific changes via `git diff` .
+```bash
+➜  git status
+On branch test
+Changes not staged for commit:
+  (use "git add <file>..." to update what will be committed)
+  (use "git checkout -- <file>..." to discard changes in working directory)
+    modified:   README.md
+Untracked files:
+  (use "git add <file>..." to include in what will be committed)
+    test
+no changes added to commit (use "git add" and/or "git commit -a")
+```
+## Build and test
+Please refer to [Compile From Source Code](../../../install/compile/fromsource_en.html) about more information of building PaddlePaddle source codes.
+Please refer to [Op Unit Tests](../new_op/new_op_en.html#unit-tests) about more information of running unit tests.
+## Commit
+Next we cancel the modification of README.md,and submit new added test file.
+```bash
+➜  git checkout -- README.md
+➜  git status
+On branch test
+Untracked files:
+  (use "git add <file>..." to include in what will be committed)
+    test
+nothing added to commit but untracked files present (use "git add" to track)
+➜  git add test
+```
+It's required that the commit message is also given on every Git commit, through which other developers will be notified of what changes have been made. Type `git commit` to realize it.
+```bash
+➜  git commit
+CRLF end-lines remover...............................(no files to check)Skipped
+yapf.................................................(no files to check)Skipped
+Check for added large files..............................................Passed
+Check for merge conflicts................................................Passed
+Check for broken symlinks................................................Passed
+Detect Private Key...................................(no files to check)Skipped
+Fix End of Files.....................................(no files to check)Skipped
+clang-formater.......................................(no files to check)Skipped
+[my-cool-stuff c703c041] add test file
+ 1 file changed, 0 insertions(+), 0 deletions(-)
+ create mode 100644 233
+```
+## Keep the latest local repository
+It needs to keep up with the latest code of original repository(<https://github.com/PaddlePaddle/Paddle>）before Pull Request.
+Check the name of current remote repository with `git remote`.
+```bash
+➜  git remote
+origin
+➜  git remote -v
+origin    https://github.com/USERNAME/Paddle (fetch)
+origin    https://github.com/USERNAME/Paddle (push)
+```
+origin is the name of remote repository that we clone,which is also the Paddle under your own account. Next we create a remote host of an original Paddle and name it upstream.
+```bash
+➜  git remote add upstream https://github.com/PaddlePaddle/Paddle
+➜  git remote
+origin
+upstream
+```
+Get the latest code of upstream and update current branch.
+```bash
+➜  git fetch upstream
+➜  git pull upstream develop
+```
+## Push to remote repository
+Push local modification to GitHub(https://github.com/USERNAME/Paddle).
+```bash
+# submit it to remote git the branch my-cool-stuff of origin
+➜  git push origin my-cool-stuff
+```
--- a/doc/paddle/guides/09_contribution/submit_pr_guide.md
+++ b/doc/paddle/guides/09_contribution/submit_pr_guide.md
+# 提交PR注意事项
+## 建立 Issue 并完成 Pull Request
+建立一个 Issue 描述问题，并记录它的编号。
+切换到所建分支，然后点击 `New pull request`。
+<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
+选择目标分支：
+<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
+在 PR 的描述说明中，填写 `resolve #Issue编号` 可以在这个 PR 被 merge 后，自动关闭对应的 Issue，具体请见[这里](https://help.github.com/articles/closing-issues-via-commit-messages/)。
+接下来等待 review，如果有需要修改的地方，参照上述步骤更新 origin 中的对应分支即可。
+## 签署CLA协议和通过单元测试
+### 签署CLA
+在首次向PaddlePaddle提交Pull Request时，您需要您签署一次CLA(Contributor License Agreement)协议，以保证您的代码可以被合入，具体签署方式如下：
+- 请您查看PR中的Check部分，找到license/cla，并点击右侧detail，进入CLA网站
+<div align="center">
+<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/cla_unsigned.png?raw=true"  height="40" width="500">
+ </div>
+- 请您点击CLA网站中的“Sign in with GitHub to agree”,点击完成后将会跳转回您的Pull Request页面
+<div align="center">
+<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/sign_cla.png?raw=true"  height="330" width="400">
+ </div>
+### 通过单元测试
+您在Pull Request中每提交一次新的commit后，会触发CI单元测试，请确认您的commit message中已加入必要的说明，请见[提交（commit）](local_dev_guide.html#permalink-8--commit-)
+请您关注您Pull Request中的CI单元测试进程，它将会在几个小时内完成
+您仅需要关注和自己提交的分支相关的CI项目，例如您向develop分支提交代码，则无需关注release/1.1一栏是否通过测试
+当所需的测试后都出现了绿色的对勾，表示您本次commit通过了各项单元测试
+如果所需的测试后出现了红色叉号，代表您本次的commit未通过某项单元测试，在这种情况下，请您点击detail查看报错详情，并将报错原因截图，以评论的方式添加在您的Pull Request中，我们的工作人员将帮您查看
+## 删除远程分支
+在 PR 被 merge 进主仓库后，我们可以在 PR 的页面删除远程仓库的分支。
+<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
+也可以使用 `git push origin :分支名` 删除远程分支，如：
+```bash
+➜  git push origin :my-cool-stuff
+```
+## 删除本地分支
+最后，删除本地分支。
+```bash
+# 切换到 develop 分支
+➜  git checkout develop
+# 删除 my-cool-stuff 分支
+➜  git branch -D my-cool-stuff
+```
+至此，我们就完成了一次代码贡献的过程。
+## 提交代码的一些约定
+为了使评审人在评审代码时更好地专注于代码本身，请您每次提交代码时，遵守以下约定：
+1）请保证Travis-CI 中单元测试能顺利通过。如果没过，说明提交的代码存在问题，评审人一般不做评审。
+2）提交PUll Request前：
+- 请注意commit的数量：
+原因：如果仅仅修改一个文件但提交了十几个commit，每个commit只做了少量的修改，这会给评审人带来很大困扰。评审人需要逐一查看每个commit才能知道做了哪些修改，且不排除commit之间的修改存在相互覆盖的情况。
+建议：每次提交时，保持尽量少的commit，可以通过`git commit --amend`补充上次的commit。对已经Push到远程仓库的多个commit，可以参考[squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)。
+- 请注意每个commit的名称：应能反映当前commit的内容，不能太随意。
+3）如果解决了某个Issue的问题，请在该PUll Request的**第一个**评论框中加上：`fix #issue_number`，这样当该PUll Request被合并后，会自动关闭对应的Issue。关键词包括：close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved，请选择合适的词汇。详细可参考[Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages)。
+此外，在回复评审人意见时，请您遵守以下约定：
+1）评审人的每个意见都必须回复（这是开源社区的基本礼貌，别人帮了忙，应该说谢谢）：
+   - 对评审意见同意且按其修改完的，给个简单的`Done`即可；
+   - 对评审意见不同意的，请给出您自己的反驳理由。
+2）如果评审意见比较多：
+   - 请给出总体的修改情况。
+   - 请采用[start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)进行回复，而非直接回复的方式。原因是每个回复都会发送一封邮件，会造成邮件灾难。
--- a/doc/paddle/guides/09_contribution/submit_pr_guide_en.md
+++ b/doc/paddle/guides/09_contribution/submit_pr_guide_en.md
+# Guide of submitting PR to Github
+## Create an Issue and finish Pull Request
+Create an Issue to describe your problem and keep its number.
+Switch to the branch you have created and click `New pull request`。
+<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
+Switch to targeted branch:
+<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
+A note of `resolve #Issue number` in PR description results in automatic close of corresponding Issue after the merge of PR.More details can be viewed [here](https://help.github.com/articles/closing-issues-via-commit-messages/)。
+Then please wait for review.If there is any need to make a modification,you can update corresponding branch in origin following the steps above.
+## Sign CLA and pass unit tests
+### Sign CLA
+For the first time to submit Pull Request,you need to sign CLA(Contributor License Agreement) to ensure merge of your code.Specific steps are listed as follows:
+- Please check the Check in PR to find license/cla and click detail on the right to change into CLA website.
+<div align="center">
+<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/cla_unsigned.png?raw=true"  height="40" width="500">
+ </div>
+- Please click “Sign in with GitHub to agree” in CLA website.It will change into your Pull Request page after the click.
+<div align="center">
+<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/sign_cla.png?raw=true"  height="330" width="400">
+ </div>
+### Pass unit tests
+Every new commit in your Pull Request will trigger CI unit tests,so please make sure that necessary comments have been included in your commit message.Please refer to [commit](local_dev_guide.html#permalink-8--commit-)
+Please note the procedure of CI unit tests in your Pull Request which will be finished in several hours.
+You only need to focus on CI projects associated with your submitted branch.For example,there is no need to check whether release/1.1 pass test or not if you submit code to develop branch.
+Green ticks after all tests means that your commit has passed all unit tests.
+Red cross after the tests means your commit hasn't passed certain unit test.Please click detail to view bug details and make a screenshot of bug,then add it as a comment in your Pull Request.Our stuff will help you check it.
+## Delete remote branch
+We can delete branches of remote repository in PR page after your PR is successfully merged into master repository.
+<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
+We can also delete the branch of remote repository with `git push origin :the_branch_name`,such as:
+```bash
+➜  git push origin :my-cool-stuff
+```
+## Delete local branch
+Finally,we delete local branch
+```bash
+# Switch to develop branch
+➜  git checkout develop
+# delete my-cool-stuff branch
+➜  git branch -D my-cool-stuff
+```
+And now we finish a full process of code contribution
+## Certain regulations about submitting code
+In order that reviewers focus on code in the code review,please follow these rules every time you submit your code:
+1）Make sure that unit tests in Travis-CI pass through successfully.If it fails,it means problems have been found in submitted code which will not be reviewed by reviewer.
+2）Before the submit of PUll Request:
+- Please note the number of commit:
+Reason：It will bother reviewers a lot if a dozen of commits are submitted after modification of only one file and only a few modifications are updated in every commit.Reviewers have to check commit one by one to figure out the modification.And sometimes it needs to take the overlap among commits into consideration.
+Suggestion：Keep commit concise as much as possible at every submit.You can make a supplyment to the previous commit with `git commit --amend`.About several commits having been pushed to remote repository,you can refer to [squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)。
+- Pay attention to the name of every commit:It would be better to abstract the content of present commit and be not too arbitrary.
+3）If you have tackled with problems of an Issue,please add `fix #issue_number` to the *first* comment area of PULL Request.Then the corresponding Issue will be closed automatically after the merge of PULL Request.Keywords are including:close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved.Please select appropriate word.Please refer to [Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages) for more details.
+In addition,please follow the following regulations in response to the suggestion of reviewers:
+1）A reply to every comment of reviewers（It's a fundamental complimentary conduct in open source community.An expression of appreciation is a need for help from others):
+   - If you adopt the suggestion of reviewer and make a modification accordingly, it's courteous to reply with a simple `Done` .
+   - Please clarify your reason to the disagreenment
+2）If there are many suggestions
+   - Please show general modification
+   - Please follow [start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/) to give your reply,instead of directly replying for that every comment will result in sending an email causing email disaster.
--- a/doc/paddle/guides/index_cn.rst
+++ b/doc/paddle/guides/index_cn.rst
@@ -8,20 +8,26 @@
 让我们从学习飞桨的基本概念这里开始：
- `Tensor概念介绍 <./tensor_introduction_cn.html>`_ : 飞桨中数据的表示方式，Tensor概念介绍。
+- `飞桨框架2.0整体介绍 <./01_paddle2.0_introduction/index_cn.html>`_ : 飞桨框架2.0新特性的介绍与飞桨框架2.0升级指南的说明。
- `飞桨广播介绍 <./broadcasting_cn.html>`_ : 飞桨中广播概念的介绍。
+- `飞桨框架2.0模型开发 <./02_paddle2.0_develop/index_cn.html>`_ : 飞桨框架2.0模型开发全流程说明。
- `飞桨框架2.0beta升级指南 <./upgrade_guide_cn.html>`_: 介绍飞桨开源框架2.0beta的主要变化和如何升级。
+- `模型可视化 <./03_VisualDL/index_cn.html>` : 介绍如何用VisualDL实现飞桨框架模型的可视化。
- `版本迁移工具 <./migration_cn.html>`_: 介绍paddle1to2转换工具的使用。
+- `动态图转静态图 <./04_dygraph_to_static/index_cn.html>`_ : 介绍飞桨框架动态图转静态图的方法。
- `动态图转静态图 <./dygraph_to_static/index_cn.html>`_: 介绍飞桨动态图转静态图的方法 
+- `预测部署 <./05_inference_deployment/index_cn.html>`_ : 介绍如何使用训练好的模型进行预测。
- `模型存储与载入 <./model_save_load_cn.html>`_: 介绍飞桨模型与参数存储载入的方法
+- `分布式训练 <./06_distributed_training/index_cn.html>_` : 介绍如何使用分布式进行训练。
+- `性能优化 <./07_performance_improving/index_cn.html>`_ : 介绍飞桨框架使用过程中的调优方法。
+- `自定义OP <./08_new_op/index_cn.html>`_ : 介绍飞桨框架自定义OP的方法。
+- `参与开发 <./09_contribution/index_cn.html>`_ : 介绍如何参与飞桨框架的开发。
 ..  toctree::
    :hidden:
-    tensor_introduction_cn.md
+    01_paddle2.0_introduction/index_cn.rst
-    broadcasting_cn.rst
+    02_paddle2.0_develop/index_cn.rst
-    upgrade_guide_cn.md
+    03_VisualDL/index_cn.rst
-    migration_cn.rst
+    04_dygraph_to_static/index_cn.rst
-    dygraph_to_static/index_cn.rst
+    05_inference_deployment/index_cn.rst
-    model_save_load_cn.rst
+    06_distributed_training/index_cn.rst
+    07_performance_improving/index_cn.rst
+    08_new_op/index_cn.rst
+    09_contribution/index_cn.rst