cherry-pick to release/1.4 (#833)

* update new_op.cm * update op note * dygraph doc * refine doc * refine doc * refine doc * refine language * refine code * hotfix deadlink (#811) * Update native_infer_en.md (#787) * Update install_Windows_en.md (#790) * Update install_Windows_en.md * Update install_Windows_en.md * Update cluster_howto_en.rst (#791) * Update cluster_howto_en.rst * Update cluster_howto_en.rst * Update doc/fluid/user_guides/howto/training/cluster_howto_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update doc/fluid/user_guides/howto/training/cluster_howto_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update cluster_howto_en.rst * Update index_cn.rst (#813) * rm_win_install_table (#815) * Update paddle_tensorrt_infer_en.md (#788) * Update paddle_tensorrt_infer_en.md * Update paddle_tensorrt_infer_en.md * Update doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update paddle_tensorrt_infer_en.md * Update paddle_tensorrt_infer_en.md * Update use_py_reader_en.rst (#793) * Update use_py_reader_en.rst * Update use_py_reader_en.rst * Update use_py_reader_en.rst * Update doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update use_py_reader_en.rst * Update learning_rate_scheduler_en.rst (#804) * Update learning_rate_scheduler_en.rst * Update doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update cluster_howto_en.rst (#807) * fix dygraph load_persistables bug (#817) * update_install_check command (#818) * update op role * update op role * modified yolov3_loss annotation (#824) * update_models_link (#825) * follow comments * follow comments * update_install_doc_0422 (#827) * follow comments * update_releasenote_1.4 (#828) * fix typo (#830) * fix typo * Update native_infer.md

cherry-pick to release/1.4 (#833)
* update new_op.cm * update op note * dygraph doc * refine doc * refine doc * refine doc * refine language * refine code * hotfix deadlink (#811) * Update native_infer_en.md (#787) * Update install_Windows_en.md (#790) * Update install_Windows_en.md * Update install_Windows_en.md * Update cluster_howto_en.rst (#791) * Update cluster_howto_en.rst * Update cluster_howto_en.rst * Update doc/fluid/user_guides/howto/training/cluster_howto_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update doc/fluid/user_guides/howto/training/cluster_howto_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update cluster_howto_en.rst * Update index_cn.rst (#813) * rm_win_install_table (#815) * Update paddle_tensorrt_infer_en.md (#788) * Update paddle_tensorrt_infer_en.md * Update paddle_tensorrt_infer_en.md * Update doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update paddle_tensorrt_infer_en.md * Update paddle_tensorrt_infer_en.md * Update use_py_reader_en.rst (#793) * Update use_py_reader_en.rst * Update use_py_reader_en.rst * Update use_py_reader_en.rst * Update doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update use_py_reader_en.rst * Update learning_rate_scheduler_en.rst (#804) * Update learning_rate_scheduler_en.rst * Update doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst Co-Authored-By: N acosta123 <42226556+acosta123@users.noreply.github.com> * Update cluster_howto_en.rst (#807) * fix dygraph load_persistables bug (#817) * update_install_check command (#818) * update op role * update op role * modified yolov3_loss annotation (#824) * update_models_link (#825) * follow comments * follow comments * update_install_doc_0422 (#827) * follow comments * update_releasenote_1.4 (#828) * fix typo (#830) * fix typo * Update native_infer.md
2d19958b · Cheerego · GitHub · d27da9ce · 2d19958b · 2d19958b
29 changed file
--- a/doc/fluid/advanced_usage/deploy/inference/native_infer.md
+++ b/doc/fluid/advanced_usage/deploy/inference/native_infer.md
@@ -25,7 +25,8 @@ PaddleTensor 定义了预测最基本的输入输出的数据格式，常用字
 `Config` 有两种，`NativeConfig` 较简单和稳定，`AnalysisConfig` 功能更新，性能更好

 - `NativeConfig` 原生 engine，由 paddle 原生的 forward operator
-  组成，可以天然支持所有paddle 训练出的模型，
+  组成，可以天然支持所有paddle 训练出的模型
+  
 - `AnalysisConfig` 
  - 支持计算图的分析和优化
  - 支持最新的各类 op fuse，性能一般比  `NativeConfig` 要好

--- a/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
+++ b/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
-# Use TensorRT Library for inference
+# Use Paddle-TensorRT Library for inference

 NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
-Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, MobileNet-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
-
-
-## Build inference libraries with `TensorRT`
+Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, Deeplabv3 Mobile, Net-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
+
+## Contents
+ - [compile Paddle-TRT inference libraries](#compile Paddle-TRT inference libraries)
+ - [Paddle-TRT interface usage](#Paddle-TRT interface usage)
+ - [Paddle-TRT example compiling test](#Paddle-TRT example compiling test)
+ - [Paddle-TRT INT8 usage](#Paddle-TRT_INT8 usage)
+ - [Paddle-TRT subgraph operation principle](#Paddle-TRT subgraph operation principle)
+ 
+## <a name="compile Paddle-TRT inference libraries">compile Paddle-TRT inference libraries</a>

 **Use Docker to build inference libraries**         

@@ -42,7 +48,7 @@ Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enable
 	make inference_lib_dist -j
 	```

-## Usage of Paddle TensorRT
+## <a name="Paddle-TRT interface usage">Paddle-TRT interface usage</a> 

 [`paddle_inference_api.h`]('https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/paddle_inference_api.h') defines all APIs of TensorRT. 

@@ -58,17 +64,17 @@ A complete process is shown below:
 #include "paddle_inference_api.h"

 namespace paddle {
-using paddle::contrib::AnalysisConfig;
+using paddle::AnalysisConfig;

 void RunTensorRT(int batch_size, std::string model_dirname) {
  // 1. Create MixedRTConfig
-  AnalysisConfig config(true);
-  config.model_dir = model_dirname;
-  config->use_gpu = true;
-  config->device = 0;
-  config->fraction_of_gpu_memory = 0.15;
+  AnalysisConfig config(model_dirname);
+  // config->SetModel(model_dirname + "/model",                                                                                                   
+  //                     model_dirname + "/params");
+  
+  config->EnableUseGpu(100, 0 /*gpu_id*/);
  config->EnableTensorRtEngine(1 << 20 /*work_space_size*/, batch_size /*max_batch_size*/);
-
+  
  // 2. Create predictor based on config
  auto predictor = CreatePaddlePredictor(config);
  // 3. Create input tensor 
@@ -104,10 +110,33 @@ int main() {
 ```
 The compilation process is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api/demo_ci)

+## <a name="Paddle-TRT_INT8 usage">Paddle-TRT INT8 usage</a>
+
+  1. Paddle-TRT INT8 introduction    
+The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: 1）**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. 2）After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
+
+  2. compile and test the INT8 example
+
+  	```shell
+ 	cd SAMPLE_BASE_DIR/sample
+ 	# sh run_impl.sh {the address of inference libraries} {the name of test script} {model directories}
+ 	# We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. 
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_generate_calib_test SAMPLE_BASE_DIR/sample/mobilenetv1
+ 	
+ 	```
+ 	
+        After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/build/mobilenetv1` model directory, which is the calibration table.
+
+  	``` shell
+ 	# conduct INT8 inference
+ 	# copy the model file with calibration tables to a specific address
+ 	cp -rf SAMPLE_BASE_DIR/sample/build/mobilenetv1 SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_int8_test SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	```

-## Subgraph Theory
+## <a name="Paddle-TRT subgraph operation principle">Paddle-TRT subgraph operation principle</a>

-   Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
+Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
   
 A simple model expresses the process : 

@@ -121,6 +150,6 @@ A simple model expresses the process :
 <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
 </p>

-  We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
+We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
   

--- a/doc/fluid/advanced_usage/development/new_op/new_op.md
+++ b/doc/fluid/advanced_usage/development/new_op/new_op.md
--- a/doc/fluid/advanced_usage/development/new_op/op_notes.md
+++ b/doc/fluid/advanced_usage/development/new_op/op_notes.md
@@ -4,9 +4,9 @@
 ### 1.Fluid中Op的构建逻辑
 Fluid中所有的Op都继承自`OperatorBase`，且所有的Op都是无状态的，每个Op包含的成员变量只有四个：type、inputs、outputs、attribute。

-Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行。
+Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，**Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行**。

-Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。
+Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv_op，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。

 Operator继承关系图：
 ![op_inheritance_relation_diagram](../../pics/op_inheritance_relation_diagram.png)
@@ -62,7 +62,7 @@ Operator继承关系图：
 <td>InferShapeFN </td>
 <td>Functor </td>
 <td>用于推断Output的Shape </td>
-<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run时调用 </td>
+<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run中调用 </td>
 </tr>
 <tr>
 <td>OpCreator </td>
@@ -85,11 +85,12 @@ Operator继承关系图：

 **注意：**

-1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker和op中attr的checker。
+1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker以及Op中attr的checker。
 2. 如果该Op有反向，则必须要有op_grad_opmaker，因为在backward会根据正向的Op中获取反向Op的Maker。
-3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：**DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化。
+3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化**。
 4. 框架没有提供默认的op_infer_var_shape方法。如果该Op是无OpKernel的，通常需要用户添加对应的op_infer_var_shape方法；如果该Op是有OpKernel的，需要实现`OperatorWithKernel`中的`InferShape`方法，此时不需要提供op_infer_var_shape方法。具体实现可参考[while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc)，[conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc)。
-5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：**在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype。
+5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype**。
+


 更多内容请参考: [如何写新的Op](new_op.html)
@@ -119,7 +120,60 @@ ShareDataWith的功能是使两个Tensor共享底层buffer，在调用这个操
 目前稀疏梯度在做更新更新的时候会先对梯度做merge，即对相同参数的梯度做累加，然后做参数以及附加参数（如velocity）的更新。

 ### 7.显存优化
-如果Op的反向不需要将前向op的所有输入输出作为其输入，则不要用`DefaultGradOpDescMaker`，这将会导致无法对没有用到的变量做内存/显存优化。
+通常反向Op会依赖于前向Op的某些输入(Input)、输出(Output)，以供反向Op计算使用。但有些情况下，反向Op不需要前向Op的所有输入和输出；有些情况下，反向Op只需要前向Op的部分输入和输出；有些情况下，反向Op只需要使用前向Op中输入和输出变量的Shape和LoD信息。若Op开发者在注册反向Op时，将不必要的前向Op输入和输出作为反向Op的输入，会导致这部分显存无法被框架现有的显存优化策略优化，从而导致模型显存占用过高。
+
+所以在写注册反向Op时需要注意以下几点：
+
+- Fluid提供的`DefaultGradOpDescMaker`，默认会将前向op的所有输入(`Input`）、输出(`Output`)以及输出变量所对应的梯度(`Output@Grad`)作为反向Op的输入，将前向Op输入所对应的梯度(`Input@Grad`)作为反向Op的输出。所以在使用`DefaultGradOpDescMaker`时需要考虑是否有些变量在计算中不被用到。
+- 如果`DefaultGradOpDescMaker`不能够满足需求，需要用户自己手动构建`GradOpDescMaker`，具体实现请参考[相关文档](new_op.html#permalink-4--gradprotomaker-);
+- 如果有些反向Op需要依赖前向Op的输入或输出变量的的Shape或LoD，但不依赖于变量中Tensor的Buffer，且不能根据其他变量推断出该Shape和LoD，需要对该变量（以下称该变量为`X`）在反向Op中进行注册`NoNeedBufferVarsInference`。**一旦注册了`NoNeedBufferVarsIference`，反向op中就不能读写该变量对应的Tensor中的buffer，只能调用Tensor的dims()和lod()方法，同时，反向Op中的`GetExpectedKernelType()`必须要重写，并且`GetExpectedKernelType()`中不能访问`X`变量中Tensor的type()方法**。比如在`SliceOpGrad`中只会用到`Input`中变量的Shape信息，所以需要为对`Input`在`SliceOpGrad`上进行注册：
+```
+namespace paddle {
+namespace operators {
+// ...
+class SliceOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    // ... 
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    // Note: don't get data type from ctx.Input<framework::Tensor>("Input");   
+    auto dtype = ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type();    
+    return framework::OpKernelType( dtype, ctx.GetPlace());
+  }
+};
+
+
+class SliceOpGradMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* bind = new framework::OpDesc();
+    bind->SetInput("Input", Input("Input"));
+    bind->SetInput(framework::GradVarName("Out"), OutputGrad("Out"));
+    bind->SetOutput(framework::GradVarName("Input"), InputGrad("Input"));
+    bind->SetAttrMap(Attrs());
+    bind->SetType("slice_grad");
+    return std::unique_ptr<framework::OpDesc>(bind);
+  }
+};
+
+DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(SliceOpGradNoNeedBufferVarsInference,
+                                      "Input");
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(slice, ops::SliceOp, ops::SliceOpMaker,
+                  ops::SliceOpGradMaker);
+REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
+                  ops::SliceOpGradNoNeedBufferVarsInference);
+```

 ### 8.混合设备调用
 由于GPU是异步执行的，当CPU调用返回之后，GPU端可能还没有真正的执行，所以如果在Op中创建了GPU运行时需要用到的临时变量，当GPU开始运行的时候，该临时变量可能在CPU端已经被释放，这样可能会导致GPU计算出错。
@@ -137,7 +191,7 @@ The following device operations are asynchronous with respect to the host:
 关于cudaMemCpy和cudaMemCpyAsync注意事项：

 - 如果数据传输是从GPU端到非页锁定的CPU端，数据传输将是同步，即使调用的是异步拷贝操作。
- 如果数据传输时从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。
+- 如果数据传输是从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。

 更多内容可参考：[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)，[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)

@@ -172,7 +226,10 @@ Enforce提示信息不能为空，并且需要写明，因为报错信息可以

 **注意：**在merge到develop分支之前一定进行公式预览。可参考[dynamic_lstmp](http://paddlepaddle.org/documentation/docs/zh/1.1/api/layers.html#dynamic-lstmp)。

-### 3.Python端Op接口中参数的顺序
+### 3.Op变量名的命名要规范
+在定义Op时，Op的输入输出以及属性的命名需要符合规范，具体命名规则请参考：[`name_convention`](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
+
+### 4.Python端Op接口中参数的顺序
 Python API中参数的顺序一般按照重要性来排，以fc为例：
 ```
 def fc(input,

--- a/doc/fluid/api_cn/layers_cn.rst
+++ b/doc/fluid/api_cn/layers_cn.rst
@@ -12939,10 +12939,10 @@ yolov3_loss
 至于置信度得分，它是anchor框和真实框之间的IoU的逻辑回归值，anchor框的得分最高为1，此时该anchor框对应着最大IoU。
 如果anchor框之间的IoU大于忽略阀值ignore_thresh，则该anchor框的置信度评分损失将会被忽略。
          
-因此，yolov3损失包括三个主要部分，框位置损失，置信度评分损失，分类损失。L1损失用于
-框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），置信度评分损失和分类损失。
+因此，yolov3损失包括三个主要部分，框位置损失，目标性损失，分类损失。L1损失用于
+框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），目标性损失和分类损失。
          
-每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标损失。
+每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标性损失。

 为了权衡大框(box)和小(box)之间的框坐标损失，框坐标损失将与比例权重相乘而得。即：

@@ -12965,7 +12965,7 @@ yolov3_loss

 参数：
    - **x**  (Variable) – YOLOv3损失运算的输入张量，这是一个形状为[N，C，H，W]的四维张量。H和W应该相同，第二维（C）存储框的位置信息，以及每个anchor box的置信度得分和one-hot分类
-    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，x、y、w、h应该是输入图像相对值。 N是batch size，B是图像中所含有的的最多的box数目
+    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，其中 x, y是真实框的中心坐标，w, h是框的宽度和高度，且x、y、w、h将除以输入图片的尺寸，缩放到[0,1]区间内。 N是batch size，B是图像中所含有的的最多的box数目
    - **gt_label**  (Variable) – 真实框的类id，应该形为[N，B]。
    - **anchors**  (list|tuple) – 指定anchor框的宽度和高度，它们将逐对进行解析
    - **anchor_mask**  (list|tuple) – 当前YOLOv3损失计算中使用的anchor的mask索引

--- a/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
+++ b/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
@@ -29,4 +29,8 @@ The following content describes the APIs related to the learning rate scheduler:

 * :code:`piecewise_decay`: Piecewise decay. That is, the stair-like decay for a given number of steps, the learning rate stays the same within each step. For related API Reference please refer to :ref:`api_fluid_layers_piecewise_decay`

-* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
\ No newline at end of file
+* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
+
+* :code:`cosine_decay`: Cosine attenuation. It means the learning rate changes with the number of steps in the form of a cosine function. For related API Reference please refer to :ref:`api_fluid_layers_cosine_decay`
+
+* :code:`linear_lr_warmup`: The learning rate increases linearly to an appointed rate with the number of steps. For related API Reference please refer to :ref:`api_fluid_layers_linear_lr_warmup`
--- a/doc/fluid/beginners_guide/install/Tables.md
+++ b/doc/fluid/beginners_guide/install/Tables.md
--- a/doc/fluid/beginners_guide/install/Tables_en.md
+++ b/doc/fluid/beginners_guide/install/Tables_en.md
--- a/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
@@ -200,7 +200,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
@@ -204,7 +204,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle

--- a/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
@@ -209,7 +209,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_Windows.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Windows.md
@@ -102,7 +102,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3`进入Python解释器，然后使用 `import paddle.fluid`, 如沒有提示错误，则表明安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_CentOS.md
+++ b/doc/fluid/beginners_guide/install/install_CentOS.md
@@ -51,7 +51,8 @@ CentOS系统下有4种安装方式：

 <a name="check"></a>
 ## ***验证安装***
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid` ，再输入`fluid.install_check.run_check()`
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`

 如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。


--- a/doc/fluid/beginners_guide/install/install_Docker.md
+++ b/doc/fluid/beginners_guide/install/install_Docker.md
@@ -4,6 +4,8 @@

 ## 环境准备

+- 目前支持的系统类型，请见[安装说明](./index_cn.html)，请注意目前暂不支持在CentOS 6使用Docker
+
 - 在本地主机上[安装Docker](https://hub.docker.com/search/?type=edition&offering=community)

 - 如需在Linux开启GPU支持，请[安装nvidia-docker](https://github.com/NVIDIA/nvidia-docker)

--- a/doc/fluid/beginners_guide/install/install_MacOS.md
+++ b/doc/fluid/beginners_guide/install/install_MacOS.md
@@ -41,7 +41,8 @@ MacOS系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid` ，再输入`fluid.install_check.run_check()`
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`

 如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。


--- a/doc/fluid/beginners_guide/install/install_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/install_Ubuntu.md
@@ -51,7 +51,8 @@ Ubuntu系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid` ，再输入`fluid.install_check.run_check()`
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`

 如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。


--- a/doc/fluid/beginners_guide/install/install_Windows.md
+++ b/doc/fluid/beginners_guide/install/install_Windows.md
@@ -46,7 +46,8 @@ Windows系统下有3种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid` ，再输入`fluid.install_check.run_check()`
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`

 如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。


--- a/doc/fluid/index_cn.rst
+++ b/doc/fluid/index_cn.rst
@@ -15,4 +15,4 @@
    user_guides/index_cn.rst
    advanced_usage/index_cn.rst
    api_cn/index_cn.rst
-    release_note_cn.rst
+    release_note_cn.md
--- a/doc/fluid/index_en.rst
+++ b/doc/fluid/index_en.rst
@@ -8,5 +8,6 @@
  user_guides/index_en.rst
  advanced_usage/index_en.rst
  api/index_en.rst
+  release_note_en.md


--- a/doc/fluid/release_note_cn.md
+++ b/doc/fluid/release_note_cn.md
+# 版本说明
+
+## 目录
+
+* 重要更新
+* 基础框架
+    * 安装
+    * 中间表达IR和Pass方面的优化
+    * IO优化
+    * 执行优化
+    * 显存优化
+    * 完善CPU JITKernel
+    * Intel CPU底层计算优化
+    * 集成Intel nGraph图编译引擎
+    * 框架基础功能增强
+    * 动态图preview版基础功能完善
+* 预测引擎
+    * 服务器预测引擎
+    * 移动端预测引擎
+    * 部署工具
+* 分布式训练
+* 模型建设
+    * PaddleCV 智能视觉
+    * PaddleNLP智能文本处理
+    * PaddleRec智能推荐
+* 工具组件
+* BUG修复
+
+## 重要更新
+
+* 基础框架对训练速度和显存占用进行了全面优化，完整支持量化训练。初步集成了Intel nGraph，动态图preview版单机单卡基本功能完善。
+* 正式发布模型压缩工具包[PaddleSlim](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim)和模型预测服务[Paddle Serving](https://github.com/PaddlePaddle/Serving)，全面提升PaddlePaddle部署能力。
+* 优化分布式IO，增加远程文件系统流式读取能力。GPU多机多卡同步训练通过增加稀疏通信能力提升带宽不敏感训练能力，在低配网络带宽网络环境下，例如10G网络下，同步训练可提速10倍。
+* 更好支持K8S生态，提供工业生产环境下的Paddle-K8S-Operator支持；Kubeflow支持paddle-job。
+* 正式发布[视频识别工具集](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/video)，覆盖主流视频分类模型，包括Nonlocal、TSM 、Attention Cluster、NeXtVLAD、LSTM、StNet、TSN。
+* 新增中文语义表示模型[ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE)，在多项中文任务上相对 BERT精度绝对提升1-2个百分点。新增对话通用理解相关模型 DGU，支持5类对话任务，在3个公开数据集达到 SOTA 的效果。
+* 新增基于图神经网络的推荐模型[（Graph Neural Network）](https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/gnn)，并提供公开数据集下的benchmark效果。
+* 正式发布[PaddleHub](https://github.com/PaddlePaddle/PaddleHub)预训练模型管理工具，提供包括预训练模型管理、命令行一键式使用和迁移学习三大功能。旨在帮助用户更高效地管理模型并开展迁移学习的工作。
+* 正式发布[X2Paddle模型转换工具](https://github.com/PaddlePaddle/X2Paddle)，用户可以无损地将其他深度学习框架预测模型迁移至PaddlePaddle。
+
+## 基础框架
+
+* 安装
+    * 增加install\_check.run\_check()接口，对安装是否成功提供更完善的检查。
+* 中间表达IR和Pass方面的优化
+    * 完成IrGraph、IrNode、IrVarNode以及IrOpNode的封装，支持使用Python编写IR Pass。
+* IO优化
+    * PyReader接口优化：可通过新接口reader = fluid.io.PyReader (..., iterable=True, ...)创建for循环可迭代的reader，并通过feed方式将数据送入网络训练。
+* 执行优化
+    * 用户可设置with\_data\_parallel的places参数，指定在某些GPU卡上运行，从而支持单进程多训练任务执行。
+    * 优化了多卡执行器调度策略，在ResNet50和Transformer模型上验证速度提升8%~19%。
+    * 多卡情况下支持对AllReduce进行按分组Fuse，ResNet模型的多卡速度提升8%~30%（不同卡数提速有差异），Transformer模型的多卡速度提升4%左右。
+* 显存优化
+    * GC策略优化：Eager Deletion策略支持while\_op内部变量的及时删除；支持非全量Eager Deletion策略，用户可设置FLAGS\_memory\_fraction\_of\_eager\_deletion=0.xx控制即时删除内存/显存空间的百分比。
+    * Op优化：优化cross entropy、expand、layer\_norm、dropout等p的反向注册机制，去除无关变量依赖，提高框架显存性能。
+    * 新增两个FLAGS（FLAGS\_initial\_gpu\_memory\_in\_mb和FLAGS\_reallocate\_gpu\_memory\_in\_mb）来让用户指定初始显存池容量和再分配显存池容量。
+    * 调整inplace\_op\_pass策略，提高inplace的策略的覆盖率。
+    * 取消了在python端做activation op inplace优化的逻辑，统一到inplace\_op\_pass。
+    * 新增Memory Profile功能。
+* 完善CPU JITKernel
+    * 优化JITKernel的调用方式，添加Cache机制和获取所有相同类型函数的接口，方便开发者根据不同情况有选择的调用。
+    * 使用JITKernel优化SGD算法，在PyramidDNN模型下对应的OP部分速度提升44%，整体训练速度提升12%；使用JITKernel优化fused\_embedding\_seq\_pool，在PyramidDNN模型下对应op的反向算子速度提升18%， 整体训练速度提升6%。
+* Intel CPU底层计算优化
+    * MKLDNN升级至18，包含若干性能增强（如基于GEMM的卷积运算 / INT8卷积运算等）。
+    * 使用MKL优化GELU OP，OP性能提升至原来的3倍。
+    * 增强MKLDNN相关Kernel的单元测试。
+* 集成了Intel nGraph图编译引擎，为PaddlePaddle支持更多硬件后端提供了便利
+    * 通过ngraph\_engine OP将子图交由nGraph核心，经图优化后调度在CPU上执行。用环境变量FLAGS\_use\_ngraph=true即可在运行时调用nGraph。
+    * 支持ResNet50模型在CPU上的训练和预测。ResNet50在CPU上的性能，和基于MKLDNN的直接优化相比，预测和训练性能均有显著提升。
+* 框架基础功能增强
+    * 支持同步的Batch Norm操作；支持softmax设置axis; 新增spectral norm, rang, acos, asin, atanh操作；新增Npair Loss，用于特征学习。
+    * 框架中添加cosine\_decay学习率调整策略。
+    * 新增sampled\_softmax\_with\_cross\_entropy, 用于提升大词典下的训练效率。
+    * 支持SGD和Adam优化算法的fuse，在Transformer模型上，速度能够提升2%，在Cycle GAN模型上，速度能够提升6%。
+    * 加强lsmtp，支持cell内部裁剪、初始化cell state和hidden state。
+    * 加强adagrad，支持初始化累积动量。
+    * 支持Tensor使用\_\_getitem\_\_ 方式操作。
+    * 新增QuantizationFreezePass、ConvertToInt8Pass以及TransformForMobilePass。完整支持动态和静态两种量化训练方式及对应模型保存。
+* 动态图preview版基础功能完善：
+    * 基础功能：支持LRDecay，整体支持GPU单卡及CPU单机的模型训练和评估。
+    * API：公开动态图对应基础接口，重构现有的 Layers，增加对 GRU、LayerNorm、NCE、PRelu 等 Layers 的支持。
+    * 性能：在Resnet，Mnist模型上验证与静态图基本持平。
+    * 增加Transformer、MNIST、Se-Resnext 等模型的动态图实现。
+
+## 预测引擎
+
+### 服务器预测
+
+* 预测库整合PaddlePaddle/Anakin，统一接口提供高效预测能力。
+    * 支持Anakin GPU子图和CPU子图。
+    * Python预测接口支持Anakin子图。
+    * Resnet、VGG、Googlenet、Mobilenet、ShuffleNet、Faster RCNN、Yolo、SSD等模型实现显著预测加速。
+* 预测框架优化，小模型预测速度提升明显
+    * 增加runtime\_context\_cache\_pass，重点模型提升17%。
+    * 优化5个OP的infershape，重点模型提升13%。
+    * 完善ZeroCopy接口，避免使用AnalysisPredictor 时存在多余CPU拷贝。
+* INT8 量化预测持续加强
+    * 进一步完善通过TensorRT 支持INT8 量化，支持Alexnet、Googlenet、Vgg、Mobilenet、ShuffleNet等模型。优化调用TensorRT下的信息序列化反序列化，加快模型初始化速度。
+    * 实现基于C++ Pass的INT8量化框架。增加若干INT8 OP Kernel : Transpose, Contact, Requantize。通过微调MkldnnQuantizerConfig中的量化策略，用户可快速得到符合精度要求的INT8量化模型。INT8量化后的ResNet-50 / MobileNet v1模型，相比原始FP32模型，性能分别提升至7倍 / 3.0倍 （在支持AVX512-DL Boost指令集的至强 6271服务器上)。
+
+### 移动端预测
+
+* ARM CPU
+    * Paddle-mobile完成矩阵运算库sgemm和sgemv的重构和效率优化，在大部分模型上能获得10%〜100%以上的性能加速。
+    * 新增while、sequence\_expand、sequence\_pool、sequence\_softmax、gru\_unit、beam\_search和beam\_search\_decode等19个算子，以及对应大量的优化工作，支持attention-based端到端模型的预测。
+    * 新增winograd 的arm v8实现，在IOS上的v8的硬件上能取得更高的预测性能；winograd支持算子融合 ，保证算子融合后的效率更高。
+    * 新增kernel为3x3的滑窗直接卷积实现，在channel数较少时会比winograd和gemm效率更高。
+    * 完成kernel为3x3的depthwise convolution重构和优化，相比之前版本支持任意的padding、性能更优且计算结果更可靠。
+    * 完成kernel为5x5的depthwise convolution armv8版本的实现，NAS模型的预测效率提升30%以上。
+    * 完成反卷积conv2d\_transpose的效率优化。
+    * 新增基于图优化的精简内存复用策略，大部分模型能降低近50%的内存占用。对于ARM CPU已自动开启（FPGA和GPU暂不支持）。
+* ARM GPU
+    * Paddle-mobile完成kernel为1x1的卷积优化，MobileNet v1在高通Adreno GPU上平均预测性能提升35%。
+* Paddle Inference初步完成和Paddle-mobile、Anakin的接口统一，待进一步深度融合。
+
+### 部署工具
+
+* 模型压缩工具包PaddleSlim
+    * 剪切模型压缩策略：支持敏感度和uniform两种方式，支持vgg、resnet、mobilenet等多种类型的网络，支持用户自定义剪切范围。
+    * 量化训练模型压缩策略：支持动态和静态两种量化训练方式，支持对参数进行分channel量化或整体量化，支持以float类型模拟int8值域保存模型，支持以int8类型保存模型，支持以兼容paddle-mobile的格式保存模型。
+    * 蒸馏模型压缩策略：支持在teacher网络和student网络任意层添加组合loss，支持FSP Loss, L2 Loss, Softmax with Cross-entropy Loss。
+    * 其它功能：支持配置文件管理压缩任务超参数，支持多种压缩策略组合使用，蒸馏和剪切压缩过程支持checkpoints功能。
+
+* Paddle Serving
+
+    * 支持paddle inference远程部署。
+    * 服务端支持用户新增数据处理Operator，支持用户自定义预估逻辑，支持模型热加载功能。
+    * 客户端提供C++ SDK，供业务逻辑进行调用，支持自定义protobuf定制网络数据传输协议，A/B测试能力。
+    * 提供经典任务使用paddle serving的示例模板，包括文本分类，图像分类任务。
+    * 针对文本分类任务，给出延迟和吞吐的benchmark。
+
+## 分布式训练
+
+* 分布式IO优化
+    * Pipe Reader接口优化：在保持数据预处理灵活性的前提下，提供高效IO的方法。支持企业级Linux系统用户定制化，实现高性能IO组件，在离线数据预处理处进行统一维护。增强远程文件系统流式读取能力，支持数据载入内存模式、分布式打乱功能。
+* Executor与分布式IO的整合
+    * AsyncExecutor整合进入Executor，增加train\_from\_dataset/infer\_from\_dataset接口，支持基于Pipe Reader的训练，在保持多队列IO功能的前提下，支持用户自定义PipeLine程序，提供python端灵活处理数据的能力。
+* GPU多机多卡同步训练增加带宽不敏感训练能力
+    * GPU同步训练增加稀疏通信能力，支持sparse all reduce。
+    * 通过通信稀疏度的控制，在算法层面保障模型收敛，并增加DGCOptimizer。
+    * 通过在resnet50 on imagenet上进行实验证明：模型收敛性方面，resnet50 90轮收敛效果不变；在高速互联网络环境下，稀疏通信不会降低训练速度；低配网络带宽网络环境下（例如10G网络），稀疏通信在训练速度上有明显优势，相比稠密通信的同步训练提速10倍。
+* Collective Operator模式
+    * Collective Operator模式的支持，增加GPU下多个all reduce的操作。通过Python API向Program中增加collective op，使得分布式优化算法开发的灵活性显著提升。
+* Resnet50 on Imagenet收敛速度优化
+    * 支持动态BatchSize、动态ImageSize以及矩形crop等方法；FP32精度下,在v100单机8卡验证，收敛速度提升68%(acc1\&gt;=75.9%, acc5=93.0%)。
+* K8S生态支持
+    * Kubeflow支持paddle-job，并贡献到kubeflow社区。
+    * 支持工业生产环境下的Paddle-K8S-Operator，可与kubeflow配合使用。
+    * K8S环境适合新手提交任务的脚本，提供百度云可复现教程。
+
+## 模型建设
+
+* PaddleCV 智能视觉
+    * 正式发布视频识别工具集，覆盖主流视频分类模型，包括Nonlocal、TSM 、Attention Cluster、NeXtVLAD、LSTM,、StNet、TSN，效果和主流实现打平。
+    * 新增基于ImageNet的预训练模型: GoogleNet, ShuffleNetV2, ResNet18,ResNet34。
+    * 新增支持目标检测YOLOv3模型，效果与最好公开实现打平（mAP比原作者提高7绝对百分点）。
+    * 发布基于COCO和MPII数据的Simple Baselines人体姿态估计模型，效果和主流实现打平。
+    * 特征学习模型新增npair loss， 在预训练模型（arcmargin loss）的基础上将recall@1提升至03%（+0.78%)。
+* PaddleNLP智能文本处理
+    * 新增支持中文语义表示ELMO模型，支持多卡训练，训练速度比主流实现快1倍。验证在中文词法分析任务上F1值绝对提升1.1%，在中文阅读理解任务上Rouge-L值提升1%。
+    * 新增中文语义表示模型ERNIE，在自然语言推断、语义相似度、命名实体识别、情感分析、问答匹配等中文任务上相对 BERT 中文模型绝对提升了 1% ~ 2% 的精度。
+    * 阅读理解模型升级，优化数据预处理和文档选取，在DuReader验证数据集上Rouge-L提升至65（baseline 39.29)。
+    * 新增基于知识感知的对话模型，对比基线生成对话模型，在F1, BLEU1, BLEU2的指标上平均提升1个百分点。
+    * 发布对话模型工具集，包含Deep Attention Matching Net, 新增对话自动评估工具和基于BERT的对话通用理解相关模型DGU(Dialogue General Understanding)，支持对话语义匹配、DA、DST、槽位解析和意图识别五种对话任务，3个公开数据集达到SOTA 的效果。
+    * 发布PaddleNLP工具包，统一文本分类、文本匹配、序列标注、阅读理解、智能对话等NLP任务的建模，并开放对应的工业级预训练模型。
+* PaddleRec智能推荐
+    * Deep Interest Network（DIN）：新增DIN模型，并在公开数据复现效果，支持cpu和gpu模式下的单机单/多卡训练。DIN适用于推荐中的排序场景（如ctr预估），主要特点为对历史序列建模的过程中结合了预估目标的信息。
+    * Graph Neural Network（GNN）：新增基于session的图神经网络推荐模型，并在公开数据复现效果，支持cpu和gpu模式下的单机单卡训练。该模型适用于推荐中的召回场景，使用GNN对用户的历史信息进行建模，可以捕捉到item序列之间蕴含的更复杂的转换关系。
+    * Word2vec：word2vec采样策略调优，并在公开数据复现效果，添加多机训练支持。
+
+## 工具组件
+
+* 正式发布PaddleHub预训练模型管理工具，旨在帮助用户更高效的管理模型并开展迁移学习的工作
+    * **预训练模型管理** ：通过hub命令行可完成PaddlePaddle生态的预训练模型下载、搜索、版本管理等功能。
+    * **命令行一键使用：** 无需代码，通过命令行即可直接使用预训练模型进行预测，快速调研训练模型效果。目前版本支持以下模型：词法分析LAC；情感分析Senta；目标检测SSD；图像分类ResNet, MobileNet。
+    *  **迁移学习：** 提供了基于预训练模型的Finetune API，用户通过少量代码即可完成迁移学习，包括BERT/ERNIE文本分类、序列标注、图像分类迁移等。
+* 正式发布X2Paddle模型转换工具，可以无损地将其他深度学习框架预测模型迁移至PaddlePaddle。工具还附带TensorFlow, Caffe框架的API详细对比文档，旨在帮助用户更便捷的从其他框架迁移PaddlePaddle。
+
+## BUG修复
+
+* 修复backward时BFS带来的精度不一致的问题
+* 修复ptimizer minimize创建多余反向输入
+* 修复Paddle-TRT运行显存占用大的问题
+* 修复AllReduceDepPass中的Bug
+* 修复FastThreadedExecutor中的Bug
+* 修复Reshape、cross\_entropy、arg\_min\_max、recurrent等Op中的bug
+* 修复VarBase构造的问题
+* 修复了若干memory\_optimizer\_pass中的问题与bug：将复用逻辑由\>= 调整为 =，减少了因Variable复用造成的碎片，去掉了memory\_opitmize\_pass对BlockDesc的依赖，修复了不同类型的Variable会相互复用的bug
+* 修复python3下使用util.plot报错问题
+* 提升Profiler的稳定性并新增Memory Profile功能
+* 修复C++预测必须在线程内clone，才能使多线程生效的问题
+* 修复一些op在InferShape时对变长shape检查的错误
+* 增加一些op对长度为零的LoD序列输入的支持
+* 修复用recurrentp实现StaticRNN的一些bug
+* 修复动态图dygraph模型checkpoint存储和读取的bug
--- a/doc/fluid/release_note_cn.rst
+++ b/doc/fluid/release_note_cn.rst
-==============
-版本说明
-==============
-
-Paddle Fluid v1.3
-##########################
-
-重要更新
-=========
-* 统一Executor和ParallelExecutor接口，用户只需通过CompiledProgram将单卡模型转化多卡模型，并利用Executor进行训练或者预测。
-* 正式发布AnalysisConfig 预测接口，支持计算图分析、算子融合等优化，并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速.
-* 模型库新增发布PaddlePaddle视频模型库，提供5个视频分类经典模型以及适合视频分类任务的通用骨架代码，用户可一键式高效配置模型完成训练和评测。
-* 新增支持NLP语义表示BERT模型，支持多机多卡训练，支持混合精度训练，训练速度对比主流实现提升50%+，提供完整部署示例。
-* 大规模稀疏参数服务器Benchmark发布， CPU多机异步训练发布显著提升点击率预估任务IO吞吐的built-in reader，多机多卡训练性能多方面提升。
-
-基础框架
-==========
-* 安装
-	* 新增Linux和MacOS下的中文版本辅助安装脚本，提供交互式安装方式，协助用户在复杂环境下快速完成PaddlePaddle安装。
-	* Windows支持优化：新增cuda8，cudnn7的GPU支持，新增AVX指令集、MKLDNN、mnist数据集支持。修复Windows加载Linux/Mac下同版本paddle训练模型的问题。
-* 增加动态图基础功能
-	* 动态图tracer、 autograd、python Layer/PyLayer，动态图支持MLP、GAN、ptbRNN、Resnet模型，动态图支持Optimizer、GPU训练。
-* Executor和ParallelExecutor接口优化
-	* 对Executor和ParallelExecutor接口进行统一，用户只需通过CompiledProgram将单卡模型转化多卡模型，并利用Executor进行训练或者预测。
-	* ParallelExecutor优化
-		对MultiDevSSAGraphBuilder进行重构，使得MultiDevSSAGraphBuilder更易扩展。
-		去除ParallelExecutor中的设备锁，提升ParallelExecutor多卡调度性能。
-* 中间表达IR和Pass方面的优化
-	* 完善C++ IR graph的python接口以及C++ IR pass的python接口。
-	* 在framework.py中新增IRGraph类，为在Python层编写IR Pass做准备。
-	* 新增支持网络无锁更新的Pass。
-	* 新增QuantizationTransformPass，此为Quantization Aware Training量化模式训练前的图修改操作部分。
-* 内存和显存方面的优化
-	* 新增支持在编译时加入 Jemalloc 作为动态链接库，提升内存管理的性能，降低基础框架内存管理开销
-	* 新增memory optimize，inplace pass, memory pool early deletion等显存优化策略。
-	* 新增支持网络无锁更新的Pass。
-	* 新增QuantizationTransformPass，此为Quantization Aware Training量化模式训练前的图修改操作部分。
-* Operator整体层面的优化
-	* 每个op在执行前只做一次scope查询，减少读写锁操作（原来需要做1~5次scope查询）
-	* 新增Temporary Allocator，减少op中的同步操作
-	* 新增py_func operator，支持python op接入，用户可以借助py_func Operator快速实现所需要的特有操作
-* 重构DDim，Variable Type等，降低基础框架调度开销。
-* INTEL FP32计算相关优化
-	* 优化density_prior_box operator，单op四线程提速3倍。
-	* 优化Stack operator，单op提速16倍。
-	* 开发Transpose，Concat和Conv3d三个基于MKLDNN的kernel。
-	* 修复lrn operator中MKLDNN kernel精度bug，同时单op提速1.3倍。
-	* 修复MKLDNN初始化占用5G内存的问题，目前初始化占用500MB。
-	* 减少从MKLDNN OP kernel到非MKLDNN OP kernel时不必要的reorder。
-* 完善CPU JitKernel
-	* sequence pooling 的jitkernel，纯op提升2倍。
-	* softmax 的jitkernel，纯op提升2倍，同时使得Bert模型CPU预测提升26%。
-	* 常见的基本逻辑：向量的每个元素求平方kVSquare、矩阵乘法kMatMul、向量的最大值kHMax、向量所有元素的和kHSum。
-
-预测引擎
-==========
-
-服务器预测
-+++++++++++
-* 正式发布AnalysisConfig 预测接口，支持计算图分析、算子融合等优化，并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速。
-* 预发布 intel CPU上的 预测 INT8 离线量化方案
-	* 开发Conv2D，Pool2D，Quantize，Dequantize四个基于MKL-DNN的INT8 kernel。
-	* 预发布Calibration的3个核心Python API（paddle.fluid.contrib.Calibrator）。
-	* 开发Calibration工具，保证FP32和INT8的精度在ResNet-50和MobileNet-V1在ImageNet验证数据集上相差在1%内。
-	* 支持Intel Xeon CascadeLake Server（VNNI指令）及Intel Xeon SkyLake Server，性能提升约为1.33倍。
-* CPU预测速度提升
-	* fuse sequence pooling concatop，支持N (<200)个sequence_pooling op concat起来组成一个新op，整体使得seqpool模型 CPU预测提升56%。
-	* fuse 连续重复的fc op为一个大op，使得seqpool模型CPU预测速度提升15%。
-	* fuse 逻辑为 $$((X * Y).^2 - (X.^2 * Y.^2) ) .* scalar$$ 的op组合 , 使得seqpool模型CPU预测速度提升8.2%。
-	* 针对输入tensor元素个数为1的情况，优化compare_op的CPU Kernel。
-* 新增Paddle-TRT 对Calibration INT8的支持，GPU预测速度提升
-	* 模型VGG，Resnet50上预测速度达到了Paddle-TRT float32的两倍性能。
-	* 模型VGG，Resnet50在imagenet数据集上测试，精度下降0.3%以内。
-* 算子融合
-	* 增加 fc和 con 相关两个 fuse，作用于 conv_op CUDNN kernel。
-	* 新增Conv+Affine Channel的融合pass，Faster RCNN运行的性能提升26.8%。
-	* 新增Transpose+Flatten+Concat 融合pass，MobilenetSSD模型性能提升15%。
-	* 实现beam_search operator的CUDA Kernel，并且将相应的top-k、elementwise_add、reshape、log计算融合到beam_search operator中。
-* 功能完善及易用性提升
-	* 新增C++ IR graph的Python接口。
-	* 新增预测库的Python接口。
-	* 服务端预测支持从内存加载模型。
-* 其他
-	* 删除legacy V2代码。从1.3版本起，不再支持V1&V2老版本功能。
-	* 修复Paddle-TRT elementwise-mul模型运行出现问题的bug。
-	* 修复Paddle-TRT  trt_engine stream多个连续输入情况下模型输出结果异常的bug。
-
-移动端预测
-+++++++++++
-* 效率优化，常见模型预测速度提升
-	* int8预测支持dequantize和其他op（batch normalization/relu/elementwise add）进行自动kernel融合。
-	* transpose2 operator对于shuffle channel操作进行优化。
-	* gru operator使用neon指令进行优化，并针对batch size为1时进行优化。
-	* 优化和实现pooling，支持任意的padding。
-	* 优化和实现batch normalization、softmax、elementwise add。
-* 新增支持多个输入和多个输出的模型预测。
-* 新增实现prelu6 operator、cast operator、top_k operator。
-* 修复int8 offline量化溢出结果不对的问题。
-* 修复winograd实现在输入feature map的height和width不相等时结果可能为0的bug。
-
-模型建设
-==========
-* PaddleCV 智能视觉
-	* 新增发布PaddlePaddle视频模型库，包括五个视频分类模型：Attention Cluster、NeXtVLAD、LSTM,、stNet、TSN。提供适合视频分类任务的通用骨架代码，包括数据读取和预处理、训练和预测、网络模型以及指标计算等多个模块。用户根据需要添加自己的网络模型，直接复用其他模块的代码，快速部署模型。
-	* 新增支持目标检测Mask R-CNN模型，效果与主流实现打平。
-	* 语义分割DeepLabV3+模型，depthwise_conv op融合，显存优化，显存占用对比上一版本减少50%。
-* PaddleNLP 智能文本处理
-	* 新增支持NLP语义表示BERT模型，支持多机多卡训练，支持混合精度训练，训练速度对比主流实现提升50%+，提供完整部署示例。
-	* 机器翻译Transformer模型优化解码计算，decoder中加入对encoder output计算结果的cache，预测速度提升一倍。
-* PaddleRec 智能推荐
-	* Sequence Semantic Retrieval 新增单机多线程、单机多卡运行示例，添加预测功能、数据预处理优化，完善部署示例。
-	* GRU4Rec新增负采样功能，使用bpr loss和cross entropy loss的效果与原作打平。
-
-分布式训练
-===========
-* 大规模稀疏参数服务器Benchmark发布
-	* 测试真实业务场景下，特征规模百亿、样本平均特征数1k的点击率预估任务，在batch=512情况下，100worker加速比95.0，吞吐量1.56M/s 。
-* CPU多机异步训练
-	* 发布面向点击率预估任务的built-in reader，Criteo数据集下IO总吞吐提升1300%。
-* GPU多机多卡水平扩展性能提升
-	* 新增并行模式：PG（ParallelGraph）、MP（Multi-Process），独立GPU卡之间的计算，提升性能同时，不影响模型精度。
-	* 在ResNet50模型，单机8卡V100下，PG, MP模式提升训练性能30%以上；4机32卡，PG模式提速46%，MP模式提速60%。
-	* 在BERT模型，8卡V100下，PG, MP模式提升训练性能26%。
-	* Multi-Process模式相比Parallel-Graph模式对Reader速度敏感度不高。
-* GPU多机多卡垂直扩展性能提升
-	* 新增功能：fp16和混合精度训练
-	* Fp16单机单卡加速情况：ResNet50提速约87%，BERT提速约70%。
-	* BERT同时开启PG和混合精度，单机8卡下单位时间吞吐提升120%。
-	* ResNet50同时开启混合精度训练和MP模式，在V100单机8卡、4机32卡下，单位时间吞吐提升100%。
-* 典型模型收敛速度优化
-	* 新增功能：动态Batch Size，动态Image Resize方法。
-	* Resnet50 on Imagenet数据集：训练收敛轮数下降为标准训练方法的1/3左右。
-
-VisualDL
-==========
-* VisualDL graph支持Paddle fluid保存的模型可视化展示。
-
-
-
-Paddle Fluid v1.2
-##########################
-
-Paddle Fluid v1.2在基础框架、预测引擎、模型建设、分布式训练各个方向上完成多项更新。基础框架支持python3.5及以上全版本。预测引擎优化，预测性能大幅提升。增强了对RL相关的支持能力。模型库新增图像分类任任务的预训练模型、语言模型任务新增基于cudnn的LSTM实现、分布式word2vec模型。CPU多机异步训练升级了包括worker异步并发和IO、通信优化在内多项功能，整体吞吐大幅提升。
-
-基础框架
-==========
-* 安装
-	* 提供新pip安装包，支持Windows下CPU执行。
-* 编程语言
-	* 新增对python3.6、python3.7的支持。
-* 重构内存分配模块Allocator，提升CPU下内存分配策略，提升显存利用率(默认关闭，需要使用FLAGS_allocator_strategy)。
-* 限制SelectedRows的使用。修复了稀疏正则和稀疏优化器的bug。
-* Tensor支持DLPack，方便被其他框架集成和集成其他训练框架。
-* OP
-	* 修复 expand op shape 推理错误的bug
-	* 支持 Selu 激活函数
-
-预测引擎
-==========
-* 服务器预测
-	* GPU 支持图融合，且支持和 TensorRT引擎混合改图，在Resnet50和Googlenet等图像通用模型上bs=1下性能提升 50%~100%。
-	* GPU支持DDPG Deep Explore预测。
-	* Paddle-TRT对更多模型的支持，其中包括Resnet， SE-Resnet， DPN，GoogleNet。
-	* CPU, GPU, TensorRT 等加速引擎合并入 AnalysisPredictor，统一由 AnalysisConfig 控制。
-	* 增加调用多线程数学库的接口。
-	* 新增TensorRT plugin的支持，包括 :code:`split operator` ， :code:`prelu operator` ，  :code:`avg_pool operator` ,  :code:`elementwise_mul operator` 。
-	* 增加了JIT CPU Kernel，支持基本的向量操作，以及常见的算法包括ReLU，LSTM和GRU的部分实现，可以实现在AVX和AVX2指令集之间自动runtime切换。
-	* 优化CRF decoding和LayerNorm在AVX以及AVX2指令集上的实现。
-	* 修复了 AnalysisPredictor 在GPU，在CPU 到 GPU 的 transfer data 不删除的问题。
-	* 修复了 Variable 中包含 container 内存持续增长的问题。
-	* 修复 :code:`fc_op` 不支持3-D Tensor的问题。
-	* 修复了Analysis predictor 在GPU下执行pass时的问题。
-	* 修复了TensorRT下运行GoogleNet的问题。
-	* 预测性能提升
-		* Max Sequence pool optimization，单op提高10%。
-		*  :code:`Softmax operator` 优化，单op提升14%。
-		*  :code:`Layer Norm operator` 优化，支持avx2指令集，单op提升5倍。
-		*  :code:`Stack operator` 优化，单op提升3.6倍。
-		* 增加depthwise_conv_mkldnn_pass，加速MobileNet预测。
-		* 加速analysis模式的图分析时间，提升70倍。
-		* DAM开源模型，提升118.8%。
-* 移动端预测
-	* 实现winograd算法， GoogleNet v1性能大幅提升35%。
-	* GoogleNet 8bit优化，相比float加速14%。
-	* MobileNet v1 8bit支持，相比float加速20%。
-	* MobileNet v2 8bit支持，相比float加速19%。
-	* FPGA V1 开发了Deconv算子。
-	* android gpu支持MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet等主流的网络模型。
-
-
-模型建设
-===========
-* CV图像分类任务发布MobileNet V1, ResNet101, ResNet152，VGG11预训练模型。
-* CV Metric Learning模型新增arcmargin损失，并调整训练方式，采用element-wise作为预训练模型，pair-wise继续微调的训练方式提升精度。
-* NLP语言模型任务新增基于cudnn的LSTM实现，对比PaddingRNN的实现方式，在不同参数配置下速度提升3~5倍。
-* 增加分布式word2vec模型，包括新增的tree-based softmax operator，negative sampling等，与经典word2vec算法对齐。
-* 新增GRU4Rec、Tag-Space算法的分布式配置。
-* 完善Multi-view Simnet模型，并增加inference配置。
-* 支持强化学习算法 DQN。
-* 现已支持python3.x的模型：语义匹配DAM，阅读理解BiDAF，机器翻译Transformer，语言模型，强化学习DQN、DoubleDQN模型、DuelingDQN模型，视频分类TSN，度量学习Metric Learning，场景文字识别CRNN-CTC 、OCR Attention，生成式对抗网络ConditionalGAN、DCGAN、CycleGAN，语义分割ICNET、DeepLab v3+，目标检测Faster-RCNN、MobileNet-SSD 、PyramidBox ，图像分类SE-ResNeXt、ResNet等，个性化推荐TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet。
-
-分布式训练
-=============
-* CPU多机异步训练
-	* worker异步并发：增加 :code:`AsyncExecutor` ，以训练文件作为执行粒度，支持分布式训练中的worker端计算异步无锁计算，同时支持单机训练。以CTR任务为例，单机训练速度，在充分利用单机线程的情况下，整体吞吐提升14倍。
-	* IO优化：增加支持 :code:`AsyncExecutor` 的DataFeed，支持可定制化的通用分类任务格式。面向CTR任务，增加CTRReader，使数据读取速度线性提升，在PaddleRec/ctr任务中，整体吞吐提升1倍。
-	* 通信优化：针对稀疏访问的Dense参数例如Embedding，增加稀疏通信机制，以语义匹配任务为例，获取参数的总量可以压缩到1%以下，在搜索真实场景的数据下，整体训练吞吐可以提升50倍。
-* GPU多机同步训练
-	* 修复Transformer、Bert模型下P2P训练模式会Hang住的问题。
-
-文档
-=========
-* API
-	* 新增13篇API使用指南。
-	* 新增300个API Reference中文文档。
-	* 优化77个API Reference英文文档：包括代码示例、参数说明等。
-* 安装文档
-	* 新增python3.6、python3.7安装说明。
-	* 新增windows pip install安装说明。
-* Book文档
-	* Book文档中的代码示例更改为Low level API。
-* 使用文档
-	* 新增《Operator相关注意事项》，更新《保存与载入模型变量》、《C++预测API介绍》、《使用TensorRT库预测》、《如何贡献代码》等多篇使用文档。
--- a/doc/fluid/release_note_en.md
+++ b/doc/fluid/release_note_en.md
--- a/doc/fluid/user_guides/howto/dygraph/DyGraph.md
+++ b/doc/fluid/user_guides/howto/dygraph/DyGraph.md
 # 动态图机制-DyGraph

+
+
 PaddlePaddle的DyGraph模式是一种动态的图执行机制，可以立即执行结果，无需构建整个图。同时，和以往静态的执行计算图不同，DyGraph模式下您的所有操作可以立即获得执行结果，而不必等待所构建的计算图全部执行完成，这样可以让您更加直观地构建PaddlePaddle下的深度学习任务，以及进行模型的调试，同时还减少了大量用于构建静态计算图的代码，使得您编写、调试网络的过程变得更加便捷。


@@ -313,7 +315,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		        train_reader = paddle.batch(
 		            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
 		
-		        dy_param_init_value = {}
 		        np.set_printoptions(precision=3, suppress=True)
 		        for epoch in range(epoch_num):
 		            for batch_id, data in enumerate(train_reader()):
@@ -333,10 +334,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		
 		                dy_out = avg_loss.numpy()
 		
-		                if epoch == 0 and batch_id == 0:
-		                    for param in mnist.parameters():
-		                        dy_param_init_value[param.name] = param.numpy()
-		
 		                avg_loss.backward()
 		                sgd.minimize(avg_loss)
 		                mnist.clear_gradients()
@@ -390,11 +387,15 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：

  在模型训练中可以使用`                    fluid.dygraph.save_persistables(your_model_object.state_dict(), "save_dir")`来保存`your_model_object`中所有的模型参数。也可以自定义需要保存的“参数名” - “参数对象”的Python Dictionary传入。

-同样可以使用`your_modle_object.load_dict(
-                        fluid.dygraph.load_persistables(your_model_object.state_dict(), "save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+同样可以使用`your_modle_object.load_dict(fluid.dygraph.load_persistables("save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+
+
+

 下面的代码展示了如何在“手写数字识别”任务中保存参数并且读取已经保存的参数来继续训练。
-	
+
+
+	dy_param_init_value={}
 	for epoch in range(epoch_num):
 	    for batch_id, data in enumerate(train_reader()):
 	        dy_x_data = np.array(
@@ -420,9 +421,8 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 	
 	        for param in mnist.parameters():
 	            dy_param_init_value[param.name] = param.numpy()
-	
-	        mnist.load_dict(fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
-	        restore = mnist.parameters()
+	        mnist.load_dict(fluid.dygraph.load_persistables("save_dir"))
+	restore = mnist.parameters()
 	# check save and load
 	success = True
 	for value in restore:
@@ -490,7 +490,7 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
        mnist_infer = MNIST("mnist")
        # load checkpoint
        mnist_infer.load_dict(
-            fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
+            fluid.dygraph.load_persistables("save_dir"))
        print("checkpoint loaded")

        # start evaluate mode
@@ -536,56 +536,41 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 ## 编写兼容的模型

 以上一步中手写数字识别的例子为例，相同的模型代码可以直接在PaddlePaddle的`Executor`中执行：
-
+	
 	exe = fluid.Executor(fluid.CPUPlace(
-        ) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
-
-        mnist = MNIST("mnist")
-        sgd = SGDOptimizer(learning_rate=1e-3)
-        train_reader = paddle.batch(
-            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
-
-        img = fluid.layers.data(
-            name='pixel', shape=[1, 28, 28], dtype='float32')
-        label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-        cost = mnist(img)
-        loss = fluid.layers.cross_entropy(cost, label)
-        avg_loss = fluid.layers.mean(loss)
-        sgd.minimize(avg_loss)
-
-        # initialize params and fetch them
-        static_param_init_value = {}
-        static_param_name_list = []
-        for param in mnist.parameters():
-            static_param_name_list.append(param.name)
-
-        out = exe.run(fluid.default_startup_program(),
-                      fetch_list=static_param_name_list)
-
-        for i in range(len(static_param_name_list)):
-            static_param_init_value[static_param_name_list[i]] = out[i]
-
-        for epoch in range(epoch_num):
-            for batch_id, data in enumerate(train_reader()):
-                static_x_data = np.array(
-                    [x[0].reshape(1, 28, 28)
-                     for x in data]).astype('float32')
-                y_data = np.array(
-                    [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
-
-                fetch_list = [avg_loss.name]
-                fetch_list.extend(static_param_name_list)
-                out = exe.run(
-                    fluid.default_main_program(),
-                    feed={"pixel": static_x_data,
-                          "label": y_data},
-                    fetch_list=fetch_list)
-
-                static_param_value = {}
-                static_out = out[0]
-                for i in range(1, len(out)):
-                    static_param_value[static_param_name_list[i - 1]] = out[
-                        i]
+	) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
+	
+	mnist = MNIST("mnist")
+	sgd = SGDOptimizer(learning_rate=1e-3)
+	train_reader = paddle.batch(
+	    paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
+	
+	img = fluid.layers.data(
+	    name='pixel', shape=[1, 28, 28], dtype='float32')
+	label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+	cost = mnist(img)
+	loss = fluid.layers.cross_entropy(cost, label)
+	avg_loss = fluid.layers.mean(loss)
+	sgd.minimize(avg_loss)
+	
+	out = exe.run(fluid.default_startup_program())
+	
+	for epoch in range(epoch_num):
+	    for batch_id, data in enumerate(train_reader()):
+	        static_x_data = np.array(
+	            [x[0].reshape(1, 28, 28)
+	             for x in data]).astype('float32')
+	        y_data = np.array(
+	            [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
+	
+	        fetch_list = [avg_loss.name]
+	        out = exe.run(
+	            fluid.default_main_program(),
+	            feed={"pixel": static_x_data,
+	                  "label": y_data},
+	            fetch_list=fetch_list)
+	
+	        static_out = out[0]

 			
 			
--- a/doc/fluid/user_guides/howto/prepare_data/reader_cn.md
+++ b/doc/fluid/user_guides/howto/prepare_data/reader_cn.md
@@ -159,7 +159,7 @@ reader = paddle.reader.shuffle(paddle.dataset.mnist.train(), 512)

 我们提供一个函数来将一个单项reader转换成一个batch reader。

-### 为什么需要一个batch raeder，在训练过程中给出reader和batch_size参数这样不足够吗？
+### 为什么需要一个batch reader，在训练过程中给出reader和batch_size参数这样不足够吗？

 在大多数情况下，在训练方法中给出reader和batch_size参数是足够的。但有时用户想自定义mini batch里数据项的顺序，或者动态改变batch_size。在这些情况下用batch reader会非常高效有用。


--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
@@ -4,8 +4,7 @@
 Use PyReader to read training and test data
 ############################################

-Paddle Fluid supports PyReader, which implements feeding data from Python to C++. Different from :ref:`user_guide_use_numpy_array_as_train_data_en` , the process of loading data to Python is asynchronous with the process of :code:`Executor::Run()` reading data when PyReader is in use.
-Moreover, PyReader is able to work with :code:`double_buffer_reader` to upgrade the performance of reading data.
+Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.

 Create PyReader Object
 ################################
@@ -17,7 +16,7 @@ You can create PyReader object as follows:
    import paddle.fluid as fluid

    py_reader = fluid.layers.py_reader(capacity=64,
-                                       shapes=[(-1,3,224,224), (-1,1)],
+                                       shapes=[(-1,784), (-1,1)],
                                       dtypes=['float32', 'int64'],
                                       name='py_reader',
                                       use_double_buffer=True)
@@ -28,14 +27,14 @@ In the code, ``capacity`` is buffer size of PyReader;
 ``name`` is name of PyReader instance; 
 ``use_double_buffer`` is True by default, which means :code:`double_buffer_reader` is used.

-To create some different PyReader objects (Usually, you have to create two different PyReader objects for training and testing phase), the names of objects must be different. For example, In the same task, PyReader objects in training and testing period are created as follows:
+Attention: If you want to create multiple PyReader objects（such as two different PyReader in training and inference period respectively）， you have to appoint different names for different PyReader objects,since PaddlePaddle uses different names to distinguish different variables, and `Program.clone()` (reference to :ref:`api_fluid_Program_clone` ）can't copy PyReader objects.

 .. code-block:: python

    import paddle.fluid as fluid

    train_py_reader = fluid.layers.py_reader(capacity=64,
-                                             shapes=[(-1,3,224,224), (-1,1)],
+                                             shapes=[(-1,784), (-1,1)],
                                             dtypes=['float32', 'int64'],
                                             name='train',
                                             use_double_buffer=True)
@@ -46,11 +45,10 @@ To create some different PyReader objects (Usually, you have to create two diffe
                                            name='test',
                                            use_double_buffer=True)

-Note: You could not copy PyReader object with :code:`Program.clone()` so you have to create PyReader objects in training and testing phase with the method mentioned above
+While using PyReader, if you need to share the model parameters of training and test periods, you can use :code:`fluid.unique_name.guard()` .
+Notes: Paddle use different names to distinguish different variables, and the names are generated by the counter in :code:`unique_name` module. By the way, the counts rise by one every time a variable name is generated. :code:`fluid.unique_name.guard()` aims to reset the counter in :code:`unique_name` module, in order to ensure that the variable names are the same when calling :code:`fluid.unique_name.guard()` repeatedly, so that parameters can be shared.

-Because you could not copy PyReader with :code:`Program.clone()` so you have to share the parameters of training phase with testing phase through :code:`fluid.unique_name.guard()` .
-
-Details are as follows:
+An example of configuring networks during the training and test periods by PyReader is as follows:

 .. code-block:: python

@@ -61,41 +59,97 @@ Details are as follows:
    import numpy

    def network(is_train):
+        # Create py_reader object and give different names
+        # when is_train = True and is_train = False
        reader = fluid.layers.py_reader(
            capacity=10,
            shapes=((-1, 784), (-1, 1)),
            dtypes=('float32', 'int64'),
            name="train_reader" if is_train else "test_reader",
            use_double_buffer=True)
+        
+        # Use read_file() method to read out the data from py_reader
        img, label = fluid.layers.read_file(reader)
        ...
        # Here, we omitted the definition of loss of the model
        return loss , reader

+    # Create main program and startup program for training
    train_prog = fluid.Program()
    train_startup = fluid.Program()

    with fluid.program_guard(train_prog, train_startup):
+        # Use fluid.unique_name.guard() to share parameters with test network
        with fluid.unique_name.guard():
            train_loss, train_reader = network(True)
            adam = fluid.optimizer.Adam(learning_rate=0.01)
            adam.minimize(train_loss)

+    # Create main program and startup program for testing
    test_prog = fluid.Program()
    test_startup = fluid.Program()
    with fluid.program_guard(test_prog, test_startup):
+        # Use fluid.unique_name.guard() to share parameters with train network
        with fluid.unique_name.guard():
            test_loss, test_reader = network(False)

 Configure data source of PyReader objects
 ##########################################
-PyReader provides :code:`decorate_tensor_provider` and :code:`decorate_paddle_reader` , both of which receieve Python :code:`generator` as data source.The difference is:
+PyReader object sets the data source by :code:`decorate_paddle_reader()` or :code:`decorate_tensor_provider()` :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` both receive the Python generator :code:`generator` as parameters. :code:`generator` generates a batch of data every time by yield ways inside.
+
+  The differences of :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` ways are:
+
+  - :code:`generator` of :code:`decorate_paddle_reader()` should return data of Numpy Array type, but :code:`generator` of :code:`decorate_tensor_provider()` should return LoDTensor type.
+
+  - :code:`decorate_tensor_provider()` requires that the returned data type and size of LoDTensor of :code:`generator` have to match the appointed dtypes and shapes parameters while configuring py_reader, but :code:`decorate_paddle_reader()` doesn't have the requirements, since the data type and size can transform inside.
+
+  Specific ways are as follows:
+
+  .. code-block:: python
+
+     import paddle.fluid as fluid
+     import numpy as np

-1. :code:`decorate_tensor_provider` :  :code:`generator` generates a  :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being :code:`LoDTensor` or Numpy array, and :code:`LoDTensor` or :code:`shape` of Numpy array must be the same as :code:`shapes` stated while PyReader is created.
+     BATCH_SIZE = 32

+     # Case 1: Use decorate_paddle_reader() method to set the data source of py_reader
+     # The generator yields Numpy-typed batched data
+     def fake_random_numpy_reader():
+         image = np.random.random(size=(BATCH_SIZE, 784))
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9)
+         yield image, label

-2. :code:`decorate_paddle_reader` :  :code:`generator` generates a :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being Numpy array,but the :code:`shape` of Numpy array doesn't have to be the same as :code:`shape` stated while PyReader is created. :code:`decorate_paddle_reader` will :code:`reshape` Numpy array internally.
+     py_reader1 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader1',
+         use_double_buffer=True)

+    py_reader1.decorate_paddle_reader(fake_random_reader)
+
+
+    # Case 2: Use decorate_tensor_provider() method to set the data source of py_reader
+     # The generator yields Tensor-typed batched data
+     def fake_random_tensor_provider():
+         image = np.random.random(size=(BATCH_SIZE, 784)).astype('float32')
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9).astype('int64')
+
+         image_tensor = fluid.LoDTensor()
+         image_tensor.set(image, fluid.CPUPlace())
+
+         label_tensor = fluid.LoDTensor()
+         label_tensor.set(label, fluid.CPUPlace())
+         yield image_tensor, label_tensor
+
+     py_reader2 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader2',
+         use_double_buffer=True)
+
+     py_reader2.decorate_tensor_provider(fake_random_tensor_provider)
 example usage：

 .. code-block:: python
@@ -142,32 +196,75 @@ example usage：
 Train and test model with PyReader
 ##################################

-Details are as follows（the remaining part of the code above）:
+Examples by using PyReader to train models and test are as follows:

 .. code-block:: python

+    import paddle
+     import paddle.fluid as fluid
+     import paddle.dataset.mnist as mnist
+     import six
+
+     def network(is_train):
+         # Create py_reader object and give different names
+         # when is_train = True and is_train = False
+         reader = fluid.layers.py_reader(
+             capacity=10,
+             shapes=((-1, 784), (-1, 1)),
+             dtypes=('float32', 'int64'),
+             name="train_reader" if is_train else "test_reader",
+             use_double_buffer=True)
+         img, label = fluid.layers.read_file(reader)
+         ...
+         # Here, we omitted the definition of loss of the model
+         return loss , reader
+
+     # Create main program and startup program for training
+     train_prog = fluid.Program()
+     train_startup = fluid.Program()
+
+     # Define train network
+     with fluid.program_guard(train_prog, train_startup):
+         # Use fluid.unique_name.guard() to share parameters with test network
+         with fluid.unique_name.guard():
+             train_loss, train_reader = network(True)
+             adam = fluid.optimizer.Adam(learning_rate=0.01)
+             adam.minimize(train_loss)
+
+     # Create main program and startup program for testing
+     test_prog = fluid.Program()
+     test_startup = fluid.Program()
+
+     # Define test network
+     with fluid.program_guard(test_prog, test_startup):
+         # Use fluid.unique_name.guard() to share parameters with train network
+         with fluid.unique_name.guard():
+             test_loss, test_reader = network(False)
+
+
    place = fluid.CUDAPlace(0)
-    startup_exe = fluid.Executor(place)
-    startup_exe.run(train_startup)
-    startup_exe.run(test_startup)
+    exe = fluid.Executor(place)

-    trainer = fluid.ParallelExecutor(
-        use_cuda=True, loss_name=train_loss.name, main_program=train_prog)
+    # Run startup program
+    exe.run(train_startup)
+    exe.run(test_startup)

-    tester = fluid.ParallelExecutor(
-        use_cuda=True, share_vars_from=trainer, main_program=test_prog)
+    # Compile programs
+    train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(loss_name=train_loss.name)
+    test_prog = fluid.CompiledProgram(test_prog).with_data_parallel(share_vars_from=train_prog)

+    # Set the data source of py_reader using decorate_paddle_reader() method
    train_reader.decorate_paddle_reader(
        paddle.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))

    test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))

-    for epoch_id in xrange(10):
+    for epoch_id in six.moves.range(10):
        train_reader.start()
        try:
            while True:
-                print 'train_loss', numpy.array(
-                    trainer.run(fetch_list=[train_loss.name]))
+                loss = exe.run(program=train_prog, fetch_list=[train_loss])
+                print 'train_loss', loss
        except fluid.core.EOFException:
            print 'End of epoch', epoch_id
            train_reader.reset()
@@ -175,8 +272,8 @@ Details are as follows（the remaining part of the code above）:
        test_reader.start()
        try:
            while True:
-                print 'test loss', numpy.array(
-                    tester.run(fetch_list=[test_loss.name]))
+                loss = exe.run(program=test_prog, fetch_list=[test_loss])
+                print 'test loss', loss
        except fluid.core.EOFException:
            print 'End of testing'
            test_reader.reset()

--- a/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
@@ -35,7 +35,9 @@ In the training of data parallelism mode, Fluid uses two communication modes to

  The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.

-  When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+  **Note:** When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+
+  **Note:** When using GPU training, if there are multiple GPU cards in each trainer node, the gradient polymerization will execute in NCCL2 way among the cards in one node, and then in multiple nodes through pserver.

 - Structure of NCCL2 communication method:

@@ -178,8 +180,10 @@ Distributed training in NCCL2 mode, because there is no parameter server role, t

 * Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
 * When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
+  In this step, :code:`gen_nccl_id_op` will add in :code:`startup program` to synchronize NCCLID information during the multi-computer program initialization.
 * Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
-
+  In this step, :code:`ParallelExecutor` will initialize NCCL2 by the multi-computer way and do the operations :code:`allreduce` across the nodes for the gradient of every parameter to execute muti-computer training
+   
 For example:

 .. code-block:: python
@@ -198,9 +202,9 @@ For example:
 .. csv-table:: Description of the necessary parameters for NCCL2 mode
 	:header: "parameter", "description"

-	"trainer_id", "The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
-	"trainers", "endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
-	"current_endpoint", "endpoint of current node"
+	"trainer_id", "(int)The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
+	"trainers", "(int)endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
+	"current_endpoint", "(string)endpoint of current node"

 Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
 synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.
@@ -227,7 +231,9 @@ Attention during usage:
 Important Notes on NCCL2 Distributed Training
 ++++++++++++++++++++++++++++++++++++++++++++++

-**Note** : Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
+**Note:** When using distributed training in NCCL2 mode, if you only want to use a part of cards in one node, you can appoint by configuring the environment variable :code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` .
+
+**Note:** Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
 exit at the final iteration. There are two common ways:

 - Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)

--- a/doc/fluid/user_guides/index_cn.rst
+++ b/doc/fluid/user_guides/index_cn.rst
@@ -14,7 +14,6 @@

    - `训练神经网络 <../user_guides/howto/training/index_cn.html>`_：介绍如何使用 Fluid 进行单机训练、多机训练、以及保存和载入模型变量

-
    - `DyGraph模式 <../user_guides/howto/dygraph/DyGraph.html>`_：介绍在 Fluid 下使用DyGraph

    - `模型评估与调试 <../user_guides/howto/evaluation_and_debugging/index_cn.html>`_：介绍在 Fluid 下进行模型评估和调试的方法，包括：

--- a/doc/fluid/user_guides/models/index_cn.rst
+++ b/doc/fluid/user_guides/models/index_cn.rst
@@ -14,8 +14,7 @@ Fluid模型配置和参数文件的工具。
 -  `AlexNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `VGG <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `GoogleNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
-  `Residual
-   Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
+-  `Residual Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Inception-v4 <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `MobileNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Dual Path
@@ -122,7 +121,7 @@ RNN 结构的 NMT 得以应运而生，例如基于卷积神经网络 CNN
 Attention 学习语言中的上下文依赖。相较于RNN/CNN,
 这种结构在单层内计算复杂度更低、易于并行化、对长程依赖更易建模，最终在多种语言之间取得了最好的翻译效果。

-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 强化学习
 --------
@@ -163,7 +162,7 @@ DQN 及其变体，并测试了它们在 Atari 游戏中的表现。

 本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作，用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发，其网络结构完全基于注意力(attention)机制，利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示，然后利用cross-attention获取应答与语境之间的相关性，在两个大规模多轮对话数据集上的表现均好于其它模型。

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -174,8 +173,7 @@ AnyQ

 SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架，该框架在百度各产品上广泛应用，主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式，同时基于该框架也集成了学术界主流的语义匹配模型，如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力。

-  `SimNet in PaddlePaddle
-   Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`__
+-  `SimNet in PaddlePaddle Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`_

 机器阅读理解
 ----
@@ -184,7 +182,7 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架

 百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集，所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区)，答案是由人类回答的。每个问题都对应多个答案，数据集包含200k问题、1000k原文和420k答案，是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型，称为DuReader，采用当前通用的网络分层结构，通过双向attention机制捕捉问题和原文之间的交互关系，生成query-aware的原文表示，最终基于query-aware的原文表示通过point network预测答案范围。

-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`_


 个性化推荐

--- a/doc/fluid/user_guides/models/index_en.rst
+++ b/doc/fluid/user_guides/models/index_en.rst
@@ -97,7 +97,7 @@ Machine Translation transforms a natural language (source language) into another
 The Transformer implemented in this example is a machine translation model based on the self-attention mechanism, in which there is no more RNN or CNN structure, but fully utilizes Attention to learn the context dependency. Compared with RNN/CNN, in a single layer, this structure has lower computational complexity, easier parallelization, and easier modeling for long-range dependencies, and finally achieves the best translation effect among multiple languages.


- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 Reinforcement learning
 -------------------------
@@ -131,7 +131,7 @@ In many scenarios of natural language processing, it is necessary to measure the

 The DAM (Deep Attention Matching Network) introduced in this example is the work of Baidu Natural Language Processing Department published in ACL-2018, which is used for the selection of responses in multi-round dialogue of retrieval chat robots. Inspired by Transformer, DAM is based entirely on the attention mechanism. It uses the stack-type self-attention structure to learn the semantic representations of responses and contexts at different granularities, and then uses cross-attention to obtain relativity between responses and contexts. The performance on the two large-scale multi-round dialogue datasets is better than other models.

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -151,7 +151,7 @@ Machine Reading Comprehension (MRC) is one of the core tasks in Natural Language

 Baidu reading comprehension dataset is an open-source real-world dataset publicized by Baidu Natural Language Processing Department. All the questions and original texts are derived from actual data (Baidu search engine data and Baidu know Q&A community), and the answer is given by humans. Each question corresponds to multiple answers. The dataset contains 200k questions, 1000k original text and 420k answers. It is currently the largest Chinese MRC dataset. Baidu also publicized the corresponding open-source reading comprehension model, called DuReader. DuReader adopts the current common network hierarchical structure, and captures the interaction between the problems and the original texts through the double attention mechanism to generate the original representation of the query-aware. Finally, based on the original text of query-aware, the answer scope is predicted by point network.

- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`__


 Personalized recommendation