refine dygraph doc

d55f1fe4 · JiabinYang · 165a0756 · af0c7d7f · d55f1fe4 · d55f1fe4
28 changed file
--- a/doc/fluid/advanced_usage/deploy/inference/native_infer_en.md
+++ b/doc/fluid/advanced_usage/deploy/inference/native_infer_en.md
@@ -96,7 +96,7 @@ There are two modes in term of memory management in `PaddleBuf` :

 In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with `tcmalloc` and other libraries.
 
-### Upgrade performance based on contrib::AnalysisConfig (Prerelease)
+### Upgrade performance based on contrib::AnalysisConfig

 AnalyisConfig is at the stage of pre-release and protected by `namespace contrib` , which may be adjusted in the future.

@@ -106,9 +106,11 @@ The usage of `AnalysisConfig` is similiar with that of `NativeConfig` but the fo

 ```c++
 AnalysisConfig config;
-config.model_dir = xxx;
-config.use_gpu = false;  // GPU optimization is not supported at present
-config.specify_input_name = true; // it needs to set name of input
+config.SetModel(dirname);                // set the directory of the model
+config.EnableUseGpu(100, 0 /*gpu id*/);  // use GPU,or
+config.DisableGpu();                     // use CPU
+config.SwitchSpecifyInputNames(true);    // need to appoint the name of your input
+config.SwitchIrOptim();     // turn on the optimization switch,and a sequence of optimizations will be executed in operation                      
 ```

 Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows:
@@ -125,11 +127,29 @@ tensor.dtype = paddle::PaddleDType::INT64;
 tensor.name = "input0"; // name need to be set here
 ```

+The subsequent execution process is totally the same with `NativeConfig` .
+	
+### variable-length sequence input
+When dealing with variable-length sequence input, you need to set LoD for `PaddleTensor` .
+	
+``` c++
+# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order.
+tensor.lod = {{0,
+	         /*0 + 3=*/3,
+	         /*3 + 2=*/5,
+	         /*5 + 4=*/9,
+	         /*9 + 1=*/10,
+	         /*10 + 2=*/12,
+	         /*12 + 3=*/15}};
+```
+	
+For more specific examples, please refer to[LoD-Tensor Instructions](../../../user_guides/howto/basic_concept/lod_tensor_en.html)
+	
 ### Suggestion for Performance

 1. If the CPU type permits, it's best to use the versions with support for AVX and MKL.
 2. Reuse input and output `PaddleTensor` to avoid frequent memory allocation resulting in low performance
-3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU inference 
+3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU or GPU inference 

 ## Code Demo


--- a/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
+++ b/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
-# Use TensorRT Library for inference
+# Use Paddle-TensorRT Library for inference

 NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
-Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, MobileNet-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
-
-
-## Build inference libraries with `TensorRT`
+Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, Deeplabv3 Mobile, Net-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
+
+## Contents
+ - [compile Paddle-TRT inference libraries](#compile Paddle-TRT inference libraries)
+ - [Paddle-TRT interface usage](#Paddle-TRT interface usage)
+ - [Paddle-TRT example compiling test](#Paddle-TRT example compiling test)
+ - [Paddle-TRT INT8 usage](#Paddle-TRT_INT8 usage)
+ - [Paddle-TRT subgraph operation principle](#Paddle-TRT subgraph operation principle)
+ 
+## <a name="compile Paddle-TRT inference libraries">compile Paddle-TRT inference libraries</a>

 **Use Docker to build inference libraries**         

@@ -42,7 +48,7 @@ Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enable
 	make inference_lib_dist -j
 	```

-## Usage of Paddle TensorRT
+## <a name="Paddle-TRT interface usage">Paddle-TRT interface usage</a> 

 [`paddle_inference_api.h`]('https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/paddle_inference_api.h') defines all APIs of TensorRT. 

@@ -58,17 +64,17 @@ A complete process is shown below:
 #include "paddle_inference_api.h"

 namespace paddle {
-using paddle::contrib::AnalysisConfig;
+using paddle::AnalysisConfig;

 void RunTensorRT(int batch_size, std::string model_dirname) {
  // 1. Create MixedRTConfig
-  AnalysisConfig config(true);
-  config.model_dir = model_dirname;
-  config->use_gpu = true;
-  config->device = 0;
-  config->fraction_of_gpu_memory = 0.15;
+  AnalysisConfig config(model_dirname);
+  // config->SetModel(model_dirname + "/model",                                                                                                   
+  //                     model_dirname + "/params");
+  
+  config->EnableUseGpu(100, 0 /*gpu_id*/);
  config->EnableTensorRtEngine(1 << 20 /*work_space_size*/, batch_size /*max_batch_size*/);
-
+  
  // 2. Create predictor based on config
  auto predictor = CreatePaddlePredictor(config);
  // 3. Create input tensor 
@@ -104,10 +110,33 @@ int main() {
 ```
 The compilation process is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api/demo_ci)

+## <a name="Paddle-TRT_INT8 usage">Paddle-TRT INT8 usage</a>
+
+  1. Paddle-TRT INT8 introduction    
+The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: 1）**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. 2）After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
+
+  2. compile and test the INT8 example
+
+  	```shell
+ 	cd SAMPLE_BASE_DIR/sample
+ 	# sh run_impl.sh {the address of inference libraries} {the name of test script} {model directories}
+ 	# We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. 
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_generate_calib_test SAMPLE_BASE_DIR/sample/mobilenetv1
+ 	
+ 	```
+ 	
+        After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/build/mobilenetv1` model directory, which is the calibration table.
+
+  	``` shell
+ 	# conduct INT8 inference
+ 	# copy the model file with calibration tables to a specific address
+ 	cp -rf SAMPLE_BASE_DIR/sample/build/mobilenetv1 SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_int8_test SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	```

-## Subgraph Theory
+## <a name="Paddle-TRT subgraph operation principle">Paddle-TRT subgraph operation principle</a>

-   Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
+Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
   
 A simple model expresses the process : 

@@ -121,6 +150,6 @@ A simple model expresses the process :
 <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
 </p>

-  We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
+We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
   

--- a/doc/fluid/advanced_usage/development/new_op/new_op.md
+++ b/doc/fluid/advanced_usage/development/new_op/new_op.md
--- a/doc/fluid/advanced_usage/development/new_op/op_notes.md
+++ b/doc/fluid/advanced_usage/development/new_op/op_notes.md
@@ -4,9 +4,9 @@
 ### 1.Fluid中Op的构建逻辑
 Fluid中所有的Op都继承自`OperatorBase`，且所有的Op都是无状态的，每个Op包含的成员变量只有四个：type、inputs、outputs、attribute。

-Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行。
+Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，**Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行**。

-Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。
+Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv_op，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。

 Operator继承关系图：
 ![op_inheritance_relation_diagram](../../pics/op_inheritance_relation_diagram.png)
@@ -62,7 +62,7 @@ Operator继承关系图：
 <td>InferShapeFN </td>
 <td>Functor </td>
 <td>用于推断Output的Shape </td>
-<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run时调用 </td>
+<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run中调用 </td>
 </tr>
 <tr>
 <td>OpCreator </td>
@@ -85,11 +85,12 @@ Operator继承关系图：

 **注意：**

-1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker和op中attr的checker。
+1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker以及Op中attr的checker。
 2. 如果该Op有反向，则必须要有op_grad_opmaker，因为在backward会根据正向的Op中获取反向Op的Maker。
-3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：**DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化。
+3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化**。
 4. 框架没有提供默认的op_infer_var_shape方法。如果该Op是无OpKernel的，通常需要用户添加对应的op_infer_var_shape方法；如果该Op是有OpKernel的，需要实现`OperatorWithKernel`中的`InferShape`方法，此时不需要提供op_infer_var_shape方法。具体实现可参考[while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc)，[conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc)。
-5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：**在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype。
+5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype**。
+


 更多内容请参考: [如何写新的Op](new_op.html)
@@ -119,7 +120,60 @@ ShareDataWith的功能是使两个Tensor共享底层buffer，在调用这个操
 目前稀疏梯度在做更新更新的时候会先对梯度做merge，即对相同参数的梯度做累加，然后做参数以及附加参数（如velocity）的更新。

 ### 7.显存优化
-如果Op的反向不需要将前向op的所有输入输出作为其输入，则不要用`DefaultGradOpDescMaker`，这将会导致无法对没有用到的变量做内存/显存优化。
+通常反向Op会依赖于前向Op的某些输入(Input)、输出(Output)，以供反向Op计算使用。但有些情况下，反向Op不需要前向Op的所有输入和输出；有些情况下，反向Op只需要前向Op的部分输入和输出；有些情况下，反向Op只需要使用前向Op中输入和输出变量的Shape和LoD信息。若Op开发者在注册反向Op时，将不必要的前向Op输入和输出作为反向Op的输入，会导致这部分显存无法被框架现有的显存优化策略优化，从而导致模型显存占用过高。
+
+所以在写注册反向Op时需要注意以下几点：
+
+- Fluid提供的`DefaultGradOpDescMaker`，默认会将前向op的所有输入(`Input`）、输出(`Output`)以及输出变量所对应的梯度(`Output@Grad`)作为反向Op的输入，将前向Op输入所对应的梯度(`Input@Grad`)作为反向Op的输出。所以在使用`DefaultGradOpDescMaker`时需要考虑是否有些变量在计算中不被用到。
+- 如果`DefaultGradOpDescMaker`不能够满足需求，需要用户自己手动构建`GradOpDescMaker`，具体实现请参考[相关文档](new_op.html#permalink-4--gradprotomaker-);
+- 如果有些反向Op需要依赖前向Op的输入或输出变量的的Shape或LoD，但不依赖于变量中Tensor的Buffer，且不能根据其他变量推断出该Shape和LoD，需要对该变量（以下称该变量为`X`）在反向Op中进行注册`NoNeedBufferVarsInference`。**一旦注册了`NoNeedBufferVarsIference`，反向op中就不能读写该变量对应的Tensor中的buffer，只能调用Tensor的dims()和lod()方法，同时，反向Op中的`GetExpectedKernelType()`必须要重写，并且`GetExpectedKernelType()`中不能访问`X`变量中Tensor的type()方法**。比如在`SliceOpGrad`中只会用到`Input`中变量的Shape信息，所以需要为对`Input`在`SliceOpGrad`上进行注册：
+```
+namespace paddle {
+namespace operators {
+// ...
+class SliceOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    // ... 
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    // Note: don't get data type from ctx.Input<framework::Tensor>("Input");   
+    auto dtype = ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type();    
+    return framework::OpKernelType( dtype, ctx.GetPlace());
+  }
+};
+
+
+class SliceOpGradMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* bind = new framework::OpDesc();
+    bind->SetInput("Input", Input("Input"));
+    bind->SetInput(framework::GradVarName("Out"), OutputGrad("Out"));
+    bind->SetOutput(framework::GradVarName("Input"), InputGrad("Input"));
+    bind->SetAttrMap(Attrs());
+    bind->SetType("slice_grad");
+    return std::unique_ptr<framework::OpDesc>(bind);
+  }
+};
+
+DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(SliceOpGradNoNeedBufferVarsInference,
+                                      "Input");
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(slice, ops::SliceOp, ops::SliceOpMaker,
+                  ops::SliceOpGradMaker);
+REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
+                  ops::SliceOpGradNoNeedBufferVarsInference);
+```

 ### 8.混合设备调用
 由于GPU是异步执行的，当CPU调用返回之后，GPU端可能还没有真正的执行，所以如果在Op中创建了GPU运行时需要用到的临时变量，当GPU开始运行的时候，该临时变量可能在CPU端已经被释放，这样可能会导致GPU计算出错。
@@ -137,7 +191,7 @@ The following device operations are asynchronous with respect to the host:
 关于cudaMemCpy和cudaMemCpyAsync注意事项：

 - 如果数据传输是从GPU端到非页锁定的CPU端，数据传输将是同步，即使调用的是异步拷贝操作。
- 如果数据传输时从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。
+- 如果数据传输是从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。

 更多内容可参考：[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)，[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)

@@ -172,7 +226,10 @@ Enforce提示信息不能为空，并且需要写明，因为报错信息可以

 **注意：**在merge到develop分支之前一定进行公式预览。可参考[dynamic_lstmp](http://paddlepaddle.org/documentation/docs/zh/1.1/api/layers.html#dynamic-lstmp)。

-### 3.Python端Op接口中参数的顺序
+### 3.Op变量名的命名要规范
+在定义Op时，Op的输入输出以及属性的命名需要符合规范，具体命名规则请参考：[`name_convention`](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
+
+### 4.Python端Op接口中参数的顺序
 Python API中参数的顺序一般按照重要性来排，以fc为例：
 ```
 def fc(input,

--- a/doc/fluid/api_cn/data/data_reader_cn.rst
+++ b/doc/fluid/api_cn/data/data_reader_cn.rst
@@ -249,7 +249,18 @@ Data Reader Interface

 .. py:function:: paddle.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)

+通过多线程方式，通过用户自定义的映射器mapper来映射reader返回的样本（到输出队列）。

+参数：
+    - **mapper** （callable） - 一种映射reader数据的函数。
+    - **reader** （callable） - 产生数据的reader。
+    - **process_num** （int） - 用于处理样本的线程数目。
+    - **buffer_size** （int） - 存有待读取数据的队列的大小。
+    - **order** （bool） - 是否保持原始reader的数据顺序。 默认为False。
+
+返回：一个将原数据进行映射后的decorated reader。
+
+返回类型： callable

 .. py:class:: paddle.reader.PipeReader(command, bufsize=8192, file_type='plain')


--- a/doc/fluid/api_cn/layers_cn.rst
+++ b/doc/fluid/api_cn/layers_cn.rst
@@ -12939,10 +12939,10 @@ yolov3_loss
 至于置信度得分，它是anchor框和真实框之间的IoU的逻辑回归值，anchor框的得分最高为1，此时该anchor框对应着最大IoU。
 如果anchor框之间的IoU大于忽略阀值ignore_thresh，则该anchor框的置信度评分损失将会被忽略。
          
-因此，yolov3损失包括三个主要部分，框位置损失，置信度评分损失，分类损失。L1损失用于
-框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），置信度评分损失和分类损失。
+因此，yolov3损失包括三个主要部分，框位置损失，目标性损失，分类损失。L1损失用于
+框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），目标性损失和分类损失。
          
-每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标损失。
+每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标性损失。

 为了权衡大框(box)和小(box)之间的框坐标损失，框坐标损失将与比例权重相乘而得。即：

@@ -12965,7 +12965,7 @@ yolov3_loss

 参数：
    - **x**  (Variable) – YOLOv3损失运算的输入张量，这是一个形状为[N，C，H，W]的四维张量。H和W应该相同，第二维（C）存储框的位置信息，以及每个anchor box的置信度得分和one-hot分类
-    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，x、y、w、h应该是输入图像相对值。 N是batch size，B是图像中所含有的的最多的box数目
+    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，其中 x, y是真实框的中心坐标，w, h是框的宽度和高度，且x、y、w、h将除以输入图片的尺寸，缩放到[0,1]区间内。 N是batch size，B是图像中所含有的的最多的box数目
    - **gt_label**  (Variable) – 真实框的类id，应该形为[N，B]。
    - **anchors**  (list|tuple) – 指定anchor框的宽度和高度，它们将逐对进行解析
    - **anchor_mask**  (list|tuple) – 当前YOLOv3损失计算中使用的anchor的mask索引

--- a/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
+++ b/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
@@ -29,4 +29,8 @@ The following content describes the APIs related to the learning rate scheduler:

 * :code:`piecewise_decay`: Piecewise decay. That is, the stair-like decay for a given number of steps, the learning rate stays the same within each step. For related API Reference please refer to :ref:`api_fluid_layers_piecewise_decay`

-* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
\ No newline at end of file
+* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
+
+* :code:`cosine_decay`: Cosine attenuation. It means the learning rate changes with the number of steps in the form of a cosine function. For related API Reference please refer to :ref:`api_fluid_layers_cosine_decay`
+
+* :code:`linear_lr_warmup`: The learning rate increases linearly to an appointed rate with the number of steps. For related API Reference please refer to :ref:`api_fluid_layers_linear_lr_warmup`
--- a/doc/fluid/beginners_guide/index_cn.rst
+++ b/doc/fluid/beginners_guide/index_cn.rst
@@ -8,7 +8,7 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵

 让我们从这里开始：

-    - `快速开始 <../beginners_guide/quick_start.html>`_
+    - `快速开始 <../beginners_guide/quick_start_cn.html>`_

 当您第一次来到PaddlePaddle，请您首先阅读以下文档，了解安装方法：

@@ -16,7 +16,7 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵

 这里为您提供了更多学习资料:

-    - `深度学习基础 <../beginners_guide/basics/index.html>`_：覆盖图像分类、个性化推荐、机器翻译等多个深度领域的基础知识，提供 Fluid 实现案例
+    - `深度学习基础 <../beginners_guide/basics/index_cn.html>`_：覆盖图像分类、个性化推荐、机器翻译等多个深度领域的基础知识，提供 Fluid 实现案例

    - `Fluid编程指南 <../beginners_guide/programming_guide/programming_guide.html>`_：介绍 Fluid 的基本概念和使用方法


--- a/doc/fluid/beginners_guide/install/Tables.md
+++ b/doc/fluid/beginners_guide/install/Tables.md
--- a/doc/fluid/beginners_guide/install/Tables_en.md
+++ b/doc/fluid/beginners_guide/install/Tables_en.md
--- a/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
@@ -200,7 +200,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
@@ -204,7 +204,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle

--- a/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
@@ -209,7 +209,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_Windows.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Windows.md
@@ -102,7 +102,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3`进入Python解释器，然后使用 `import paddle.fluid`, 如沒有提示错误，则表明安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_CentOS.md
+++ b/doc/fluid/beginners_guide/install/install_CentOS.md
@@ -51,7 +51,10 @@ CentOS系统下有4种安装方式：

 <a name="check"></a>
 ## ***验证安装***
-安装完成后您可以使用命令`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_Docker.md
+++ b/doc/fluid/beginners_guide/install/install_Docker.md
@@ -4,6 +4,8 @@

 ## 环境准备

+- 目前支持的系统类型，请见[安装说明](./index_cn.html)，请注意目前暂不支持在CentOS 6使用Docker
+
 - 在本地主机上[安装Docker](https://hub.docker.com/search/?type=edition&offering=community)

 - 如需在Linux开启GPU支持，请[安装nvidia-docker](https://github.com/NVIDIA/nvidia-docker)

--- a/doc/fluid/beginners_guide/install/install_MacOS.md
+++ b/doc/fluid/beginners_guide/install/install_MacOS.md
@@ -41,7 +41,10 @@ MacOS系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用：`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载


--- a/doc/fluid/beginners_guide/install/install_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/install_Ubuntu.md
@@ -51,7 +51,10 @@ Ubuntu系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用命令`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_Windows.md
+++ b/doc/fluid/beginners_guide/install/install_Windows.md
@@ -46,7 +46,10 @@ Windows系统下有3种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载


--- a/doc/fluid/beginners_guide/install/install_Windows_en.md
+++ b/doc/fluid/beginners_guide/install/install_Windows_en.md
@@ -9,9 +9,8 @@ This instruction will show you how to install PaddlePaddle on Windows.  The foll

 **Note** : 

-* The current version does not support NCCL, distributed training, AVX, warpctc and MKL related functions.
+* The current version does not support NCCL, distributed training related functions.

-* Currently, only PaddlePaddle for CPU is supported on Windows.



@@ -30,14 +29,20 @@ Version of pip or pip3 should be equal to or above 9.0.1 .

 * Install PaddlePaddle

+* ***CPU version of PaddlePaddle***:
 Execute `pip install paddlepaddle` or `pip3 install paddlepaddle` to download and install PaddlePaddle.

-
+* ***GPU version of PaddlePaddle***:
+Execute `pip install paddlepaddle-gpu`(python2.7) or `pip3 install paddlepaddle-gpu`(python3.x) to download and install PaddlePaddle.
+ 
 ## ***Verify installation***

 After completing the installation, you can use `python` or `python3` to enter the python interpreter and then use `import paddle.fluid` to verify that the installation was successful.

 ## ***How to uninstall***

+* ***CPU version of PaddlePaddle***:
 Use the following command to uninstall PaddlePaddle : `pip uninstallpaddlepaddle `or `pip3 uninstall paddlepaddle`

+* ***GPU version of PaddlePaddle***:
+Use the following command to uninstall PaddlePaddle : `pip uninstall paddlepaddle-gpu` or `pip3 uninstall paddlepaddle-gpu`
--- a/doc/fluid/user_guides/howto/dygraph/DyGraph.md
+++ b/doc/fluid/user_guides/howto/dygraph/DyGraph.md
-# DyGraph
+# 动态图机制-DyGraph
+
+

 PaddlePaddle的DyGraph模式是一种动态的图执行机制，可以立即执行结果，无需构建整个图。同时，和以往静态的执行计算图不同，DyGraph模式下您的所有操作可以立即获得执行结果，而不必等待所构建的计算图全部执行完成，这样可以让您更加直观地构建PaddlePaddle下的深度学习任务，以及进行模型的调试，同时还减少了大量用于构建静态计算图的代码，使得您编写、调试网络的过程变得更加便捷。

@@ -313,7 +315,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		        train_reader = paddle.batch(
 		            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
 		
-		        dy_param_init_value = {}
 		        np.set_printoptions(precision=3, suppress=True)
 		        for epoch in range(epoch_num):
 		            for batch_id, data in enumerate(train_reader()):
@@ -333,10 +334,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		
 		                dy_out = avg_loss.numpy()
 		
-		                if epoch == 0 and batch_id == 0:
-		                    for param in mnist.parameters():
-		                        dy_param_init_value[param.name] = param.numpy()
-		
 		                avg_loss.backward()
 		                sgd.minimize(avg_loss)
 		                mnist.clear_gradients()
@@ -390,11 +387,15 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：

  在模型训练中可以使用`                    fluid.dygraph.save_persistables(your_model_object.state_dict(), "save_dir")`来保存`your_model_object`中所有的模型参数。也可以自定义需要保存的“参数名” - “参数对象”的Python Dictionary传入。

-同样可以使用`your_modle_object.load_dict(
-                        fluid.dygraph.load_persistables(your_model_object.state_dict(), "save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+同样可以使用`your_modle_object.load_dict(fluid.dygraph.load_persistables("save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+
+
+

 下面的代码展示了如何在“手写数字识别”任务中保存参数并且读取已经保存的参数来继续训练。
-	
+
+
+	dy_param_init_value={}
 	for epoch in range(epoch_num):
 	    for batch_id, data in enumerate(train_reader()):
 	        dy_x_data = np.array(
@@ -420,9 +421,8 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 	
 	        for param in mnist.parameters():
 	            dy_param_init_value[param.name] = param.numpy()
-	
-	        mnist.load_dict(fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
-	        restore = mnist.parameters()
+	        mnist.load_dict(fluid.dygraph.load_persistables("save_dir"))
+	restore = mnist.parameters()
 	# check save and load
 	success = True
 	for value in restore:
@@ -490,7 +490,7 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
        mnist_infer = MNIST("mnist")
        # load checkpoint
        mnist_infer.load_dict(
-            fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
+            fluid.dygraph.load_persistables("save_dir"))
        print("checkpoint loaded")

        # start evaluate mode
@@ -536,56 +536,41 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 ## 编写兼容的模型

 以上一步中手写数字识别的例子为例，相同的模型代码可以直接在PaddlePaddle的`Executor`中执行：
-
+	
 	exe = fluid.Executor(fluid.CPUPlace(
-        ) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
-
-        mnist = MNIST("mnist")
-        sgd = SGDOptimizer(learning_rate=1e-3)
-        train_reader = paddle.batch(
-            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
-
-        img = fluid.layers.data(
-            name='pixel', shape=[1, 28, 28], dtype='float32')
-        label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-        cost = mnist(img)
-        loss = fluid.layers.cross_entropy(cost, label)
-        avg_loss = fluid.layers.mean(loss)
-        sgd.minimize(avg_loss)
-
-        # initialize params and fetch them
-        static_param_init_value = {}
-        static_param_name_list = []
-        for param in mnist.parameters():
-            static_param_name_list.append(param.name)
-
-        out = exe.run(fluid.default_startup_program(),
-                      fetch_list=static_param_name_list)
-
-        for i in range(len(static_param_name_list)):
-            static_param_init_value[static_param_name_list[i]] = out[i]
-
-        for epoch in range(epoch_num):
-            for batch_id, data in enumerate(train_reader()):
-                static_x_data = np.array(
-                    [x[0].reshape(1, 28, 28)
-                     for x in data]).astype('float32')
-                y_data = np.array(
-                    [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
-
-                fetch_list = [avg_loss.name]
-                fetch_list.extend(static_param_name_list)
-                out = exe.run(
-                    fluid.default_main_program(),
-                    feed={"pixel": static_x_data,
-                          "label": y_data},
-                    fetch_list=fetch_list)
-
-                static_param_value = {}
-                static_out = out[0]
-                for i in range(1, len(out)):
-                    static_param_value[static_param_name_list[i - 1]] = out[
-                        i]
+	) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
+	
+	mnist = MNIST("mnist")
+	sgd = SGDOptimizer(learning_rate=1e-3)
+	train_reader = paddle.batch(
+	    paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
+	
+	img = fluid.layers.data(
+	    name='pixel', shape=[1, 28, 28], dtype='float32')
+	label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+	cost = mnist(img)
+	loss = fluid.layers.cross_entropy(cost, label)
+	avg_loss = fluid.layers.mean(loss)
+	sgd.minimize(avg_loss)
+	
+	out = exe.run(fluid.default_startup_program())
+	
+	for epoch in range(epoch_num):
+	    for batch_id, data in enumerate(train_reader()):
+	        static_x_data = np.array(
+	            [x[0].reshape(1, 28, 28)
+	             for x in data]).astype('float32')
+	        y_data = np.array(
+	            [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
+	
+	        fetch_list = [avg_loss.name]
+	        out = exe.run(
+	            fluid.default_main_program(),
+	            feed={"pixel": static_x_data,
+	                  "label": y_data},
+	            fetch_list=fetch_list)
+	
+	        static_out = out[0]

 			
 			
--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
@@ -4,8 +4,7 @@
 Use PyReader to read training and test data
 ############################################

-Paddle Fluid supports PyReader, which implements feeding data from Python to C++. Different from :ref:`user_guide_use_numpy_array_as_train_data_en` , the process of loading data to Python is asynchronous with the process of :code:`Executor::Run()` reading data when PyReader is in use.
-Moreover, PyReader is able to work with :code:`double_buffer_reader` to upgrade the performance of reading data.
+Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.

 Create PyReader Object
 ################################
@@ -17,7 +16,7 @@ You can create PyReader object as follows:
    import paddle.fluid as fluid

    py_reader = fluid.layers.py_reader(capacity=64,
-                                       shapes=[(-1,3,224,224), (-1,1)],
+                                       shapes=[(-1,784), (-1,1)],
                                       dtypes=['float32', 'int64'],
                                       name='py_reader',
                                       use_double_buffer=True)
@@ -28,14 +27,14 @@ In the code, ``capacity`` is buffer size of PyReader;
 ``name`` is name of PyReader instance; 
 ``use_double_buffer`` is True by default, which means :code:`double_buffer_reader` is used.

-To create some different PyReader objects (Usually, you have to create two different PyReader objects for training and testing phase), the names of objects must be different. For example, In the same task, PyReader objects in training and testing period are created as follows:
+Attention: If you want to create multiple PyReader objects（such as two different PyReader in training and inference period respectively）， you have to appoint different names for different PyReader objects,since PaddlePaddle uses different names to distinguish different variables, and `Program.clone()` (reference to :ref:`api_fluid_Program_clone` ）can't copy PyReader objects.

 .. code-block:: python

    import paddle.fluid as fluid

    train_py_reader = fluid.layers.py_reader(capacity=64,
-                                             shapes=[(-1,3,224,224), (-1,1)],
+                                             shapes=[(-1,784), (-1,1)],
                                             dtypes=['float32', 'int64'],
                                             name='train',
                                             use_double_buffer=True)
@@ -46,11 +45,10 @@ To create some different PyReader objects (Usually, you have to create two diffe
                                            name='test',
                                            use_double_buffer=True)

-Note: You could not copy PyReader object with :code:`Program.clone()` so you have to create PyReader objects in training and testing phase with the method mentioned above
+While using PyReader, if you need to share the model parameters of training and test periods, you can use :code:`fluid.unique_name.guard()` .
+Notes: Paddle use different names to distinguish different variables, and the names are generated by the counter in :code:`unique_name` module. By the way, the counts rise by one every time a variable name is generated. :code:`fluid.unique_name.guard()` aims to reset the counter in :code:`unique_name` module, in order to ensure that the variable names are the same when calling :code:`fluid.unique_name.guard()` repeatedly, so that parameters can be shared.

-Because you could not copy PyReader with :code:`Program.clone()` so you have to share the parameters of training phase with testing phase through :code:`fluid.unique_name.guard()` .
-
-Details are as follows:
+An example of configuring networks during the training and test periods by PyReader is as follows:

 .. code-block:: python

@@ -61,41 +59,97 @@ Details are as follows:
    import numpy

    def network(is_train):
+        # Create py_reader object and give different names
+        # when is_train = True and is_train = False
        reader = fluid.layers.py_reader(
            capacity=10,
            shapes=((-1, 784), (-1, 1)),
            dtypes=('float32', 'int64'),
            name="train_reader" if is_train else "test_reader",
            use_double_buffer=True)
+        
+        # Use read_file() method to read out the data from py_reader
        img, label = fluid.layers.read_file(reader)
        ...
        # Here, we omitted the definition of loss of the model
        return loss , reader

+    # Create main program and startup program for training
    train_prog = fluid.Program()
    train_startup = fluid.Program()

    with fluid.program_guard(train_prog, train_startup):
+        # Use fluid.unique_name.guard() to share parameters with test network
        with fluid.unique_name.guard():
            train_loss, train_reader = network(True)
            adam = fluid.optimizer.Adam(learning_rate=0.01)
            adam.minimize(train_loss)

+    # Create main program and startup program for testing
    test_prog = fluid.Program()
    test_startup = fluid.Program()
    with fluid.program_guard(test_prog, test_startup):
+        # Use fluid.unique_name.guard() to share parameters with train network
        with fluid.unique_name.guard():
            test_loss, test_reader = network(False)

 Configure data source of PyReader objects
 ##########################################
-PyReader provides :code:`decorate_tensor_provider` and :code:`decorate_paddle_reader` , both of which receieve Python :code:`generator` as data source.The difference is:
+PyReader object sets the data source by :code:`decorate_paddle_reader()` or :code:`decorate_tensor_provider()` :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` both receive the Python generator :code:`generator` as parameters. :code:`generator` generates a batch of data every time by yield ways inside.
+
+  The differences of :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` ways are:
+
+  - :code:`generator` of :code:`decorate_paddle_reader()` should return data of Numpy Array type, but :code:`generator` of :code:`decorate_tensor_provider()` should return LoDTensor type.
+
+  - :code:`decorate_tensor_provider()` requires that the returned data type and size of LoDTensor of :code:`generator` have to match the appointed dtypes and shapes parameters while configuring py_reader, but :code:`decorate_paddle_reader()` doesn't have the requirements, since the data type and size can transform inside.
+
+  Specific ways are as follows:
+
+  .. code-block:: python
+
+     import paddle.fluid as fluid
+     import numpy as np

-1. :code:`decorate_tensor_provider` :  :code:`generator` generates a  :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being :code:`LoDTensor` or Numpy array, and :code:`LoDTensor` or :code:`shape` of Numpy array must be the same as :code:`shapes` stated while PyReader is created.
+     BATCH_SIZE = 32

+     # Case 1: Use decorate_paddle_reader() method to set the data source of py_reader
+     # The generator yields Numpy-typed batched data
+     def fake_random_numpy_reader():
+         image = np.random.random(size=(BATCH_SIZE, 784))
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9)
+         yield image, label

-2. :code:`decorate_paddle_reader` :  :code:`generator` generates a :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being Numpy array,but the :code:`shape` of Numpy array doesn't have to be the same as :code:`shape` stated while PyReader is created. :code:`decorate_paddle_reader` will :code:`reshape` Numpy array internally.
+     py_reader1 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader1',
+         use_double_buffer=True)

+    py_reader1.decorate_paddle_reader(fake_random_reader)
+
+
+    # Case 2: Use decorate_tensor_provider() method to set the data source of py_reader
+     # The generator yields Tensor-typed batched data
+     def fake_random_tensor_provider():
+         image = np.random.random(size=(BATCH_SIZE, 784)).astype('float32')
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9).astype('int64')
+
+         image_tensor = fluid.LoDTensor()
+         image_tensor.set(image, fluid.CPUPlace())
+
+         label_tensor = fluid.LoDTensor()
+         label_tensor.set(label, fluid.CPUPlace())
+         yield image_tensor, label_tensor
+
+     py_reader2 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader2',
+         use_double_buffer=True)
+
+     py_reader2.decorate_tensor_provider(fake_random_tensor_provider)
 example usage：

 .. code-block:: python
@@ -142,32 +196,75 @@ example usage：
 Train and test model with PyReader
 ##################################

-Details are as follows（the remaining part of the code above）:
+Examples by using PyReader to train models and test are as follows:

 .. code-block:: python

+    import paddle
+     import paddle.fluid as fluid
+     import paddle.dataset.mnist as mnist
+     import six
+
+     def network(is_train):
+         # Create py_reader object and give different names
+         # when is_train = True and is_train = False
+         reader = fluid.layers.py_reader(
+             capacity=10,
+             shapes=((-1, 784), (-1, 1)),
+             dtypes=('float32', 'int64'),
+             name="train_reader" if is_train else "test_reader",
+             use_double_buffer=True)
+         img, label = fluid.layers.read_file(reader)
+         ...
+         # Here, we omitted the definition of loss of the model
+         return loss , reader
+
+     # Create main program and startup program for training
+     train_prog = fluid.Program()
+     train_startup = fluid.Program()
+
+     # Define train network
+     with fluid.program_guard(train_prog, train_startup):
+         # Use fluid.unique_name.guard() to share parameters with test network
+         with fluid.unique_name.guard():
+             train_loss, train_reader = network(True)
+             adam = fluid.optimizer.Adam(learning_rate=0.01)
+             adam.minimize(train_loss)
+
+     # Create main program and startup program for testing
+     test_prog = fluid.Program()
+     test_startup = fluid.Program()
+
+     # Define test network
+     with fluid.program_guard(test_prog, test_startup):
+         # Use fluid.unique_name.guard() to share parameters with train network
+         with fluid.unique_name.guard():
+             test_loss, test_reader = network(False)
+
+
    place = fluid.CUDAPlace(0)
-    startup_exe = fluid.Executor(place)
-    startup_exe.run(train_startup)
-    startup_exe.run(test_startup)
+    exe = fluid.Executor(place)

-    trainer = fluid.ParallelExecutor(
-        use_cuda=True, loss_name=train_loss.name, main_program=train_prog)
+    # Run startup program
+    exe.run(train_startup)
+    exe.run(test_startup)

-    tester = fluid.ParallelExecutor(
-        use_cuda=True, share_vars_from=trainer, main_program=test_prog)
+    # Compile programs
+    train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(loss_name=train_loss.name)
+    test_prog = fluid.CompiledProgram(test_prog).with_data_parallel(share_vars_from=train_prog)

+    # Set the data source of py_reader using decorate_paddle_reader() method
    train_reader.decorate_paddle_reader(
        paddle.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))

    test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))

-    for epoch_id in xrange(10):
+    for epoch_id in six.moves.range(10):
        train_reader.start()
        try:
            while True:
-                print 'train_loss', numpy.array(
-                    trainer.run(fetch_list=[train_loss.name]))
+                loss = exe.run(program=train_prog, fetch_list=[train_loss])
+                print 'train_loss', loss
        except fluid.core.EOFException:
            print 'End of epoch', epoch_id
            train_reader.reset()
@@ -175,8 +272,8 @@ Details are as follows（the remaining part of the code above）:
        test_reader.start()
        try:
            while True:
-                print 'test loss', numpy.array(
-                    tester.run(fetch_list=[test_loss.name]))
+                loss = exe.run(program=test_prog, fetch_list=[test_loss])
+                print 'test loss', loss
        except fluid.core.EOFException:
            print 'End of testing'
            test_reader.reset()

--- a/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
@@ -35,7 +35,9 @@ In the training of data parallelism mode, Fluid uses two communication modes to

  The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.

-  When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+  **Note:** When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+
+  **Note:** When using GPU training, if there are multiple GPU cards in each trainer node, the gradient polymerization will execute in NCCL2 way among the cards in one node, and then in multiple nodes through pserver.

 - Structure of NCCL2 communication method:

@@ -178,8 +180,10 @@ Distributed training in NCCL2 mode, because there is no parameter server role, t

 * Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
 * When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
+  In this step, :code:`gen_nccl_id_op` will add in :code:`startup program` to synchronize NCCLID information during the multi-computer program initialization.
 * Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
-
+  In this step, :code:`ParallelExecutor` will initialize NCCL2 by the multi-computer way and do the operations :code:`allreduce` across the nodes for the gradient of every parameter to execute muti-computer training
+   
 For example:

 .. code-block:: python
@@ -198,24 +202,45 @@ For example:
 .. csv-table:: Description of the necessary parameters for NCCL2 mode
 	:header: "parameter", "description"

-	"trainer_id", "The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
-	"trainers", "endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
-	"current_endpoint", "endpoint of current node"
+	"trainer_id", "(int)The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
+	"trainers", "(int)endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
+	"current_endpoint", "(string)endpoint of current node"

 Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
 synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.

+Start Up NCCL2 Distributed Training in Muti-Process Mode
++++++++++++++++++++++++++++++++++++++++++++++
+
+ Usually you can get better multi-training performance by using multi-process mode to start up NCCL2 distributed training assignment. Paddle provides :code:`paddle.distributed.launch` module to start up multi-process assignment, after which each training process will use an independent GPU device.
+
+Attention during usage:
+
+ * set the number of nodes: set the number of nodes of an assignment by the environment variable :code:`PADDLE_NUM_TRAINERS` , and this variable will also be set in every training process.
+ * set the number of devices of each node: by activating the parameter :code:`--gpus` , you can set the number of GPU devices of each node, and the sequence number of each process will be set in the environment variable :code:`PADDLE_TRAINER_ID` automatically.
+ * data segment: mult-process mode means one process in each device. Generally, each process manages a part of training data, in order to make sure that all processes can manage the whole data set.
+ * entrance file: entrance file is the training script for actual startup.
+ * journal: for each training process, the joural is saved in the default :code:`./mylog` directory, and you can assign by the parameter :code:`--log_dir` .
+
+  startup example:
+
+  .. code-block:: bash
+
+     > PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch train.py --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
+
 Important Notes on NCCL2 Distributed Training
 ++++++++++++++++++++++++++++++++++++++++++++++

-**Note** : Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
+**Note:** When using distributed training in NCCL2 mode, if you only want to use a part of cards in one node, you can appoint by configuring the environment variable :code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` .
+
+**Note:** Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
 exit at the final iteration. There are two common ways:

 - Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)
 - Each node only trains fixed number of batches per pass, which is controlled by python codes. If a node has more data than this fixed amount, then these 
  marginal data will not be trained.

-**Note** :  If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.
+**Note** : If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.

 Assuming you need to use :code:`eth2` as the communication device, you need to set the following environment variables:


--- a/doc/fluid/user_guides/howto/training/save_load_variables.rst
+++ b/doc/fluid/user_guides/howto/training/save_load_variables.rst
 .. _user_guide_save_load_vars:

-#############################
-模型/变量的保存/载入与增量训练
-#############################
+################################
+模型/变量的保存、载入与增量训练
+################################

 模型变量分类
 ############
@@ -37,6 +37,21 @@
 那么我们应该将各种长期变量都保存下来，甚至还需要记录一下当前的epoch和step的id。
 因为一些模型变量虽然不是参数，但对于模型的训练依然必不可少。

+save_vars、save_params、save_persistables 以及 save_inference_model的区别
+##########################################################################
+1. :code:`save_inference_model` 会根据用户配置的 :code:`feeded_var_names` 和 :code:`target_vars` 进行网络裁剪，保存下裁剪后的网络结构的 ``__model__`` 以及裁剪后网络中的长期变量
+
+2. :code:`save_persistables` 不会保存网络结构，会保存网络中的全部长期变量到指定位置。
+
+3. :code:`save_params` 不会保存网络结构，会保存网络中的全部模型参数到指定位置。
+
+4. :code:`save_vars` 不会保存网络结构，会根据用户指定的 :code:`fluid.framework.Parameter` 列表进行保存。
+
+ :code:`save_persistables` 保存的网络参数是最全面的，如果是增量训练或者恢复训练， 请选择 :code:`save_persistables` 进行变量保存。
+ :code:`save_inference_model` 会保存网络参数及裁剪后的模型，如果后续要做预测相关的工作， 请选择 :code:`save_inference_model` 进行变量和网络的保存。
+ :code:`save_vars 和 save_params` 仅在用户了解清楚用途及特殊目的情况下使用， 一般不建议使用。
+
+
 保存模型用于对新样本的预测
 ==========================

@@ -97,8 +112,6 @@
 另外，需特别注意运行 :code:`fluid.default_startup_program()` 必须在调用 :code:`fluid.io.load_params`
 之前。如果在之后运行，可能会覆盖已加载的模型参数导致错误。

-
-
 预测模型的保存和加载
 ##############################

@@ -162,6 +175,7 @@

 1. 在训练的最后调用 :code:`fluid.io.save_persistables` 保存长期变量时，不必要所有的trainer都调用这个方法来保存，一般0号trainer来保存即可。
 2. 多机增量训练的参数加载在PServer端，trainer端不用加载参数。在PServer全部启动后，trainer会从PServer端同步参数。
+3. 在确认需要使用增量的情况下， 多机在调用 :code:`fluid.DistributeTranspiler.transpile` 时需要指定 ``current_endpoint`` 参数。

 多机增量（不带分布式大规模稀疏矩阵）训练的一般步骤为：

@@ -211,7 +225,7 @@
    training_role == "PSERVER"
    config = fluid.DistributeTranspilerConfig()
    t = fluid.DistributeTranspiler(config=config)
-    t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True)
+    t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True, current_endpoint=current_endpoint)

    if training_role == "PSERVER":
        current_endpoint = "127.0.0.1:1001"

--- a/doc/fluid/user_guides/index_cn.rst
+++ b/doc/fluid/user_guides/index_cn.rst
@@ -14,8 +14,7 @@

    - `训练神经网络 <../user_guides/howto/training/index_cn.html>`_：介绍如何使用 Fluid 进行单机训练、多机训练、以及保存和载入模型变量

-
-	- `DyGraph模式 <../user_guides/howto/dygraph/DyGraph.md>`_：介绍在 Fluid 下使用DyGraph       
+    - `DyGraph模式 <../user_guides/howto/dygraph/DyGraph.html>`_：介绍在 Fluid 下使用DyGraph

    - `模型评估与调试 <../user_guides/howto/evaluation_and_debugging/index_cn.html>`_：介绍在 Fluid 下进行模型评估和调试的方法，包括：


--- a/doc/fluid/user_guides/models/index_cn.rst
+++ b/doc/fluid/user_guides/models/index_cn.rst
@@ -14,8 +14,7 @@ Fluid模型配置和参数文件的工具。
 -  `AlexNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `VGG <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `GoogleNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
-  `Residual
-   Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
+-  `Residual Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Inception-v4 <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `MobileNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Dual Path
@@ -122,7 +121,7 @@ RNN 结构的 NMT 得以应运而生，例如基于卷积神经网络 CNN
 Attention 学习语言中的上下文依赖。相较于RNN/CNN,
 这种结构在单层内计算复杂度更低、易于并行化、对长程依赖更易建模，最终在多种语言之间取得了最好的翻译效果。

-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 强化学习
 --------
@@ -163,7 +162,7 @@ DQN 及其变体，并测试了它们在 Atari 游戏中的表现。

 本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作，用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发，其网络结构完全基于注意力(attention)机制，利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示，然后利用cross-attention获取应答与语境之间的相关性，在两个大规模多轮对话数据集上的表现均好于其它模型。

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -174,8 +173,7 @@ AnyQ

 SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架，该框架在百度各产品上广泛应用，主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式，同时基于该框架也集成了学术界主流的语义匹配模型，如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力。

-  `SimNet in PaddlePaddle
-   Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`__
+-  `SimNet in PaddlePaddle Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`_

 机器阅读理解
 ----
@@ -184,7 +182,7 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架

 百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集，所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区)，答案是由人类回答的。每个问题都对应多个答案，数据集包含200k问题、1000k原文和420k答案，是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型，称为DuReader，采用当前通用的网络分层结构，通过双向attention机制捕捉问题和原文之间的交互关系，生成query-aware的原文表示，最终基于query-aware的原文表示通过point network预测答案范围。

-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`_


 个性化推荐

--- a/doc/fluid/user_guides/models/index_en.rst
+++ b/doc/fluid/user_guides/models/index_en.rst
@@ -97,7 +97,7 @@ Machine Translation transforms a natural language (source language) into another
 The Transformer implemented in this example is a machine translation model based on the self-attention mechanism, in which there is no more RNN or CNN structure, but fully utilizes Attention to learn the context dependency. Compared with RNN/CNN, in a single layer, this structure has lower computational complexity, easier parallelization, and easier modeling for long-range dependencies, and finally achieves the best translation effect among multiple languages.


- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 Reinforcement learning
 -------------------------
@@ -131,7 +131,7 @@ In many scenarios of natural language processing, it is necessary to measure the

 The DAM (Deep Attention Matching Network) introduced in this example is the work of Baidu Natural Language Processing Department published in ACL-2018, which is used for the selection of responses in multi-round dialogue of retrieval chat robots. Inspired by Transformer, DAM is based entirely on the attention mechanism. It uses the stack-type self-attention structure to learn the semantic representations of responses and contexts at different granularities, and then uses cross-attention to obtain relativity between responses and contexts. The performance on the two large-scale multi-round dialogue datasets is better than other models.

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -151,7 +151,7 @@ Machine Reading Comprehension (MRC) is one of the core tasks in Natural Language

 Baidu reading comprehension dataset is an open-source real-world dataset publicized by Baidu Natural Language Processing Department. All the questions and original texts are derived from actual data (Baidu search engine data and Baidu know Q&A community), and the answer is given by humans. Each question corresponds to multiple answers. The dataset contains 200k questions, 1000k original text and 420k answers. It is currently the largest Chinese MRC dataset. Baidu also publicized the corresponding open-source reading comprehension model, called DuReader. DuReader adopts the current common network hierarchical structure, and captures the interaction between the problems and the original texts through the double attention mechanism to generate the original representation of the query-aware. Finally, based on the original text of query-aware, the answer scope is predicted by point network.

- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`__


 Personalized recommendation

--- a/Paddle @ 2ff867f8
+++ b/Paddle @ 2ff867f8
-Subproject commit 7b453631fc29e4fe2f68757fc458634d55690b3d
+Subproject commit 2ff867f88628e9cb8b76eaf79045eca0f52e5b85