refine dygraph doc

d55f1fe4 · JiabinYang · 165a0756 · af0c7d7f · d55f1fe4 · d55f1fe4
28 changed file
--- a/doc/fluid/advanced_usage/deploy/inference/native_infer_en.md
+++ b/doc/fluid/advanced_usage/deploy/inference/native_infer_en.md
@@ -96,7 +96,7 @@ There are two modes in term of memory management in `PaddleBuf` :

 In the two modes, the first is more convenient while the second strictly controls memory management to facilitate integration with `tcmalloc` and other libraries.
 
-### Upgrade performance based on contrib::AnalysisConfig (Prerelease)
+### Upgrade performance based on contrib::AnalysisConfig

 AnalyisConfig is at the stage of pre-release and protected by `namespace contrib` , which may be adjusted in the future.

@@ -106,9 +106,11 @@ The usage of `AnalysisConfig` is similiar with that of `NativeConfig` but the fo

 ```c++
 AnalysisConfig config;
-config.model_dir = xxx;
-config.use_gpu = false;  // GPU optimization is not supported at present
-config.specify_input_name = true; // it needs to set name of input
+config.SetModel(dirname);                // set the directory of the model
+config.EnableUseGpu(100, 0 /*gpu id*/);  // use GPU,or
+config.DisableGpu();                     // use CPU
+config.SwitchSpecifyInputNames(true);    // need to appoint the name of your input
+config.SwitchIrOptim();     // turn on the optimization switch,and a sequence of optimizations will be executed in operation                      
 ```

 Note that input PaddleTensor needs to be allocated. Previous examples need to be revised as follows:
@@ -125,11 +127,29 @@ tensor.dtype = paddle::PaddleDType::INT64;
 tensor.name = "input0"; // name need to be set here
 ```

+The subsequent execution process is totally the same with `NativeConfig` .
+	
+### variable-length sequence input
+When dealing with variable-length sequence input, you need to set LoD for `PaddleTensor` .
+	
+``` c++
+# Suppose the sequence lengths are [3, 2, 4, 1, 2, 3] in order.
+tensor.lod = {{0,
+	         /*0 + 3=*/3,
+	         /*3 + 2=*/5,
+	         /*5 + 4=*/9,
+	         /*9 + 1=*/10,
+	         /*10 + 2=*/12,
+	         /*12 + 3=*/15}};
+```
+	
+For more specific examples, please refer to[LoD-Tensor Instructions](../../../user_guides/howto/basic_concept/lod_tensor_en.html)
+	
 ### Suggestion for Performance

 1. If the CPU type permits, it's best to use the versions with support for AVX and MKL.
 2. Reuse input and output `PaddleTensor` to avoid frequent memory allocation resulting in low performance
-3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU inference 
+3. Try to replace `NativeConfig` with `AnalysisConfig` to perform optimization for CPU or GPU inference 

 ## Code Demo


--- a/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
+++ b/doc/fluid/advanced_usage/deploy/inference/paddle_tensorrt_infer_en.md
-# Use TensorRT Library for inference
+# Use Paddle-TensorRT Library for inference

 NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
-Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, MobileNet-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
-
-
-## Build inference libraries with `TensorRT`
+Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are AlexNet, MobileNet, ResNet50, VGG19, ResNext, Se-ReNext, GoogleNet, DPN, ICNET, Deeplabv3 Mobile, Net-SSD and so on. We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
+
+## Contents
+ - [compile Paddle-TRT inference libraries](#compile Paddle-TRT inference libraries)
+ - [Paddle-TRT interface usage](#Paddle-TRT interface usage)
+ - [Paddle-TRT example compiling test](#Paddle-TRT example compiling test)
+ - [Paddle-TRT INT8 usage](#Paddle-TRT_INT8 usage)
+ - [Paddle-TRT subgraph operation principle](#Paddle-TRT subgraph operation principle)
+ 
+## <a name="compile Paddle-TRT inference libraries">compile Paddle-TRT inference libraries</a>

 **Use Docker to build inference libraries**         

@@ -42,7 +48,7 @@ Subgraph is used in Paddle 1.0 to preliminarily integrate TensorRT, which enable
 	make inference_lib_dist -j
 	```

-## Usage of Paddle TensorRT
+## <a name="Paddle-TRT interface usage">Paddle-TRT interface usage</a> 

 [`paddle_inference_api.h`]('https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/paddle_inference_api.h') defines all APIs of TensorRT. 

@@ -58,17 +64,17 @@ A complete process is shown below:
 #include "paddle_inference_api.h"

 namespace paddle {
-using paddle::contrib::AnalysisConfig;
+using paddle::AnalysisConfig;

 void RunTensorRT(int batch_size, std::string model_dirname) {
  // 1. Create MixedRTConfig
-  AnalysisConfig config(true);
-  config.model_dir = model_dirname;
-  config->use_gpu = true;
-  config->device = 0;
-  config->fraction_of_gpu_memory = 0.15;
+  AnalysisConfig config(model_dirname);
+  // config->SetModel(model_dirname + "/model",                                                                                                   
+  //                     model_dirname + "/params");
+  
+  config->EnableUseGpu(100, 0 /*gpu_id*/);
  config->EnableTensorRtEngine(1 << 20 /*work_space_size*/, batch_size /*max_batch_size*/);
-
+  
  // 2. Create predictor based on config
  auto predictor = CreatePaddlePredictor(config);
  // 3. Create input tensor 
@@ -104,10 +110,33 @@ int main() {
 ```
 The compilation process is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference/api/demo_ci)

+## <a name="Paddle-TRT_INT8 usage">Paddle-TRT INT8 usage</a>
+
+  1. Paddle-TRT INT8 introduction    
+The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows: 1）**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation. 2）After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
+
+  2. compile and test the INT8 example
+
+  	```shell
+ 	cd SAMPLE_BASE_DIR/sample
+ 	# sh run_impl.sh {the address of inference libraries} {the name of test script} {model directories}
+ 	# We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. 
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_generate_calib_test SAMPLE_BASE_DIR/sample/mobilenetv1
+ 	
+ 	```
+ 	
+        After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/build/mobilenetv1` model directory, which is the calibration table.
+
+  	``` shell
+ 	# conduct INT8 inference
+ 	# copy the model file with calibration tables to a specific address
+ 	cp -rf SAMPLE_BASE_DIR/sample/build/mobilenetv1 SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	sh run_impl.sh BASE_DIR/fluid_inference_install_dir/  fluid_int8_test SAMPLE_BASE_DIR/sample/mobilenetv1_calib
+ 	```

-## Subgraph Theory
+## <a name="Paddle-TRT subgraph operation principle">Paddle-TRT subgraph operation principle</a>

-   Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
+Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
   
 A simple model expresses the process : 

@@ -121,6 +150,6 @@ A simple model expresses the process :
 <img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
 </p>

-  We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
+We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
   

--- a/doc/fluid/advanced_usage/development/new_op/new_op.md
+++ b/doc/fluid/advanced_usage/development/new_op/new_op.md
@@ -2,14 +2,19 @@

 ## 概念简介

-简单介绍需要用到基类，详细介绍请参考设计文档。
+简单介绍需要用到基类，详细介绍请参考[设计文档](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/motivation/refactorization.md#operatoropwithkernelopkernel)。

 - `framework::OperatorBase`: Operator(简写，Op)基类。
 - `framework::OpKernel`: Op计算函数的基类，称作Kernel。
 - `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
- `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
+- `framework::OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释，主要用于Python API接口生成。

-依据是否包含kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorWithKernel`，后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
+根据是否包含Kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op：
+
+- 包含Kernel的Op继承自`OperatorWithKernel`，这类Op的功能实现与输入的数据类型、数据布局、数据所在的设备以及Op实现所调用第三方库等有关。比如ConvOp，如果使用CPU计算，一般通过调用mkl库中的矩阵乘操作实现，如果使用GPU计算，一般通过调用cublas库中的矩阵乘操作实现，或者直接调用cudnn库中的卷积操作。
+- 不包含Kernel的Op继承自`OperatorBase`，因为这类Op的功能实现与设备以及输入的数据不相关。比如WhileOp、IfElseOp等。
+
+本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：

 <table>
 <thead>
@@ -21,7 +26,7 @@
 <tbody>
 <tr>
 <td>OpProtoMake定义 </td>
-<td>.cc 文件，Backward Op不需要定义OpProtoMake </td>
+<td>.cc 文件 </td>
 </tr>
 <tr>
 <td>Op定义 </td>
@@ -38,16 +43,11 @@
 </tbody>
 </table>

-
-实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
-
+实现新的op都添加至目录[paddle/fluid/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators)下，文件命名以`*_op.h`（如有）、`*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**

 下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。

-
 ## 实现C++类
-
-
 ### 定义ProtoMaker类

 矩阵乘法的公式：$Out = X * Y$, 可见该计算由两个输入，一个输出组成。
@@ -57,73 +57,99 @@
 ```cpp
 class MulOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
-  MulOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor), 2D tensor of size (M x K)");
-    AddInput("Y", "(Tensor), 2D tensor of size (K x N)");
-    AddOutput("Out", "(Tensor), 2D tensor of size (M x N)");
+  void Make() override {
+    AddInput("X", "(Tensor), The first input tensor of mul op.");
+    AddInput("Y", "(Tensor), The second input tensor of mul op.");
+    AddOutput("Out", "(Tensor), The output tensor of mul op.");
+    AddAttr<int>(
+        "x_num_col_dims",
+        R"DOC((int, default 1), The mul_op can take tensors with more than two
+              dimensions as its inputs. If the input $X$ is a tensor with more
+              than two dimensions, $X$ will be flattened into a two-dimensional
+              matrix first. The flattening rule is: the first `num_col_dims`
+              will be flattened to form the first dimension of the final matrix
+              (the height of the matrix), and the rest `rank(X) - num_col_dims`
+              dimensions are flattened to form the second dimension of the final
+              matrix (the width of the matrix). As a result, height of the
+              flattened matrix is equal to the product of $X$'s first
+              `x_num_col_dims` dimensions' sizes, and width of the flattened
+              matrix is equal to the product of $X$'s last `rank(x) - num_col_dims`
+              dimensions' size. For example, suppose $X$ is a 6-dimensional
+              tensor with the shape [2, 3, 4, 5, 6], and `x_num_col_dims` = 3.
+              Thus, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] =
+              [24, 30].
+        )DOC")
+        .SetDefault(1)
+        .EqualGreaterThan(1);
+    AddAttr<int>(
+        "y_num_col_dims",
+        R"DOC((int, default 1), The mul_op can take tensors with more than two,
+              dimensions as its inputs. If the input $Y$ is a tensor with more
+              than two dimensions, $Y$ will be flattened into a two-dimensional
+              matrix first. The attribute `y_num_col_dims` determines how $Y$ is
+              flattened. See comments of `x_num_col_dims` for more details.
+        )DOC")
+        .SetDefault(1)
+        .EqualGreaterThan(1);
    AddComment(R"DOC(
-Two Element Mul Operator.
-The equation is: Out = X * Y
-)DOC");
-  }
-};
-```
+Mul Operator.

-[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)继承自`framework::OpProtoAndCheckerMaker`，构造函数含有2个参数：
+This operator is used to perform matrix multiplication for input $X$ and $Y$.

-   - `framework::OpProto` ： 前者存储Op的输入输出和参数属性，将用于Python API接口的生成。
-   - `framework::OpAttrChecker` ：后者用于检查参数属性的合法性。
+The equation is:

-构造函数里通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
+$$Out = X * Y$$

-上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
-
-
-再以[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55)为例：
+Both the input $X$ and $Y$ can carry the LoD (Level of Details) information,
+or not. But the output only shares the LoD information with input $X$.

-```cpp
-template <typename AttrType>
-class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor) Input tensor of scale operator.");
-    AddOutput("Out", "(Tensor) Output tensor of scale operator.");
-    AddComment(R"DOC(
-Scale operator
-$$Out = scale*X$$
 )DOC");
-    AddAttr<AttrType>("scale",
-                      "(float, default 1.0)"
-                      "The scaling factor of the scale operator.")
-        .SetDefault(1.0);
  }
 };
 ```

-这个例子有`AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc)继承自`framework::OpProtoAndCheckerMaker`。
+
+开发者通过覆盖`framework::OpProtoAndCheckerMaker`中的`Make`函数来定义Op所对应的Proto，通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddAttr`添加属性参数，通过`AddComment`添加Op的注释。这些函数会将对应内容添加到`OpProto`中。
+
+上面的代码在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，命名请遵守[命名规范](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。

 ### 定义GradProtoMaker类
-每个Op的必须有一个对应的GradProtoMaker，若未定制对应前向Op的GradProtoMaker，fluid提供了DefaultGradProtoMaker，默认注册会使用全部输入输出，包括Input, Output, Output@Grad等，使用不需要的变量的会造成显存浪费。
-下面示例定义了ScaleOp的GradProtoMaker。
+通常情况下，每个Op的会有一个对应的`GradProtoMaker`，为方便代码编写，fluid提供了默认的`GradProtoMaker`，即：`DefaultGradProtoMaker`。`DefaultGradProtoMaker`会使用前向Op的全部输入(`Input`)输出(`Output`)以及输出变量所对应的梯度（`Output@Grad`）作为反向Op的输入，将前向Op的输入变量所对应的的梯度（`Input@Grad`）作为输出。
+
+**注意:**
+不要将反向Op不会用到的变量放到反向Op的输入列表中，这样会导致这些不会被反向Op用到的变量的空间不能够及时回收，进而有可能导致用到该Op的模型可以设置的batch_size较低。
+比如`relu`操作的前向操作为：`out.device(d) = x.cwiseMax(static_cast<T>(0));`反向操作为：`dx.device(d) = dout * (out > static_cast<T>(0)).template cast<T>();`。显然，反向操作中只是用到了`out`、`dout`、`dx`，没有用到`x`。
+
+
+下面示例定义了`MulOp`的GradProtoMaker。

 ```cpp
-class ScaleGradMaker : public framework::SingleGradOpDescMaker {
+class MulOpGradMaker : public framework::SingleGradOpDescMaker {
 public:
  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;

+ protected:
  std::unique_ptr<framework::OpDesc> Apply() const override {
-    auto *grad_op = new framework::OpDesc();
-    grad_op->SetType("scale");
-    grad_op->SetInput("X", OutputGrad("Out"));
-    grad_op->SetOutput("Out", InputGrad("X"));
-    grad_op->SetAttr("scale", GetAttr("scale"));
-    return std::unique_ptr<framework::OpDesc>(grad_op);
+    std::unique_ptr<framework::OpDesc> retv(new framework::OpDesc());
+    retv->SetType("mul_grad");
+    retv->SetInput("X", Input("X"));
+    retv->SetInput("Y", Input("Y"));
+    retv->SetInput(framework::GradVarName("Out"), OutputGrad("Out"));
+    retv->SetOutput(framework::GradVarName("X"), InputGrad("X"));
+    retv->SetOutput(framework::GradVarName("Y"), InputGrad("Y"));
+    retv->SetAttrMap(Attrs());
+    return retv;
  }
 };
 ```

+**注意：**
+
+- 有些Op的前向逻辑和反向逻辑是一样的，比如[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc).这种情况下，前向Op和反向Op的Kernel可以为同一个。
+- 有些前向Op所对应的反向Op可能有多个，比如[`SumOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/sum_op.cc)，这种情况下，`GradMaker`需要继承`framework::GradOpDescMakerBase`。
+- 有些Op的反向对应另一个Op的前向，比如[`SplitOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h)，这种情况下，[`SplitGradMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h#L52)中定义的`SplitOp`反向Op的Type就是`concat`，
+
 ### 定义Operator类

 下面实现了MulOp的定义：
@@ -134,20 +160,53 @@ class MulOp : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;

 protected:
-  void InferShape(const framework::InferShapeContext &ctx) const override {
-    //never use Input<Tensor> or Output<Tensor> if you want a to get a LoDTensor.
-    auto dim0 = ctx.Input<LoDTensor>("X")->dims();
-    auto dim1 = ctx.Input<LoDTensor>("Y")->dims();
-    PADDLE_ENFORCE_EQ(dim0.size(), 2,
-                      "input X(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("X"));
-    PADDLE_ENFORCE_EQ(dim1.size(), 2,
-                      "input Y(%s) should be a tensor with 2 dims, a matrix",
-                      ctx.op_.Input("Y"));
-    PADDLE_ENFORCE_EQ(
-        dim0[1], dim1[0],
-        "First matrix's width must be equal with second matrix's height.");
-    ctx.Output<LoDTensor>("Out")->Resize({dim0[0], dim1[1]});
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of MulOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) of MulOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of MulOp should not be null.");
+
+    auto x_dims = ctx->GetInputDim("X");
+    auto y_dims = ctx->GetInputDim("Y");
+
+    int x_num_col_dims = ctx->Attrs().Get<int>("x_num_col_dims");
+    int y_num_col_dims = ctx->Attrs().Get<int>("y_num_col_dims");
+
+    VLOG(3) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims
+            << " x_num_col_dims=" << x_num_col_dims
+            << " y_num_col_dims=" << y_num_col_dims;
+
+    PADDLE_ENFORCE_GT(
+        x_dims.size(), x_num_col_dims,
+        "The input tensor X's rank of MulOp should be larger than "
+        "x_num_col_dims.");
+    PADDLE_ENFORCE_GT(
+        y_dims.size(), y_num_col_dims,
+        "The input tensor Y's rank of MulOp should be larger than "
+        "y_num_col_dims: %ld vs %ld",
+        y_dims.size(), y_num_col_dims);
+
+    auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
+    auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
+
+    PADDLE_ENFORCE_EQ(x_mat_dims[1], y_mat_dims[0],
+                      "First matrix's width must be equal with second matrix's "
+                      "height. %s, %s",
+                      x_mat_dims[1], y_mat_dims[0]);
+    std::vector<int64_t> output_dims;
+    output_dims.reserve(
+        static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims));
+
+    for (int i = 0; i < x_num_col_dims; ++i) {
+      output_dims.push_back(x_dims[i]);
+    }
+
+    for (int i = y_num_col_dims; i < y_dims.size(); ++i) {
+      output_dims.push_back(y_dims[i]);
+    }
+
+    ctx->SetOutputDim("Out", framework::make_ddim(output_dims));
+    ctx->ShareLoD("X", /*->*/ "Out");
  }
 };
 ```
@@ -167,20 +226,22 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
  : OperatorWithKernel(type, inputs, outputs, attrs) {}
 ```

-还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
+还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`framework::InferShapeContext* ctx`，通过该参数可获取到输入输出以及属性。它的功能是：

  - 做检查， 尽早报错：检查输入数据维度、类型等是否合法。
-  - 设置输出Tensor的形状。
+  - 设置输出Tensor的形状以及LoD信息。

 通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和下面将要介绍的注册函数一起放在`.cc`中

+**注意：**通常`InferShape`操作在[编译时和运行时](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md#%E8%AE%A9%E6%88%91%E4%BB%AC%E5%9C%A8fluid%E7%A8%8B%E5%BA%8F%E5%AE%9E%E4%BE%8B%E4%B8%AD%E5%8C%BA%E5%88%86%E7%BC%96%E8%AF%91%E6%97%B6%E5%92%8C%E8%BF%90%E8%A1%8C%E6%97%B6)都会被调用，在一些NLP任务可能会涉及到很多的变长操作，Paddle中在编译时变长使用-1表示，所以Op中在做检查或者推断输出变量的Shape时需要注意输入变量的某个维度是否可能为-1。
+
 ### 定义OpKernel类

 `MulKernel`继承自`framework::OpKernel`，带有下面两个模板参数:

- `typename DeviceContext`: 表示设备类型，不同设备(CPU、CUDA)共享同一个Kernel时，需加该模板参数，不共享则不加，一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
+- `typename DeviceContext`: 表示设备类型。不同设备(CPU、CUDA)共享同一个Kernel时，需加该模板参数；不共享则不加，一个不共享的例子是[`SGDOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/optimizers/sgd_op.h)。

- `typename T` : 表示数据类型，如`float`, `double`等。
+- `typename T` : 表示数据类型，如`float`, `double`, `int16`等。

 需要为`MulKernel`类重写`Compute`接口。

@@ -192,33 +253,53 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,

 Op的输入和输出可分别通过`ExecutionContext::Input<T>()`和`ExecutionContext::Output<T>()`获得。

-**注意：** 若op的输入/输出的变量类型是`LoDTensor`（fluid默认所有的Tensor默认都是LoDTensor类型），请写成`ExecutionContext::Input<LoDTensor>()`和`ExecutionContext::Output<LoDTensor>()`，不要写`ExecutionContext::Input<Tensor>()`和`ExecutionContext::Output<Tensor>()`。因为若实际的变量类型为`SelectedRows`，`Input<Tensor>()`和`Output<Tensor>()`方法会将`SelectedRows`类型特化为`Tensor`，导致潜在的错误。
+**注意：** 若op的输入/输出的变量类型是`LoDTensor`（fluid默认所有的`Tensor`默认都是`LoDTensor`类型），请写成`ExecutionContext::Input<LoDTensor>()`和`ExecutionContext::Output<LoDTensor>()`，不要写`ExecutionContext::Input<Tensor>()`和`ExecutionContext::Output<Tensor>()`。因为若实际的变量类型为`SelectedRows`，`Input<Tensor>()`和`Output<Tensor>()`方法会将`SelectedRows`类型特化为`Tensor`，导致潜在的错误。

 下面是 `MulKernel` `Compute`的实现：

-  ```cpp
-  template <typename DeviceContext, typename T>
-  class MulKernel : public framework::OpKernel {
-  public:
+```cpp
+template <typename DeviceContext, typename T>
+class MulKernel : public framework::OpKernel<T> {
+ public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto* X = context.Input<LoDTensor>("X");
-    auto* Y = context.Input<LoDTensor>("Y");
-    auto* Z = context.Output<LoDTensor>("Out");
-    Z->mutable_data<T>(context.GetPlace());
-    auto& device_context = context.template device_context<DeviceContext>();
-    math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
+    const Tensor* x = context.Input<Tensor>("X");
+    const Tensor* y = context.Input<Tensor>("Y");
+    Tensor* z = context.Output<Tensor>("Out");
+    const Tensor x_matrix =
+        x->dims().size() > 2
+            ? framework::ReshapeToMatrix(
+                  *x, context.template Attr<int>("x_num_col_dims"))
+            : *x;
+    const Tensor y_matrix =
+        y->dims().size() > 2
+            ? framework::ReshapeToMatrix(
+                  *y, context.template Attr<int>("y_num_col_dims"))
+            : *y;
+
+    z->mutable_data<T>(context.GetPlace());
+    auto z_dim = z->dims();
+    if (z_dim.size() != 2) {
+      z->Resize({x_matrix.dims()[0], y_matrix.dims()[1]});
+    }
+
+    auto blas = math::GetBlas<DeviceContext, T>(context);
+
+    blas.MatMul(x_matrix, y_matrix, z);
+    if (z_dim.size() != 2) {
+      z->Resize(z_dim);
+    }
  }
-  };
-  ```
+};
+```

 需要注意：**不同设备(CPU、CUDA)共享一个Op定义，是否则共享同一个`OpKernel`，取决于`Compute`调用的函数是否支持不同设备。**

-`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考：[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43)。
+`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考：[`SGDOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/optimizers/sgd_op.h)。

 为了使`OpKernel`的计算过程书写更加简单，并且CPU、CUDA的代码可以复用，我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库，请参考[使用文档](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md)。

 到此，前向Op实现完成。接下来，需要在`.cc`文件中注册该op和kernel。
-反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。**但需注意反向Op没有`ProtoMaker`**。
+反向Op类的定义，反向OpKernel的定义与前向Op类似，这里不再赘述。

 ### 注册Operator

@@ -227,11 +308,14 @@ Op的输入和输出可分别通过`ExecutionContext::Input<T>()`和`ExecutionCo
    ```cpp
    namespace ops = paddle::operators;
    REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
-                  paddle::framework::DefaultGradOpDescMaker<true>)
+                  ops::MulOpGradMaker)
    REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
-    REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
+    REGISTER_OP_CPU_KERNEL(mul, 
+                  ops::MulKernel<paddle::platform::CPUDeviceContext, float>,
+                  ops::MulKernel<paddle::platform::CPUDeviceContext, double>);
    REGISTER_OP_CPU_KERNEL(mul_grad,
-                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
+                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>,
+                  ops::MulGradKernel<paddle::platform::CPUDeviceContext, double>);
    ```

    在上面的代码中：
@@ -250,11 +334,38 @@ Op的输入和输出可分别通过`ExecutionContext::Input<T>()`和`ExecutionCo
    #define EIGEN_USE_GPU

    namespace ops = paddle::operators;
-    REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
+    REGISTER_OP_CUDA_KERNEL(mul, 
+                            ops::MulKernel<paddle::platform::CUDADeviceContext, float>,
+                            ops::MulKernel<paddle::platform::CUDADeviceContext, double>);
    REGISTER_OP_CUDA_KERNEL(mul_grad,
-                           ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
+                            ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>,
+                            ops::MulGradKernel<paddle::platform::CUDADeviceContext, double>);
    ```

+**注意：**
+
+在运行Op时，框架系统会根据输入数据所在的设备、输入数据的类型等信息自动的选择合适的OpKernel，比如输入的数据是在GPU上，并且为`float`类型，框架系统会选择由`REGISTER_OP_CUDA_KERNEL`注册的`ops::MulKernel<paddle::platform::CUDADeviceContext, float>`。如果用户希望指定运行时可被调用的OpKernel，用户需要覆盖`framework::OperatorWithKernel`中的`GetExpectedKernelType`函数，比如`ConvOp`会根据属性`use_cudnn`为`false`还是为`true`决定是否调用cudnn库中提供的conv操作。
+
+```
+framework::OpKernelType ConvOp::GetExpectedKernelType(
+    const framework::ExecutionContext& ctx) const {
+  int customized_type_value =
+      framework::OpKernelType::kDefaultCustomizedTypeValue;
+  framework::LibraryType library{framework::LibraryType::kPlain};
+  auto input_data_type = ctx.Input<Tensor>("Input")->type();
+  std::string data_format = ctx.Attr<std::string>("data_format");
+  framework::DataLayout layout = framework::StringToDataLayout(data_format);
+#ifdef PADDLE_WITH_CUDA
+  if (ctx.Attr<bool>("use_cudnn")) {
+    library = framework::LibraryType::kCUDNN;
+  }
+#endif
+  auto type = framework::OpKernelType(input_data_type, ctx.GetPlace(), layout,
+                                      library, customized_type_value);
+  return type;
+}
+```
+
 ### 编译

 运行下面命令可以进行编译：
@@ -267,13 +378,25 @@ make mul_op

 系统会对新增的op自动绑定Python，并链接到生成的lib库中。

+### 使用mul操作在Python端构建Layer
+
+在Python端，`mul`操作用于构建FC层，即：
+
+$$Out = Act({X*W + b})$$
+
+具体实现方式可参考[FC层的实现代码](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/layers/nn.py#L205)。
+
 ## 实现单元测试

 单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py)。

+**注意：**
+
+单测中的测试用例需要尽可能的覆盖Op中的所有分支。
+
 ### 前向Operator单测

-Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator，需要：
+Op单元测试继承自`OpTest`。各项具体的单元测试在`TestMulOp`里完成。测试Operator，需要：

 1. 在`setUp`函数定义输入、输出，以及相关的属性参数。
 2. 生成随机的输入数据。
@@ -333,7 +456,7 @@ Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp

 `python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。

-请注意，**不同于Op的编译测试，运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake paddle_dir -DWITH_TESTING=ON`。编译成功后，执行下面的命令来运行单元测试：
+请注意，**不同于Op的编译测试，运行单元测试测时需要编译整个工程**，并且编译时需要打开`WITH_TESTING`, 即`cmake -DWITH_TESTING=ON ..`。编译成功后，执行下面的命令来运行单元测试：

 ```bash
 make test ARGS="-R test_mul_op -V"
@@ -365,7 +488,7 @@ PADDLE_ENFORCE_EQ(比较对象A, 比较对象B, 错误提示信息)

 #### 总体原则

-任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_**检查的地方，必须有详略得当的备注解释！**错误提示信息**不能为空！
+任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_XX检查的地方，必须有详略得当的备注解释！<font color="#FF0000">**错误提示信息不能为空！**</font>

 #### 提示信息书写标准


--- a/doc/fluid/advanced_usage/development/new_op/op_notes.md
+++ b/doc/fluid/advanced_usage/development/new_op/op_notes.md
@@ -4,9 +4,9 @@
 ### 1.Fluid中Op的构建逻辑
 Fluid中所有的Op都继承自`OperatorBase`，且所有的Op都是无状态的，每个Op包含的成员变量只有四个：type、inputs、outputs、attribute。

-Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行。
+Op的核心方法是Run，Run方法需要两方面的资源：数据资源和计算资源，这两个资源分别通过`Scope`和`Place`获取。框架内部有一个全局的`DeviceContextPool`，用来记录`Place`和`DeviceContext`之间的对应的关系，即每个`Place`有且仅有一个`DeviceContext`与之对应，`DeviceContext`中存放了当前设备的计算资源。比如对于GPU，这些资源包括`cudnn_handle`、`cublas_handle`、`stream`等，**Op内部所有的计算（数据拷贝和CUDA Kernel等）都必须在`DeviceContext`中进行**。

-Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。
+Fluid框架的设计理念是可以在多种设备及第三方库上运行，有些Op的实现可能会因为设备或者第三方库的不同而不同。为此，Fluid引入了OpKernel的方式，即一个Op可以有多个OpKernel，这类Op继承自`OperatorWithKernel`，这类Op的代表是conv_op，conv_op的OpKerne有：`GemmConvKernel`、`CUDNNConvOpKernel`、`ConvMKLDNNOpKernel`，且每个OpKernel都有double和float两种数据类型。不需要OpKernel的代表有`WhileOp`等。

 Operator继承关系图：
 ![op_inheritance_relation_diagram](../../pics/op_inheritance_relation_diagram.png)
@@ -62,7 +62,7 @@ Operator继承关系图：
 <td>InferShapeFN </td>
 <td>Functor </td>
 <td>用于推断Output的Shape </td>
-<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run时调用 </td>
+<td>分为编译时和运行时，编译时是在Python端调用；如果Op继承自OperatorWithKernel，运行时是在op.run中调用 </td>
 </tr>
 <tr>
 <td>OpCreator </td>
@@ -85,11 +85,12 @@ Operator继承关系图：

 **注意：**

-1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker和op中attr的checker。
+1. 对于所有Op，前三个参数是必须的，op_type指明op的名字，OperatorBase是该Op的对象，op_maker_and_checker_maker是op的maker以及Op中attr的checker。
 2. 如果该Op有反向，则必须要有op_grad_opmaker，因为在backward会根据正向的Op中获取反向Op的Maker。
-3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：**DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化。
+3. 框架提供了一个默认的op_grad_opmaker：`DefaultGradOpDescMaker`，这个Maker会将前向Op的输入和输出都作为反向Op的输入，将前向Op的输入的梯度作为反向Op的输出，并将前向Op的属性拷贝过来。**注意：DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入，即使这个输入是没有必要的，这将会导致无法对没有用到的变量做内存优化**。
 4. 框架没有提供默认的op_infer_var_shape方法。如果该Op是无OpKernel的，通常需要用户添加对应的op_infer_var_shape方法；如果该Op是有OpKernel的，需要实现`OperatorWithKernel`中的`InferShape`方法，此时不需要提供op_infer_var_shape方法。具体实现可参考[while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc)，[conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc)。
-5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：**在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype。
+5. 框架没有提供默认的op_infer_var_type方法，用户需要根据实际情况添加op_infer_var_shape。严格来说每个Op都应该注册一个InferVarType，op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意：在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor，C++端的InferVarType可以修改`Variable`的type和dtype**。
+


 更多内容请参考: [如何写新的Op](new_op.html)
@@ -119,7 +120,60 @@ ShareDataWith的功能是使两个Tensor共享底层buffer，在调用这个操
 目前稀疏梯度在做更新更新的时候会先对梯度做merge，即对相同参数的梯度做累加，然后做参数以及附加参数（如velocity）的更新。

 ### 7.显存优化
-如果Op的反向不需要将前向op的所有输入输出作为其输入，则不要用`DefaultGradOpDescMaker`，这将会导致无法对没有用到的变量做内存/显存优化。
+通常反向Op会依赖于前向Op的某些输入(Input)、输出(Output)，以供反向Op计算使用。但有些情况下，反向Op不需要前向Op的所有输入和输出；有些情况下，反向Op只需要前向Op的部分输入和输出；有些情况下，反向Op只需要使用前向Op中输入和输出变量的Shape和LoD信息。若Op开发者在注册反向Op时，将不必要的前向Op输入和输出作为反向Op的输入，会导致这部分显存无法被框架现有的显存优化策略优化，从而导致模型显存占用过高。
+
+所以在写注册反向Op时需要注意以下几点：
+
+- Fluid提供的`DefaultGradOpDescMaker`，默认会将前向op的所有输入(`Input`）、输出(`Output`)以及输出变量所对应的梯度(`Output@Grad`)作为反向Op的输入，将前向Op输入所对应的梯度(`Input@Grad`)作为反向Op的输出。所以在使用`DefaultGradOpDescMaker`时需要考虑是否有些变量在计算中不被用到。
+- 如果`DefaultGradOpDescMaker`不能够满足需求，需要用户自己手动构建`GradOpDescMaker`，具体实现请参考[相关文档](new_op.html#permalink-4--gradprotomaker-);
+- 如果有些反向Op需要依赖前向Op的输入或输出变量的的Shape或LoD，但不依赖于变量中Tensor的Buffer，且不能根据其他变量推断出该Shape和LoD，需要对该变量（以下称该变量为`X`）在反向Op中进行注册`NoNeedBufferVarsInference`。**一旦注册了`NoNeedBufferVarsIference`，反向op中就不能读写该变量对应的Tensor中的buffer，只能调用Tensor的dims()和lod()方法，同时，反向Op中的`GetExpectedKernelType()`必须要重写，并且`GetExpectedKernelType()`中不能访问`X`变量中Tensor的type()方法**。比如在`SliceOpGrad`中只会用到`Input`中变量的Shape信息，所以需要为对`Input`在`SliceOpGrad`上进行注册：
+```
+namespace paddle {
+namespace operators {
+// ...
+class SliceOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    // ... 
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    // Note: don't get data type from ctx.Input<framework::Tensor>("Input");   
+    auto dtype = ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type();    
+    return framework::OpKernelType( dtype, ctx.GetPlace());
+  }
+};
+
+
+class SliceOpGradMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* bind = new framework::OpDesc();
+    bind->SetInput("Input", Input("Input"));
+    bind->SetInput(framework::GradVarName("Out"), OutputGrad("Out"));
+    bind->SetOutput(framework::GradVarName("Input"), InputGrad("Input"));
+    bind->SetAttrMap(Attrs());
+    bind->SetType("slice_grad");
+    return std::unique_ptr<framework::OpDesc>(bind);
+  }
+};
+
+DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(SliceOpGradNoNeedBufferVarsInference,
+                                      "Input");
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(slice, ops::SliceOp, ops::SliceOpMaker,
+                  ops::SliceOpGradMaker);
+REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
+                  ops::SliceOpGradNoNeedBufferVarsInference);
+```

 ### 8.混合设备调用
 由于GPU是异步执行的，当CPU调用返回之后，GPU端可能还没有真正的执行，所以如果在Op中创建了GPU运行时需要用到的临时变量，当GPU开始运行的时候，该临时变量可能在CPU端已经被释放，这样可能会导致GPU计算出错。
@@ -137,7 +191,7 @@ The following device operations are asynchronous with respect to the host:
 关于cudaMemCpy和cudaMemCpyAsync注意事项：

 - 如果数据传输是从GPU端到非页锁定的CPU端，数据传输将是同步，即使调用的是异步拷贝操作。
- 如果数据传输时从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。
+- 如果数据传输是从CPU端到CPU端，数据传输将是同步的，即使调用的是异步拷贝操作。

 更多内容可参考：[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)，[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)

@@ -172,7 +226,10 @@ Enforce提示信息不能为空，并且需要写明，因为报错信息可以

 **注意：**在merge到develop分支之前一定进行公式预览。可参考[dynamic_lstmp](http://paddlepaddle.org/documentation/docs/zh/1.1/api/layers.html#dynamic-lstmp)。

-### 3.Python端Op接口中参数的顺序
+### 3.Op变量名的命名要规范
+在定义Op时，Op的输入输出以及属性的命名需要符合规范，具体命名规则请参考：[`name_convention`](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md)。
+
+### 4.Python端Op接口中参数的顺序
 Python API中参数的顺序一般按照重要性来排，以fc为例：
 ```
 def fc(input,

--- a/doc/fluid/api_cn/data/data_reader_cn.rst
+++ b/doc/fluid/api_cn/data/data_reader_cn.rst
@@ -249,7 +249,18 @@ Data Reader Interface

 .. py:function:: paddle.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)

+通过多线程方式，通过用户自定义的映射器mapper来映射reader返回的样本（到输出队列）。

+参数：
+    - **mapper** （callable） - 一种映射reader数据的函数。
+    - **reader** （callable） - 产生数据的reader。
+    - **process_num** （int） - 用于处理样本的线程数目。
+    - **buffer_size** （int） - 存有待读取数据的队列的大小。
+    - **order** （bool） - 是否保持原始reader的数据顺序。 默认为False。
+
+返回：一个将原数据进行映射后的decorated reader。
+
+返回类型： callable

 .. py:class:: paddle.reader.PipeReader(command, bufsize=8192, file_type='plain')


--- a/doc/fluid/api_cn/layers_cn.rst
+++ b/doc/fluid/api_cn/layers_cn.rst
@@ -12939,10 +12939,10 @@ yolov3_loss
 至于置信度得分，它是anchor框和真实框之间的IoU的逻辑回归值，anchor框的得分最高为1，此时该anchor框对应着最大IoU。
 如果anchor框之间的IoU大于忽略阀值ignore_thresh，则该anchor框的置信度评分损失将会被忽略。
          
-因此，yolov3损失包括三个主要部分，框位置损失，置信度评分损失，分类损失。L1损失用于
-框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），置信度评分损失和分类损失。
+因此，yolov3损失包括三个主要部分，框位置损失，目标性损失，分类损失。L1损失用于
+框坐标（w，h），同时，sigmoid交叉熵损失用于框坐标（x，y），目标性损失和分类损失。
          
-每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标损失。
+每个真实框在所有anchor中找到最匹配的anchor，预测各anchor框都将会产生所有三种损失的计算，但是没有匹配GT box(ground truth box真实框)的anchor的预测只会产生目标性损失。

 为了权衡大框(box)和小(box)之间的框坐标损失，框坐标损失将与比例权重相乘而得。即：

@@ -12965,7 +12965,7 @@ yolov3_loss

 参数：
    - **x**  (Variable) – YOLOv3损失运算的输入张量，这是一个形状为[N，C，H，W]的四维张量。H和W应该相同，第二维（C）存储框的位置信息，以及每个anchor box的置信度得分和one-hot分类
-    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，x、y、w、h应该是输入图像相对值。 N是batch size，B是图像中所含有的的最多的box数目
+    - **gt_box**  (Variable) – 真实框，应该是[N，B，4]的形状。第三维用来承载x、y、w、h，其中 x, y是真实框的中心坐标，w, h是框的宽度和高度，且x、y、w、h将除以输入图片的尺寸，缩放到[0,1]区间内。 N是batch size，B是图像中所含有的的最多的box数目
    - **gt_label**  (Variable) – 真实框的类id，应该形为[N，B]。
    - **anchors**  (list|tuple) – 指定anchor框的宽度和高度，它们将逐对进行解析
    - **anchor_mask**  (list|tuple) – 当前YOLOv3损失计算中使用的anchor的mask索引

--- a/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
+++ b/doc/fluid/api_guides/low_level/layers/learning_rate_scheduler_en.rst
@@ -29,4 +29,8 @@ The following content describes the APIs related to the learning rate scheduler:

 * :code:`piecewise_decay`: Piecewise decay. That is, the stair-like decay for a given number of steps, the learning rate stays the same within each step. For related API Reference please refer to :ref:`api_fluid_layers_piecewise_decay`

-* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
\ No newline at end of file
+* :code:`append_LARS`: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to `Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation <https://arxiv. Org/abs/1802.09750>`_ . For related API Reference please refer to :ref:`api_fluid_layers_append_LARS`
+
+* :code:`cosine_decay`: Cosine attenuation. It means the learning rate changes with the number of steps in the form of a cosine function. For related API Reference please refer to :ref:`api_fluid_layers_cosine_decay`
+
+* :code:`linear_lr_warmup`: The learning rate increases linearly to an appointed rate with the number of steps. For related API Reference please refer to :ref:`api_fluid_layers_linear_lr_warmup`
--- a/doc/fluid/beginners_guide/index_cn.rst
+++ b/doc/fluid/beginners_guide/index_cn.rst
@@ -8,7 +8,7 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵

 让我们从这里开始：

-    - `快速开始 <../beginners_guide/quick_start.html>`_
+    - `快速开始 <../beginners_guide/quick_start_cn.html>`_

 当您第一次来到PaddlePaddle，请您首先阅读以下文档，了解安装方法：

@@ -16,7 +16,7 @@ PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵

 这里为您提供了更多学习资料:

-    - `深度学习基础 <../beginners_guide/basics/index.html>`_：覆盖图像分类、个性化推荐、机器翻译等多个深度领域的基础知识，提供 Fluid 实现案例
+    - `深度学习基础 <../beginners_guide/basics/index_cn.html>`_：覆盖图像分类、个性化推荐、机器翻译等多个深度领域的基础知识，提供 Fluid 实现案例

    - `Fluid编程指南 <../beginners_guide/programming_guide/programming_guide.html>`_：介绍 Fluid 的基本概念和使用方法


--- a/doc/fluid/beginners_guide/install/Tables.md
+++ b/doc/fluid/beginners_guide/install/Tables.md
@@ -221,7 +221,7 @@ PaddePaddle通过编译时指定路径来实现引用各种BLAS/CUDA/cuDNN库。

 您可以在 [Release History](https://pypi.org/project/paddlepaddle-gpu/#history) 中找到PaddlePaddle-gpu的各个发行版本。

-需要注意的是，在v1.3版本中，<code> paddlepaddle-gpu </code> 命令在windows环境下，会默认安装CUDA 8.0和cuDNN 7编译的PaddlePaddle
+需要注意的是，<code> paddlepaddle-gpu </code> 命令在windows环境下，会默认安装CUDA 8.0和cuDNN 7编译的PaddlePaddle

 ***

@@ -245,130 +245,178 @@ PaddePaddle通过编译时指定路径来实现引用各种BLAS/CUDA/cuDNN库。
 	<tbody>
 	<tr>
 		<td> cpu-noavx-mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cpu_avx_mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cpu_avx_openblas </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
+	</tr>
+    <tr>
+		<td> cpu_noavx_openblas </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> - </td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn5_avx_mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn7_noavx_mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn7_avx_mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0.post87-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0.post87-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1.post87-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1.post87-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda9.0_cudnn7_avx_mkl </td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
-	</tr>
-	<tr>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
+	</tr>
+    <tr>
 		<td> win_cpu_noavx_openblas </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
 		<td> win_cpu_noavx_mkl </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
 		<td> win_cpu_avx_openblas </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_avx_openblas </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_noavx_mkl </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_noavx_openblas </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
-		<td> win_cuda8.0_cudnn7_cpu_avx_openblas </td>
+		<td> mac_cpu </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp27-cp27m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp35-cp35m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp36-cp36m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp37-cp37m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-macosx_10_6_intel.whl</a></td>
 	</tr>
   </tbody>
 </table>

--- a/doc/fluid/beginners_guide/install/Tables_en.md
+++ b/doc/fluid/beginners_guide/install/Tables_en.md
@@ -262,7 +262,7 @@ PaddePaddle implements references to various BLAS/CUDA/cuDNN libraries by specif

 You can find various distributions of PaddlePaddle-gpu in [the Release History](https://pypi.org/project/paddlepaddle-gpu/#history).

-Please note that: paddlepaddle-gpu==1.3.0 in windows, will download package compiled with CUDA 8.0 and cuDNN 7
+Please note that: paddlepaddle-gpu in windows, will download package compiled with CUDA 8.0 and cuDNN 7

 ***
 <a name="dockers"></a>
@@ -323,129 +323,178 @@ You can find the docker image for each release of PaddlePaddle in the [DockerHub
 	<tbody>
 	<tr>
 		<td> cpu-noavx-mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-noavx-mkl/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cpu_avx_mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-mkl/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-mkl/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cpu_avx_openblas </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl"> paddlepaddle-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-cpu-avx-openblas/paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas/paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
+	</tr>
+    <tr>
+		<td> cpu_noavx_openblas </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> - </td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn5_avx_mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.3.0.post85-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn5-avx-mkl/paddlepaddle_gpu-1.4.1.post85-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn7_noavx_mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl/paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda8.0_cudnn7_avx_mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0.post87-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0.post87-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post87-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0.post87-cp37-cp37m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1.post87-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1.post87-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post87-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1.post87-cp37-cp37m-linux_x86_64.whl</a></td>
 	</tr>
 	<tr>
 		<td> cuda9.0_cudnn7_avx_mkl </td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27mu-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp27-cp27m-linux_x86_64.whl</a></td>
-		<td><a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.3.0-cp35-cp35m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp36-cp36m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-linux_x86_64.whl</a></td>
-		<td> <a href="http://paddlepaddle.org/download?url=http://paddle-wheel.bj.bcebos.com/1.3.0-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.3.0.post97-cp37-cp37m-linux_x86_64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-linux_x86_64.whl</a></td>
-	</tr>
-	<tr>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp27-cp27mu-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27mu-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp27-cp27m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp27-cp27m-linux_x86_64.whl</a></td>
+		<td><a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp35-cp35m-linux_x86_64.whl"> paddlepaddle_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp36-cp36m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-linux_x86_64.whl</a></td>
+		<td> <a href="http://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda9-cudnn7-avx-mkl/paddlepaddle_gpu-1.4.1.post97-cp37-cp37m-linux_x86_64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-linux_x86_64.whl</a></td>
+	</tr>
+    <tr>
 		<td> win_cpu_noavx_openblas </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-openblas/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-openblas-win%2Fpaddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
 		<td> win_cpu_noavx_mkl </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-noavx-mkl/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-noavx-mkl-win/paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
 		<td> win_cpu_avx_openblas </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-avx-openblas-win/paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_avx_openblas </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-avx-openblas-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_noavx_mkl </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-mkl-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
+	</tr>
+	<tr>
+		<td> win_cuda8.0_cudnn7_gpu_noavx_openblas </td>
+		<td> - </td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp27-cp27m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp35-cp35m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp36-cp36m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-gpu-cuda8-cudnn7-noavx-openblas-win/paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl">
+		paddlepaddle_gpu-1.4.1-cp37-cp37m-win_amd64.whl</a></td>
 	</tr>
 	<tr>
-		<td> win_cuda8.0_cudnn7_cpu_avx_openblas </td>
+		<td> mac_cpu </td>
 		<td> - </td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp27-cp27m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp27-cp27m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp35-cp35m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp35-cp35m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp36-cp36m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp36-cp36m-win_amd64.whl</a></td>
-		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.3.0-win-avx-openblas/paddlepaddle_gpu-1.3.0-cp37-cp37m-win_amd64.whl">
-		paddlepaddle_gpu-1.3.0-cp37-cp37m-win_amd64.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp27-cp27m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp27-cp27m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp35-cp35m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp35-cp35m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp36-cp36m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp36-cp36m-macosx_10_6_intel.whl</a></td>
+		<td> <a href="https://paddle-wheel.bj.bcebos.com/1.4.1-cpu-mac/paddlepaddle-1.4.1-cp37-cp37m-macosx_10_6_intel.whl">
+		paddlepaddle-1.4.1-cp37-cp37m-macosx_10_6_intel.whl</a></td>
 	</tr>
   </tbody>
 </table>

--- a/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_CentOS.md
@@ -200,7 +200,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_MacOS.md
@@ -204,7 +204,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle

--- a/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Ubuntu.md
@@ -209,7 +209,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3` 进入Python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/compile/compile_Windows.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Windows.md
@@ -102,7 +102,10 @@
 恭喜，至此您已完成PaddlePaddle的编译安装

 ## ***验证安装***
-安装完成后您可以使用：`python` 或 `python3`进入Python解释器，然后使用 `import paddle.fluid`, 如沒有提示错误，则表明安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_CentOS.md
+++ b/doc/fluid/beginners_guide/install/install_CentOS.md
@@ -51,7 +51,10 @@ CentOS系统下有4种安装方式：

 <a name="check"></a>
 ## ***验证安装***
-安装完成后您可以使用命令`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## ***如何卸载***
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_Docker.md
+++ b/doc/fluid/beginners_guide/install/install_Docker.md
@@ -4,6 +4,8 @@

 ## 环境准备

+- 目前支持的系统类型，请见[安装说明](./index_cn.html)，请注意目前暂不支持在CentOS 6使用Docker
+
 - 在本地主机上[安装Docker](https://hub.docker.com/search/?type=edition&offering=community)

 - 如需在Linux开启GPU支持，请[安装nvidia-docker](https://github.com/NVIDIA/nvidia-docker)

--- a/doc/fluid/beginners_guide/install/install_MacOS.md
+++ b/doc/fluid/beginners_guide/install/install_MacOS.md
@@ -41,7 +41,10 @@ MacOS系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用：`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载


--- a/doc/fluid/beginners_guide/install/install_Ubuntu.md
+++ b/doc/fluid/beginners_guide/install/install_Ubuntu.md
@@ -51,7 +51,10 @@ Ubuntu系统下有4种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用命令`python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载
 请使用以下命令卸载PaddlePaddle：

--- a/doc/fluid/beginners_guide/install/install_Windows.md
+++ b/doc/fluid/beginners_guide/install/install_Windows.md
@@ -46,7 +46,10 @@ Windows系统下有3种安装方式：

 <a name="check"></a>
 ## 验证安装
-安装完成后您可以使用 `python` 或 `python3` 进入python解释器，然后使用`import paddle.fluid` 验证是否安装成功。
+安装完成后您可以使用 `python` 或 `python3` 进入python解释器，输入`import paddle.fluid as fluid` ，再输入
+ `fluid.install_check.run_check()`
+
+如果出现`Your Paddle Fluid is installed succesfully!`，说明您已成功安装。

 ## 如何卸载


--- a/doc/fluid/beginners_guide/install/install_Windows_en.md
+++ b/doc/fluid/beginners_guide/install/install_Windows_en.md
@@ -9,9 +9,8 @@ This instruction will show you how to install PaddlePaddle on Windows.  The foll

 **Note** : 

-* The current version does not support NCCL, distributed training, AVX, warpctc and MKL related functions.
+* The current version does not support NCCL, distributed training related functions.

-* Currently, only PaddlePaddle for CPU is supported on Windows.



@@ -30,14 +29,20 @@ Version of pip or pip3 should be equal to or above 9.0.1 .

 * Install PaddlePaddle

+* ***CPU version of PaddlePaddle***:
 Execute `pip install paddlepaddle` or `pip3 install paddlepaddle` to download and install PaddlePaddle.

-
+* ***GPU version of PaddlePaddle***:
+Execute `pip install paddlepaddle-gpu`(python2.7) or `pip3 install paddlepaddle-gpu`(python3.x) to download and install PaddlePaddle.
+ 
 ## ***Verify installation***

 After completing the installation, you can use `python` or `python3` to enter the python interpreter and then use `import paddle.fluid` to verify that the installation was successful.

 ## ***How to uninstall***

+* ***CPU version of PaddlePaddle***:
 Use the following command to uninstall PaddlePaddle : `pip uninstallpaddlepaddle `or `pip3 uninstall paddlepaddle`

+* ***GPU version of PaddlePaddle***:
+Use the following command to uninstall PaddlePaddle : `pip uninstall paddlepaddle-gpu` or `pip3 uninstall paddlepaddle-gpu`
--- a/doc/fluid/user_guides/howto/dygraph/DyGraph.md
+++ b/doc/fluid/user_guides/howto/dygraph/DyGraph.md
-# DyGraph
+# 动态图机制-DyGraph
+
+

 PaddlePaddle的DyGraph模式是一种动态的图执行机制，可以立即执行结果，无需构建整个图。同时，和以往静态的执行计算图不同，DyGraph模式下您的所有操作可以立即获得执行结果，而不必等待所构建的计算图全部执行完成，这样可以让您更加直观地构建PaddlePaddle下的深度学习任务，以及进行模型的调试，同时还减少了大量用于构建静态计算图的代码，使得您编写、调试网络的过程变得更加便捷。

@@ -313,7 +315,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		        train_reader = paddle.batch(
 		            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
 		
-		        dy_param_init_value = {}
 		        np.set_printoptions(precision=3, suppress=True)
 		        for epoch in range(epoch_num):
 		            for batch_id, data in enumerate(train_reader()):
@@ -333,10 +334,6 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 		
 		                dy_out = avg_loss.numpy()
 		
-		                if epoch == 0 and batch_id == 0:
-		                    for param in mnist.parameters():
-		                        dy_param_init_value[param.name] = param.numpy()
-		
 		                avg_loss.backward()
 		                sgd.minimize(avg_loss)
 		                mnist.clear_gradients()
@@ -390,11 +387,15 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：

  在模型训练中可以使用`                    fluid.dygraph.save_persistables(your_model_object.state_dict(), "save_dir")`来保存`your_model_object`中所有的模型参数。也可以自定义需要保存的“参数名” - “参数对象”的Python Dictionary传入。

-同样可以使用`your_modle_object.load_dict(
-                        fluid.dygraph.load_persistables(your_model_object.state_dict(), "save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+同样可以使用`your_modle_object.load_dict(fluid.dygraph.load_persistables("save_dir"))`接口来恢复保存的模型参数从而达到继续训练的目的。
+
+
+

 下面的代码展示了如何在“手写数字识别”任务中保存参数并且读取已经保存的参数来继续训练。
-	
+
+
+	dy_param_init_value={}
 	for epoch in range(epoch_num):
 	    for batch_id, data in enumerate(train_reader()):
 	        dy_x_data = np.array(
@@ -420,9 +421,8 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 	
 	        for param in mnist.parameters():
 	            dy_param_init_value[param.name] = param.numpy()
-	
-	        mnist.load_dict(fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
-	        restore = mnist.parameters()
+	        mnist.load_dict(fluid.dygraph.load_persistables("save_dir"))
+	restore = mnist.parameters()
 	# check save and load
 	success = True
 	for value in restore:
@@ -490,7 +490,7 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
        mnist_infer = MNIST("mnist")
        # load checkpoint
        mnist_infer.load_dict(
-            fluid.dygraph.load_persistables(mnist.state_dict(), "save_dir"))
+            fluid.dygraph.load_persistables("save_dir"))
        print("checkpoint loaded")

        # start evaluate mode
@@ -536,56 +536,41 @@ PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：
 ## 编写兼容的模型

 以上一步中手写数字识别的例子为例，相同的模型代码可以直接在PaddlePaddle的`Executor`中执行：
-
+	
 	exe = fluid.Executor(fluid.CPUPlace(
-        ) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
-
-        mnist = MNIST("mnist")
-        sgd = SGDOptimizer(learning_rate=1e-3)
-        train_reader = paddle.batch(
-            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
-
-        img = fluid.layers.data(
-            name='pixel', shape=[1, 28, 28], dtype='float32')
-        label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-        cost = mnist(img)
-        loss = fluid.layers.cross_entropy(cost, label)
-        avg_loss = fluid.layers.mean(loss)
-        sgd.minimize(avg_loss)
-
-        # initialize params and fetch them
-        static_param_init_value = {}
-        static_param_name_list = []
-        for param in mnist.parameters():
-            static_param_name_list.append(param.name)
-
-        out = exe.run(fluid.default_startup_program(),
-                      fetch_list=static_param_name_list)
-
-        for i in range(len(static_param_name_list)):
-            static_param_init_value[static_param_name_list[i]] = out[i]
-
-        for epoch in range(epoch_num):
-            for batch_id, data in enumerate(train_reader()):
-                static_x_data = np.array(
-                    [x[0].reshape(1, 28, 28)
-                     for x in data]).astype('float32')
-                y_data = np.array(
-                    [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
-
-                fetch_list = [avg_loss.name]
-                fetch_list.extend(static_param_name_list)
-                out = exe.run(
-                    fluid.default_main_program(),
-                    feed={"pixel": static_x_data,
-                          "label": y_data},
-                    fetch_list=fetch_list)
-
-                static_param_value = {}
-                static_out = out[0]
-                for i in range(1, len(out)):
-                    static_param_value[static_param_name_list[i - 1]] = out[
-                        i]
+	) if not core.is_compiled_with_cuda() else fluid.CUDAPlace(0))
+	
+	mnist = MNIST("mnist")
+	sgd = SGDOptimizer(learning_rate=1e-3)
+	train_reader = paddle.batch(
+	    paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
+	
+	img = fluid.layers.data(
+	    name='pixel', shape=[1, 28, 28], dtype='float32')
+	label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+	cost = mnist(img)
+	loss = fluid.layers.cross_entropy(cost, label)
+	avg_loss = fluid.layers.mean(loss)
+	sgd.minimize(avg_loss)
+	
+	out = exe.run(fluid.default_startup_program())
+	
+	for epoch in range(epoch_num):
+	    for batch_id, data in enumerate(train_reader()):
+	        static_x_data = np.array(
+	            [x[0].reshape(1, 28, 28)
+	             for x in data]).astype('float32')
+	        y_data = np.array(
+	            [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
+	
+	        fetch_list = [avg_loss.name]
+	        out = exe.run(
+	            fluid.default_main_program(),
+	            feed={"pixel": static_x_data,
+	                  "label": y_data},
+	            fetch_list=fetch_list)
+	
+	        static_out = out[0]

 			
 			
--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
@@ -4,8 +4,7 @@
 Use PyReader to read training and test data
 ############################################

-Paddle Fluid supports PyReader, which implements feeding data from Python to C++. Different from :ref:`user_guide_use_numpy_array_as_train_data_en` , the process of loading data to Python is asynchronous with the process of :code:`Executor::Run()` reading data when PyReader is in use.
-Moreover, PyReader is able to work with :code:`double_buffer_reader` to upgrade the performance of reading data.
+Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.

 Create PyReader Object
 ################################
@@ -17,7 +16,7 @@ You can create PyReader object as follows:
    import paddle.fluid as fluid

    py_reader = fluid.layers.py_reader(capacity=64,
-                                       shapes=[(-1,3,224,224), (-1,1)],
+                                       shapes=[(-1,784), (-1,1)],
                                       dtypes=['float32', 'int64'],
                                       name='py_reader',
                                       use_double_buffer=True)
@@ -28,14 +27,14 @@ In the code, ``capacity`` is buffer size of PyReader;
 ``name`` is name of PyReader instance; 
 ``use_double_buffer`` is True by default, which means :code:`double_buffer_reader` is used.

-To create some different PyReader objects (Usually, you have to create two different PyReader objects for training and testing phase), the names of objects must be different. For example, In the same task, PyReader objects in training and testing period are created as follows:
+Attention: If you want to create multiple PyReader objects（such as two different PyReader in training and inference period respectively）， you have to appoint different names for different PyReader objects,since PaddlePaddle uses different names to distinguish different variables, and `Program.clone()` (reference to :ref:`api_fluid_Program_clone` ）can't copy PyReader objects.

 .. code-block:: python

    import paddle.fluid as fluid

    train_py_reader = fluid.layers.py_reader(capacity=64,
-                                             shapes=[(-1,3,224,224), (-1,1)],
+                                             shapes=[(-1,784), (-1,1)],
                                             dtypes=['float32', 'int64'],
                                             name='train',
                                             use_double_buffer=True)
@@ -46,11 +45,10 @@ To create some different PyReader objects (Usually, you have to create two diffe
                                            name='test',
                                            use_double_buffer=True)

-Note: You could not copy PyReader object with :code:`Program.clone()` so you have to create PyReader objects in training and testing phase with the method mentioned above
+While using PyReader, if you need to share the model parameters of training and test periods, you can use :code:`fluid.unique_name.guard()` .
+Notes: Paddle use different names to distinguish different variables, and the names are generated by the counter in :code:`unique_name` module. By the way, the counts rise by one every time a variable name is generated. :code:`fluid.unique_name.guard()` aims to reset the counter in :code:`unique_name` module, in order to ensure that the variable names are the same when calling :code:`fluid.unique_name.guard()` repeatedly, so that parameters can be shared.

-Because you could not copy PyReader with :code:`Program.clone()` so you have to share the parameters of training phase with testing phase through :code:`fluid.unique_name.guard()` .
-
-Details are as follows:
+An example of configuring networks during the training and test periods by PyReader is as follows:

 .. code-block:: python

@@ -61,41 +59,97 @@ Details are as follows:
    import numpy

    def network(is_train):
+        # Create py_reader object and give different names
+        # when is_train = True and is_train = False
        reader = fluid.layers.py_reader(
            capacity=10,
            shapes=((-1, 784), (-1, 1)),
            dtypes=('float32', 'int64'),
            name="train_reader" if is_train else "test_reader",
            use_double_buffer=True)
+        
+        # Use read_file() method to read out the data from py_reader
        img, label = fluid.layers.read_file(reader)
        ...
        # Here, we omitted the definition of loss of the model
        return loss , reader

+    # Create main program and startup program for training
    train_prog = fluid.Program()
    train_startup = fluid.Program()

    with fluid.program_guard(train_prog, train_startup):
+        # Use fluid.unique_name.guard() to share parameters with test network
        with fluid.unique_name.guard():
            train_loss, train_reader = network(True)
            adam = fluid.optimizer.Adam(learning_rate=0.01)
            adam.minimize(train_loss)

+    # Create main program and startup program for testing
    test_prog = fluid.Program()
    test_startup = fluid.Program()
    with fluid.program_guard(test_prog, test_startup):
+        # Use fluid.unique_name.guard() to share parameters with train network
        with fluid.unique_name.guard():
            test_loss, test_reader = network(False)

 Configure data source of PyReader objects
 ##########################################
-PyReader provides :code:`decorate_tensor_provider` and :code:`decorate_paddle_reader` , both of which receieve Python :code:`generator` as data source.The difference is:
+PyReader object sets the data source by :code:`decorate_paddle_reader()` or :code:`decorate_tensor_provider()` :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` both receive the Python generator :code:`generator` as parameters. :code:`generator` generates a batch of data every time by yield ways inside.
+
+  The differences of :code:`decorate_paddle_reader()` and :code:`decorate_tensor_provider()` ways are:
+
+  - :code:`generator` of :code:`decorate_paddle_reader()` should return data of Numpy Array type, but :code:`generator` of :code:`decorate_tensor_provider()` should return LoDTensor type.
+
+  - :code:`decorate_tensor_provider()` requires that the returned data type and size of LoDTensor of :code:`generator` have to match the appointed dtypes and shapes parameters while configuring py_reader, but :code:`decorate_paddle_reader()` doesn't have the requirements, since the data type and size can transform inside.
+
+  Specific ways are as follows:
+
+  .. code-block:: python
+
+     import paddle.fluid as fluid
+     import numpy as np

-1. :code:`decorate_tensor_provider` :  :code:`generator` generates a  :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being :code:`LoDTensor` or Numpy array, and :code:`LoDTensor` or :code:`shape` of Numpy array must be the same as :code:`shapes` stated while PyReader is created.
+     BATCH_SIZE = 32

+     # Case 1: Use decorate_paddle_reader() method to set the data source of py_reader
+     # The generator yields Numpy-typed batched data
+     def fake_random_numpy_reader():
+         image = np.random.random(size=(BATCH_SIZE, 784))
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9)
+         yield image, label

-2. :code:`decorate_paddle_reader` :  :code:`generator` generates a :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being Numpy array,but the :code:`shape` of Numpy array doesn't have to be the same as :code:`shape` stated while PyReader is created. :code:`decorate_paddle_reader` will :code:`reshape` Numpy array internally.
+     py_reader1 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader1',
+         use_double_buffer=True)

+    py_reader1.decorate_paddle_reader(fake_random_reader)
+
+
+    # Case 2: Use decorate_tensor_provider() method to set the data source of py_reader
+     # The generator yields Tensor-typed batched data
+     def fake_random_tensor_provider():
+         image = np.random.random(size=(BATCH_SIZE, 784)).astype('float32')
+         label = np.random.random_integers(size=(BATCH_SIZE, 1), low=0, high=9).astype('int64')
+
+         image_tensor = fluid.LoDTensor()
+         image_tensor.set(image, fluid.CPUPlace())
+
+         label_tensor = fluid.LoDTensor()
+         label_tensor.set(label, fluid.CPUPlace())
+         yield image_tensor, label_tensor
+
+     py_reader2 = fluid.layers.py_reader(
+         capacity=10,
+         shapes=((-1, 784), (-1, 1)),
+         dtypes=('float32', 'int64'),
+         name='py_reader2',
+         use_double_buffer=True)
+
+     py_reader2.decorate_tensor_provider(fake_random_tensor_provider)
 example usage：

 .. code-block:: python
@@ -142,32 +196,75 @@ example usage：
 Train and test model with PyReader
 ##################################

-Details are as follows（the remaining part of the code above）:
+Examples by using PyReader to train models and test are as follows:

 .. code-block:: python

+    import paddle
+     import paddle.fluid as fluid
+     import paddle.dataset.mnist as mnist
+     import six
+
+     def network(is_train):
+         # Create py_reader object and give different names
+         # when is_train = True and is_train = False
+         reader = fluid.layers.py_reader(
+             capacity=10,
+             shapes=((-1, 784), (-1, 1)),
+             dtypes=('float32', 'int64'),
+             name="train_reader" if is_train else "test_reader",
+             use_double_buffer=True)
+         img, label = fluid.layers.read_file(reader)
+         ...
+         # Here, we omitted the definition of loss of the model
+         return loss , reader
+
+     # Create main program and startup program for training
+     train_prog = fluid.Program()
+     train_startup = fluid.Program()
+
+     # Define train network
+     with fluid.program_guard(train_prog, train_startup):
+         # Use fluid.unique_name.guard() to share parameters with test network
+         with fluid.unique_name.guard():
+             train_loss, train_reader = network(True)
+             adam = fluid.optimizer.Adam(learning_rate=0.01)
+             adam.minimize(train_loss)
+
+     # Create main program and startup program for testing
+     test_prog = fluid.Program()
+     test_startup = fluid.Program()
+
+     # Define test network
+     with fluid.program_guard(test_prog, test_startup):
+         # Use fluid.unique_name.guard() to share parameters with train network
+         with fluid.unique_name.guard():
+             test_loss, test_reader = network(False)
+
+
    place = fluid.CUDAPlace(0)
-    startup_exe = fluid.Executor(place)
-    startup_exe.run(train_startup)
-    startup_exe.run(test_startup)
+    exe = fluid.Executor(place)

-    trainer = fluid.ParallelExecutor(
-        use_cuda=True, loss_name=train_loss.name, main_program=train_prog)
+    # Run startup program
+    exe.run(train_startup)
+    exe.run(test_startup)

-    tester = fluid.ParallelExecutor(
-        use_cuda=True, share_vars_from=trainer, main_program=test_prog)
+    # Compile programs
+    train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(loss_name=train_loss.name)
+    test_prog = fluid.CompiledProgram(test_prog).with_data_parallel(share_vars_from=train_prog)

+    # Set the data source of py_reader using decorate_paddle_reader() method
    train_reader.decorate_paddle_reader(
        paddle.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))

    test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))

-    for epoch_id in xrange(10):
+    for epoch_id in six.moves.range(10):
        train_reader.start()
        try:
            while True:
-                print 'train_loss', numpy.array(
-                    trainer.run(fetch_list=[train_loss.name]))
+                loss = exe.run(program=train_prog, fetch_list=[train_loss])
+                print 'train_loss', loss
        except fluid.core.EOFException:
            print 'End of epoch', epoch_id
            train_reader.reset()
@@ -175,8 +272,8 @@ Details are as follows（the remaining part of the code above）:
        test_reader.start()
        try:
            while True:
-                print 'test loss', numpy.array(
-                    tester.run(fetch_list=[test_loss.name]))
+                loss = exe.run(program=test_prog, fetch_list=[test_loss])
+                print 'test loss', loss
        except fluid.core.EOFException:
            print 'End of testing'
            test_reader.reset()

--- a/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
@@ -35,7 +35,9 @@ In the training of data parallelism mode, Fluid uses two communication modes to

  The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.

-  When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+  **Note:** When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+
+  **Note:** When using GPU training, if there are multiple GPU cards in each trainer node, the gradient polymerization will execute in NCCL2 way among the cards in one node, and then in multiple nodes through pserver.

 - Structure of NCCL2 communication method:

@@ -178,8 +180,10 @@ Distributed training in NCCL2 mode, because there is no parameter server role, t

 * Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
 * When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
+  In this step, :code:`gen_nccl_id_op` will add in :code:`startup program` to synchronize NCCLID information during the multi-computer program initialization.
 * Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
-
+  In this step, :code:`ParallelExecutor` will initialize NCCL2 by the multi-computer way and do the operations :code:`allreduce` across the nodes for the gradient of every parameter to execute muti-computer training
+   
 For example:

 .. code-block:: python
@@ -198,24 +202,45 @@ For example:
 .. csv-table:: Description of the necessary parameters for NCCL2 mode
 	:header: "parameter", "description"

-	"trainer_id", "The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
-	"trainers", "endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
-	"current_endpoint", "endpoint of current node"
+	"trainer_id", "(int)The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
+	"trainers", "(int)endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
+	"current_endpoint", "(string)endpoint of current node"

 Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
 synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.

+Start Up NCCL2 Distributed Training in Muti-Process Mode
++++++++++++++++++++++++++++++++++++++++++++++
+
+ Usually you can get better multi-training performance by using multi-process mode to start up NCCL2 distributed training assignment. Paddle provides :code:`paddle.distributed.launch` module to start up multi-process assignment, after which each training process will use an independent GPU device.
+
+Attention during usage:
+
+ * set the number of nodes: set the number of nodes of an assignment by the environment variable :code:`PADDLE_NUM_TRAINERS` , and this variable will also be set in every training process.
+ * set the number of devices of each node: by activating the parameter :code:`--gpus` , you can set the number of GPU devices of each node, and the sequence number of each process will be set in the environment variable :code:`PADDLE_TRAINER_ID` automatically.
+ * data segment: mult-process mode means one process in each device. Generally, each process manages a part of training data, in order to make sure that all processes can manage the whole data set.
+ * entrance file: entrance file is the training script for actual startup.
+ * journal: for each training process, the joural is saved in the default :code:`./mylog` directory, and you can assign by the parameter :code:`--log_dir` .
+
+  startup example:
+
+  .. code-block:: bash
+
+     > PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch train.py --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
+
 Important Notes on NCCL2 Distributed Training
 ++++++++++++++++++++++++++++++++++++++++++++++

-**Note** : Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
+**Note:** When using distributed training in NCCL2 mode, if you only want to use a part of cards in one node, you can appoint by configuring the environment variable :code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` .
+
+**Note:** Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
 exit at the final iteration. There are two common ways:

 - Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)
 - Each node only trains fixed number of batches per pass, which is controlled by python codes. If a node has more data than this fixed amount, then these 
  marginal data will not be trained.

-**Note** :  If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.
+**Note** : If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.

 Assuming you need to use :code:`eth2` as the communication device, you need to set the following environment variables:


--- a/doc/fluid/user_guides/howto/training/save_load_variables.rst
+++ b/doc/fluid/user_guides/howto/training/save_load_variables.rst
 .. _user_guide_save_load_vars:

-#############################
-模型/变量的保存/载入与增量训练
-#############################
+################################
+模型/变量的保存、载入与增量训练
+################################

 模型变量分类
 ############
@@ -37,6 +37,21 @@
 那么我们应该将各种长期变量都保存下来，甚至还需要记录一下当前的epoch和step的id。
 因为一些模型变量虽然不是参数，但对于模型的训练依然必不可少。

+save_vars、save_params、save_persistables 以及 save_inference_model的区别
+##########################################################################
+1. :code:`save_inference_model` 会根据用户配置的 :code:`feeded_var_names` 和 :code:`target_vars` 进行网络裁剪，保存下裁剪后的网络结构的 ``__model__`` 以及裁剪后网络中的长期变量
+
+2. :code:`save_persistables` 不会保存网络结构，会保存网络中的全部长期变量到指定位置。
+
+3. :code:`save_params` 不会保存网络结构，会保存网络中的全部模型参数到指定位置。
+
+4. :code:`save_vars` 不会保存网络结构，会根据用户指定的 :code:`fluid.framework.Parameter` 列表进行保存。
+
+ :code:`save_persistables` 保存的网络参数是最全面的，如果是增量训练或者恢复训练， 请选择 :code:`save_persistables` 进行变量保存。
+ :code:`save_inference_model` 会保存网络参数及裁剪后的模型，如果后续要做预测相关的工作， 请选择 :code:`save_inference_model` 进行变量和网络的保存。
+ :code:`save_vars 和 save_params` 仅在用户了解清楚用途及特殊目的情况下使用， 一般不建议使用。
+
+
 保存模型用于对新样本的预测
 ==========================

@@ -97,8 +112,6 @@
 另外，需特别注意运行 :code:`fluid.default_startup_program()` 必须在调用 :code:`fluid.io.load_params`
 之前。如果在之后运行，可能会覆盖已加载的模型参数导致错误。

-
-
 预测模型的保存和加载
 ##############################

@@ -162,6 +175,7 @@

 1. 在训练的最后调用 :code:`fluid.io.save_persistables` 保存长期变量时，不必要所有的trainer都调用这个方法来保存，一般0号trainer来保存即可。
 2. 多机增量训练的参数加载在PServer端，trainer端不用加载参数。在PServer全部启动后，trainer会从PServer端同步参数。
+3. 在确认需要使用增量的情况下， 多机在调用 :code:`fluid.DistributeTranspiler.transpile` 时需要指定 ``current_endpoint`` 参数。

 多机增量（不带分布式大规模稀疏矩阵）训练的一般步骤为：

@@ -211,7 +225,7 @@
    training_role == "PSERVER"
    config = fluid.DistributeTranspilerConfig()
    t = fluid.DistributeTranspiler(config=config)
-    t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True)
+    t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True, current_endpoint=current_endpoint)

    if training_role == "PSERVER":
        current_endpoint = "127.0.0.1:1001"

--- a/doc/fluid/user_guides/index_cn.rst
+++ b/doc/fluid/user_guides/index_cn.rst
@@ -14,8 +14,7 @@

    - `训练神经网络 <../user_guides/howto/training/index_cn.html>`_：介绍如何使用 Fluid 进行单机训练、多机训练、以及保存和载入模型变量

-
-	- `DyGraph模式 <../user_guides/howto/dygraph/DyGraph.md>`_：介绍在 Fluid 下使用DyGraph       
+    - `DyGraph模式 <../user_guides/howto/dygraph/DyGraph.html>`_：介绍在 Fluid 下使用DyGraph

    - `模型评估与调试 <../user_guides/howto/evaluation_and_debugging/index_cn.html>`_：介绍在 Fluid 下进行模型评估和调试的方法，包括：


--- a/doc/fluid/user_guides/models/index_cn.rst
+++ b/doc/fluid/user_guides/models/index_cn.rst
@@ -14,8 +14,7 @@ Fluid模型配置和参数文件的工具。
 -  `AlexNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `VGG <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `GoogleNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
-  `Residual
-   Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
+-  `Residual Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Inception-v4 <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `MobileNet <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models>`__
 -  `Dual Path
@@ -122,7 +121,7 @@ RNN 结构的 NMT 得以应运而生，例如基于卷积神经网络 CNN
 Attention 学习语言中的上下文依赖。相较于RNN/CNN,
 这种结构在单层内计算复杂度更低、易于并行化、对长程依赖更易建模，最终在多种语言之间取得了最好的翻译效果。

-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+-  `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 强化学习
 --------
@@ -163,7 +162,7 @@ DQN 及其变体，并测试了它们在 Atari 游戏中的表现。

 本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作，用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发，其网络结构完全基于注意力(attention)机制，利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示，然后利用cross-attention获取应答与语境之间的相关性，在两个大规模多轮对话数据集上的表现均好于其它模型。

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -174,8 +173,7 @@ AnyQ

 SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架，该框架在百度各产品上广泛应用，主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式，同时基于该框架也集成了学术界主流的语义匹配模型，如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力。

-  `SimNet in PaddlePaddle
-   Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`__
+-  `SimNet in PaddlePaddle Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`_

 机器阅读理解
 ----
@@ -184,7 +182,7 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架

 百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集，所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区)，答案是由人类回答的。每个问题都对应多个答案，数据集包含200k问题、1000k原文和420k答案，是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型，称为DuReader，采用当前通用的网络分层结构，通过双向attention机制捕捉问题和原文之间的交互关系，生成query-aware的原文表示，最终基于query-aware的原文表示通过point network预测答案范围。

-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+-  `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`_


 个性化推荐

--- a/doc/fluid/user_guides/models/index_en.rst
+++ b/doc/fluid/user_guides/models/index_en.rst
@@ -97,7 +97,7 @@ Machine Translation transforms a natural language (source language) into another
 The Transformer implemented in this example is a machine translation model based on the self-attention mechanism, in which there is no more RNN or CNN structure, but fully utilizes Attention to learn the context dependency. Compared with RNN/CNN, in a single layer, this structure has lower computational complexity, easier parallelization, and easier modeling for long-range dependencies, and finally achieves the best translation effect among multiple languages.


- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README_cn.md>`__
+- `Transformer <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/neural_machine_translation/transformer/README.md>`__

 Reinforcement learning
 -------------------------
@@ -131,7 +131,7 @@ In many scenarios of natural language processing, it is necessary to measure the

 The DAM (Deep Attention Matching Network) introduced in this example is the work of Baidu Natural Language Processing Department published in ACL-2018, which is used for the selection of responses in multi-round dialogue of retrieval chat robots. Inspired by Transformer, DAM is based entirely on the attention mechanism. It uses the stack-type self-attention structure to learn the semantic representations of responses and contexts at different granularities, and then uses cross-attention to obtain relativity between responses and contexts. The performance on the two large-scale multi-round dialogue datasets is better than other models.

- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/deep_attention_matching_net>`__
+- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching>`__

 AnyQ
 ----
@@ -151,7 +151,7 @@ Machine Reading Comprehension (MRC) is one of the core tasks in Natural Language

 Baidu reading comprehension dataset is an open-source real-world dataset publicized by Baidu Natural Language Processing Department. All the questions and original texts are derived from actual data (Baidu search engine data and Baidu know Q&A community), and the answer is given by humans. Each question corresponds to multiple answers. The dataset contains 200k questions, 1000k original text and 420k answers. It is currently the largest Chinese MRC dataset. Baidu also publicized the corresponding open-source reading comprehension model, called DuReader. DuReader adopts the current common network hierarchical structure, and captures the interaction between the problems and the original texts through the double attention mechanism to generate the original representation of the query-aware. Finally, based on the original text of query-aware, the answer scope is predicted by point network.

- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/machine_reading_comprehension/README.md>`__
+- `DuReader in PaddlePaddle Fluid <https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/reading_comprehension>`__


 Personalized recommendation

--- a/Paddle @ 2ff867f8
+++ b/Paddle @ 2ff867f8
-Subproject commit 7b453631fc29e4fe2f68757fc458634d55690b3d
+Subproject commit 2ff867f88628e9cb8b76eaf79045eca0f52e5b85