Fix docs 1.8 (#2150)

* test=develop, test=document_fix * test=release/1.8, test=document_fix

Fix docs 1.8 (#2150)
* test=develop, test=document_fix * test=release/1.8, test=document_fix
4c2c4ac4 · Daniel Yang · GitHub · 3bb48daf · 4c2c4ac4 · 4c2c4ac4
5 changed file
--- a/doc/fluid/advanced_guide/distributed_training/index_cn.rst
+++ b/doc/fluid/advanced_guide/distributed_training/index_cn.rst
@@ -6,5 +6,4 @@
    :maxdepth: 1
    cluster_quick_start.rst
-    cluster_howto.rst
    fleet_api_howto_cn.rst
--- a/doc/fluid/advanced_guide/distributed_training/index_en.rst
+++ b/doc/fluid/advanced_guide/distributed_training/index_en.rst
@@ -8,5 +8,5 @@ Distributed Training
   :maxdepth: 1
   cluster_quick_start_en.rst
-   cluster_howto_en.rst
--- a/doc/fluid/beginners_guide/basic_concept/dygraph/DyGraph.md
+++ b/doc/fluid/beginners_guide/basic_concept/dygraph/DyGraph.md
-# 命令式编程模式(动态图)机制-DyGraph
+# 命令式编程使用教程
-PaddlePaddle的DyGraph模式是一种动态的图执行机制，可以立即执行结果，无需构建整个图。同时，和以往静态的执行计算图不同，DyGraph模式下您的所有操作可以立即获得执行结果，而不必等待所构建的计算图全部执行完成，这样可以让您更加直观地构建PaddlePaddle下的深度学习任务，以及进行模型的调试，同时还减少了大量用于构建静态计算图的代码，使得您编写、调试网络的过程变得更加便捷。
-PaddlePaddle DyGraph是一个更加灵活易用的模式，可提供：  
-* 更加灵活便捷的代码组织结构：使用python的执行控制流程和面向对象的模型设计
+从编程范式上说，飞桨兼容支持声明式编程和命令式编程，通俗地讲即静态图和动态图。其实飞桨本没有图的概念，在飞桨的设计中，把一个神经网络定义成一段类似程序的描述，也就是用户在写程序的过程中，就定义了模型表达及计算。在静态图的控制流实现方面，飞桨借助自己实现的控制流OP而不是python原生的if else和for循环，这使得在飞桨中的定义的program即一个网络模型，可以有一个内部的表达，是可以全局优化编译执行的。考虑对开发者来讲，更愿意使用python原生控制流，飞桨也做了支持，并通过解释方式执行，这就是命令式编程模式。但整体上，这两种编程范式是相对兼容统一的。飞桨将持续发布更完善的命令式编程功能，同时保持更强劲的性能。
-* 更加便捷的调试功能：直接使用python的打印方法即时打印所需要的结果，从而检查正在运行的模型结果便于测试更改
-* 和静态执行图通用的模型代码：同样的模型代码可以使用更加便捷的DyGraph调试，执行，同时也支持使用原有的声明式编程模式(静态图)模式执行
-有关的命令式编程模式机制更多的实际模型示例请参考[Paddle/models/dygraph](https://github.com/PaddlePaddle/models/tree/develop/dygraph)
+飞桨平台中，将神经网络抽象为计算表示**Operator**（算子，常简称OP）和数据表示**Variable**（变量），如 图1 所示。神经网络的每层操作均由一个或若干**Operator**组成，每个**Operator**接受一系列的**Variable**作为输入，经计算后输出一系列的**Variable**。
+<center><img src="https://ai-studio-static-online.cdn.bcebos.com/15197499f49840fcb43a38d19d9c729e19f3a7bf5ae5432a8eeca083ac4e02b7" width="600" ></center>
+<br>
-## 设置和基本用法
+<center>图1 Operator和Variable关系示意图</center>
+根据**Operator**解析执行方式不同，飞桨支持如下两种编程范式：
+* **声明式编程模式（静态图）**：先编译后执行的方式。用户需预先定义完整的网络结构，再对网络结构进行编译优化后，才能执行获得计算结果。
+* **命令式编程模式（动态图）**：解析式的执行方式。用户无需预先定义完整的网络结构，每写一行网络代码，即可同时获得计算结果。
-1. 升级到最新的PaddlePaddle 1.6.0:
+举例来说，假设用户写了一行代码：y=x+1，在静态图模式下，运行此代码只会往计算图中插入一个Tensor加1的**Operator**，此时**Operator**并未真正执行，无法获得y的计算结果。但在动态图模式下，所有**Operator**均是即时执行的，运行完此代码后**Operator**已经执行完毕，用户可直接获得y的计算结果。
-    ```
+## 为什么命令式编程模式越来越流行？
-    pip install -q --upgrade paddlepaddle==1.6.0
-    ```
-2. 使用`fluid.dygraph.guard(place=None)` 上下文：
+声明式编程模式作为较早提出的一种编程范式，提供丰富的 API ，能够快速的实现各种模型；并且可以利用全局的信息进行图优化，优化性能和显存占用；在预测部署方面也可以实现无缝衔接。 但具体实践中声明式编程模式存在如下问题：
+1. 采用先编译后执行的方式，组网阶段和执行阶段割裂，导致调试不方便。
+2. 属于一种符号化的编程方式，要学习新的编程方式，有一定的入门门槛。
+3. 网络结构固定，对于一些树结构的任务支持的不够好。
-    ```python
+命令式编程模式的出现很好的解决了这些问题，存在以下优势：
-    import paddle.fluid as fluid
+1. 代码运行完成后，可以立马获取结果，支持使用 IDE 断点调试功能，使得调试更方便。
-    with fluid.dygraph.guard():
+2. 属于命令式的编程方式，与编写Python的方式类似，更容易上手。
-        # write your executable dygraph code here  
+3. 网络的结构在不同的层次中可以变化，使用更灵活。
-    ```
-现在您就可以在`fluid.dygraph.guard()`上下文环境中使用DyGraph的模式运行网络了，DyGraph将改变以往PaddlePaddle的执行方式： 现在他们将会立即执行，并且将计算结果返回给Python。
-Dygraph将非常适合和Numpy一起使用，使用`fluid.dygraph.to_variable(x)`将会将ndarray转换为`fluid.Variable`，而使用`fluid.Variable.numpy()`将可以把任意时刻获取到的计算结果转换为Numpy`ndarray`：  
+综合以上优势，使得命令式编程模式越来越受开发者的青睐，本章侧重介绍在飞桨中命令式编程方法，包括如下几部分：
+1. 如何开启命令式编程模式
+2. 如何使用命令式编程进行模型训练
+3. 如何基于命令式编程进行多卡训练
+4. 如何部署命令式编程模型
+5. 命令式编程模式常见的使用技巧，如中间变量值/梯度打印、断点调试、阻断反向传递，以及某些场景下如何改写为静态图模式运行。
+## 1. 开启命令式编程模式
+目前飞桨默认的模式是静态图，采用基于 context （上下文）的管理方式开启动态图模式：
+```
+with fluid.dygraph.guard()
+```
+我们先通过一个实例，观察一下动态图模式开启前后执行方式的差别：
 ```python
-x = np.ones([2, 2], np.float32)
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+main_program = fluid.Program()
+startup_program = fluid.Program()
+with fluid.program_guard(main_program=main_program, startup_program=startup_program):
+    # 利用np.ones函数构造出[2*2]的二维数组，值为1
+    data = np.ones([2, 2], np.float32)
+    # 静态图模式下，使用layers.data构建占位符用于数据输入
+    x = fluid.layers.data(name='x', shape=[2], dtype='float32')
+    print('In static mode, after calling layers.data, x = ', x)
+    # 静态图模式下，对Variable类型的数据执行x=x+10操作
+    x += 10
+    # 在静态图模式下，需要用户显示指定运行设备
+    # 此处调用fluid.CPUPlace() API来指定在CPU设备上运行程序
+    place = fluid.CPUPlace()
+    # 创建“执行器”，并用place参数指明需要在何种设备上运行
+    exe = fluid.Executor(place=place)
+    # 初始化操作，包括为所有变量分配空间等，比如上面的‘x’，在下面这行代码执行后才会被分配实际的内存空间
+    exe.run(fluid.default_startup_program())
+    # 使用执行器“执行”已经记录的所有操作，在本例中即执行layers.data、x += 10操作
+    # 在调用执行器的run接口时，可以通过fetch_list参数来指定要获取哪些变量的计算结果，这里我们要获取‘x += 10’执行完成后‘x’的结果；
+    # 同时也可以通过feed参数来传入数据，这里我们将data数据传递给‘fluid.layers.data’指定的‘x’。
+    data_after_run = exe.run(fetch_list=[x], feed={'x': data})
+    # 此时我们打印执行器返回的结果，可以看到“执行”后，Tensor中的数据已经被赋值并进行了运算，每个元素的值都是11
+    print('In static mode, data after run:', data_after_run)
+# 开启动态图模式
 with fluid.dygraph.guard():
-    inputs = []
+    # 动态图模式下，将numpy的ndarray类型的数据转换为Variable类型
-    for _ in range(10):
+    x = fluid.dygraph.to_variable(data)
-        inputs.append(fluid.dygraph.to_variable(x))
+    print('In DyGraph mode, after calling dygraph.to_variable, x = ', x)
-    ret = fluid.layers.sums(inputs)
+    # 动态图模式下，对Variable类型的数据执行x=x+10操作
-    print(ret.numpy())
+    x += 10
+    # 动态图模式下，调用Variable的numpy函数将Variable类型的数据转换为numpy的ndarray类型的数据
+    print('In DyGraph mode, data after run:', x.numpy())
 ```
-得到输出：
+    In static mode, after calling layers.data, x =  name: "x"
+    type {
+      type: LOD_TENSOR
+      lod_tensor {
+        tensor {
+          data_type: FP32
+          dims: -1
+          dims: 2
+        }
+        lod_level: 0
+      }
+    }
+    persistable: false
-```
+    In static mode, data after run: [array([[11., 11.],
-[[10. 10.]
+           [11., 11.]], dtype=float32)]
-[10. 10.]]
+    In DyGraph mode, after calling dygraph.to_variable, x =  name generated_var_0, dtype: VarType.FP32 shape: [2, 2] 	lod: {}
-```
+    	dim: 2, 2
+    	layout: NCHW
+    	dtype: float
+    	data: [1 1 1 1]
+    In DyGraph mode, data after run: [[11. 11.]
+     [11. 11.]]
->    这里创建了一系列`ndarray`的输入，执行了一个`sum`操作之后，我们可以直接将运行的结果打印出来
-然后通过调用`reduce_sum`后使用`Variable.backward()`方法执行反向，使用`Variable.gradient()`方法即可获得反向网络执行完成后的梯度值的`ndarray`形式：
+    /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:804: UserWarning: There are no operators in the program to be executed. If you pass Program manually, please use fluid.program_guard to ensure the current Program is being used.
+      warnings.warn(error_info)
-```python
+从以上输出结果可以看出：
-loss = fluid.layers.reduce_sum(ret)
+* 动态图模式下，所有操作在运行时就已经完成，更接近我们平时的编程方式，可以随时获取每一个操作的执行结果。
-loss.backward()
+* 静态图模式下，过程中并没有实际执行操作，上述例子中可以看到只能打印声明的类型，最后需要调用执行器来统一执行所有操作，计算结果需要通过执行器统一返回。
-print(loss.gradient())
-```
+##  2. 使用命令式编程进行模型训练
+接下来我们以一个简单的手写体识别任务为例，说明如何使用飞桨的动态图来进行模型的训练。包括如下步骤：
-得到输出 ：
+* 2.1 定义数据读取器：读取数据和预处理操作。
+* 2.2 定义模型和优化器：搭建神经网络结构。
+* 2.3 训练：配置优化器、学习率、训练参数。循环调用训练过程，循环执行“前向计算 + 损失函数 + 反向传播”。
+* 2.4 评估测试：将训练好的模型保存并评估测试。
-```
+最后介绍一下：
-[1.]
+* 2.5 模型参数的保存和加载方法。
-```
-## 基于DyGraph构建网络
+在前面章节我们已经了解到，“手写数字识别”的任务是：根据一个28 * 28像素的图像，识别图片中的数字。可采用MNIST数据集进行训练。
+![](https://ai-studio-static-online.cdn.bcebos.com/f8ffb092f6354d8c9c0219224db0e87b5490c5715cc346cf87b7098b2c3c2069)
-1. 编写一段用于DyGraph执行的Object-Oriented-Designed, PaddlePaddle模型代码主要由以下**两部分**组成： **请注意，如果您设计的这一层结构是包含参数的，则必须要使用继承自`fluid.dygraph.Layer`的Object-Oriented-Designed的类来描述该层的行为。**
+有关该任务和数据集的详细介绍，可参考：[初识飞桨手写数字识别模型](https://aistudio.baidu.com/aistudio/projectdetail/224342)
-    1. 建立一个可以在DyGraph模式中执行的，Object-Oriented的网络，需要继承自`fluid.dygraph.Layer`，其中需要调用基类的`__init__`方法，在构造函数中，我们通常会执行一些例如参数初始化，子网络初始化的操作，执行这些操作时不依赖于输入的动态信息:
+### 2.1 定义数据读取器
-        ```python
+飞桨提供了多个封装好的数据集API，本任务我们可以通过调用 [paddle.dataset.mnist](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/data/dataset_cn.html) 的 train 函数和 test 函数，直接获取处理好的 MNIST 训练集和测试集；然后调用 [paddle.batch](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/io_cn/batch_cn.html#batch) 接口返回 reader 的装饰器，该 reader 将输入 reader 的数据打包成指定 BATCH_SIZE 大小的批处理数据。
-        class MyLayer(fluid.dygraph.Layer):
-            def __init__(self, input_size):
-                super(MyLayer, self).__init__()
-                self.linear = fluid.dygraph.nn.Linear(input_size, 12)
-        ```
-    2. 实现一个`forward(self, *inputs)`的执行函数，该函数将负责执行实际运行时网络的执行逻辑， 该函数将会在每一轮训练/预测中被调用，这里我们将执行一个简单的 `linear` -> `relu` -> `elementwise add` -> `reduce sum`：
-        ```python
+```python
-            def forward(self, inputs):
+import paddle
-                x = self.linear(inputs)
-                x = fluid.layers.relu(inputs)
-                self._x_for_debug = x
-                x = fluid.layers.elementwise_mul(x, x)
-                x = fluid.layers.reduce_sum(x)
-                return [x]
-        ```
-2. 在`fluid.dygraph.guard()`中执行：
+# 定义批大小
+BATCH_SIZE = 64
-    1. 使用Numpy构建输入：
+# 通过调用paddle.dataset.mnist的train函数和test函数来构造reader
+train_reader = paddle.batch(
+    paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
+test_reader = paddle.batch(
+    paddle.dataset.mnist.test(), batch_size=BATCH_SIZE, drop_last=True)
+```
-        ```python
+    Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-images-idx3-ubyte.gz
-        np_inp = np.array([1.0, 2.0, -1.0], dtype=np.float32)
+    Begin to download
-        ```
-    2. 转换输入的`ndarray`为`Variable`, 并执行前向网络获取返回值： 使用`fluid.dygraph.to_variable(np_inp)`转换Numpy输入为DyGraph接收的输入，然后使用`my_layer(var_inp)[0]`调用callable object并且获取了`x`作为返回值，利用`x.numpy()`方法直接获取了执行得到的`x`的`ndarray`返回值。
+    Download finished
+    Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz
+    Begin to download
+    ........
+    Download finished
+    Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz
+    Begin to download
-        ```python
+    Download finished
-        with fluid.dygraph.guard():
+    Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz
-            var_inp = fluid.dygraph.to_variable(np_inp)
+    Begin to download
-            my_layer = MyLayer(np_inp.shape[-1])
+    ..
-            x = my_layer(var_inp)[0]
+    Download finished
-            dy_out = x.numpy()
-        ```
-    3. 计算梯度：自动微分对于实现机器学习算法（例如用于训练神经网络的反向传播）来说很有用， 使用`x.backward()`方法可以从某个`fluid.Varaible`开始执行反向网络，同时利用`my_layer._x_for_debug.gradient()`获取了网络中`x`梯度的`ndarray` 返回值：
-        ```python
+### 2.2 定义模型和优化器
-            x.backward()
-            dy_grad = my_layer._x_for_debug.gradient()
+本节我们采用如下网络模型，该模型可以很好的完成“手写数字识别”的任务。模型由卷积层 -> 池化层 -> 卷积层 -> 池化层 -> 全连接层组成，池化层即降采样层。
-        ```
+![](https://ai-studio-static-online.cdn.bcebos.com/f9e59d727d68437aaaad8cee410e564c7a80063367bd4fcd9f710a1480ee338c)
+在开始构建网络模型前，需要了解如下信息：
+> <font size=2>在动态图模式中，参数和变量的存储管理方式与静态图不同。动态图模式下，网络中学习的参数和中间变量，生命周期和 Python 对象的生命周期是一致的。简单来说，一个 Python 对象的生命周期结束，相应的存储空间就会释放。</font>
+对于一个网络模型，在模型学习的过程中参数会不断更新，所以参数需要在整个学习周期内一直保持存在，因此需要一个机制来保持网络的所有的参数不被释放，飞桨的动态图模式采用了继承自 [fluid.dygraph.Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的面向对象设计的方法来管理所有的参数，该方法也更容易模块化组织代码。
+下面介绍如何通过继承 fluid.dygraph.Layers 实现一个简单的ConvPool层；该层由一个 [卷积层](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Conv2D_cn.html#conv2d) 和一个 [池化层](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Pool2D_cn.html#pool2d) 组成。
-完整代码如下：
 ```python
 import paddle.fluid as fluid
-import numpy as np
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D
+# 定义SimpleImgConvPool网络，必须继承自fluid.dygraph.Layer
+# 该网络由一个卷积层和一个池化层组成
+class SimpleImgConvPool(fluid.dygraph.Layer):
+    # 在__init__构造函数中会执行变量的初始化、参数初始化、子网络初始化的操作
+    # 本例中执行了Conv2D和Pool2D网络的初始化操作
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 filter_size,
+                 pool_size,
+                 pool_stride,
+                 pool_padding=0,
+                 pool_type='max',
+                 global_pooling=False,
+                 conv_stride=1,
+                 conv_padding=0,
+                 conv_dilation=1,
+                 conv_groups=1,
+                 act=None,
+                 use_cudnn=False,
+                 param_attr=None,
+                 bias_attr=None):
+        super(SimpleImgConvPool, self).__init__()
+        # Conv2D网络的初始化
+        self._conv2d = Conv2D(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=conv_stride,
+            padding=conv_padding,
+            dilation=conv_dilation,
+            groups=conv_groups,
+            param_attr=None,
+            bias_attr=None,
+            act=act,
+            use_cudnn=use_cudnn)
+        # Pool2D网络的初始化
+        self._pool2d = Pool2D(
+            pool_size=pool_size,
+            pool_type=pool_type,
+            pool_stride=pool_stride,
+            pool_padding=pool_padding,
+            global_pooling=global_pooling,
+            use_cudnn=use_cudnn)
+    # forward函数实现了SimpleImgConvPool网络的执行逻辑
+    def forward(self, inputs):
+        x = self._conv2d(inputs)
+        x = self._pool2d(x)
+        return x
+```
+可以看出实现一个 ConvPool 层（即SimpleImgConvPool）分为两个步骤：
+1. 定义 \_\_init\_\_ 构造函数。
-class MyLayer(fluid.dygraph.Layer):
+在 \_\_init\_\_ 构造函数中，通常会执行变量初始化、参数初始化、子网络初始化等操作，执行这些操作时不依赖于输入的动态信息。这里我们对子网络（卷积层和池化层）做了初始化操作。
-    def __init__(self, input_size):
-        super(MyLayer, self).__init__()
-        self.linear = fluid.dygraph.nn.Linear(input_size, 12)
-    def forward(self, inputs):
+2. 定义 forward 函数。
-        x = self.linear(inputs)
-        x = fluid.layers.relu(x)
-        self._x_for_debug = x
-        x = fluid.layers.elementwise_mul(x, x)
-        x = fluid.layers.reduce_sum(x)
-        return [x]
+该函数负责定义网络运行时的执行逻辑，将会在每一轮训练/预测中被调用。上述示例中，forward 函数的逻辑是先执行一个卷积操作，然后执行一个池化操作。
-if __name__ == '__main__':
-    np_inp = np.array([[1.0, 2.0, -1.0]], dtype=np.float32)
-    with fluid.dygraph.guard():
-        var_inp = fluid.dygraph.to_variable(np_inp)
-        my_layer = MyLayer(np_inp.shape[-1])
-        x = my_layer(var_inp)[0]
-        dy_out = x.numpy()
-        x.backward()
-        dy_grad = my_layer._x_for_debug.gradient()
-        my_layer.clear_gradients()  # 将参数梯度清零以保证下一轮训练的正确性
-```
-### 关于自动剪枝
+接下来我们介绍如何利用子网络组合出MNIST网络，该网络由两个 SimpleImgConvPool 子网络和一个全连接层组成。
+```python
+# 定义MNIST网络，必须继承自fluid.dygraph.Layer
+# 该网络由两个SimpleImgConvPool子网络、reshape层、matmul层、softmax层、accuracy层组成
+class MNIST(fluid.dygraph.Layer):
+    # 在__init__构造函数中会执行变量的初始化、参数初始化、子网络初始化的操作
+    # 本例中执行了self.pool_2_shape变量、matmul层中参数self.output_weight、SimpleImgConvPool子网络的初始化操作
+    def __init__(self):
+        super(MNIST, self).__init__()
+        self._simple_img_conv_pool_1 = SimpleImgConvPool(
+            1, 20, 5, 2, 2, act="relu")
+        self._simple_img_conv_pool_2 = SimpleImgConvPool(
+            20, 50, 5, 2, 2, act="relu")
+        # self.pool_2_shape变量定义了经过self._simple_img_conv_pool_2层之后的数据
+        # 除了batch_size维度之外其他维度的乘积
+        self.pool_2_shape = 50 * 4 * 4
+        # self.pool_2_shape、SIZE定义了self.output_weight参数的维度
+        SIZE = 10
+        # 定义全连接层的参数
+        self.output_weight = self.create_parameter(
+            [self.pool_2_shape, 10])
+    # forward函数实现了MNIST网络的执行逻辑
+    def forward(self, inputs, label=None):
+        x = self._simple_img_conv_pool_1(inputs)
+        x = self._simple_img_conv_pool_2(x)
+        x = fluid.layers.reshape(x, shape=[-1, self.pool_2_shape])
+        x = fluid.layers.matmul(x, self.output_weight)
+        x = fluid.layers.softmax(x)
+        if label is not None:
+            acc = fluid.layers.accuracy(input=x, label=label)
+            return x, acc
+        else:
+            return x
+```
-每个 ``Variable`` 都有一个 ``stop_gradient`` 属性，可以用于细粒度地在反向梯度计算时排除部分子图，以提高效率。
+在这个复杂的 Layer 的 \_\_init\_\_ 构造函数中，包含了更多基础的操作：
+1. 变量的初始化：self.pool_2_shape = 50 * 4 * 4
+2. 全连接层参数的创建，通过调用 [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [create_parameter](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#create_parameter) 接口：self.output_weight = self.create_parameter( [ self.pool_2_shape, 10])
+3. 子 Layer 的构造：self._simple_img_conv_pool_1、self._simple_img_conv_pool_2
-如果OP只要有一个输入需要梯度，那么该OP的输出也需要梯度。
+forward 函数的实现和 前面SimpleImgConvPool 类中的实现方式类似。
-相反，只有当OP的所有输入都不需要梯度时，该OP的输出也不需要梯度。
-在所有的 ``Variable`` 都不需要梯度的子图中，反向计算就不会进行计算了。
-在命令式编程模式模式下，除参数以外的所有 ``Variable`` 的 ``stop_gradient`` 属性默认值都为 ``True``，而参数的 ``stop_gradient`` 属性默认值为 ``False``。
+接下来定义MNIST类的对象，以及优化器。这里优化器我们选择 [AdamOptimizer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/AdamOptimizer_cn.html#adamoptimizer) ，通过 [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [parameters](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#parameters) 接口来读取该网络的全部参数，实现如下：
-该属性用于自动剪枝，避免不必要的反向运算。
-例如：
 ```python
-import paddle.fluid as fluid
 import numpy as np
+from paddle.fluid.optimizer import AdamOptimizer
+from paddle.fluid.dygraph.base import to_variable
 with fluid.dygraph.guard():
-    x = fluid.dygraph.to_variable(np.random.randn(5, 5))  # 默认stop_gradient=True
+    # 定义MNIST类的对象
-    y = fluid.dygraph.to_variable(np.random.randn(5, 5))  # 默认stop_gradient=True
+    mnist = MNIST()
-    z = fluid.dygraph.to_variable(np.random.randn(5, 5))
+    # 定义优化器为AdamOptimizer，学习旅learning_rate为0.001
-    z.stop_gradient = False
+    # 注意动态图模式下必须传入parameter_list参数，该参数为需要优化的网络参数，本例需要优化mnist网络中的所有参数
-    a = x + y
+    adam = AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-    a.stop_gradient  # True
-    b = a + z
-    b.stop_gradient  # False
 ```
-当你想冻结你的模型的一部分，或者你事先知道你不会使用某些参数的梯度的时候，这个功能是非常有用的。
+### 2.3 训练
+当我们定义好上述网络结构之后，就可以进行训练了。
+实现如下：
+* 数据读取：读取每批数据，通过 [to_variable](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/to_variable_cn.html#to-variable) 接口将 numpy.ndarray 对象转换为 [Variable](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#variable) 类型的对象。
+* 网络正向执行：在正向执行时，用户构造出img和label之后，可利用类似函数调用的方式（如：mnist(img, label)）传递参数执行对应网络的 forward 函数。
+* 计算损失值：根据网络返回的计算结果，计算损失值，便于后续执行反向计算。
+* 执行反向计算：需要用户主动调用 [backward](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#backward) 接口来执行反向计算。
+* 参数更新：调用优化器的 [minimize](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/AdamOptimizer_cn.html#minimize) 接口对参数进行更新。
+* 梯度重置：将本次计算的梯度值清零，以便进行下一次迭代和梯度更新。
+* 保存训练好的模型：通过 [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [state_dict](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#state_dict) 获取模型的参数；通过 [save_dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/save_dygraph_cn.html#save-dygraph) 对模型参数进行保存。
-例如：
 ```python
-import paddle.fluid as fluid
 import numpy as np
+from paddle.fluid.optimizer import AdamOptimizer
+from paddle.fluid.dygraph.base import to_variable
 with fluid.dygraph.guard():
-    value0 = np.arange(26).reshape(2, 13).astype("float32")
+    # 定义MNIST类的对象
-    value1 = np.arange(6).reshape(2, 3).astype("float32")
+    mnist = MNIST()
-    value2 = np.arange(10).reshape(2, 5).astype("float32")
+    # 定义优化器为AdamOptimizer，学习旅learning_rate为0.001
-    fc = fluid.Linear(13, 5, dtype="float32")
+    # 注意动态图模式下必须传入parameter_list参数，该参数为需要优化的网络参数，本例需要优化mnist网络中的所有参数
-    fc2 = fluid.Linear(3, 3, dtype="float32")
+    adam = AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-    a = fluid.dygraph.to_variable(value0)
-    b = fluid.dygraph.to_variable(value1)
+    # 设置全部样本的训练次数
-    c = fluid.dygraph.to_variable(value2)
+    epoch_num = 5
-    out1 = fc(a)
-    out2 = fc2(b)
+    # 执行epoch_num次训练
-    out1.stop_gradient = True  # 将不会对out1这部分子图做反向计算
+    for epoch in range(epoch_num):
-    out = fluid.layers.concat(input=[out1, out2, c], axis=1)
+        # 读取训练数据进行训练
-    out.backward()
+        for batch_id, data in enumerate(train_reader()):
-    # 可以发现这里fc参数的梯度都为0
+            dy_x_data = np.array([x[0].reshape(1, 28, 28) for x in data]).astype('float32')
-    assert (fc.weight.gradient() == 0).all()
+            y_data = np.array([x[1] for x in data]).astype('int64').reshape(-1, 1)
-    assert (out1.gradient() == 0).all()
+            #将ndarray类型的数据转换为Variable类型
+            img = to_variable(dy_x_data)
+            label = to_variable(y_data)
+            # 网络正向执行
+            cost, acc = mnist(img, label)
+            # 计算损失值
+            loss = fluid.layers.cross_entropy(cost, label)
+            avg_loss = fluid.layers.mean(loss)
+            # 执行反向计算
+            avg_loss.backward()
+            # 参数更新
+            adam.minimize(avg_loss)
+            # 将本次计算的梯度值清零，以便进行下一次迭代和梯度更新
+            mnist.clear_gradients()
+            # 输出对应epoch、batch_id下的损失值
+            if batch_id % 100 == 0:
+                print("Loss at epoch {} step {}: {:}".format(
+                    epoch, batch_id, avg_loss.numpy()))
+    # 保存训练好的模型
+    model_dict = mnist.state_dict()
+    fluid.save_dygraph(model_dict, "save_temp")
+```
+    Loss at epoch 0 step 0: [3.362183]
+    Loss at epoch 0 step 100: [0.20108832]
+    Loss at epoch 0 step 200: [0.1681692]
+    Loss at epoch 0 step 300: [0.11894853]
+    Loss at epoch 0 step 400: [0.13005154]
+    Loss at epoch 0 step 500: [0.10004535]
+    Loss at epoch 0 step 600: [0.11465541]
+    Loss at epoch 0 step 700: [0.14584845]
+    Loss at epoch 0 step 800: [0.21515566]
+    Loss at epoch 0 step 900: [0.13847716]
+    Loss at epoch 1 step 0: [0.03004131]
+    Loss at epoch 1 step 100: [0.1855965]
+    Loss at epoch 1 step 200: [0.07302501]
+    Loss at epoch 1 step 300: [0.02016284]
+    Loss at epoch 1 step 400: [0.03899964]
+    Loss at epoch 1 step 500: [0.05415711]
+    Loss at epoch 1 step 600: [0.09633664]
+    Loss at epoch 1 step 700: [0.07155745]
+    Loss at epoch 1 step 800: [0.13023862]
+    Loss at epoch 1 step 900: [0.09051394]
+    Loss at epoch 2 step 0: [0.00580437]
+    Loss at epoch 2 step 100: [0.1506507]
+    Loss at epoch 2 step 200: [0.03713503]
+    Loss at epoch 2 step 300: [0.01145383]
+    Loss at epoch 2 step 400: [0.0497771]
+### 2.4 评估测试
+模型训练完成，我们已经保存了训练好的模型，接下来进行评估测试。某些OP（如 dropout、batch_norm）需要区分训练模式和评估模式，以标识不同的执行状态。飞桨中OP默认采用的是训练模式（train mode），可通过如下方法切换：
+ ```
+model.eval()      #切换到评估模式
+model.train()     #切换到训练模式
+ ```
+模型评估测试的实现如下：
+* 首先定义 MNIST 类的对象 mnist_eval，然后通过 [load_dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/load_dygraph_cn.html#load-dygraph) 接口加载保存好的模型参数，通过 [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [set_dict](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#set_dict) 接口将参数导入到模型中，通过 [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 eval 接口切换到预测评估模式。
+* 读取测试数据执行网络正向计算，进行评估测试，输出不同 batch 数据下损失值和准确率的平均值。
+```python
+with fluid.dygraph.guard():
+    # 定义MNIST类的对象
+    mnist_eval = MNIST()
+    # 加载保存的模型
+    model_dict, _ = fluid.load_dygraph("save_temp")
+    mnist_eval.set_dict(model_dict)
+    print("checkpoint loaded")
+    # 切换到预测评估模式
+    mnist_eval.eval()
+    acc_set = []
+    avg_loss_set = []
+    # 读取测试数据进行评估测试
+    for batch_id, data in enumerate(test_reader()):
+        dy_x_data = np.array([x[0].reshape(1, 28, 28)
+                              for x in data]).astype('float32')
+        y_data = np.array(
+            [x[1] for x in data]).astype('int64').reshape(-1, 1)
+        # 将ndarray类型的数据转换为Variable类型
+        img = to_variable(dy_x_data)
+        label = to_variable(y_data)
+        # 网络正向执行
+        prediction, acc = mnist_eval(img, label)
+        # 计算损失值
+        loss = fluid.layers.cross_entropy(input=prediction, label=label)
+        avg_loss = fluid.layers.mean(loss)
+        acc_set.append(float(acc.numpy()))
+        avg_loss_set.append(float(avg_loss.numpy()))
+    # 输出不同 batch 数据下损失值和准确率的平均值
+    acc_val_mean = np.array(acc_set).mean()
+    avg_loss_val_mean = np.array(avg_loss_set).mean()
+    print("Eval avg_loss is: {}, acc is: {}".format(avg_loss_val_mean, acc_val_mean))
 ```
-## 使用DyGraph训练模型
+### 2.5 模型参数的保存和加载
+在动态图模式下，模型和优化器在不同的模块中，所以模型和优化器分别在不同的对象中存储，使得模型参数和优化器信息需分别存储。
+因此模型的保存需要单独调用模型和优化器中的 state_dict() 接口，同样模型的加载也需要单独进行处理。
+保存模型 ：
+1. 保存模型参数：首先通过 minist.state_dict 函数获取 mnist 网络的所有参数，然后通过 fluid.save_dygraph 函数将获得的参数保存至以 save_path 为前缀的文件中。
+1. 保存优化器信息：首先通过 adam.state_dict 函数获取 adam 优化器的信息，然后通过  fluid.save_dygraph 函数将获得的参数保存至以 save_path 为前缀的文件中。
+   * [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [state_dict](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#state_dict) 接口：该接口可以获取当前层及其子层的所有参数，并将参数存放在 dict 结构中。
+   * [Optimizer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/AdamOptimizer_cn.html#adamoptimizer) 的 state_dict 接口：该接口可以获取优化器的信息，并将信息存放在 dict 结构中。其中包含优化器使用的所有变量，例如对于 Adam 优化器，包括 beta1、beta2、momentum 等信息。注意如果该优化器的 minimize 函数没有被调用过，则优化器的信息为空。
+   * [save_dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/save_dygraph_cn.html#save-dygraph) 接口：该接口将传入的参数或优化器的 dict 保存到磁盘上。
+```
+# 保存模型参数
+1. fluid.save_dygraph(minist.state_dict(), “save_path”)
+# 保存优化器信息
+2. fluid.save_dygraph(adam.state_dict(), “save_path”)
+```
+加载模型：
+1. 通过 fluid.load_dygraph 函数获取模型参数信息 model_state 和优化器信息 opt_state；
+1. 通过 mnist.set_dict 函数用获取的模型参数信息设置 mnist 网络的参数
+1. 通过 adam.set_dict 函数用获取的优化器信息设置 adam 优化器信息。
+   * [Layer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#layer) 的 [set_dict](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/Layer_cn.html#set_dict) 接口：该接口根据传入的 dict 结构设置参数，所有参数将由 dict 结构中的 Tensor 设置。
+   * [Optimizer](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/AdamOptimizer_cn.html#adamoptimizer) 的 set_dict 接口：该接口根据传入的 dict 结构设置优化器信息，例如对于 Adam 优化器，包括 beta1、beta2、momentum 等信息。如果使用了 LearningRateDecay ，则 global_step 信息也将会被设置。
+   * [load_dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/load_dygraph_cn.html#load-dygraph) 接口：该接口尝试从磁盘中加载参数或优化器的 dict 。
+```
+# 获取模型参数和优化器信息
+1. model_state, opt_state= fluid.load_dygraph(“save_path”)
+# 加载模型参数
+2. mnist.set_dict(model_state)
+# 加载优化器信息
+3. adam.set_dict(opt_state)
+```
-接下来我们将以“手写数字识别”这个最基础的模型为例，展示如何利用DyGraph模式搭建并训练一个模型：
-有关手写数字识别的相关理论知识请参考[PaddleBook](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)中的内容，我们在这里默认您已经了解了该模型所需的深度学习理论知识。
+## 3. 多卡训练
-1. 准备数据，我们使用`paddle.dataset.mnist`作为训练所需要的数据集：
+针对数据量、计算量较大的任务，我们需要多卡并行训练，以提高训练效率。目前动态图模式可支持GPU的单机多卡训练方式，在动态图中多卡的启动和单卡略有不同，动态图多卡通过 Python 基础库 subprocess.Popen 在每一张 GPU 上启动单独的 Python 程序的方式，每张卡的程序独立运行，只是在每一轮梯度计算完成之后，所有的程序进行梯度的同步，然后更新训练的参数。
+我们通过一个实例了解如何进行多卡训练：
+><font size=2>由于AI Studio上未配置多卡环境，所以本实例需在本地构建多卡环境后运行。</font>
+1. 本实例仍然采用前面定义的 MNIST 网络，可将前面定义的 SimpleImgConvPool、MNIST 网络结构、相关的库导入代码、以及下面多卡训练的示例代码拷贝至本地文件 train.py 中。
-    ```python
-    train_reader = paddle.batch(
-    paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
-    ```
-2. 构建网络，虽然您可以根据之前的介绍自己定义所有的网络结构，但是您也可以直接使用`fluid.dygraph.Layer`当中我们为您定制好的一些基础网络结构，这里我们利用`fluid.dygraph.Conv2D`以及`fluid.dygraph.Pool2d`构建了基础的`SimpleImgConvPool`：
-    ```python
-    class SimpleImgConvPool(fluid.dygraph.Layer):
-        def __init__(self,
-                     num_channels,
-                     num_filters,
-                     filter_size,
-                     pool_size,
-                     pool_stride,
-                     pool_padding=0,
-                     pool_type='max',
-                     global_pooling=False,
-                     conv_stride=1,
-                     conv_padding=0,
-                     conv_dilation=1,
-                     conv_groups=1,
-                     act=None,
-                     use_cudnn=False,
-                     param_attr=None,
-                     bias_attr=None):
-            super(SimpleImgConvPool, self).__init__()
-            self._conv2d = fluid.dygraph.Conv2D(
-                num_channels=num_channels,
-                num_filters=num_filters,
-                filter_size=filter_size,
-                stride=conv_stride,
-                padding=conv_padding,
-                dilation=conv_dilation,
-                groups=conv_groups,
-                param_attr=param_attr,
-                bias_attr=bias_attr,
-                act=act,
-                use_cudnn=use_cudnn)
-            self._pool2d = fluid.dygraph.Pool2D(
-                pool_size=pool_size,
-                pool_type=pool_type,
-                pool_stride=pool_stride,
-                pool_padding=pool_padding,
-                global_pooling=global_pooling,
-                use_cudnn=use_cudnn)
-        def forward(self, inputs):
-            x = self._conv2d(inputs)
-            x = self._pool2d(x)
-            return x
-    ```
-    > 注意: 构建网络时子网络的定义和使用请在`__init__`中进行， 而子网络的执行则在`forward`函数中进行
-3. 利用已经构建好的`SimpleImgConvPool`组成最终的`MNIST`网络：
-    ```python
-    class MNIST(fluid.dygraph.Layer):
-        def __init__(self):
-            super(MNIST, self).__init__()
-            self._simple_img_conv_pool_1 = SimpleImgConvPool(
-                1, 20, 5, 2, 2, act="relu")
-            self._simple_img_conv_pool_2 = SimpleImgConvPool(
-                20, 50, 5, 2, 2, act="relu")
-            self.pool_2_shape = 50 * 4 * 4
-            SIZE = 10
-            scale = (2.0 / (self.pool_2_shape**2 * SIZE))**0.5
-            self._fc = fluid.dygraph.Linear(
-                        self.pool_2_shape,
-                        10,
-                        param_attr=fluid.param_attr.ParamAttr(
-                            initializer=fluid.initializer.NormalInitializer(
-                                loc=0.0, scale=scale)),
-                        act="softmax")
-        def forward(self, inputs, label=None):
-            x = self._simple_img_conv_pool_1(inputs)
-            x = self._simple_img_conv_pool_2(x)
-            x = fluid.layers.reshape(x, shape=[-1, self.pool_2_shape])
-            x = self._fc(x)
-            if label is not None:
-                acc = fluid.layers.accuracy(input=x, label=label)
-                return x, acc
-            else:
-                return x
-   ```
-4. 在`fluid.dygraph.guard()`中定义配置好的`MNIST`网络结构，此时即使没有训练也可以在`fluid.dygraph.guard()`中调用模型并且检查输出：
-    ```python
-    with fluid.dygraph.guard():
-        mnist = MNIST()
-        train_reader = paddle.batch(
-                paddle.dataset.mnist.train(), batch_size=32, drop_last=True)
-        id, data = list(enumerate(train_reader()))[0]
-        dy_x_data = np.array(
-            [x[0].reshape(1, 28, 28)
-             for x in data]).astype('float32')
-        img = fluid.dygraph.to_variable(dy_x_data)
-        print("result is: {}".format(mnist(img).numpy()))
-   ```
-   输出：
-   ```
-   result is: [[0.10135901 0.1051138  0.1027941  ... 0.0972859  0.10221873 0.10165327]
-           [0.09735426 0.09970362 0.10198303 ... 0.10134517 0.10179105 0.10025002]
-           [0.09539858 0.10213123 0.09543551 ... 0.10613529 0.10535969 0.097991  ]
-           ...
-           [0.10120598 0.0996111  0.10512722 ... 0.10067689 0.10088114 0.10071224]
-           [0.09889644 0.10033772 0.10151272 ... 0.10245881 0.09878646 0.101483  ]
-           [0.09097178 0.10078511 0.10198414 ... 0.10317434 0.10087223 0.09816764]]
-   ```
-5. 构建训练循环，在每一轮参数更新完成后我们调用`mnist.clear_gradients()`来重置梯度：
-    ```python
-    with fluid.dygraph.guard():
-        epoch_num = 5
-        BATCH_SIZE = 64
-        train_reader = paddle.batch(
-            paddle.dataset.mnist.train(), batch_size=32, drop_last=True)
-        mnist = MNIST()
-        adam = fluid.optimizer.AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-        for epoch in range(epoch_num):
-            for batch_id, data in enumerate(train_reader()):
-                dy_x_data = np.array([x[0].reshape(1, 28, 28)
-                                      for x in data]).astype('float32')
-                y_data = np.array(
-                    [x[1] for x in data]).astype('int64').reshape(-1, 1)
-                img = fluid.dygraph.to_variable(dy_x_data)
-                label = fluid.dygraph.to_variable(y_data)
-                cost = mnist(img)
-                loss = fluid.layers.cross_entropy(cost, label)
-                avg_loss = fluid.layers.mean(loss)
-                if batch_id % 100 == 0 and batch_id is not 0:
-                    print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
-                avg_loss.backward()
-                adam.minimize(avg_loss)
-                mnist.clear_gradients()
-    ```
-6. 变量及优化器
-    模型的参数或者任何您希望检测的值可以作为变量封装在类中，然后通过对象获取并使用`numpy()`方法获取其`ndarray`的输出， 在训练过程中您可以使用`mnist.parameters()`来获取到网络中所有的参数，也可以指定某一个`Layer`的某个参数或者`parameters()`来获取该层的所有参数，使用`numpy()`方法随时查看参数的值
-    反向运行后调用之前定义的`Adam`优化器对象的`minimize`方法进行参数更新:
-    ```python
-    with fluid.dygraph.guard():
-        epoch_num = 5
-        BATCH_SIZE = 64
-        mnist = MNIST()
-        adam = fluid.optimizer.AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-        train_reader = paddle.batch(
-            paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
-        np.set_printoptions(precision=3, suppress=True)
-        for epoch in range(epoch_num):
-            for batch_id, data in enumerate(train_reader()):
-                dy_x_data = np.array(
-                    [x[0].reshape(1, 28, 28)
-                     for x in data]).astype('float32')
-                y_data = np.array(
-                    [x[1] for x in data]).astype('int64').reshape(BATCH_SIZE, 1)
-                img = fluid.dygraph.to_variable(dy_x_data)
-                label = fluid.dygraph.to_variable(y_data)
-                label.stop_gradient = True
-                cost = mnist(img)
-                loss = fluid.layers.cross_entropy(cost, label)
-                avg_loss = fluid.layers.mean(loss)
-                dy_out = avg_loss.numpy()
-                avg_loss.backward()
-                adam.minimize(avg_loss)
-                mnist.clear_gradients()
-                dy_param_value = {}
-                for param in mnist.parameters():
-                    dy_param_value[param.name] = param.numpy()
-                if batch_id % 20 == 0:
-                    print("Loss at step {}: {}".format(batch_id, avg_loss.numpy()))
-        print("Final loss: {}".format(avg_loss.numpy()))
-        print("_simple_img_conv_pool_1_conv2d W's mean is: {}".format(mnist._simple_img_conv_pool_1._conv2d._filter_param.numpy().mean()))
-        print("_simple_img_conv_pool_1_conv2d Bias's mean is: {}".format(mnist._simple_img_conv_pool_1._conv2d._bias_param.numpy().mean()))
-    ```
-    输出：
-        ```
-        Loss at step 0: [2.302]
-        Loss at step 20: [1.616]
-        Loss at step 40: [1.244]
-        Loss at step 60: [1.142]
-        Loss at step 80: [0.911]
-        Loss at step 100: [0.824]
-        Loss at step 120: [0.774]
-        Loss at step 140: [0.626]
-        Loss at step 160: [0.609]
-        Loss at step 180: [0.627]
-        Loss at step 200: [0.466]
-        Loss at step 220: [0.499]
-        Loss at step 240: [0.614]
-        Loss at step 260: [0.585]
-        Loss at step 280: [0.503]
-        Loss at step 300: [0.423]
-        Loss at step 320: [0.509]
-        Loss at step 340: [0.348]
-        Loss at step 360: [0.452]
-        Loss at step 380: [0.397]
-        Loss at step 400: [0.54]
-        Loss at step 420: [0.341]
-        Loss at step 440: [0.337]
-        Loss at step 460: [0.155]
-        Final loss: [0.164]
-        _simple_img_conv_pool_1_conv2d W's mean is: 0.00606656912714
-        _simple_img_conv_pool_1_conv2d Bias's mean is: -3.4576318285e-05
-        ```
-7.    性能
-在使用`fluid.dygraph.guard()`时可以通过传入`fluid.CUDAPlace(0)`或者`fluid.CPUPlace()`来选择执行DyGraph的设备，通常如果不做任何处理将会自动适配您的设备。
-## 使用多卡训练模型
-目前PaddlePaddle支持通过多进程方式进行多卡训练，即每个进程对应一张卡。训练过程中，在第一次执行前向操作时，如果该操作需要参数，则会将0号卡的参数Broadcast到其他卡上，确保各个卡上的参数一致；在计算完反向操作之后，将产生的参数梯度在所有卡之间进行聚合；最后在各个GPU卡上分别进行参数更新。
 ```python
+import numpy as np
+from paddle.fluid.optimizer import AdamOptimizer
+from paddle.fluid.dygraph.base import to_variable
+# 通过 Env() 的 dev_id 设置程序运行的设备
 place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
 with fluid.dygraph.guard(place):
+    # 准备多卡环境
    strategy = fluid.dygraph.parallel.prepare_context()
    epoch_num = 5
    BATCH_SIZE = 64
    mnist = MNIST()
    adam = fluid.optimizer.AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
+    # 数据并行模块
    mnist = fluid.dygraph.parallel.DataParallel(mnist, strategy)
    train_reader = paddle.batch(
        paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
+    # 数据切分
    train_reader = fluid.contrib.reader.distributed_batch_reader(
        train_reader)
@@ -480,61 +544,36 @@ with fluid.dygraph.guard(place):
            label = fluid.dygraph.to_variable(y_data)
            label.stop_gradient = True
+            # 网络正向执行
            cost, acc = mnist(img, label)
+            # 计算损失值
            loss = fluid.layers.cross_entropy(cost, label)
            avg_loss = fluid.layers.mean(loss)
+            # 单步训练：首先对 loss 进行归一化，然后计算单卡的梯度，最终将所有的梯度聚合
            avg_loss = mnist.scale_loss(avg_loss)
            avg_loss.backward()
            mnist.apply_collective_grads()
+            # 参数更新
            adam.minimize(avg_loss)
+            # 将本次计算的梯度值清零，以便进行下一次迭代和梯度更新
            mnist.clear_gradients()
+            # 输出对应epoch、batch_id下的损失值
            if batch_id % 100 == 0 and batch_id is not 0:
                print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
 ```
-命令式编程模式单卡训练转多卡训练需要修改的地方主要有四处：
+2、飞桨动态图多进程多卡模型训练启动时，需要指定使用的 GPU，比如使用 0,1 卡，可执行如下命令启动训练：
-1. 需要从环境变量获取设备的ID，即：
-    ```python
-    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
-    ```
-2. 需要对原模型做一些预处理，即：
-    ```python
-    strategy = fluid.dygraph.parallel.prepare_context()
-    mnist = MNIST()
-    adam = AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-    mnist = fluid.dygraph.parallel.DataParallel(mnist, strategy)
-    ```
-3. 数据读取，必须确保每个进程读取的数据是不同的，即所有进程读取数据的交集为空，所有进程读取数据的并集是完整的数据集：
-    ```python
-    train_reader = paddle.batch(
-        paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
-    train_reader = fluid.contrib.reader.distributed_batch_reader(
-        train_reader)
-    ```
-4. 需要对loss进行调整，以及对参数的梯度进行聚合，即：
-    ```python
-    avg_loss = mnist.scale_loss(avg_loss)
-    avg_loss.backward()
-    mnist.apply_collective_grads()
-    ```
-Paddle命令式编程模式多进程多卡模型训练启动时需要指定使用的GPU，即如果使用`0,1,2,3`卡，启动方式如下：
 ```
-python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py
+CUDA_VISIBLE_DEVICES=0,1 python -m paddle.distributed.launch --log_dir ./mylog train.py
 ```
+其中 log_dir 为存放 log 的地址，train.py 为程序名。
-输出结果为：
+执行结果如下：
 ```
 -----------  Configuration Arguments -----------
@@ -542,319 +581,426 @@ cluster_node_ips: 127.0.0.1
 log_dir: ./mylog
 node_ip: 127.0.0.1
 print_config: True
-selected_gpus: 0,1,2,3
+selected_gpus: 0,1
 started_port: 6170
 training_script: train.py
-training_script_args: ['--use_data_parallel', '1']
+training_script_args: []
-use_paddlecloud: True
+use_paddlecloud: False
 ------------------------------------------------
-trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171,127.0.0.1:6172,127.0.0.1:6173 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 4
+trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 2
 ```
-此时，程序会将每个进程的输出log导出到./mylog路径下：
+此时，程序会将每个进程的输出 log 导出到 ./mylog 路径下，可以打开 workerlog.0 和 workerlog.1 来查看结果：
 ```
 .
 ├── mylog
 │   ├── workerlog.0
-│   ├── workerlog.1
+│   └── workerlog.1
-│   ├── workerlog.2
-│   └── workerlog.3
 └── train.py
 ```
-如果不指定`--log_dir`，程序会将打印出所有进程的输出，即：
+总结一下，多卡训练相比单卡训练，有如下步骤不同：
+1. 通过 Env() 的 dev_id 设置程序运行的设备。
+```
+place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+with fluid.dygraph.guard(place):
+```
+2. 准备多卡环境。
+```
+strategy = fluid.dygraph.parallel.prepare_context()
+```
+3. 数据并行模块。
+在数据并行的时候，我们需要存储和初始化一些多卡相关的信息，这些信息和操作放在 DataParallel 类中，使用的时候，我们需要利用 model（定义的模型） 和 strategy（第二步得到的多卡环境） 信息初始化 DataParallel。
 ```
-----------  Configuration Arguments -----------
+mnist = fluid.dygraph.parallel.DataParallel(mnist, strategy)
-cluster_node_ips: 127.0.0.1
+```
-log_dir: None
+4. 数据切分。
-node_ip: 127.0.0.1
-print_config: True
-selected_gpus: 0,1,2,3
-started_port: 6170
-training_script: train.py
-training_script_args: ['--use_data_parallel', '1']
-use_paddlecloud: True
------------------------------------------------
-trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171,127.0.0.1:6172,127.0.0.1:6173 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 4
-grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-I0923 09:32:36.423513 56410 nccl_context.cc:120] init nccl context nranks: 4 local rank: 1 gpu id: 1
-I0923 09:32:36.425287 56411 nccl_context.cc:120] init nccl context nranks: 4 local rank: 2 gpu id: 2
-I0923 09:32:36.429337 56409 nccl_context.cc:120] init nccl context nranks: 4 local rank: 0 gpu id: 0
-I0923 09:32:36.429440 56412 nccl_context.cc:120] init nccl context nranks: 4 local rank: 3 gpu id: 3
-W0923 09:32:42.594097 56412 device_context.cc:198] Please NOTE: device: 3, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
-W0923 09:32:42.605836 56412 device_context.cc:206] device: 3, cuDNN Version: 7.5.
-W0923 09:32:42.632463 56410 device_context.cc:198] Please NOTE: device: 1, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
-W0923 09:32:42.637948 56410 device_context.cc:206] device: 1, cuDNN Version: 7.5.
-W0923 09:32:42.648674 56411 device_context.cc:198] Please NOTE: device: 2, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
-W0923 09:32:42.654021 56411 device_context.cc:206] device: 2, cuDNN Version: 7.5.
-W0923 09:32:43.048696 56409 device_context.cc:198] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
-W0923 09:32:43.053236 56409 device_context.cc:206] device: 0, cuDNN Version: 7.5.
-start data reader (trainers_num: 4, trainer_id: 2)
-start data reader (trainers_num: 4, trainer_id: 3)
-start data reader (trainers_num: 4, trainer_id: 1)
-start data reader (trainers_num: 4, trainer_id: 0)
-Loss at epoch 0 step 0: [0.57390565]
-Loss at epoch 0 step 0: [0.57523954]
-Loss at epoch 0 step 0: [0.575606]
-Loss at epoch 0 step 0: [0.5767452]
-```
-## 模型参数的保存
-命令式编程模式由于模型和优化器在不同的对象中存储，模型参数和优化器信息要分别存储。
- 在模型训练中可以使用 `paddle.fluid.dygraph.save_dygraph(state_dict, model_path)` 来保存模型参数的dict或优化器信息的dict。
-同样可以使用 `paddle.fluid.dygraph.load_dygraph(model_path)` 获取保存的模型参数的dict和优化器信息的dict。
-再使用`your_modle_object.set_dict(para_dict)`接口来恢复保存的模型参数从而达到继续训练的目的。
-以及使用`your_optimizer_object.set_dict(opti_dict)`接口来恢复保存的优化器中的`learning rate decay`值。
-下面的代码展示了如何在“手写数字识别”任务中保存参数并且读取已经保存的参数来继续训练。
-```python
+数据切分是一个非常重要的流程，是为了防止每张卡在每一轮训练见到的数据都一样，可以使用 distributed_batch_reader 对单卡的 reader 进行进行切分处理。 用户也可以其他的策略来达到数据切分的目的，比如事先分配好每张卡的数据，这样就可以使用单卡的 reader ，不使用 distributed_batch_reader。
-import paddle.fluid as fluid
-with fluid.dygraph.guard():
+```
-    epoch_num = 5
+train_reader = fluid.contrib.reader.distributed_batch_reader(train_reader)
-    BATCH_SIZE = 64
+```
-    mnist = MNIST()
+5. 单步训练。
-    adam = fluid.optimizer.Adam(learning_rate=0.001, parameter_list=mnist.parameters())
-    train_reader = paddle.batch(
-        paddle.dataset.mnist.train(), batch_size= BATCH_SIZE, drop_last=True)
-    np.set_printoptions(precision=3, suppress=True)
+首先对 loss 进行归一化，然后计算单卡的梯度，最终将所有的梯度聚合。
-    dy_param_init_value={}
+```
-    for epoch in range(epoch_num):
+avg_loss = mnist.scale_loss(avg_loss)
-        for batch_id, data in enumerate(train_reader()):
+avg_loss.backward()
-            dy_x_data = np.array(
+mnist.apply_collective_grads()
-                [x[0].reshape(1, 28, 28)
+```
-                 for x in data]).astype('float32')
+6. 模型保存。
-            y_data = np.array(
-                [x[1] for x in data]).astype('int64').reshape(BATCH_SIZE, 1)
-            img = fluid.dygraph.to_variable(dy_x_data)
+和单卡不同，多卡训练时需逐个进程执行保存操作，多个进程同时保存会使模型文件格式出错。
-            label = fluid.dygraph.to_variable(y_data)
+```
-            label.stop_gradient = True
+if fluid.dygraph.parallel.Env().local_rank == 0：
+	fluid.save_dygraph(mnist.state_dict, “path”)
+```
+7. 评估测试。
-            cost = mnist(img)
+对模型进行评估测试时，如果需要加载模型，须确保评估和保存的操作在同一个进程中，否则可能出现模型尚未保存完成，即启动评估，造成加载出错的问题。如果不需要加载模型，则没有这个问题，在一个进程或多个进程中评估均可。
-            loss = fluid.layers.cross_entropy(cost, label)
-            avg_loss = fluid.layers.mean(loss)
-            dy_out = avg_loss.numpy()
+## 4. 模型部署
-            avg_loss.backward()
+动态图虽然有非常多的优点，但是如果用户希望使用 C++ 部署已经训练好的模型，会存在一些不便利。比如，动态图中可使用 Python 原生的控制流，包含 if/else、switch、for/while，这些控制流需要通过一定的机制才能映射到 C++ 端，实现在 C++ 端的部署。
-            adam.minimize(avg_loss)
-            if batch_id == 20:
+<ul><li>如果用户使用的 if/else、switch、for/while 与输入（包括输入的值和 shape ）无关，则可以使用如下动态图模型部署方案：
-                fluid.dygraph.save_dygraph(mnist.state_dict(), "paddle_dy")
+<ul><li>使用 TracedLayer 将前向动态图模型转换为静态图模型。可以将动态图保存后做在线C++预测；除此以外，用户也可使用转换后的静态图模型在Python端做预测，通常比原先的动态图性能更好。</li>
-            mnist.clear_gradients()
+<li>所有的TracedLayer对象均不应通过构造函数创建，而需通过调用静态方法 TracedLayer.trace(layer, inputs) 创建。</li>
+<li>TracedLayer使用 Executor 和 CompiledProgram 运行静态图模型。</li></ul></li>
+</ul>
-            if batch_id == 20:
-                for param in mnist.parameters():
-                    dy_param_init_value[param.name] = param.numpy()
-                model, _ = fluid.dygraph.load_dygraph("paddle_dy")
-                mnist.set_dict(model)
-                break
-        if epoch == 0:
-            break
-    restore = mnist.parameters()
-    # check save and load
-    success = True
-    for value in restore:
+```python
-        if (not np.array_equal(value.numpy(), dy_param_init_value[value.name])) or (not np.isfinite(value.numpy().all())) or (np.isnan(value.numpy().any())):
+from paddle.fluid.dygraph import TracedLayer
-            success = False
-    print("model save and load success? {}".format(success))
+with fluid.dygraph.guard():
+    # 定义MNIST类的对象
+    mnist = MNIST()
+    in_np = np.random.random([10, 1, 28, 28]).astype('float32')
+    # 将numpy的ndarray类型的数据转换为Variable类型
+    input_var = fluid.dygraph.to_variable(in_np)
+    # 通过 TracerLayer.trace 接口将动态图模型转换为静态图模型
+    out_dygraph, static_layer = TracedLayer.trace(mnist, inputs=[input_var])
+    save_dirname = './saved_infer_model'
+    # 将转换后的模型保存
+    static_layer.save_inference_model(save_dirname, feed=[0], fetch=[0])
 ```
-需要注意的是，如果采用多卡训练，只需要一个进程对模型参数进行保存，因此在保存模型参数时，需要进行指定保存哪个进程的参数，比如
 ```python
-    if fluid.dygraph.parallel.Env().local_rank == 0:
+# 静态图中需要使用执行器执行之前已经定义好的网络
-        fluid.dygraph.save_dygraph(mnist.state_dict(), "paddle_dy")
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+program, feed_vars, fetch_vars = fluid.io.load_inference_model(save_dirname, exe)
+# 静态图中需要调用执行器的run方法执行计算过程
+fetch, = exe.run(program, feed={feed_vars[0]: in_np}, fetch_list=fetch_vars)
 ```
-## 模型评估
+以上示例中，通过 [TracerLayer.trace](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/TracedLayer_cn.html#trace) 接口来运行动态图模型并将其转换为静态图模型，该接口需要传入动态图的网络模型 mnist 和输入变量列表 [input_var]；然后调用 [save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dygraph_cn/TracedLayer_cn.html#save_inference_model) 接口将静态图模型保存为用于预测部署的模型，之后利用 [load_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/io_cn/load_inference_model_cn.html) 接口将保存的模型加载，并使用 [Executor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/executor_cn/Executor_cn.html#executor) 执行，检查结果是否正确。
-当我们需要在DyGraph模式下利用搭建的模型进行预测任务，请在`fluid.dygraph.guard()`上下文中调用一次`YourModel.eval()`接口来切换到预测模式。例如，在之前的手写数字识别模型中我们可以使用`mnist.eval()`来切换到预测模式。需要显示地调用`YourModel.eval()`切换到预测模式的原因是，我们默认在`fluid.dygraph.guard()`上下文中是训练模式，训练模式下DyGraph在运行前向网络的时候会自动求导，添加反向网络；而在预测时，DyGraph只需要执行前向的预测网络，不需要进行自动求导并执行反向网络。
+[save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/dygraph_cn/TracedLayer_cn.html#save_inference_model) 保存的下来的模型，同样可以使用 C++ 加载部署，具体的操作请参考：[C++ 预测 API介绍](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html)
-**请注意，如果您在`GPU`设备中运行`YourModel`模型，并且未调用`loss.backward`（通常来说，是进行预测时），则必须调用`YourModel.eval()`，以避免构建反向网络，否则有可能会导致显存不足。**
+* 如果任务中包含了依赖数据的控制流，比如下面这个示例中if条件的判断依赖输入的shape。针对这种场景，可以使用基于ProgramTranslator的方式转成静态图的program，通过save_inference_model 接口将静态图模型保存为用于预测部署的模型，之后利用 [load_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/io_cn/load_inference_model_cn.html) 接口将保存的模型加载，并使用 [Executor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/executor_cn/Executor_cn.html#executor) 执行，检查结果是否正确。
-下面的代码展示了如何使用DyGraph模式训练一个用于执行“手写数字识别”任务的模型并保存，并且利用已经保存好的模型进行预测。
+保存的下来的模型，同样可以使用 C++ 加载部署，具体的操作请参考：[C++ 预测 API介绍](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html)
+```python
+with fluid.dygraph.guard():
+    in_np = np.array([-2]).astype('int')
+    # 将numpy的ndarray类型的数据转换为Variable类型
+    input_var = fluid.dygraph.to_variable(in_np)
+    # if判断与输入input_var的shape有关
+    if input_var.shape[0] > 1:
+        print("input_var's shape[0] > 1")
+    else:
+        print("input_var's shape[1] < 1")
+```
+* 针对依赖数据的控制流，解决流程如下 1. 添加declarative装饰器； 2. 利用ProgramTranslator进行转换
+1) 添加declarative装饰器
+首先需要对给MNist类的forward函数添加一个declarative 装饰器，来标记需要转换的代码块，（注：需要在最外层的class的forward函数中添加）
-我们在`fluid.dygraph.guard()`上下文中进行了模型的保存和训练，值得注意的是，当我们需要在训练的过程中进行预测时需要使用`YourModel.eval()`切换到预测模式，并且在预测完成后使用`YourModel.train()`切换回训练模式继续训练。
-我们在`inference_mnist `中启用另一个`fluid.dygraph.guard()`，并在其上下文中`load`之前保存的`checkpoint`进行预测，同样的在执行预测前需要使用`YourModel.eval()`来切换到预测模式。
 ```python
-def test_mnist(reader, model, batch_size):
+from paddle.fluid.dygraph.jit import declarative
-    acc_set = []
-    avg_loss_set = []
+# 定义MNIST网络，必须继承自fluid.dygraph.Layer
-    for batch_id, data in enumerate(reader()):
+# 该网络由两个SimpleImgConvPool子网络、reshape层、matmul层、softmax层、accuracy层组成
-        dy_x_data = np.array([x[0].reshape(1, 28, 28)
+class MNIST(fluid.dygraph.Layer):
-                              for x in data]).astype('float32')
+    # 在__init__构造函数中会执行变量的初始化、参数初始化、子网络初始化的操作
-        y_data = np.array(
+    # 本例中执行了self.pool_2_shape变量、matmul层中参数self.output_weight、SimpleImgConvPool子网络的初始化操作
-            [x[1] for x in data]).astype('int64').reshape(batch_size, 1)
+    def __init__(self):
+        super(MNIST, self).__init__()
+        self._simple_img_conv_pool_1 = SimpleImgConvPool(
+            1, 20, 5, 2, 2, act="relu")
+        self._simple_img_conv_pool_2 = SimpleImgConvPool(
+            20, 50, 5, 2, 2, act="relu")
+        # self.pool_2_shape变量定义了经过self._simple_img_conv_pool_2层之后的数据
+        # 除了batch_size维度之外其他维度的乘积
+        self.pool_2_shape = 50 * 4 * 4
+        # self.pool_2_shape、SIZE定义了self.output_weight参数的维度
+        SIZE = 10
+        # 定义全连接层的参数
+        self.output_weight = self.create_parameter(
+            [self.pool_2_shape, 10])
+    # forward函数实现了MNIST网络的执行逻辑
+	@declarative
+    def forward(self, inputs, label=None):
+        x = self._simple_img_conv_pool_1(inputs)
+        x = self._simple_img_conv_pool_2(x)
+        x = fluid.layers.reshape(x, shape=[-1, self.pool_2_shape])
+        x = fluid.layers.matmul(x, self.output_weight)
+        x = fluid.layers.softmax(x)
+        if label is not None:
+            acc = fluid.layers.accuracy(input=x, label=label)
+            return x, acc
+        else:
+            return x
-        img = fluid.dygraph.to_variable(dy_x_data)
+```
-        label = fluid.dygraph.to_variable(y_data)
-        label.stop_gradient = True
-        prediction, acc = model(img, label)
-        loss = fluid.layers.cross_entropy(input=prediction, label=label)
-        avg_loss = fluid.layers.mean(loss)
-        acc_set.append(float(acc.numpy()))
-        avg_loss_set.append(float(avg_loss.numpy()))
-        # get test acc and loss
-    acc_val_mean = np.array(acc_set).mean()
-    avg_loss_val_mean = np.array(avg_loss_set).mean()
-    return avg_loss_val_mean, acc_val_mean
+      File "<ipython-input-1-b7b25c28bae2>", line 25
+        @declarative
+                    ^
+    TabError: inconsistent use of tabs and spaces in indentation
-def inference_mnist():
-    with fluid.dygraph.guard():
-        mnist_infer = MNIST()
-        # load checkpoint
-        model_dict, _ = fluid.dygraph.load_dygraph("paddle_dy")
-        mnist_infer.load_dict(model_dict)
-        print("checkpoint loaded")
-        # start evaluate mode
+2） 利用ProgramTranslator进行转换
-        mnist_infer.eval()
-        def load_image(file):
-            im = Image.open(file).convert('L')
-            im = im.resize((28, 28), Image.ANTIALIAS)
-            im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)
-            im = im / 255.0 * 2.0 - 1.0
-            return im
-        cur_dir = os.path.dirname(os.path.realpath(__file__))
-        tensor_img = load_image(cur_dir + '/image/infer_3.png')
-        results = mnist_infer(fluid.dygraph.to_variable(tensor_img))
+```python
-        lab = np.argsort(results.numpy())
+from paddle.fluid.dygraph.dygraph_to_static import ProgramTranslator
-        print("Inference result of image/infer_3.png is: %d" % lab[0][-1])
 with fluid.dygraph.guard():
-    epoch_num = 1
+    prog_trans = fluid.dygraph.ProgramTranslator()
-    BATCH_SIZE = 64
    mnist = MNIST()
-    adam = fluid.optimizer.AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
-    test_reader = paddle.batch(
-        paddle.dataset.mnist.test(), batch_size=BATCH_SIZE, drop_last=True)
+    in_np = np.random.random([10, 1, 28, 28]).astype('float32')
+    label_np = np.random.randint(0, 10, size=(10,1)).astype( "int64")
+    input_var = fluid.dygraph.to_variable(in_np)
+    label_var = flui.dyraph.to_variable(label_np)
+    out = mnist( input_var, label_var)
+    prog_trans.save_inference_model("./mnist_dy2stat", fetch=[0,1])
+```
+## 5. 使用技巧
+### 5.1 中间变量值、梯度打印
+1. 用户想要查看任意变量的值，可以使用 [numpy](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#numpy) 接口。
+```
+x = y * 10
+print(x.numpy())
+```
+来直接打印变量的值
+2. 查看反向的值
+可以在执行了 backward 之后，可以通过 [gradient](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#gradient) 接口来看任意变量的梯度
+```
+x = y * 10
+x.backward()
+print(y.gradient())
+```
+可以直接打印反向梯度的值
+### 5.2 断点调试
+因为动态图采用了命令似的编程方式，程序在执行之后，可以立马获取到执行的结果，因此在动态图中，用户可以利用IDE提供的断点调试功能，通过查 Variable 的 shape、真实值等信息，有助于发现程序中的问题。
+1. 如下图所示，在示例程序中设置两个断点，执行到第一个断点的位置，我们可以观察变量 x 和 linear1 的信息。
+![](https://ai-studio-static-online.cdn.bcebos.com/b9bade026bea4ae797d26dcd4590452d0d563574df6b4e1cbedd0645dcbcb349)
+![](https://ai-studio-static-online.cdn.bcebos.com/c2a9096e653044849b98d94758a4ac3a77025351c1134453b2c8d18dc8ad8a73)
+2. 同时可以观察 linear1 中的权重值。
+![](https://ai-studio-static-online.cdn.bcebos.com/e46576c64de84fa780830e1146afda0acc67fb20ea43452dadfc4949a3aad684)
+![](https://ai-studio-static-online.cdn.bcebos.com/c00a6152805a492485ba0bdde773b2ac7f544f56a0364038aa2d0681ed8d0483)
+![](https://ai-studio-static-online.cdn.bcebos.com/f9bc8a52eaa24181a6a6832e992feb9e726afa17764146c38fd69e8d008e7994)
+### 5.3 使用声明式编程模式运行
+动态图虽然有友好编写、易于调试等功能，但是动态图中需要频繁进行 Python 与 C++ 交互，会导致一些任务在动态图中运行比静态图慢，根据经验，这类任务中包含了很多小粒度的 OP（指运算量相对比较小的 OP，如加减乘除、sigmoid 等，像 conv、matmul 等属于大粒度的 OP不在此列 ）。
+在实际任务中，如果发现这类任务运行较慢，有以下两种处理方式：
+* 1. 当用户使用的 if/else、switch、for/while 与输入（包括输入的值和 shape ）无关时，可以在不改动模型定义的情况下使用静态图的模式运行。该方法将模型训练改为了静态图模式，区别于第4小节仅预测部署改为了静态图模式。
+* 2. 如果使用了与输入相关的控制流，请参照[如何把动态图转写成静态图](https://www.paddlepaddle.org.cn/tutorials/projectdetail/360460#anchor-3)章节，将动态图代码进行转写。
+下面我们介绍上面的第一种方案，仍然以手写字体识别任务为例，在静态图模式下的实现如下：
+```python
+# 设置全部样本训练次数（epoch_num）、批大小（BATCH_SIZE）。
+epoch_num = 1
+BATCH_SIZE = 64
+main_program = fluid.Program()
+startup_program = fluid.Program()
+with fluid.program_guard(main_program=main_program, startup_program=startup_program):
+    # 静态图中需要使用执行器执行之前已经定义好的网络
+    exe = fluid.Executor(fluid.CPUPlace())
+    # 定义MNIST类的对象，可以使用动态图定义好的网络结构
+    mnist_static = MNIST()
+    # 定义优化器对象，静态图模式下不需要传入parameter_list参数
+    sgd_static = fluid.optimizer.SGDOptimizer(learning_rate=1e-3)
+    # 通过调用paddle.dataset.mnist的train函数，直接获取处理好的MNIST训练集
    train_reader = paddle.batch(
-        paddle.dataset.mnist.train(),
+        paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
-        batch_size=BATCH_SIZE,
-        drop_last=True)
+    # 静态图需要明确定义输入变量，即“占位符”，在静态图组网阶段并没有读入数据，所以需要使用占位符指明输入数据的类型，shape等信息
+    img_static = fluid.data(
+        name='pixel', shape=[None, 1, 28, 28], dtype='float32')
+    label_static = fluid.data(name='label', shape=[None, 1], dtype='int64')
+    # 调用网络，执行前向计算
+    cost_static = mnist_static(img_static)
+    # 计算损失值
+    loss_static = fluid.layers.cross_entropy(cost_static, label_static)
+    avg_loss_static = fluid.layers.mean(loss_static)
+    # 调用优化器的minimize接口计算和更新梯度
+    sgd_static.minimize(avg_loss_static)
+    # 静态图中需要显示对网络进行初始化操作
+    exe.run(fluid.default_startup_program())
    for epoch in range(epoch_num):
        for batch_id, data in enumerate(train_reader()):
-            dy_x_data = np.array([x[0].reshape(1, 28, 28)
+            x_data_static = np.array(
-                                  for x in data]).astype('float32')
+                [x[0].reshape(1, 28, 28)
-            y_data = np.array(
+                for x in data]).astype('float32')
-                [x[1] for x in data]).astype('int64').reshape(-1, 1)
+            y_data_static = np.array(
+                [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
-            img = fluid.dygraph.to_variable(dy_x_data)
+            fetch_list = [avg_loss_static.name]
-            label = fluid.dygraph.to_variable(y_data)
+            # 静态图中需要调用执行器的run方法执行计算过程，需要获取的计算结果（如avg_loss）需要通过fetch_list指定
-            label.stop_gradient = True
+            out = exe.run(
+                fluid.default_main_program(),
+                feed={"pixel": x_data_static,
+                    "label": y_data_static},
+                fetch_list=fetch_list)
-            cost, acc = mnist(img, label)
+            static_out = out[0]
-            loss = fluid.layers.cross_entropy(cost, label)
+            if batch_id % 100 == 0 and batch_id is not 0:
-            avg_loss = fluid.layers.mean(loss)
+                print("epoch: {}, batch_id: {}, loss: {}".format(epoch, batch_id, static_out))
+```
+动态图改写成静态图涉及如下改动：
+1. 定义占位符
-            avg_loss.backward()
+* 利用fluid.data 定义占位符，在静态图表示会在执行器执行时才会提供数据。
-            adam.minimize(avg_loss)
+2. 组网
-            # save checkpoint
-            mnist.clear_gradients()
+* 优化器对象在静态图模式下不需要传入parameter_list参数。
-            if batch_id % 100 == 0:
+* 将定义的占位符，输入给模型执行正向，然后计算损失值，最后利用优化器将损失值做最小化优化，得到要训练的网络。
-                print("Loss at epoch {} step {}: {:}".format(
-                    epoch, batch_id, avg_loss.numpy()))
+3. 执行
+* 需要对网络进行初始化操作。
+* 需要使用执行器执行之前已经定义好的网络，需要调用执行器的run方法执行计算过程。
+### 5.4 阻断反向传递
+在一些任务中，只希望拿到正向预测的值，但是不希望更新参数，或者在反向的时候剪枝，减少计算量，阻断反向的传播， Paddle提供了两种解决方案： [detach](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#detach) 接口和 [stop_gradient](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/Variable_cn.html#stop_gradient) 接口，建议用户使用 detach 接口。
+1. detach接口（建议用法）
+使用方式如下：
+```
+fw_out = fw_out.detach()
+```
+detach() 接口会产生一个新的、和当前计算图分离的，但是拥有当前变量内容的临时变量。
+通过该接口可以阻断反向的梯度传递。
+```python
+import paddle.fluid as fluid
+import numpy as np
+with fluid.dygraph.guard():
+    value0 = np.arange(26).reshape(2, 13).astype("float32")
+    value1 = np.arange(6).reshape(2, 3).astype("float32")
+    value2 = np.arange(10).reshape(2, 5).astype("float32")
+    # 将ndarray类型的数据转换为Variable类型
+    a = fluid.dygraph.to_variable(value0)
+    b = fluid.dygraph.to_variable(value1)
+    c = fluid.dygraph.to_variable(value2)
-        mnist.eval()
+    # 构造fc、fc2层
-        test_cost, test_acc = test_mnist(test_reader, mnist, BATCH_SIZE)
+    fc = fluid.Linear(13, 5, dtype="float32")
-        mnist.train()
+    fc2 = fluid.Linear(3, 3, dtype="float32")
-        print("Loss at epoch {} , Test avg_loss is: {}, acc is: {}".format(
-            epoch, test_cost, test_acc))
-    fluid.dygraph.save_dygraph(mnist.state_dict(), "paddle_dy")
+    # 对fc、fc2层执行前向计算
-    print("checkpoint saved")
+    out1 = fc(a)
+    out2 = fc2(b)
-    inference_mnist()
+    # 将不会对out1这部分子图做反向计算
+    out1 = out1.detach()
+    out = fluid.layers.concat(input=[out1, out2, c], axis=1)
+    out.backward()
+    # 可以发现这里out1.gradient()的值都为0，同时使得fc.weight的grad没有初始化
+    assert (out1.gradient() == 0).all()
 ```
-输出：
+2. stop_gradient 接口
+每个 Variable 都有一个 stop_gradient 属性，可以用于细粒度地在反向梯度计算时排除部分子图，以提高效率。
+如果OP只要有一个输入需要梯度，那么该OP的输出也需要梯度。相反，只有当OP的所有输入都不需要梯度时，该OP的输出也不需要梯度。在所有的 Variable 都不需要梯度的子图中，反向计算就不会进行计算了。
+在动态图模式下，除参数以外的所有 Variable 的 stop_gradient 属性默认值都为 True，而参数的 stop_gradient 属性默认值为 False。 该属性用于自动剪枝，避免不必要的反向运算。
+使用方式如下：
 ```
-Loss at epoch 0 step 0: [2.2991252]
+fw_out.stop_gradient = True
-Loss at epoch 0 step 100: [0.15491392]
-Loss at epoch 0 step 200: [0.13315125]
-Loss at epoch 0 step 300: [0.10253005]
-Loss at epoch 0 step 400: [0.04266362]
-Loss at epoch 0 step 500: [0.08894891]
-Loss at epoch 0 step 600: [0.08999012]
-Loss at epoch 0 step 700: [0.12975612]
-Loss at epoch 0 step 800: [0.15257305]
-Loss at epoch 0 step 900: [0.07429226]
-Loss at epoch 0 , Test avg_loss is: 0.05995981965082674, acc is: 0.9794671474358975
-checkpoint saved
-No optimizer loaded. If you didn't save optimizer, please ignore this. The program can still work with new optimizer.
-checkpoint loaded
-Inference result of image/infer_3.png is: 3
 ```
-## 编写兼容的模型
+通过将 Variable 的 stop_gradient 属性设置为 True，当 stop_gradient 设置为 True 时，梯度在反向传播时，遇到该 Variable，就不会继续传递。
-以上一步中手写数字识别的例子为例，命令式编程模式的模型代码可以直接用于声明式编程模式中作为模型代码，执行时，直接使用PaddlePaddle声明式编程模式执行方式即可，这里以声明式编程模式中的`executor`为例, 模型代码可以直接使用之前的模型代码，执行时使用`Executor`执行即可
 ```python
-epoch_num = 1
+import paddle.fluid as fluid
-BATCH_SIZE = 64
+import numpy as np
-exe = fluid.Executor(fluid.CPUPlace())
-mnist = MNIST()
+with fluid.dygraph.guard():
-sgd = fluid.optimizer.SGDOptimizer(learning_rate=1e-3, parameter_list=mnist.parameters())
+    value0 = np.arange(26).reshape(2, 13).astype("float32")
-train_reader = paddle.batch(
+    value1 = np.arange(6).reshape(2, 3).astype("float32")
-    paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
+    value2 = np.arange(10).reshape(2, 5).astype("float32")
-img = fluid.layers.data(
-    name='pixel', shape=[1, 28, 28], dtype='float32')
+    # 将ndarray类型的数据转换为Variable类型
-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    a = fluid.dygraph.to_variable(value0)
-cost = mnist(img)
+    b = fluid.dygraph.to_variable(value1)
-loss = fluid.layers.cross_entropy(cost, label)
+    c = fluid.dygraph.to_variable(value2)
-avg_loss = fluid.layers.mean(loss)
-sgd.minimize(avg_loss)
+    # 构造fc、fc2层
+    fc = fluid.Linear(13, 5, dtype="float32")
-out = exe.run(fluid.default_startup_program())
+    fc2 = fluid.Linear(3, 3, dtype="float32")
-for epoch in range(epoch_num):
+    # 对fc、fc2层执行前向计算
-    for batch_id, data in enumerate(train_reader()):
+    out1 = fc(a)
-        static_x_data = np.array(
+    out2 = fc2(b)
-            [x[0].reshape(1, 28, 28)
-             for x in data]).astype('float32')
-        y_data = np.array(
-            [x[1] for x in data]).astype('int64').reshape([BATCH_SIZE, 1])
-        fetch_list = [avg_loss.name]
+    # 相当于不会对out1这部分子图做反向计算
-        out = exe.run(
+    out1.stop_gradient = True
-            fluid.default_main_program(),
-            feed={"pixel": static_x_data,
-                  "label": y_data},
-            fetch_list=fetch_list)
-        static_out = out[0]
+    out = fluid.layers.concat(input=[out1, out2, c], axis=1)
+    out.backward()
-        if batch_id % 100 == 0 and batch_id is not 0:
+    # 可以发现这里fc参数的梯度都为0
-            print("epoch: {}, batch_id: {}, loss: {}".format(epoch, batch_id, static_out))
+    assert (fc.weight.gradient() == 0).all()
+    assert (out1.gradient() == 0).all()
 ```
--- a/doc/fluid/release_note_cn.md
+++ b/doc/fluid/release_note_cn.md
 # Release Note
 ## 重要更新
-本版本深度优化了命令式编程模式（动态图）的功能、性能和体验，框架基础功能也进一步强化；原生推理库性能显著优化，轻量化推理引擎PaddleLite实现了对硬件支持的极大覆盖，PaddleServing全面升级，提供功能强大简单易用的服务化部署能力。对应的开发套件和工具组件进一步丰富完善，既有套件组件的功能和体验持续提升，全新发布PaddleClas视觉分类套件和量桨Paddle Quantum量子机器学习框架。
+本版本深度优化了命令式编程模式（动态图）的功能、性能和体验，框架基础功能也进一步强化；原生推理库性能显著优化，轻量化推理引擎PaddleLite实现了对硬件支持的极大覆盖，新发布前端推理引擎Paddle.js，PaddleServing全面升级，提供功能强大简单易用的服务化部署能力。对应的开发套件和工具组件进一步丰富完善，既有套件组件的功能和体验持续提升，全新发布PaddleClas视觉分类套件和量桨Paddle Quantum量子机器学习框架。
-**训练框架** ：深度优化了命令式编程（动态图）功能、性能和体验，特别是增强了动静转换的能力，能支持依赖数据的控制流的动态图实现进行静态存储部署，也可以转为静态图模式训练；Data Loader的功能和梯度裁剪的使用方式进一步优化；声明式编程模式下多卡运行时fetch不定长Tensor等问题得到解决，混合精度配合重计算显示出支持大Batch训练很好的成效。新增了大量API，并新增 ComplexVariable，支持复数张量的表示和常见的复数运算。
+**训练框架：** 深度优化了命令式编程（动态图）功能、性能和体验，特别是增强了动静转换的能力，能支持依赖数据的控制流的动态图实现进行静态存储部署，也可以转为静态图模式训练；Data Loader的功能和梯度裁剪的使用方式进一步优化；声明式编程模式下多卡运行时fetch不定长Tensor等问题得到解决，混合精度配合重计算显示出支持大Batch训练很好的成效。新增了大量API，并新增 ComplexVariable，支持复数张量的表示和常见的复数运算。
-**预测部署** ：Paddle inference 新增CUDA下多线程多流支持、TRT子图对动态shape输入的支持，强化量化推理，性能显著优化；Paddle Serving 全面升级，功能完善，易用性显著提升；Paddle Lite进一步优化编译安装体验，全面提升对支持芯片的覆盖度（包括RK、MTK、百度昆仑、寒武纪、比特大陆、华为NPU等等）以及对应的模型数量和性能；PaddleSlim量化、裁剪和NAS功能持续强化；新增Paddle.js发布，Paddle.js是国内首个开源JavaScript深度学习前端推理引擎，可以帮助用户实现网页端深度学习模型部署，搭建如小程序、网页游戏互动、美妆、试装等应用；
+**预测部署：** Paddle inference 新增CUDA下多线程多流支持、TRT子图对动态shape输入的支持，强化量化推理，性能显著优化；Paddle Serving 全面升级，功能完善，易用性显著提升；Paddle Lite进一步优化编译安装体验，全面提升对支持芯片的覆盖度（包括RK、MTK、百度昆仑、寒武纪、比特大陆、华为NPU等等）以及对应的模型数量和性能；PaddleSlim量化、裁剪和NAS功能持续强化；发布国内首个开源JavaScript深度学习前端推理引擎Paddle.js，可以帮助用户实现网页端深度学习模型部署。
-**开发套件** ：全新发布PaddleClas，包含23个图像分类网络实现，117个图像预训练模型，并添加了数据增广、SSLD蒸馏等辅助策略，以及特色应用案例；PaddleSeg人像分割系列模型全面升级，新增多种遥感相关的策略方案；PaddleDetection、PaddleOCR和语音合成套件Parakeet算法覆盖更全面，速度显著提升。
+**开发套件：** 全新发布PaddleClas，包含23个图像分类网络实现，117个图像预训练模型，并添加了数据增广、SSLD蒸馏等辅助策略，以及特色应用案例；PaddleSeg人像分割系列模型全面升级，新增多种遥感相关的策略方案；PaddleDetection、PaddleOCR和语音合成套件Parakeet算法覆盖更全面，速度显著提升。
-**工具组件** ：PaddleHub新增包括一系列视觉预训练模型在内更多的模型，BERT类预训练模型支持动态图模式下的一键加载； PaddleFL发布1.0版本，开源基于Mulit-party Computation (MPC)的联邦学习，支持横向、纵向等多个联邦学习场景；PGL发布业界首个结合语义信息与结构信息的图神经网络模型ERNIESage；PARL开源工业界首个进化学习应用框架Evokit；全新发布量子机器学习框架量桨Paddle Quantum。
+**工具组件：** PaddleHub新增包括一系列视觉预训练模型在内更多的模型，模型总数120+； PaddleFL发布1.0版本，开源基于Mulit-party Computation (MPC)的联邦学习，支持横向、纵向等多个联邦学习场景；PGL发布业界首个结合语义信息与结构信息的图神经网络模型ERNIESage；PARL开源工业界首个进化学习应用框架Evokit；全新发布量子机器学习框架量桨Paddle Quantum。
-## 基础框架
+##  基础框架
 ### 新增API
 - 新增`fluid.device_guard`：设置OP的运行设备为CPU或者GPU。
- 新增 `fluid.enable_dygraph` 和 `fluid.disable_dygraph` 接口，支持函数式启动关闭动态图模式，相对`with fluid.dygraph.guard()`的方式减少代码缩进。
+- 新增 `fluid.enable_imperative` 和 `fluid.disable_imperative` 接口，支持函数式启动关闭动态图模式，相对`with fluid.dygraph.guard()`的方式减少代码缩进。
 - 在fluid.dygraph目录新增4个API（具体定义见文档）: BCELoss, L1Loss, MSELoss, NLLLoss, InstanceNorm
 - 在fluid.layers目录新增30个API（具体定义见文档）: addmm, allclose, arange, bmm, clamp, cross, diag_embed, dist, dot, elementwise_equal, flip, full, full_like, index_select, interpolate, log1p, log_softmax, logsumexp, meshgrid, nonzero, randint, randn, randperm, resize_bicubic, resize_linear, roll, t, tril, triu
@@ -31,7 +29,7 @@
    -  针对动态图下`no_grad`只能通过装饰器的方式使用的问题，新增了支持context manager使用方式，更方便动态图无梯度操作的代码编写。
    - 为了方便单独设置batchnorm和dropout两个op的train/eval模式设置，将train/eval模式信息从全局设置，变成Layer内部设置；新增Layer形式的Dropout，记录模式信息。
    - 支持 `cond` `switch` `while_loop` 控制流接口和 tensor array 的读写也可在动态图下使用 ，便于高层API的统一。
-    - 修改if var在动态图模式下的行为，按var中的值进行判断，解决动态图模式下 if x > y 行为与预期不符的问题；并支持将var转换为float/long/int/len/index的功能，提动态图升易用性。
+    - 修改`if var`在动态图模式下的行为（不兼容升级），按var中的值进行判断，解决动态图模式下 if x > y 行为与预期不符的问题；并支持将var转换为float/long/int/len/index的功能，提动态图升易用性。
    - 针对任务中强依赖hook的功能，新增Layer的forward pre-hook和forward post-hook接口，可以在不改变网络输入输出的结构的情况下方便地获取、改变网络中间层变量的值，提升动态图易用性。
    - 支持cudnn algorithm cache可以在动态图模式下生效，在waveflow模型上性能提升200%。

--- a/doc/fluid/release_note_en.md
+++ b/doc/fluid/release_note_en.md
+#  Release Note
+## Important Updates
+This version deeply optimizes the function, performance, and experience of the imperative programming mode (dynamic graph), and further strengthens the basic functions of the framework. It also significantly optimizes the performance of the native inference library, provides a lightweight inference engine Paddle Lite to achieve a great coverage of hardware support, rcomprehensively upgrades Paddle Serving, and has a powerful and simple service-oriented deployment capability. This version further enriches and improves the corresponding development kits and utility components, continues to improve the function and experience of the existing kits and components, and releases a new  image classification kit,i.e.,  and Paddle quantum machine learning framework.
+**Training framework:** Deeply optimizes the function, performance, and experience of imperative programming (dynamic graph) and especially enhances the capability of converting dynamic graph to static graph. Supports to convert data-dependent control flow into static graph to save and deploy, or train under static graph mode. Further optimizes the function of Data Loader and the usage of gradient clipping. Fixes problems for declarative programming mode such as fetching tensors with different lengths between multi-cards. The combination of mixed precision and recomputation shows good results in large-batch training. Adds a number of APIs and ComplexVariable and supports complex number tensor expressions and common complex number operations.
+**Inference Deployment:**  For Paddle inference, adds the multi-threaded multi-stream support under CUDA and the TRT sub-map's support for the input of dynamic shape, strengthens quantization inference, and significantly optimizes the performance. Fully upgrades Paddle Serving, improves its function, and significantly enhances its usability. Further optimizes the compilation and installation experience of Paddle Lite, comprehensively improves the coverage of supported chips (including RK, MTK, Baidu Kunlun, Cambricon, Bitmain, and Huawei NPU) as well as the corresponding model quantity and performance. Continues to strengthen the PaddleSlim quantization, pruning, and NAS functions. Releases the added Paddle.js which is the first open source front-end inference engine for deep learning of JavaScript in China and can help users to implement the deployment of deep learning models on the webpage side.
+**Development kits:**  Releases PaddleClas including 23 image classification network implementations and 117 image pre-training models. Adds data augmentation, SSLD distillation, and other auxiliary strategies as well as characteristic application cases. Fully upgrades the PaddleSeg portrait segmentation series of models and adds multiple remote sensing related strategies. The coverage of PaddleDetection, PaddleOCR, and text-to-speech kit Parakeet algorithms is more comprehensive and the speed is increased significantly.
+**Utility Components:**  For PaddleHub, adds more models including a series of vision pre-training models, total number of pre-trained models is more than 120. Releases PaddleFL Version 1.0, open sources federated learning based on mulit-party computation (MPC), and supports multiple federated learning scenarios such as horizontal and vertical layout. For PGL, releases industry's first graphical neural network model ERNIESage which combines semantic information with structural information. For PARL, open sources the industry's first evolutionary learning application framework Evokit. Releases a new quantum machine learning framework Paddle Quantum.
+## Basic Framework
+### New APIs
+- Adds `fluid.device_guard`: Sets an OP's running device to CPU or GPU.
+- Adds `fluid.enable_imperative` and `fluid.disable_imperative`, to enable and disable dynamic graph mode, and avoid code indentation relative to `with fluid.dygraph.guard()`.
+- Adds four APIs in the fluid.dygraph directory (see the document for details): BCELoss, L1Loss, MSELoss, NLLLoss, and InstanceNorm
+- Adds 30 APIs in the fluid.layers directory (see the document for details): addmm, allclose, arange, bmm, clamp, cross, diag\_embed, dist, dot, elementwise\_equal, flip, full, full\_like, index\_select, interpolate, log1p, log\_softmax, logsumexp, meshgrid, nonzero, randint, randn, randperm, resize\_bicubic, resize\_linear, roll, t, tril, and triu
+### Function Optimization
+- Imperative Programming Mode (Dynamic Graph):
+  - Enhances the dynamic-to-static function, adds ProgramTranslator based on grammar analysis and transformation, and supports the deployment of dynamic graph model with data-dependent control flow; supports the transformation of the dynamic graph model into the static graph model for training and improves the training performance of tasks such as RNN.
+  - Reconstitutes the variable life cycle management mechanism of the dynamic graph to ensure that memory/GPU memory resources can be released correctly in train mode without calling the var.backward() API.
+  - Adds the double grad function in dynamic graph mode and supports the GAN model training relying on gradient penalty.
+  - To solve that can only use decorator method to set `no_grad` in dynamic graph mode , adds the context manager method to facilitate the coding of gradientless operations of the dynamic graph.
+  - To facilitate separate setting of the train/eval mode of batchnorm and dropout ops, changes the train/eval mode information to the internal setting of Layer from the global setting. Adds Layer-formed Dropout with the mode information.
+  - Supports the use of the `cond` `switch` `while_loop` control flow interfaces and tensor array read-write in dynamic graph mode to facilitate the unification of high-level APIs.
+  - Modifies the behavior of `if var` in the dynamic graph mode to make a judgment according to the value in var (incompatible upgrade). Fixes the problem `if x > y` behavior is not consistent with the expectation in dynamic graph mode.　Supports the function of converting var　into float/long/int/len/index　to enhance the usability of the dynamic graph.
+  - For the functions that strongly rely on hook in tasks, adds the forward pre-hook and forward post-hook APIs for Layer to easily obtain and change the values of variables of the network without changing the structure of network input and output, thus improving the usability of the dynamic graph.
+  - Supports the validity of the cudnn algorithm cache in dynamic graph mode and improves the performance by 200% on the waveflow model.
+- Declarative Programming Mode (Static Graph):
+  - The executor supports automatic pruning of the network during run time according to the feed and fetch variables. Remove the parts irrelevant to the current feed and fetch to improve the running efficiency. Supports the multi-task learning network.
+  - Optimizes the back propagation process. Automatically prunes the variables that do not need to be back propagated. Explicitly sets `stop\_gradient=True` for variables when networking is not required.
+  - The executor supports to fetch variable-length tensors of multi-card to provide the better support for tasks that use variable-length data (e.g. some NLP tasks).
+  - Fixes the problem of discarding some tail data about the insufficient number of cards in the single-process multi-card inference phase by setting `drop\_last=False` in DataLoader to avoid discarding the tail data.
+  - Adds a mixed precision (AMP) and recomputation combination mechanism. When they are jointly used for the Bert-large model, the maximum batch size and the throughput are increased by 400% and 17.5%-31.4% respectively.
+- DataLoader:
+  - Adds accelerated data reading in multi-process mode. For the Map-style type of dataset s, users can improve the data reading performance by implementing user-defined Dataset and BatchSampler. The speed is significantly increased for tasks with a large amount of data reading or complex pre-processing. For example, the multi-process reading mode is used for the video classification TSM model. The training performance is improved by 419% in declarative programming mode ("static graph") and 89.6% in imperative programming mode ("dynamic graph").
+- Usage Method of Gradient Pruning:
+  - The clipping type is passed in by the optimizer's `grad\_clip` parameter. The global and partial clipping functions are supported. The original `set\_gradient\_clip` API is no longer recommended and may be removed in subsequent versions. The `grad\_clip` parameter is removed in `ParamAttr` (incompatibility upgrade). Gradient clipping of a single parameter cannot be performed through ParamAttr. Gradient clipping of some parameters can only be implemented through the above new APIs.
+- Dynamic graphs, static graphs, and high-level APIs support consistent call of collective operators.
+- Intel stops maintenance on Ngraph and removes the codes related to the NGraph library.
+- Removes unused or incompatible attributes such as the `is_test` attribute from all MKL-DNN-related ops.
+- Adds the Support for Complex Number Computation:
+  - Adds ComplexVariable and supports complex number tensor expressions and common complex number operations, including four basic operations, matmul, kron product, reshape, and transpose.
+- Function Upgrade of the Performance Analysis Tool (Profiler):
+  - Supports hierarchical statistics and printing of profile results based on nested call relationships between events.
+  - Adds the tracer\_option parameter which can be configured as `Default`, `OpDetail`, and `AllOpDetail`. Supports users to select different levels of timing and analysis objects.
+  - Adds the statistic function of framework overhead and GpuMemcpy operations.
+- Full Optimization of Error Messages
+  - Optimizes an accumulative total of thousands of vague error messages and standardizes the error type and description.
+  - Automatically detects some user misoperations and gives clear error messages.
+  - Optimizes GPU-related API error messages, converts unreadable error codes into specific messages, and keeps synchronous with information on NVIDIA's official website.
+### Performance Optimization
+- Imperative Programming Mode ("Dynamic Graph"):
+  - Optimizes the data structure of the automatically generated OP function to reduce the framework overhead, the trainning speed of ptb lm model increased by 4% on single card.
+  - Optimizes the design of the InferVarType interface to reduce the framework overhead, raises the speed of InferVarType, and increases the training speed of ptb lm model by over 5%.
+  - Reduces the unnecessary addition of attributes in the dynamic graph ops to reduce the framework overhead,b and increases the training speed of ptb lm model by 4% .
+  - To improve the data loading performance, i cmplements the tensor application shared memory storage and the tensor serialization and deserialization mechanism, supports the transmission of tensors between processes, optimizes the asynchronous performance of DataLoader in dynamic graph mode, and further increases the single card training speed of ResNet model on the P40 machine by over 15%.
+  - Optimizes the performance of the dynamic graph variable slice by 60% and supports step in the slice to be negative.
+- Declarative Programming Mode ("Static Graph"):
+  - Adds the automatic fusion function. Supports the fusion of subgraphs composed of elementwise, activation, sum, cast, scale, fill\_constant, and other element-wise operators. The performance improvement depends on the number of related subgraphs matched in the network. Currently, the training speed of the RNN language model is greatly improved.
+- OP Performance Optimization
+  - Adds caches for Prepare Data during the OP execution process, accelerates by an average of 2% for 10+ model training tasks, and reduces the framework overhead by up to 6%.
+  - Optimizes the GPU performance of depthwise\_conv2d to accelerate by 20% at the common parameter settings.
+  - Optimizes the implementation of the GPU broadcast mode of elementwise\_mul to accelerate by 2-50 times for different inputs.
+  - Optimizes the GPU implementation of conv2d\_transpose to achieve significant performance improvement for fp16.
+  - Optimizes the implementation of shape OP to avoid waiting due to unnecessary data transfer between different devices.
+#### Bug Fixes
+- To ensure successful operation at a large SGD data size, fix the SGD error `Xbyak::Error` problem that occurs when the data size is very large.
+- Fix the MKL memory leak problem under the Linux version.
+- Fix the bug of command line parameter parsing at the dynamic graph multi-card startup.
+- Fix the problem that occurs when the clone(for\_test=True) interface processes a network containing the control flow op.
+- Fix cyclic dependency between the dynamic and static graph modules.
+- Fix the compatibility problem of pickle dump/load between Python 2 \& 3.
+- Fix the problem of the parameter not being registered or overridden as none for the dynamic graph layer.
+- Fix the problem of Op output Var naming conflict caused when different Op names have the same attribute value.
+- Fix the output LoD setting when axis=0, which shall be the splicing of input LoD.
+- Fix the bug of BatchNorm mean and var that cannot be updated in eval mode.
+## Inference Deployment
+### Paddle Inference
+#### Function Upgrade
+- Adds TRT submap's support for dynamic shape input as well as the `config.SetTRTDynamicShapeInfo(min_input_shape, max_input_shape, opt_input_shape)` interface. This interface is used to specify the minimum, maximum, and optimal shape information of the input of the submap (Optimal shape means that TRT will select the runtime optimal kernel at this shape). After the shape information is specified, the Dynamic shape mode is used during the Paddle-TRT operation and the input of any shape between `max_input_shape` and `min_input_shape` is supported during the inference. This function supports FCN, Faster RCNN, Ernie/Bert, and other dynamic shape input models.
+- To meet the need for binding the computation flow to the current thread during user inference, refactors the device context data structure to support the CUDA computation flow priority, and adds a thread local GPU memory allocator ThreadLocalAllocator. Has the ability to bind different threads to different CUDA streams.
+- The MKL-DNN quantization function fully supports all quantitative models. Adds the support for 'weight\_quantize\_type' as range\_abs\_max and 'channel\_wise\_abs\_max'. Supports the out\_threshold attribute.
+- Adds official website inference API reference
+#### Performance Optimization
+- For the targeted optimization of CUDA Bert/Ernie, adds `embedding_eltwise_layernorm` fusion implementation and optimizes the `multihead_matmul` and `fc_elementwise_layernorm` fusion implementation. Compared with the previous version, the ernie fp32 inference is optimized to 8.7 ms from 10 ms or by 13% under the conditions of P4 card, cuda10, and batch\_size=1.
+- TRT submap's support for the dynamic shape of the Ernie/Bert model. Under the conditions of T4 card, cuda10, and batch\_size=1, the ernie fp16 inference performance is 2.9 ms, which is accelerated by 56%, compared with 6.6 ms for fp32.
+- Optimization of mobilenet v3 by Paddle-TRT. Supports TRT hard sigmoid OP and adds a hard swish plugin. The inference is optimized to 2.29 ms from 3.48 ms or by 34% under the conditions of batch\_size = 1 and P4, or to 1.33 ms from 2.76 ms or by 51% under the conditions of V100.
+- Adds the support for the swish activation function DNNL so that the ShuffleNet performance is improved by 76% on the 6248 single-core processor.
+- Quantization: Adds the support for matmul op quantization; adds the `matmul+transpose+reshape` fuse and the `scale+matmul` fuse. After matmul quantization and fuse addition, the performance of the Ernie fp32 and quantized INT8 models is improved by about 10%(on the 6271 machine).
+- Adds the support for DNNL inplace op: Currently, the execution of inplace of `elementwise_add` and most activation functions including softmax, gelu, and relu are supported so that the Ernie performance is improved by about 2% on 6248.
+- After the above optimization and quantization, the speed of the current Ernie INT8 model is increased by about 5.51 times compared with the FP32 model on which DNNL optimization (including fuses) and quantization are not performed.
+#### Bug Fixes
+- Fixes the problem of failure to identify and load a locally generated calibration table in the service and regeneration of a calibration table due to inconsistency of locally and server generated calibration table names resulting from unstable fusion strategies in the TRT int8 off-line quantization in the inference phase. Currently, the calibration table name can remain consistent when TRT off-line quantization calibration runs for multiple times.
+- Fix the problem of parameter transmission error during the generation of a calibration table in the TRT off-line quantization in the Inference phase. This problem will affect the final quantitative inference precision to some extent.
+### Paddle Serving
+#### Improved Usability
+- Uses pybind to encapsulate c++ codes. Provids a usage method of the python API. Provides the python2 and python3 environment whl installation packages of paddle\_serving\_server, paddle\_serving\_server\_gpu, and paddle\_serving\_client. Releases Version 0.2.1
+- Provides cpu and gpu Docker images in the centos6/7 environment, including executable images and compilable images
+- Provides an API to directly save the models and configuration files required for Serving deployment. Seamlessly connects the Paddle training framework
+- Implements the startup of the model inference service using one line of commands
+#### Function Perfection
+- Provides RPC and HTTP inference service methods
+- Supports Python and Go language clients
+- Supports the A/B test
+- Releases Paddle\_serving\_app Version 0.0.2. Provides preprocessing APIs for LAC words segmentation preprocessing, Chinese BERT model preprocessing, and image processing
+- Supports the timeline visualization of the inference service
+#### Performance Optimization
+- In RPC service mode, the Chinese BERT semantic vector indicates that the inference speed of the inference service is increased by 2.04 times compared with paddle\_gpu\_serving Version 0.8.2 under the conditions of a single P4 card and batch size 1.
+#### Documents and Examples
+- Improves and adds Chinese and English operating documents, Chinese and English development and deployment documents, and Chinese performance tuning documents.
+- Provides seven types of model inference service examples, including Chinese word segmentation, English emotion analysis, Chinese semantic vector representation, CTR estimation, image classification, and other fields.
+### Paddle Lite
+#### Function Upgrade
+- Compilation and Installation
+  - Optimization of Paddle-Lite compilation scripts: Splits separate compilation scripts from the Android\\iOS\\ArmLinux platform to improve the script usability.
+  - Support for Python installation: The Paddle-Lite Python library can be installed on PC Linux/Windows/Mac. Python can call the Lite opt optimization model.
+  - Support for Windows compilation: Paddle-Lite can be compiled in the Windows environment. Currently, the Windows environment supports only the x86 compilation.
+- Basic Functions
+  - Adds the submap segmentation function. For models lited by submap access, manually segments a submap through configuration files so that a specified OP runs in the host to improve the performance (ssd\_mobilenet\_v1, accelerated by about 4.3 times).
+  - Optimizes the support for quantitative models generated using the uncalibrated post-training quantization method. Quantizes common classification models to 8bit. Decreases the precision loss to 0.1% from 2%.
+- Hardware Support
+  - Adds RK 1808 NPU to support the fully quantitative MobileNetV1 model.
+  - Adds MTK MT8175 APU to support the fully quantitative MobileNetV1 model.
+  - Adds a method of access to Baidu Kunlun XPU Kernel to support ERNIE, ResNet-50, and BERT models.
+  - Adds Cambricon MLU270 to support the following models: Resnet50 (int8) and Senet101 (int8).
+  - Adds Bitmain BM1682 to support the following models: Mobilenet, Yolov3, Mobilenet-ssd, Inceptionv4, Vgg16, DarkNet-YOLOv3, and PyramidBox.
+  - Mobile GPU (opencl): Supports mobilenetv1/v2, GAN correlation, mnasnet, sqeueezenet, shufflenet, resnet, unet, and vgg16 models
+  - Nvidia GPU: Adds exec multi-stream support. For the model structure with parallelism, the performance is expected to be improved by 5-15%, compared with a single stream. Common visual models, generally have no parallel structure and will obtain no benefit from enabling multi-stream. The multi-streams function `config.set_multi_stream(true);` is enabled under the cuda platform.
+  - Optimization of the x86 platform: Reduces the size of the inference library (200M---->16M), supports LOG shutdown (--shutdown\_log=ON), supports the multi-thread sharing model weight parameter by full\_api, and adds x86 cxx\_demo
+  - Huawei NPU:
+    - Increase the speed of Benchmark models (mobilenet\_v1, mobilenet\_v2, squeezenet\_v1.1, mnasnet, and shufflenet\_v2) by 5-10 times.
+    - Supports caching different sizes of NPU models to improve the performance of models with a variable input size.
+- Demo:
+  - Adds an Android Demo for real-time mask detection based on camera preview
+  - Adds an Android Demo for real-time face key point detection and beauty
+  - Adds an Android Demo for Boston house price inference of mobile training
+#### Performance Optimization
+- Reduction in time consumption of InferShape: When the predictor continuously runs, the total time consumption of infershape is reduced to 0.08 ms from 0.25 ms.
+- The kernel of the opencl part supports dynamic shape and transfers partial computation to ReinitWhenNeeded. fc\_buffer, elementwise\_add, scale, activation, and grid\_sampler.
+- Optimizes the sgemm performance on the low-end machine.
+- Optimizes the Precision Profiler function. Optimizes the type setting. Adds the support for the standard deviation and growth rate attributes (a sequence can be compared when the mean value is the same as the standard deviation). Supports the precision printing of output of the OpenCL image/buffer at every layer. Supports writing precision results (final precision summary) at every layer to mobile devices to facilitate APP debugging. Separates precision printing from dependency on the original profiler for time consumption statistics.
+#### Bug Fixes
+- Fix the problem that the predictor results are random because the act\_type of the conv op is not initialized.
+- Fix the opencl kernel. The bilinear kernel compatibility problem on the mali gpu, the problem of incorrect computation results of the instance norm, and the kernel registration error of the reshape result in the model transformation failure. The problem that the exp and tanh cannot find the kernel results in kernel name error and model op binding failure.
+- Fix the problem that the opencl gets stuck at the end of the mali gpu's computation.
+- Fix the opencl resource-related problem. Isolates every cl::kernel/cl::program and other resources in every predictor.
+### PaddleSlim
+#### Quantization
+- Adds a post-training quantization method without calibration data. The int16 precision is lossless. The int8 precision loss is smaller than 0.1%.
+- Enhances the quantization function, improves the output scale information of the quantization OP, and supports the CPU inference-side comprehensive adaptive quantitative model.
+#### Pruning
+- Adds two pruning strategies including FPGM and BN scale. Improves the precision by 0.6% at the same compressibility on the MobileNetV3-YOLOV3-COCO task.
+- Adds a user-defined pruning strategy API to facilitate developers to quickly add compression strategies.
+- Adds the default processing logic of added operators in the pruning function and extends the support for pruning more complex networks.
+#### NAS
+- Adds the DARTS series of search algorithms and provides an extended interface to facilitate users to investigate and implement a new model structure search strategy.
+- Adds an early stop mechanism in the model structure to improve the usability of the search function.
+- Adds a model structure search strategy based on reinforcement learning and provides an extended interface to provide reference for users' investigation and implement of the new strategy.
+#### Pantheon
+- Supports data transmission and storage in fp16 format. Supports knowledge transmission with multiple channels in online distillation mode. Increases the transmission efficiency of knowledge data.
+- Adds lexical analysis examples to facilitate users to build their own distillation tasks based on the examples
+## Development Kits
+### PaddleDetection
+- Enhancement of the Richness of Models
+  - Adds Efficientdet-D0: The COCO val2017 precision is 0.3 higher than the TF precision (33.8 vs 33.5), excluding the postprocessing inference speed that is basically equal or has a weak advantage (about 13 ms vs about 14 ms, T4 measured speed).
+  - Adds the instance segmentation model HTC. The inference speed under the V100 is 11.5 FPS which is 7.4 FPS higher than that of the competing product. The BBox mAP under COCO 2017 is 42.1% and Mask is 37.1.
+  - Adds the anchor-free model FCOS:  The COCO val2017 precision is 1.1 higher than the pytorch precision (39.8 vs 38.7).
+  - Adds a MobileNetV3 backbone network in YOLOv3. The COCO dataset  precision is 31.6.
+  - Adds the anchor-free model CornernetSqueeze: The COCO val2017 precision is 34.5 which is 0.1 higher than that of the competing product. The precision of the optimized model is 38.2, up 3.7 points. The speed is 5% faster than that of yolo\_v3 darknet
+  - Adds the server-side practical object detection model cascade\_rcnn\_resnet50\_vd\_fpn\_dcn. The COCO mAP is 47.8% at 20 FPS on V100, better than that of the competing product EfficientDet.
+- Launch of Three Mobile Models
+  - SSDLite series of models: For the ssdlite-mobilenet\_v3 large model, mAP on the COCO dataset  is 22.8% and the single-thread inference speed on the Qualcomm Snapdragon 845 is 95 ms. For the ssdlite-mobilenet\_v3 small model, mAP on the COCO dataset  is 16.6%, the single-thread inference speed on the Qualcomm Snapdragon 845 is 40 ms, and the precision is better than that of the competing product. For the ssdlite-mobilenet\_v1 model, mAP on the COCO dataset  is 23.6%, the single-thread inference speed on the Qualcomm Snapdragon 845 is 140 ms, and the precision is better than that of the competing product.
+  - yolo v3: For the yolov3\_mobilenet\_v3 pruning model, the single-thread inference speed on the Qualcomm Snapdragon 845 is 91 ms and the precision is 24.6 (input dimensions 320 \* 320). Both the speed and precision are better than the speed and precision of the framework SSDLite model, a competing product.
+  - Faster RCNN: Based on the COCO dataset , mAP of cascade\_rcnn\_mobilenet\_v3 large\_fpn is 25.0% and the single-thread inference speed on the Qualcomm Snapdragon 845 is 87 ms at the input image dimensions 320 x 320; mAP is 30.2% and the single-thread inference speed on the Qualcomm Snapdragon 845 is 351 ms at the input image dimensions 640 x 640.
+- Inference Deployment Refactoring:
+  - Adds a Python inference deployment process and supports the RCNN, YOLO, SSD, RetinaNet, and face series of models. Supports video inference.
+  - Refactors C++ inference deployment and improves the usability.
+- Usability Improvement and Function Components
+  - Adds strength of AutoAugment data.
+  - Upgrades the PaddleDetection library document structure.
+  - Supports migration learning to automatically perform shape matching.
+  - Optimizes the memory usage in the mask branch evaluation phase.
+  - Upgrades the inference deployment function and adds python scenario image and video inference.
+### PaddleSeg
+- Adds the Lovasz Loss function to effectively improve the precision of multi-class segmentation
+- Fully upgrades human portrait segmentation series model
+  * Releases the first mobile real-time portrait segmentation model HumanSeg-lite
+  * Adds a video-level segmentation postprocessing solution based on optical flow algorithm
+- Adds a remote sensing image segmentation solution
+  * Adds a data pre-processing solution for multi-channel remote sensing images
+  * Adds a data augmentation strategy for multi-channel images
+  * Provides a tutorial on the segmentation of two meteorological remote sensing fields including snow detection and cloud detection
+### PaddleClas
+- Adds the MobileNetV3 series of models and performs performance evaluation on 23 series of and 117 pre-training models.
+- Adds an SSLD knowledge distillation solution, improves the recognition accuracy by more than 3%, and releases six distillation models including resnet50\_vd (82.4%) and mobilenetv3 (78.9%).
+- Adds eight data augmentation modes including AutoAugment, RandAugment, CutOutRandErasing, HideAndSeek, GridMask, Mixup, and Cutmix that are used to increase the diversity of training samples and improve the generalization performance of the models.
+- Adds 100,000 types of image classification pre-training models and improves the recognition accuracy rate by up to 30% for image classification service application scenarios.
+### PaddleOCR
+- Adds DB and EAST text detection algorithms.
+- Adds Rosetta, CRNN, STAR-Net, and RARE text recognition algorithms.
+- Adds an ultra-lightweight OCR model with a total size of only 8.6M (4.1M for text detection and 4.5M for text recognition). Supports text recognition at scenarios such as horizontal and vertical layout, long text, and mixed Chinese and English numbers.
+### Parakeet
+- Releases English pre-training models and audio samples of WaveFlow (res channel=64/128), ClariNet, WaveNet, and other models.
+- Fixes the problem of too slow speed of the Conv2DTranspose fp16 kernel and simplifies the WaveFlow inference logic in fp16 mode.
+- Increases the model training speed significantly. Doubles the speed in DeepVoice3, TransformerTTS, and other models by optimizing the data preprocessing and OP calculation logic.
+## Utility Components
+### PaddleHub
+* Enhances the vision models’ richness. The total number of pre-trained  models is 120+.
+  * Adds the large-scale vision pre-training models and greatly improves the fine-tune effects of image classification and object detection tasks
+  * Adds the industrial short video classification model VideoTag and supports the recognition of more than 3000 types of Chinese tags
+  * Adds the lightweight Chinese OCR model and supports one-click quick OCR recognition
+  * Adds pedestrian detection, vehicle detection, animal recognition, and Object365 2019 large-scale object detection models which win the first prize on detection contest.
+* Fine-tune API Upgrade
+  * Adds five predefined networks for the text classification tasks, including CNN, BOW, LSTM, BiLSTM and DPCNN.
+### PaddleX
+* Releases PaddleX 1.0, the Entire Process Development Toolkit
+- Opens up the entire process of deep learning development from data access to inference deployment and provides a easy-to-use Python API
+- Covers the four mainstream task scenarios including image classification, object detection, semantic segmentation, and instance segmentation in the CV field. Integrates utility components such as PaddleHub, PaddleSlim and VisualDL.
+- Presets 26 types of a total of 43 models including the industrial practice refining precipitation pre-training model and a number of characteristic and advantageous Paddle models.
+- Provides advanced functions such as automatic data analysis, automatic hyper-parameter recommendation, data argumentation strategy, model pruning training, model quantization, pre-training model saving and reuse, multi-platform  deployment, model interpretation and encryption.
+- Innovatively integrates the model explainability analysis function
+- Provides an official implemented GUI and supports one-click installation of Windows, Linux, and Mac systems.
+### VisualDL
+* Releases VisualDL Version 2.0 beta
+- Upgrades the back-end kernel, is lighter, faster, and more compatible, and supports file storage system expansion
+- Fully upgrades APIs and uses less codes to finish visual analysis, significantly improving the usability
+- Upgrades UI and interaction, provides better localization support, achieves clearer and more intuitive visual analysis, and gives users immersive experience
+- Deeply integrates with Paddle development kits and utility components and provides a smoother deep learning development experience
+### PaddleFL
+- Releases PaddleFL Version 1.0
+  - Open sources federated learning based on mulit-party computation (MPC) to supports horizontal, vertical, and other federated learning scenarios
+  - Refactors the original framework to integrate and open source the new and original federated learning solutions
+  - Adds the function of converting a single-machine model into an FL trainable program to support more models and scenarios
+### PGL
+* Releases the industry's first graphical neural network model ERNIESage which combines semantic information with structural information
+* Adds PGL-KE. Currently, PGL covers 25+ graph learning models including walk, messaging, and knowledge embedding
+* Adds graph batch, graph pooling, and other graph operators
+* Fully supports the Open Graph Benchmark benchmark test set and releases the corresponding SOTA
+* Adds MetaPath2Vec++, Mulit-MetaPath2Vec++, STGCN, GIN, and PinSage models in the model zoo
+### PARL
+* Open sources the industry's first evolutionary learning application framework EvoKit
+* Adds the support for the Multi-Agent RL algorithm including MADDPG
+* Adds the support for multi-card training and releases an example of a multi-card DQN algorithm
+* Open sources SOTA algorithms TD3 and SAC in the continuous control field
+* Open sources the NeurIPS2019 reinforcement learning challenge champion model and training solution
+### Paddle Quantum (Quantum Computation Laboratory)
+* First release of Paddle Quantum. Paddle Quantum is a quantum machine learning tool set developed based on Baidu Paddle. It supports the setup and training of the quantum neural network and provides easy-to-use quantum machine learning development kits and cutting-edge quantum application tool sets such as quantum optimization and quantum chemistry, making Paddle the first deep learning platform supporting quantum machine learning in China.
+  - Supports a QAOA algorithm to solve the max-cut problem
+  - Supports a VQE algorithm to calculate the minimum characteristic value of H\_2
+  - Supports an SSVQE algorithm to calculate the characteristic spectrum of a given Hamiltonian
+  - Supports a VQSD algorithm to calculate the diagonalized form of the quantum state and give the eigendecomposition of the quantum state
+  - Supports a Gibbs algorithm to generate the Gibbs state of a given Hamiltonian at a certain temperature
+  - Supports common functions in quantum computation
+  - Supports the description of the U\_Ansatz quantum circuit