Add post training quantization docs and demo (#3948)

* add post training quantization demo, test=develop * update, test=develop

Add post training quantization docs and demo (#3948)
* add post training quantization demo, test=develop * update, test=develop
c08160b8 · juncaipeng · GitHub · 85ba4e7f · c08160b8 · c08160b8
8 changed file
--- a/PaddleSlim/docs/tutorial.md
+++ b/PaddleSlim/docs/tutorial.md
@@ -44,7 +44,7 @@
 <strong>表2：模型量化前后精度对比</strong>
 </p>
-目前，学术界主要将量化分为两大类：`Post Training Quantization`和`Quantization Aware Training`。`Post Training Quantization`是指使用KL散度、滑动平均等方法确定量化参数且不需要重新训练的定点量化方法。`Quantization Aware Training`是在训练过程中对量化进行建模以确定量化参数，它与`Post Training Quantization`模式相比可以提供更高的预测精度。本文主要针对`Quantization Aware Training`量化模式进行阐述说明。
+目前，学术界主要将量化分为两大类：`Post Training Quantization`和`Quantization Aware Training`。`Post Training Quantization`是指使用KL散度、滑动平均等方法确定量化参数且不需要重新训练的定点量化方法。`Quantization Aware Training`是在训练过程中对量化进行建模以确定量化参数，它与`Post Training Quantization`模式相比可以提供更高的预测精度。
 ### 1.2 量化原理
@@ -59,7 +59,7 @@ $$ M = max(abs(x)) $$ $$ q = \left \lfloor \frac{x}{M} * (n - 1) \right \rceil $
 $q = scale * r + b$
 其中`min-max`和`max-abs`被称为量化参数或者量化比例或者量化范围。
-#### 1.2.2 量化训练框架
+#### 1.2.2 量化训练
 ##### 1.2.2.1 前向传播
 前向传播过程采用模拟量化的方式，具体描述如下：
@@ -104,7 +104,7 @@ $$
 因此，量化Pass也会改变相应反向算子的某些输入。
-#### 1.2.3 确定量化参数
+##### 1.2.2.3 确定量化比例系数
 存在着两种策略可以计算求取量化比例系数，即动态策略和静态策略。动态策略会在每次迭代过程中计算量化比例系数的值。静态策略则对不同的输入采用相同的量化比例系数。
 对于权重而言，在训练过程中采用动态策略。换句话说，在每次迭代过程中量化比例系数均会被重新计算得到直至训练过程结束。
 对于激活而言，可以选择动态策略也可以选择静态策略。若选择使用静态策略，则量化比例系数会在训练过程中被评估求得，且在推断过程中被使用(不同的输入均保持不变)。静态策略中的量化比例系数可于训练过程中通过如下三种方式进行评估：
@@ -119,7 +119,18 @@ $$ Vt = (1 - k) * V + k * V_{t-1} $$
 式中，$V$ 是当前batch的最大绝对值， $Vt$是滑动平均值。$k$是一个因子，例如其值可取为0.9。
+#### 1.2.4 训练后量化
+训练后量化是基于采样数据，采用KL散度等方法计算量化比例因子的方法。相比量化训练，训练后量化不需要重新训练，可以快速得到量化模型。
+训练后量化的目标是求取量化比例因子，主要有两种方法：非饱和量化方法 ( No Saturation) 和饱和量化方法 (Saturation)。非饱和量化方法计算FP32类型Tensor中绝对值的最大值`abs_max`，将其映射为127，则量化比例因子等于`abs_max/127`。饱和量化方法使用KL散度计算一个合适的阈值`T` (`0<T<mab_max`)，将其映射为127，则量化比例因子等于`T/127`。一般而言，对于待量化op的权重Tensor，采用非饱和量化方法，对于待量化op的激活Tensor（包括输入和输出），采用饱和量化方法 。
+训练后量化的实现步骤如下：
+* 加载预训练的FP32模型，配置`DataLoader`；
+* 读取样本数据，执行模型的前向推理，保存待量化op激活Tensor的数值；
+* 基于激活Tensor的采样数据，使用饱和量化方法计算它的量化比例因子；
+* 模型权重Tensor数据一直保持不变，使用非饱和方法计算它每个通道的绝对值最大值，作为每个通道的量化比例因子；
+* 将FP32模型转成INT8模型，进行保存。
 ## 2. 卷积核剪裁原理

--- a/PaddleSlim/quant_low_level_api/README.md
+++ b/PaddleSlim/quant_low_level_api/README.md
 <div align="center">
  <h3>
-    <a href="../docs/tutorial.md">
+    <a href="./README.md">
-      算法原理介绍
+      模型量化概述
    </a>
    <span> | </span>
-    <a href="../docs/usage.md">
+    <a href="../docs/tutorial.md">
-      使用文档
+      模型量化原理
    </a>
    <span> | </span>
-    <a href="../docs/demo.md">
+    <a href="./quantization_aware_training.md">
-      示例文档
+      量化训练使用方法和示例
    </a>
    <span> | </span>
-    <a href="../docs/model_zoo.md">
+    <a href="./post_training_quantization.md">
-      Model Zoo
+      训练后量化使用方法和示例
    </a>
  </h3>
 </div>
 ---
-# 量化训练Low-Level API使用示例
+模型量化是使用更少的比特数表示神经网络的权重和激活的方法，具有加快推理速度、减小存储大小、降低功耗等优点。
-## 目录
- [量化训练Low-Level APIs介绍](#1-量化训练low-level-apis介绍)
- [基于Low-Level API的量化训练](#2-基于low-level-api的量化训练)
-## 1. 量化训练Low-Level APIs介绍
-量化训练Low-Level APIs主要涉及到PaddlePaddle框架中的四个IrPass，即`QuantizationTransformPass`、`QuantizationFreezePass`、`ConvertToInt8Pass`以及`TransformForMobilePass`。这四个IrPass的具体功能描述如下：
-* `QuantizationTransformPass`: QuantizationTransformPass主要负责在IrGraph的`conv2d`、`depthwise_conv2d`、`mul`等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入，示例如图1：
-<p align="center">
-<img src="../docs/images/usage/TransformPass.png" height=400 width=520 hspace='10'/> <br />
-<strong>图1：应用QuantizationTransformPass后的结果</strong>
-</p>
-* `QuantizationFreezePass`：QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，QuantizationFreezePass还会将`conv2d`、`depthwise_conv2d`、`mul`等算子的权重离线量化为int8_t范围内的值(但数据类型仍为float32)，以减少预测过程中对权重的量化操作，示例如图2：
-<p align="center">
-<img src="../docs/images/usage/FreezePass.png" height=400 width=420 hspace='10'/> <br />
-<strong>图2：应用QuantizationFreezePass后的结果</strong>
-</p>
-* `ConvertToInt8Pass`：ConvertToInt8Pass必须在QuantizationFreezePass之后执行，其主要目的是将执行完QuantizationFreezePass后输出的权重类型由`FP32`更改为`INT8`。换言之，用户可以选择将量化后的权重保存为float32类型（不执行ConvertToInt8Pass）或者int8_t类型（执行ConvertToInt8Pass），示例如图3：
-<p align="center">
-<img src="../docs/images/usage/ConvertToInt8Pass.png" height=400 width=400 hspace='10'/> <br />
-<strong>图3：应用ConvertToInt8Pass后的结果</strong>
-</p>
-* `TransformForMobilePass`：经TransformForMobilePass转换后，用户可得到兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)移动端预测库的量化模型。paddle-mobile中的量化op和反量化op的名称分别为`quantize`和`dequantize`。`quantize`算子和PaddlePaddle框架中的`fake_quantize_abs_max`算子簇的功能类似，`dequantize` 算子和PaddlePaddle框架中的`fake_dequantize_max_abs`算子簇的功能相同。若选择paddle-mobile执行量化训练输出的模型，则需要将`fake_quantize_abs_max`等算子改为`quantize`算子以及将`fake_dequantize_max_abs`等算子改为`dequantize`算子，示例如图4：
-<p align="center">
-<img src="../docs/images/usage/TransformForMobilePass.png" height=400 width=400 hspace='10'/> <br />
-<strong>图4：应用TransformForMobilePass后的结果</strong>
-</p>
-## 2. 基于Low-Level API的量化训练
-本小节以ResNet50和MobileNetV1为例，介绍了PaddlePaddle量化训练Low-Level API的使用方法，具体如下：
-1） 执行如下命令clone [Pddle models repo](https://github.com/PaddlePaddle/models)：
-```bash
-git clone https://github.com/PaddlePaddle/models.git
-```
-2） 准备数据集（包括训练数据集和验证数据集）。以ILSVRC2012数据集为例，数据集应包含如下结构：
-```bash
-data
-└──ILSVRC2012
-        ├── train
-        ├── train_list.txt
-        ├── val
-        └── val_list.txt
-```
-3）切换到`models/PaddleSlim/quant_low_level_api`目录下，修改`run_quant.sh`内容，即将**data_dir**设置为第2)步所准备的数据集路径。最后，执行`run_quant.sh`脚本即可进行量化训练。
-### 2.1 量化训练Low-Level API使用小结：
-* 参照[quant.py](quant.py)文件的内容，总结使用量化训练Low-Level API的方法如下：
-```python
-#startup_program = fluid.Program()
-#train_program = fluid.Program()
-#train_cost = build_program(
-#    main_prog=train_program,
-#    startup_prog=startup_program,
-#    is_train=True)
-#build_program(
-#    main_prog=test_program,
-#    startup_prog=startup_program,
-#    is_train=False)
-#test_program = test_program.clone(for_test=True)
-# The above pseudo code is used to build up the model.
-# ---------------------------------------------------------------------------------
-# The following code are part of Quantization Aware Training logic:
-# 0) Convert Programs to IrGraphs.
-main_graph = IrGraph(core.Graph(train_program.desc), for_test=False)
-test_graph = IrGraph(core.Graph(test_program.desc), for_test=True)
-# 1) Make some quantization transforms in the graph before training and testing.
-# According to the weight and activation quantization type, the graph will be added
-# some fake quantize operators and fake dequantize operators.
-transform_pass = QuantizationTransformPass(
-        scope=fluid.global_scope(), place=place,
-        activation_quantize_type=activation_quant_type,
-        weight_quantize_type=weight_quant_type)
-transform_pass.apply(main_graph)
-transform_pass.apply(test_graph)
-# Compile the train_graph for training.
-binary = fluid.CompiledProgram(main_graph.graph).with_data_parallel(
-    loss_name=train_cost.name, build_strategy=build_strategy)
-# Convert the transformed test_graph to test program for testing.
-test_prog = test_graph.to_program()
-# For training
-exe.run(binary, fetch_list=train_fetch_list)
-# For testing
-exe.run(program=test_prog, fetch_list=test_fetch_list)
-# 2) Freeze the graph after training by adjusting the quantize
-# operators' order for the inference.
-freeze_pass = QuantizationFreezePass(
-    scope=fluid.global_scope(),
-    place=place,
-    weight_quantize_type=weight_quant_type)
-freeze_pass.apply(test_graph)
-# 3) Convert the weights into int8_t type.
-# [This step is optional.]
-convert_int8_pass = ConvertToInt8Pass(scope=fluid.global_scope(), place=place)
-convert_int8_pass.apply(test_graph)
-# 4) Convert the freezed graph for paddle-mobile execution.
-# [This step is optional. But, if you execute this step, you must execute the step 3).]
-mobile_pass = TransformForMobilePass()
-mobile_pass.apply(test_graph)
-```
-* [run_quant.sh](run_quant.sh)脚本中的命令配置详解：
-```bash
-   --model：指定量化训练的模型，如MobileNet、ResNet50。
-   --pretrained_fp32_model：指定预训练float32模型参数的位置。
-   --checkpoint：指定模型断点训练的checkpoint路径。若指定了checkpoint路径，则不应该再指定pretrained_fp32_model路径。
-   --use_gpu：选择是否使用GPU训练。
-   --data_dir：指定训练数据集和验证数据集的位置。
-   --batch_size：设置训练batch size大小。
-   --total_images：指定训练数据图像的总数。
-   --class_dim：指定类别总数。
-   --image_shape：指定图像的尺寸。
-   --model_save_dir：指定模型保存的路径。
-   --lr_strategy：学习率衰减策略。
-   --num_epochs：训练的总epoch数。
-   --lr：初始学习率，指定预训练模型参数进行fine-tune时一般设置一个较小的初始学习率。
-   --act_quant_type：激活量化类型，可选abs_max,  moving_average_abs_max, range_abs_max。
-   --wt_quant_type：权重量化类型，可选abs_max, channel_wise_abs_max。
-```
-> **备注:** 量化训练结束后，用户可在其指定的模型保存路径下看到float、int8和mobile三个目录。下面对三个目录下保存的模型特点进行解释说明:
+目前，模型量化主要分为量化训练（Quantization Aware Training）和训练后量化（Post Training Quantization）。量化训练是在训练过程中对量化进行建模以确定量化参数，具有为复杂模型提供更高的精度的优点。训练后量化是基于采样数据，采用KL散度等方法计算量化比例因子的方法。它具有无需重新训练、快速获得量化模型的方法。
-> - **float目录:** 参数范围为int8范围但参数数据类型为float32的量化模型。
-> - **int8目录:** 参数范围为int8范围且参数数据类型为int8的量化模型。
-> - **mobile目录:** 参数特点与int8目录相同且兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)的量化模型。
->
-> **注意:** 目前PaddlePaddle框架在Server端只支持使用float目录下的量化模型做预测。
+模型量化的原理和Low-Level API使用方法可以参考如下文档：
+* [模型量化原理](../docs/tutorial.md)
+* [量化训练Low-Level API使用方法和示例](./quantization_aware_training.md)
+* [训练后量化Low-Level API使用方法和示例](./post_training_quantization.md)
--- a/PaddleSlim/quant_low_level_api/post_training_quantization.md
+++ b/PaddleSlim/quant_low_level_api/post_training_quantization.md
+<div align="center">
+  <h3>
+    <a href="./README.md">
+      模型量化概述
+    </a>
+    <span> | </span>
+    <a href="../docs/tutorial.md">
+      模型量化原理
+    </a>
+    <span> | </span>
+    <a href="./quantization_aware_training.md">
+      量化训练使用方法和示例
+    </a>
+    <span> | </span>
+    <a href="./post_training_quantization.md">
+      训练后量化使用方法和示例
+    </a>
+  </h3>
+</div>
+---
+# 训练后量化Low-Level API使用方法和示例
+## 目录
+- [训练后量化使用说明](#1-训练后量化使用说明)
+- [训练后量化使用示例](#2-训练后量化使用示例)
+## 1. 训练后量化使用说明
+1）**准备模型和校准数据**
+首先，需要准备已经训练好的FP32预测模型，即 `save_inference_model()` 保存的模型。训练后量化读取校准数据进行前向计算，所以需要准备校准数据集。校准数据集应为测试集（或训练集）中具有代表性的一部分，如随机取出的部分数据，这样可以计算得到更加准确的量化比例因子。样本数据的数量可以是100~500，当然样本数据越多，计算的的量化比例因子越准确。
+2）**配置校准数据生成器**
+训练后量化内部使用异步数据读取的方式读取校准数据，用户只需要根据模型的输入，配置读取数据的sample_generator。sample_generator是Python生成器，用作`DataLoader.set_sample_generator()`的数据源，**必须每次返回单个样本**。建议参考官方文档[异步数据读取](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/howto/prepare_data/use_py_reader.html)。
+3）**调用训练后量化**
+机器上安装PaddlePaddle，然后调用PostTrainingQuantization实现训练后量化，以下对api接口进行详细介绍。
+``` python
+class PostTrainingQuantization(
+    executor,
+    sample_generator,
+    model_dir,
+    model_filename=None,
+    params_filename=None,
+    batch_size=10,
+    batch_nums=None,
+    scope=None,
+    algo="KL",
+    quantizable_op_type=["conv2d", "depthwise_conv2d", "mul"],
+    is_full_quantize=False)
+```
+调用上述api，传入训练后量化必要的参数。参数说明：
+* executor：执行模型的executor，可以在cpu或者gpu上执行。
+* sample_generator：第二步中配置的校准数据生成器。
+* model_dir：待量化模型的路径，其中保存模型文件和权重文件。
+* model_filename：待量化模型的模型文件名，如果模型文件名不是`__model__`，则需要使用model_filename设置模型文件名。
+* params_filename：待量化模型的权重文件名，如果所有权重保存成一个文件，则需要使用params_filename设置权重文件名。
+* batch_size：一次读取样本数据的数量。
+* batch_nums：读取样本数据的次数。如果设置为None，则从sample_generator中读取所有样本数据进行训练后量化；如果设置为非None，则从sample_generator中读取`batch_size*batch_nums`个样本数据。
+* scope：模型运行时使用的scope，默认为None，则会使用global_scope()。
+* algo：计算待量化激活Tensor的量化比例因子的方法。设置为`KL`，则使用KL散度方法，设置为`direct`，则使用abs max方法。默认为`KL`。
+* quantizable_op_type: 需要量化的op类型，默认是`["conv2d", "depthwise_conv2d", "mul"]`，列表中的值可以是任意支持量化的op类型。
+* is_full_quantize：是否进行全量化。设置为True，则对模型中所有支持量化的op进行量化；设置为False，则只对`quantizable_op_type` 中op类型进行量化。目前，支持的量化类型如下：'conv2d', 'depthwise_conv2d', 'mul', "pool2d", "elementwise_add", "concat", "softmax", "argmax", "transpose", "equal", "gather", "greater_equal", "greater_than", "less_equal", "less_than", "mean", "not_equal", "reshape", "reshape2", "bilinear_interp", "nearest_interp", "trilinear_interp", "slice", "squeeze", "elementwise_sub"。
+```
+PostTrainingQuantization.quantize()
+```
+调用上述接口开始训练后量化。根据样本数量、模型的大小和量化op类型不同，训练后量化需要的时间也不一样。比如使用ImageNet2012数据集中100图片对`MobileNetV1`进行训练后量化，花费大概1分钟。
+```
+PostTrainingQuantization.save_quantized_model(save_model_path)
+```
+调用上述接口保存训练后量化模型，其中save_model_path为保存的路径。
+## 2. 训练后量化使用示例
+下面以MobileNetV1为例，介绍训练后量化Low-Level API的使用方法。
+> 该示例的代码放在[models/PaddleSlim/quant_low_level_api/](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api)目录下。如果需要执行该示例，首先clone下来[models](https://github.com/PaddlePaddle/models.git)，然后执行[run_post_training_quanzation.sh](run_post_training_quanzation.sh)脚本，最后量化模型保存在`mobilenetv1_int8_model`目录下。
+1）**准备模型和校准数据**
+安装最新版PaddlePaddle，准备已经训练好的FP32预测模型。
+准备校准数据，文件结构如下。val文件夹中有100张图片，val_list.txt文件中包含图片的label。
+```bash
+samples_100
+└──val
+└──val_list.txt
+```
+2）**配置校准数据生成器**
+MobileNetV1的输入是图片和标签，所以配置读取校准数据的sample_generator，每次返回一张图片和一个标签。详细代码在[models/PaddleSlim/reader.py](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/reader.py)。
+3）**调用训练后量化**
+调用训练后量化的核心代码如下，详细代码在[post_training_quantization.py](post_training_quantization.py)。
+``` python
+place = fluid.CUDAPlace(0) if args.use_gpu == "True" else fluid.CPUPlace()
+exe = fluid.Executor(place)
+sample_generator = reader.val(data_dir=args.data_path)
+ptq = PostTrainingQuantization(
+    executor=exe,
+    sample_generator=sample_generator,
+    model_dir=args.model_dir,
+    model_filename=args.model_filename,
+    params_filename=args.params_filename,
+    batch_size=args.batch_size,
+    batch_nums=args.batch_nums,
+    algo=args.algo,
+    is_full_quantize=args.is_full_quantize == "True")
+quantized_program = ptq.quantize()
+ptq.save_quantized_model(args.save_model_path)
+```
+4）**测试训练后量化模型精度**
+使用ImageNet2012测试集中100张图片做校准数据集，对`conv2d`, `depthwise_conv2d`, `mul`, `pool2d`, `elementwise_add`和`concat`进行训练后量化，然后在ImageNet2012验证集上测试。下表列出了常见分类模型训练后量化前后的精度对比。
+模型 | FP32 Top1 | FP32 Top5 | INT8 Top1 | INT8 Top5| Top1 Diff | Tp5 Diff
+-|:-:|:-:|:-:|:-:|:-:|:-:
+googlenet   | 70.50% | 89.59% | 70.12% | 89.38% | -0.38% | -0.21%
+mobilenetv1 | 70.91% | 89.54% | 70.24% | 89.03% | -0.67% | -0.51%
+mobilenetv2 | 71.90% | 90.56% | 71.36% | 90.17% | -0.54% | -0.39%
+resnet50    | 76.35% | 92.80% | 76.26% | 92.81% | -0.09% | +0.01%
+resnet101   | 77.49% | 93.57% | 75.44% | 92.56% | -2.05% | -1.01%
+vgg16       | 72.08% | 90.63% | 71.93% | 90.64% | -0.15% | +0.01%
+vgg19       | 72.56% | 90.83% | 72.55% | 90.77% | -0.01% | -0.06%
--- a/PaddleSlim/quant_low_level_api/post_training_quantization.py
+++ b/PaddleSlim/quant_low_level_api/post_training_quantization.py
+import sys
+import os
+import six
+import numpy as np
+import argparse
+import paddle.fluid as fluid
+sys.path.append('..')
+import reader
+from paddle.fluid.contrib.slim.quantization import PostTrainingQuantization
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_dir", type=str, default="", help="path/to/fp32_model_params")
+parser.add_argument(
+    "--data_path", type=str, default="/dataset/ILSVRC2012/", help="")
+parser.add_argument("--save_model_path", type=str, default="")
+parser.add_argument(
+    "--model_filename",
+    type=str,
+    default=None,
+    help="The name of file to load the inference program, If it is None, the default filename __model__ will be used"
+)
+parser.add_argument(
+    "--params_filename",
+    type=str,
+    default=None,
+    help="The name of file to load all parameters, If parameters were saved in separate files, set it as None"
+)
+parser.add_argument(
+    "--algo",
+    type=str,
+    default="KL",
+    help="use KL or direct method to quantize the activation tensor, set it as KL or direct"
+)
+parser.add_argument("--is_full_quantize", type=str, default="False", help="")
+parser.add_argument("--batch_size", type=int, default=10, help="")
+parser.add_argument("--batch_nums", type=int, default=10, help="")
+parser.add_argument("--use_gpu", type=str, default="False", help="")
+args = parser.parse_args()
+print("-------------------args----------------------")
+for arg, value in sorted(six.iteritems(vars(args))):
+    print("%s: %s" % (arg, value))
+print("---------------------------------------------")
+place = fluid.CUDAPlace(0) if args.use_gpu == "True" else fluid.CPUPlace()
+exe = fluid.Executor(place)
+sample_generator = reader.val(data_dir=args.data_path)
+ptq = PostTrainingQuantization(
+    executor=exe,
+    sample_generator=sample_generator,
+    model_dir=args.model_dir,
+    model_filename=args.model_filename,
+    params_filename=args.params_filename,
+    batch_size=args.batch_size,
+    batch_nums=args.batch_nums,
+    algo=args.algo,
+    is_full_quantize=args.is_full_quantize == "True")
+quantized_program = ptq.quantize()
+ptq.save_quantized_model(args.save_model_path)
+print("post training quantization finish.\n")
--- a/PaddleSlim/quant_low_level_api/quantization_aware_training.md
+++ b/PaddleSlim/quant_low_level_api/quantization_aware_training.md
+<div align="center">
+  <h3>
+    <a href="./README.md">
+      模型量化概述
+    </a>
+    <span> | </span>
+    <a href="../docs/tutorial.md">
+      模型量化原理
+    </a>
+    <span> | </span>
+    <a href="./quantization_aware_training.md">
+      量化训练使用方法和示例
+    </a>
+    <span> | </span>
+    <a href="./post_training_quantization.md">
+      训练后量化使用方法和示例
+    </a>
+  </h3>
+</div>
+---
+# 量化训练Low-Level API使用示例
+## 目录
+- [量化训练Low-Level APIs介绍](#1-量化训练low-level-apis介绍)
+- [基于Low-Level API的量化训练](#2-基于low-level-api的量化训练)
+## 1. 量化训练Low-Level APIs介绍
+量化训练Low-Level APIs主要涉及到PaddlePaddle框架中的五个IrPass，即`QuantizationTransformPass`、`AddQuantDequantPass`、`QuantizationFreezePass`、`ConvertToInt8Pass`以及`TransformForMobilePass`。这五个IrPass的具体功能描述如下：
+* `QuantizationTransformPass`: QuantizationTransformPass主要负责在IrGraph的`conv2d`、`depthwise_conv2d`、`mul`等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入，示例如图1。
+<p align="center">
+<img src="../docs/images/usage/TransformPass.png" height=400 width=520 hspace='10'/> <br />
+<strong>图1：应用QuantizationTransformPass后的结果</strong>
+</p>
+QuantizationTransformPass支持对模型中特定类别op进行量化，只需要设置输入参数`quantizable_op_type`，默认`quantizable_op_type=['conv2d', 'depthwise_conv2d', 'mul']`。比如设置`quantizable_op_type=['conv2d']`，则该pass只会对模型中的`conv2d` 进行量化。注意，设置QuantizationTransformPass的`quantizable_op_type` 后， 也需要在QuantizationFreezePass 和 ConvertToInt8Pass传入相同的 `quantizable_op_type` 。
+QuantizationTransformPass也支持对模型中的个别op不进行量化。如下示例：首先定义 `skip_pattern` ；然后在构建模型时候，在skip_pattern的name_scope中定义不需要量化的op，即示例中的 `conv1` ；最后在调用QuantizationTransformPass的时候，传输设置的`skip_pattern`参数，则可以实现不对 `conv1` 进行量化。
+```
+# define network
+skip_pattern=['skip_quant']
+......
+with fluid.name_scope(skip_pattern[0]):
+  conv1 = fluid.layers.conv2d(
+            input=input,
+            filter_size=filter_size,
+            num_filters=ch_out,
+            stride=stride,
+            padding=padding,
+            act=None,
+            bias_attr=bias_attr)
+......
+# init QuantizationTransformPass and set skip_pattern
+transform_pass = QuantizationTransformPass(
+            scope=fluid.global_scope(),
+            place=place,
+            activation_quantize_type=activation_quant_type,
+            weight_quantize_type=weight_quantize_type,
+            skip_pattern=skip_pattern)
+# apply QuantizationTransformPass
+```
+* `AddQuantDequantPass`  ：AddQuantDequantPass主要负责在IrGraph的 `elementwise_add` 和 `pool2d` 等算子的各个输入前插入 `QuantDequant` op，在量化训练中收集待量化op输入 `Tensor` 的量化 `scale` 信息。该Pass使用方法和QuantizationTransformPass相似，同样支持对模型中特定类别op进行量化，支持对模型中的个别op不进行量化。注意，目前PaddleLite还不支持`elementwise_add` 和 `pool2d` 的int8 kernel。
+* `QuantizationFreezePass`：QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，QuantizationFreezePass还会将`conv2d`、`depthwise_conv2d`、`mul`等算子的权重离线量化为int8_t范围内的值(但数据类型仍为float32)，以减少预测过程中对权重的量化操作，示例如图2：
+<p align="center">
+<img src="../docs/images/usage/FreezePass.png" height=400 width=420 hspace='10'/> <br />
+<strong>图2：应用QuantizationFreezePass后的结果</strong>
+</p>
+* `ConvertToInt8Pass`：ConvertToInt8Pass必须在QuantizationFreezePass之后执行，其主要目的是将执行完QuantizationFreezePass后输出的权重类型由`FP32`更改为`INT8`。换言之，用户可以选择将量化后的权重保存为float32类型（不执行ConvertToInt8Pass）或者int8_t类型（执行ConvertToInt8Pass），示例如图3：
+<p align="center">
+<img src="../docs/images/usage/ConvertToInt8Pass.png" height=400 width=400 hspace='10'/> <br />
+<strong>图3：应用ConvertToInt8Pass后的结果</strong>
+</p>
+* `TransformForMobilePass`：经TransformForMobilePass转换后，用户可得到兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)移动端预测库的量化模型。paddle-mobile中的量化op和反量化op的名称分别为`quantize`和`dequantize`。`quantize`算子和PaddlePaddle框架中的`fake_quantize_abs_max`算子簇的功能类似，`dequantize` 算子和PaddlePaddle框架中的`fake_dequantize_max_abs`算子簇的功能相同。若选择paddle-mobile执行量化训练输出的模型，则需要将`fake_quantize_abs_max`等算子改为`quantize`算子以及将`fake_dequantize_max_abs`等算子改为`dequantize`算子，示例如图4：
+<p align="center">
+<img src="../docs/images/usage/TransformForMobilePass.png" height=400 width=400 hspace='10'/> <br />
+<strong>图4：应用TransformForMobilePass后的结果</strong>
+</p>
+## 2. 基于Low-Level API的量化训练
+本小节以ResNet50和MobileNetV1为例，介绍了PaddlePaddle量化训练Low-Level API的使用方法，具体如下：
+1） 执行如下命令clone [Pddle models repo](https://github.com/PaddlePaddle/models)：
+```bash
+git clone https://github.com/PaddlePaddle/models.git
+```
+2） 准备数据集（包括训练数据集和验证数据集）。以ILSVRC2012数据集为例，数据集应包含如下结构：
+```bash
+data
+└──ILSVRC2012
+        ├── train
+        ├── train_list.txt
+        ├── val
+        └── val_list.txt
+```
+3）切换到`models/PaddleSlim/quant_low_level_api`目录下，修改`run_quantization_aware_training.sh`内容，即将**data_dir**设置为第2)步所准备的数据集路径。最后，执行`run_quantization_aware_training.sh`脚本即可进行量化训练。
+### 2.1 量化训练Low-Level API使用小结：
+* 参照[quantization_aware_training.py](quantization_aware_training.py)文件的内容，总结使用量化训练Low-Level API的方法如下：
+```python
+#startup_program = fluid.Program()
+#train_program = fluid.Program()
+#train_cost = build_program(
+#    main_prog=train_program,
+#    startup_prog=startup_program,
+#    is_train=True)
+#build_program(
+#    main_prog=test_program,
+#    startup_prog=startup_program,
+#    is_train=False)
+#test_program = test_program.clone(for_test=True)
+# The above pseudo code is used to build up the model.
+# ---------------------------------------------------------------------------------
+# The following code are part of Quantization Aware Training logic:
+# 0) Convert Programs to IrGraphs.
+main_graph = IrGraph(core.Graph(train_program.desc), for_test=False)
+test_graph = IrGraph(core.Graph(test_program.desc), for_test=True)
+# 1) Make some quantization transforms in the graph before training and testing.
+# According to the weight and activation quantization type, the graph will be added
+# some fake quantize operators and fake dequantize operators.
+transform_pass = QuantizationTransformPass(
+        scope=fluid.global_scope(), place=place,
+        activation_quantize_type=activation_quant_type,
+        weight_quantize_type=weight_quant_type)
+transform_pass.apply(main_graph)
+transform_pass.apply(test_graph)
+# Compile the train_graph for training.
+binary = fluid.CompiledProgram(main_graph.graph).with_data_parallel(
+    loss_name=train_cost.name, build_strategy=build_strategy)
+# Convert the transformed test_graph to test program for testing.
+test_prog = test_graph.to_program()
+# For training
+exe.run(binary, fetch_list=train_fetch_list)
+# For testing
+exe.run(program=test_prog, fetch_list=test_fetch_list)
+# 2) Freeze the graph after training by adjusting the quantize
+# operators' order for the inference.
+freeze_pass = QuantizationFreezePass(
+    scope=fluid.global_scope(),
+    place=place,
+    weight_quantize_type=weight_quant_type)
+freeze_pass.apply(test_graph)
+# 3) Convert the weights into int8_t type.
+# [This step is optional.]
+convert_int8_pass = ConvertToInt8Pass(scope=fluid.global_scope(), place=place)
+convert_int8_pass.apply(test_graph)
+# 4) Convert the freezed graph for paddle-mobile execution.
+# [This step is optional. But, if you execute this step, you must execute the step 3).]
+mobile_pass = TransformForMobilePass()
+mobile_pass.apply(test_graph)
+```
+* [run_quantization_aware_training.sh](run_quantization_aware_training.sh)脚本中的命令配置详解：
+```bash
+   --model：指定量化训练的模型，如MobileNet、ResNet50。
+   --pretrained_fp32_model：指定预训练float32模型参数的位置。
+   --checkpoint：指定模型断点训练的checkpoint路径。若指定了checkpoint路径，则不应该再指定pretrained_fp32_model路径。
+   --use_gpu：选择是否使用GPU训练。
+   --data_dir：指定训练数据集和验证数据集的位置。
+   --batch_size：设置训练batch size大小。
+   --total_images：指定训练数据图像的总数。
+   --class_dim：指定类别总数。
+   --image_shape：指定图像的尺寸。
+   --model_save_dir：指定模型保存的路径。
+   --lr_strategy：学习率衰减策略。
+   --num_epochs：训练的总epoch数。
+   --lr：初始学习率，指定预训练模型参数进行fine-tune时一般设置一个较小的初始学习率。
+   --act_quant_type：激活量化类型，可选moving_average_abs_max, range_abs_max和abs_max。
+   --wt_quant_type：权重量化类型，可选abs_max, channel_wise_abs_max。
+```
+> **备注:** 量化训练结束后，用户可在其指定的模型保存路径下看到float、int8和mobile三个目录。下面对三个目录下保存的模型特点进行解释说明:
+> - **float目录:** 参数范围为int8范围但参数数据类型为float32的量化模型。
+> - **int8目录:** 参数范围为int8范围且参数数据类型为int8的量化模型。
+> - **mobile目录:** 参数特点与int8目录相同且兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)的量化模型。
+>
+> **注意:** 目前PaddlePaddle框架在Server端只支持使用float目录下的量化模型做预测。
+### 2.2 测试QAT量化模型精度
+使用ImageNet2012的训练集进行训练，然后在ImageNet2012验证集上测试。其中，我们对`conv2d`, `depthwise_conv2d`, `mul`, `pool2d`, `elementwise_add`和`concat`进行量化，训练5个epoch。下表列出了常见分类模型QAT量化前后的精度。
+模型 | FP32 Top1 | FP32 Top5 | INT8 Top1 | INT8 Top5| Top1 Diff | Tp5 Diff
+-|:-:|:-:|:-:|:-:|:-:|:-:
+googlenet   | 70.50% | 89.59% | 69.96% | 89.18% | -0.54% | -0.41%
+mobilenetv1 | 70.91% | 89.54% | 70.50% | 89.42% | -0.41% | -0.12%
+mobilenetv2 | 71.90% | 90.56% | 72.05% | 90.56% | +0.15% | -0.00%
+resnet50    | 76.35% | 92.80% | 76.52% | 92.93% | +0.17% | +0.13%
+resnet101   | 77.49% | 93.57% | 77.80% | 93.78% | +0.31% | +0.21%
+vgg16       | 72.08% | 90.63% | 71.53% | 89.70% | -0.55% | -0.93%
+vgg19       | 72.56% | 90.83% | 71.99% | 89.93% | -0.57% | -0.90%
--- a/PaddleSlim/quant_low_level_api/quant.py
+++ b/PaddleSlim/quant_low_level_api/quant.py
--- a/PaddleSlim/quant_low_level_api/run_post_training_quanzation.sh
+++ b/PaddleSlim/quant_low_level_api/run_post_training_quanzation.sh
+export CUDA_VISIBLE_DEVICES=0
+root_url="https://paddle-inference-dist.bj.bcebos.com/int8"
+mobilenetv1="mobilenetv1_fp32_model"
+samples="samples_100"
+if [ ! -d ${mobilenetv1} ]; then
+    wget ${root_url}/${mobilenetv1}.tgz
+    tar zxf ${mobilenetv1}.tgz
+fi
+if [ ! -d ${samples} ]; then
+    wget ${root_url}/${samples}.tgz
+    tar zxf ${samples}.tgz
+fi
+python post_training_quantization.py \
+    --model_dir=${mobilenetv1} \
+    --data_path=${samples} \
+    --save_model_path="mobilenetv1_int8_model" \
+    --algo="KL" \
+    --is_full_quantize=True \
+    --batch_size=10 \
+    --batch_nums=10 \
+    --use_gpu=True \
--- a/PaddleSlim/quant_low_level_api/run_quant.sh
+++ b/PaddleSlim/quant_low_level_api/run_quant.sh