[Cherry-pick] [Doc] Use paddleslim docs in quantization (#3567)

* Use paddleslim docs in quantization, test=develop, test=document_fix (#3565)

[Cherry-pick] [Doc] Use paddleslim docs in quantization (#3567)
* Use paddleslim docs in quantization, test=develop, test=document_fix (#3565)
3402c693 · cc · GitHub · 2471f513 · 3402c693 · 3402c693
3 changed file
--- a/docs/user_guides/model_quantization.md
+++ b/docs/user_guides/model_quantization.md
 # 模型量化-量化训练

-本文主要介绍使用Paddle-Lite加载PaddlePaddle产出的量化模型，并进行推理执行。我们以MobileNetV1模型为示例，首先说明产出量化模型，然后说明预测部署。
+本文主要介绍使用Paddle-Lite加载PaddlePaddle产出的量化模型，并进行推理执行。

 ## 1 简介

-量化训练是基于大量训练数据，对训练好的预测模型进行量化。该方法使用模拟量化的思想，在训练阶段更新权重，实现减小量化误差。
+量化训练是使用较多练数据，对训练好的预测模型进行量化。该方法使用模拟量化的思想，在训练阶段更新权重，实现减小量化误差。

 使用条件：
 * 有预训练模型
-* 有较多训练数据
+* 有较多训练数据（大于5000）

 使用步骤：
 * 产出量化模型：使用PaddlePaddle调用量化训练接口，产出量化模型
@@ -23,271 +23,37 @@

 建议首先使用“有校准数据训练后量化”对模型进行量化，然后使用使用量化模型进行预测。如果该量化模型的精度达不到要求，再使用“量化训练”。

-
 ## 2 产出量化模型

-目前，PaddlePaddle框架的量化训练主要针对卷积层（包括二维卷积和Depthwise卷积）、和全连接层，对应算子是conv2d、depthwise_conv2d和mul，更多量化训练的原理请参考[文档](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/docs/tutorial.md#1-quantization-aware-training%E9%87%8F%E5%8C%96%E4%BB%8B%E7%BB%8D)。Paddle-Lite支持运行PaddlePaddle框架量化训练产出的模型，可以进一步加快模型在移动端的执行速度。
+目前，PaddleSlim 框架的量化训练主要针对卷积层（包括二维卷积和Depthwise卷积）、和全连接层，对应算子是conv2d、depthwise_conv2d和mul。Paddle-Lite支持运行PaddlePaddle框架量化训练产出的模型，可以进一步加快模型在移动端的执行速度。

 温馨提示：如果您是初次接触PaddlePaddle框架，建议首先学习[新人入门](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/index_cn.html)和[使用指南](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/index_cn.html)。

-您可以选择下载训练好的量化模型，或者使用PaddleSlim模型压缩工具训练得到量化模型。
-
-### 下载量化模型
-
-官方发布了[MobileNetV1量化模型](https://paddle-inference-dist.bj.bcebos.com/int8%2Fpretrain%2Fmobilenet_v1_quant%2Ffloat.zip)，直接下载到本地。
-
-```bash
-wget https://paddle-inference-dist.bj.bcebos.com/int8%2Fpretrain%2Fmobilenet_v1_quant%2Ffloat.zip
-```
-
-### 使用PaddleSlim模型压缩工具训练量化模型
-
-#### 安装PaddlePaddle
-
-根据操作系统、安装方式、Python版本和CUDA版本，按照[官方说明](https://paddlepaddle.org.cn/start)安装PaddlePaddle。例如：
-
-Ubuntu 16.04.4 LTS操作系统，CUDA9，cuDNN7，GPU版本安装:
-```bash
-pip install paddlepaddle-gpu==1.6.0.post97 -i https://mirrors.aliyun.com/pypi/simple/
-```
-
-Ubuntu 16.04.4 LTS操作系统，CPU版本安装:
-```bash
-pip install paddlepaddle==1.6.0 -i https://mirrors.aliyun.com/pypi/simple/
-```
-
-#### 克隆量化训练所需的代码库
-
-克隆[PaddlePaddle/models](https://github.com/PaddlePaddle/models)到本地，并进入models/PaddleSlim路径。
-
-```bash
-git clone https://github.com/PaddlePaddle/models.git
-cd models/PaddleSlim
-```
-
-#### 准备数据和模型
-
-##### 训练数据准备
-
-参考[models/PaddleCV/image_classification](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#data-preparation)中的数据准备教程，下载训练数据，并且保存到PaddleSlim/data路径下。
-
-##### 预训练模型准备
-
-参考/models/PaddleSlim/run.sh脚本， 从[models/PaddleCV/image_classification](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification#supported-models-and-performances)下载MobileNetV1的预训练模型，并保存到PaddleSlim/pretrain路径下。
-
-经过以上三步，PaddleSlim目录下的文件结构如下所示：
-
-```bash
-.
-├── compress.py # 模型压缩任务主脚本，定义了压缩任务需要的模型相关信息
-├── configs # 压缩任务的配置文件，包括:蒸馏、int8量化量化、filter剪切和组合策略的配置文件
-├── data # 存放训练数据（需要用户自己创建）
-│   └── ILSVRC2012
-├── pretrain # 存放预训练模型参数，执行run.sh自动生成
-│   ├── MobileNetV1_pretrained
-│   ├── MobileNetV1_pretrained.tar
-│   ├── ResNet50_pretrained
-│   └── ResNet50_pretrained.tar
-├── docs # 文档目录
-├── light_nas
-├── models # 模型网络结构的定义，如MobileNetV1
-├── quant_low_level_api # 量化训练的底层API, 用于灵活定制量化训练的过程，适用于高阶用户
-├── reader.py # 定义数据处理逻辑
-├── README.md
-├── run.sh # 模型压缩任务启动脚本
-└── utility.py # 定义了常用的工具方法
-```
-
-##### 压缩脚本介绍
-
-在`compress.py`中定义了执行压缩任务需要的所有模型相关的信息，这里对几个关键的步骤进行简要介绍：
-
-**目标网络的定义**
-compress.py的以下代码片段定义了train program, 这里train program只有前向计算操作。
-```python
-out = model.net(input=image, class_dim=args.class_dim)
-cost = fluid.layers.cross_entropy(input=out, label=label)
-avg_cost = fluid.layers.mean(x=cost)
-acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
-acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
-```
-
-然后，通过clone方法得到eval_program, 用来在压缩过程中评估模型精度，如下：
-
-```python
-val_program = fluid.default_main_program().clone()
-```
-
-定义完目标网络结构，需要对其初始化，并根据需要加载预训练模型。
-
-**定义feed_list和fetch_list**
-对于train program, 定义train_feed_list用于指定从train data reader中取的数据feed给哪些variable。定义train_fetch_list用于指定在训练时，需要在log中展示的结果。如果需要在训练过程中在log中打印accuracy信心，则将('acc_top1', acc_top1.name)添加到train_fetch_list中即可。
-```python
-train_feed_list = [('image', image.name), ('label', label.name)]
-train_fetch_list = [('loss', avg_cost.name)]
-```
-
-> 注意： 在train_fetch_list里必须有loss这一项。
-
-对于eval program. 同上定义eval_feed_list和train_fetch_list:
-
-```python
-val_feed_list = [('image', image.name), ('label', label.name)]
-val_fetch_list = [('acc_top1', acc_top1.name), ('acc_top5', acc_top5.name)]
-```
-
-**Compressor和量化配置文件**
-`compress.py`主要使用Compressor和yaml文件完成对模型的量化训练工作。Compressor类的定义如下：
-```python
-class Compressor(object):
-    def __init__(self,
-                 place,
-                 scope,
-                 train_program,
-                 train_reader=None,
-                 train_feed_list=None,
-                 train_fetch_list=None,
-                 eval_program=None,
-                 eval_reader=None,
-                 eval_feed_list=None,
-                 eval_fetch_list=None,
-                 teacher_programs=[],
-                 checkpoint_path='./checkpoints',
-                 train_optimizer=None,
-                 distiller_optimizer=None):
-```
-
-在定义Compressor对象时，需要注意以下问题：
-* train program如果带反向operators和优化更新相关的operators, 参数train_optimizer需要设置为None.
-* eval_program中parameter的名称需要与train_program中的parameter的名称完全一致。
-* 最终保存的量化模型是在eval_program网络基础上进行剪枝保存的。所以，如果用户希望最终保存的模型可以用于inference, 则eval program需要包含推理阶段需要的各种operators.
-* checkpoint保存的是float数据类型的模型。
-
-`configs/quantization.yaml`量化配置文件示例如下：
-
-```python
-version: 1.0
-strategies:
-    quantization_strategy:
-        class: 'QuantizationStrategy'
-        start_epoch: 0
-        end_epoch: 9
-        float_model_save_path: './output/float'
-        mobile_model_save_path: './output/mobile'
-        int8_model_save_path: './output/int8'
-        weight_bits: 8
-        activation_bits: 8
-        weight_quantize_type: 'abs_max'
-        activation_quantize_type: 'moving_average_abs_max'
-        save_in_nodes: ['image']
-        save_out_nodes: ['fc_0.tmp_2']
-compressor:
-    epoch: 10
-    checkpoint_path: './checkpoints_quan/'
-    strategies:
-        - quantization_strategy
-```
-其中，可配置参数包括：
- **class:** 量化策略的类名称，目前仅支持`QuantizationStrategy`。
- **start_epoch:** 在start_epoch开始之前，量化训练策略会往train_program和eval_program插入量化operators和反量化operators。 从start_epoch开始，进入量化训练阶段。
- **end_epoch:** 在end_epoch结束之后，会保存用户指定格式的模型。注意：end_epoch之后并不会停止量化训练，而是继续训练直到epoch数等于compressor.epoch值为止。举例来说，当start_epoch=0，end_epoch=0，compressor.epoch=2时，量化训练开始于epoch0，结束于epoch1，但保存的模型是epoch0结束时的参数状态。
- **float_model_save_path:**  保存float数据格式的模型路径，即该路径下的模型参数范围为int8范围但参数数据类型为float32。如果设置为None, 则不存储float格式的模型，默认为None。**注意：Paddle-Lite即使用该目录下的模型进行量化模型推理优化，详见本文[使用Paddle-Lite运行量化模型推理](#二使用Paddle-Lite运行量化模型推理)部分。**
- **int8_model_save_path:** 保存int8数据格式的模型路径，即该路径下的模型参数范围为int8范围且参数数据类型为int8。如果设置为None, 则不存储int8格式的模型，默认为None.
- **mobile_model_save_path:** 保存兼容paddle-mobile框架的模型路径。如果设置为None, 则不存储paddle-mobile格式的模型，默认为None。目前paddle-mobile已升级为Paddle-Lite。
- **weight_bits:** 量化weight的bit数，注意偏置(bias)参数不会被量化。
- **activation_bits:** 量化activation的bit数。
-  **weight_quantize_type:** weight量化方式，目前量化训练支持`abs_max`、 `channel_wise_abs_max`。
- **activation_quantize_type:** activation量化方式，目前量化训练支持`range_abs_max`、`moving_average_abs_max`。PaddlePaddle中还支持 `abs_max` 方法对激活进行量化，但是该方法动态计算输入的量化scale，这会增加计算量、减慢模型推理速度，所以lite不支持 `abs_max`激活量化方式。
- **save_in_nodes:** variable名称列表。在保存量化后模型的时候，需要根据save_in_nodes对eval programg 网络进行前向遍历剪枝。默认为eval_feed_list内指定的variable的名称列表。
- **save_out_nodes:** varibale名称列表。在保存量化后模型的时候，需要根据save_out_nodes对eval programg 网络进行回溯剪枝。默认为eval_fetch_list内指定的variable的名称列表。
-
-> **备注：**
->
-> 1）`abs_max`意为在训练的每个step及inference阶段均动态计算量化scale值。`channel_wise_abs_max`与`abs_max`类似，不同点在于它会对卷积权重进行分channel求取量化scale。换言之，`abs_max`属于tensor-wise量化，而`channel_wise_abs_max`属于channel-wise量化，详细说明请猛戳[此处](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/quantization/training_quantization_model_format.md)。
-> 
-> 2）`moving_average_abs_max`和`range_abs_max`意为在训练阶段计算出一个静态的量化scale值，并将其用于inference阶段。`moving_average_abs_max`使用窗口滑动平均的方法计算量化scale，而`range_abs_max`则使用窗口绝对值最大值的方式。
-> 
-> 3）**目前，Paddle-Lite仅支持运行weight量化方式使用`abs_max`且activation量化方式使用`moving_average_abs_max`或`range_abs_max`产出的量化模型**。
-
-#### 执行量化训练
-
-修改run.sh，即注释掉`# enable GC strategy`与`# for sensitivity filter pruning`之间的内容并打开`#for quantization`相关的脚本命令（所需打开注释的命令如下所示）。
-
-```bash
-# for quantization
-#---------------------------
-export CUDA_VISIBLE_DEVICES=0
-python compress.py \
--batch_size 64 \
--model "MobileNet" \
--pretrained_model ./pretrain/MobileNetV1_pretrained \
--compress_config ./configs/quantization.yaml \
--quant_only True
-```
-最后，运行`sh run.sh`命令开始int8量化训练。
-
-上述量化训练过程完成后，若按照本文中所述`configs/quantization.yaml`文件内容配置的模型输出路径，则可在models/PaddleSlim/output目录下看到`float`、`int8`和`mobile`三个目录，其中：
-* float目录: 参数范围为int8范围但参数数据类型为float32的量化模型。Paddle-Lite即使用该目录下的模型文件及参数进行量化模型的部署。
-* int8目录: 参数范围为int8范围且参数数据类型为int8的量化模型。
-* mobile目录：参数特点与int8目录相同且兼容paddle-mobile的量化模型（目前paddle-mobile已升级为Paddle-Lite）。
+使用PaddleSlim模型压缩工具训练量化模型，请参考文档：
+* 量化训练[快速开始教程](https://paddlepaddle.github.io/PaddleSlim/quick_start/quant_aware_tutorial.html)
+* 量化训练[API接口说明](https://paddlepaddle.github.io/PaddleSlim/api_cn/quantization_api.html)
+* 量化训练[Demo](https://github.com/PaddlePaddle/PaddleSlim/tree/release/1.0.1/demo/quant/quant_aware)

 ## 3 使用Paddle-Lite运行量化模型推理

-### 使用模型优化工具对量化模型进行优化
-
-接下来，使用原始的量化模型生成适合在移动端直接部署的模型。
-
-参考[源码编译](source_compile)配置编译环境，确保可以编译成功。参考[模型转化方法](model_optimize_tool)，首先编译model_optimize_tool工具，然后执行下面命令对量化训练的模型进行优化（注意，需要自行修改model_file、param_file和optimize_out）。
-```bash
-./model_optimize_tool                         \
--model_file=mobilenet_v1_quant/float/model   \
--param_file=mobilenet_v1_quant/float/weights \
--optimize_out_type=naive_buffer              \
--optimize_out=mobilenet_v1_quant_opt         \
--valid_targets=arm                           \
-```
+首先，使用PaddleLite提供的模型转换工具（model_optimize_tool）将量化模型转换成移动端预测的模型，然后加载转换后的模型进行预测部署。

-如前所述，量化训练后，float目录下的模型参数范围为int8，但参数数据类型仍为float32类型，这样确实没有起到模型参数压缩的效果。但是，经过model\_optimize\_tool工具优化后对应的量化参数均会以int8类型重新存储达到参数压缩的效果，且模型结构也被优化（如进行了各种operator fuse操作）。
+### 3.1 模型转换

-### 在手机端准备量化模型文件
-
-使用如下命令将mobilenet_v1_quant_opt目录下的量化模型文件导入到手机端：
+参考[模型转换](../user_guides/model_optimize_tool)准备模型转换工具，建议从Release页面下载。

+参考[模型转换](../user_guides/model_optimize_tool)使用模型转换工具，参数按照实际情况设置。比如在安卓手机ARM端进行预测，模型转换的命令为：
 ```bash
-adb push mobilenet_v1_quant_opt /data/local/tmp
+./opt --model_dir=./mobilenet_v1_quant \
+      --optimize_out_type=naive_buffer \
+      --optimize_out=mobilenet_v1_quant_opt \
+      --valid_targets=arm
 ```

-### 使用mobilenetv1\_light\_api运行优化后的量化模型
-
-参考[源码编译](source_compile)配置编译环境后，在Paddle-Lite执行如下命令获取轻量级API的demo：
+### 3.2 量化模型预测

-```bash
-cd /Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/demo/cxx/mobile_light
-make clean && make -j
-```
-执行完上述命令后，可在`Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/demo/cxx/mobile_light/`路径下看到`mobilenetv1_light_api`可执行文件。将`mobilenetv1_light_api`导入到手机端并运行量化模型推理。执行命令如下：
+和FP32模型一样，转换后的量化模型可以在Android/IOS APP中加载预测，建议参考[C++ Demo](../demo_guides/cpp_demo)、[Java Demo](../demo_guides/java_demo)、[Android/IOS Demo](../demo_guides/android_app_demo)。

-```bash
-adb push Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/demo/cxx/mobile_light/mobilenetv1_light_api /data/local/tmp
-adb shell chmod +x /data/local/tmp/mobilenetv1_light_api
-adb shell /data/local/tmp/mobilenetv1_light_api               \
-    --model_dir=/data/local/tmp/mobilenet_v1_quant_opt
-```
-**程序运行结果如下：**
-```bash
-Output dim: 1000
-Output[0]: 0.000228
-Output[100]: 0.000260
-Output[200]: 0.000250
-Output[300]: 0.000560
-Output[400]: 0.000950
-Output[500]: 0.000275
-Output[600]: 0.005143
-Output[700]: 0.002509
-Output[800]: 0.000538
-Output[900]: 0.000969
-```
-在C++中使用Paddle-Lite API的方法请猛戳[此处](../demo_guides/cpp_demo)，用户也可参考[mobilenetv1_light_api.cc](https://github.com/PaddlePaddle/Paddle-Lite/blob/develop/lite/demo/cxx/mobile_light/mobilenetv1_light_api.cc)的代码示例。

 ## FAQ


--- a/docs/user_guides/post_quant_no_data.md
+++ b/docs/user_guides/post_quant_no_data.md
 # 模型量化-无校准数据训练后量化

-本文首先简单介绍无校准数据训练后量化，然后说明产出量化模型，最好阐述量化模型预测。
+本文首先简单介绍无校准数据训练后量化，然后说明产出量化模型，最后阐述量化模型预测。

 ## 1 简介

@@ -18,7 +18,7 @@
 * 权重量化成INT8类型，模型精度会受到影响，模型大小为原始的1/4

 缺点：
-* 暂无
+* 只可以减小模型大小，不能加快模型推理

 ## 2 产出量化模型

@@ -43,10 +43,15 @@ model_dir = path/to/fp32_model_params
 save_model_dir = path/to/save_model_path
 weight_quant = WeightQuantization(model_dir=model_dir)
 weight_quant.quantize_weight_to_int(save_model_dir=save_model_dir,
-                                    weight_bits=16,
-                                    quantizable_op_type=['conv2d', 'depthwise_conv2d', 'mul'])
+                                    weight_bits=8,
+                                    quantizable_op_type=['conv2d', 'mul'],
+                                    weight_quantize_type="channel_wise_abs_max",
+                                    generate_test_model=False)
 ```

+执行完成后，可以在 `save_model_dir/quantized_model` 目录下得到量化模型。
+
+
 对于调用无校准数据训练后量化，以下对api接口进行详细介绍。

 ```python
@@ -58,24 +63,29 @@ class WeightQuantization(model_dir, model_filename=None, params_filename=None)
 * params_filename(str, optional)：待量化模型的权重文件名，如果所有权重保存成一个文件，则需要使用params_filename设置权重文件名。

 ```python
-WeightQuantization.quantize_weight_to_int(save_model_dir,
-                                          save_model_filename=None,
-                                          save_params_filename=None,
-                                          quantizable_op_type=['conv2d', 'mul'],
-                                          weight_bits=8,
-                                          threshold_rate=0.0)
+WeightQuantization.quantize_weight_to_int(self,
+                               save_model_dir,
+                               save_model_filename=None,
+                               save_params_filename=None,
+                               quantizable_op_type=["conv2d", "mul"],
+                               weight_bits=8,
+                               weight_quantize_type="channel_wise_abs_max",
+                               generate_test_model=False,
+                               threshold_rate=0.0)
 ```
 参数说明如下：
 * save_model_dir(str)：保存量化模型的路径。
 * save_model_filename(str, optional)：如果save_model_filename等于None，则模型的网络结构保存到__model__文件，如果save_model_filename不等于None，则模型的网络结构保存到特定的文件。默认为None。
 * save_params_filename(str, optional)：如果save_params_filename等于None，则模型的参数分别保存到一系列文件中，如果save_params_filename不等于None，则模型的参数会保存到一个文件中，文件名为设置的save_params_filename。默认为None。
-* quantizable_op_type(list[str]): 需要量化的op类型，默认是`['conv2d', 'mul']`，列表中的值可以是任意支持量化的op类型 `['conv2d', 'depthwise_conv2d', 'mul']`。
-* weight_bits(int, optional)：权重量化保存的比特数，可以是8~16，一般设置为8/16。默认为8。
+* quantizable_op_type(list[str]): 需要量化的op类型，默认是`['conv2d', 'mul']`，列表中的值可以是任意支持量化的op类型 `['conv2d', 'depthwise_conv2d', 'mul']`。一般不对 `depthwise_conv2d` 量化，因为对减小模型大小收益不大，同时可能影响模型精度。
+* weight_bits(int, optional)：权重量化保存的比特数，可以是8~16，一般设置为8/16，默认为8。量化为8bit，模型体积最多可以减小4倍，可能存在微小的精度损失。量化成16bit，模型大小最多可以减小2倍，基本没有精度损失。
+* weight_quantize_type(str, optional): 权重量化的方式，支持 `channel_wise_abs_max` 和 `abs_max`，一般都是 `channel_wise_abs_max`，量化模型精度损失小。
+* generate_test_model(bool, optional): 是否产出测试模型，用于测试量化模型部署时的精度。测试模型保存在 `save_model_dir/test_model` 目录下，可以和FP32模型一样使用Fluid加载测试，但是该模型不能用于预测端部署。


 ## 3 量化模型预测

-目前，对于无校准数据训练后量化产出的量化模型，不支持PaddlePaddle加载执行，只能使用PaddleLite进行预测部署。
+目前，对于无校准数据训练后量化产出的量化模型，只能使用PaddleLite进行预测部署。

 很简单，首先使用PaddleLite提供的模型转换工具（opt）将量化模型转换成移动端预测的模型，然后加载转换后的模型进行预测部署。


--- a/docs/user_guides/post_quant_with_data.md
+++ b/docs/user_guides/post_quant_with_data.md
 # 模型量化-有校准数据训练后量化

-本文首先简单介绍有校准数据训练后量化，然后说明产出量化模型、量化模型预测，最后给出一个使用示例。
-如果想快速上手，大家可以先参考使用示例，再查看详细使用方法。
-
 ## 1 简介

 有校准数据训练后量化，使用少量校准数据计算量化因子，可以快速得到量化模型。使用该量化模型进行预测，可以减少计算量、降低计算内存、减小模型大小。
@@ -14,7 +11,7 @@
 * 有少量校准数据，比如100~500张图片

 使用步骤：
-* 产出量化模型：使用PaddlePaddle或者PaddleSlim调用有校准数据训练后量化接口，产出量化模型
+* 产出量化模型：使用PaddleSlim调用有校准数据训练后量化接口，产出量化模型
 * 量化模型预测：使用PaddleLite加载量化模型进行预测推理

 优点：
@@ -27,11 +24,11 @@

 ## 2 产出量化模型

-大家可以使用PaddlePaddle或者PaddleSlim调用有校准数据训练后量化接口，得到量化模型。本文主要介绍使用PaddlePaddle产出量化模型，使用PaddleSlim可以参考[文档](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim)。
+大家可以使用PaddleSlim调用有校准数据训练后量化接口，得到量化模型。

-### 2.1 安装PaddlePaddle
+### 2.1 安装PaddleSlim

-参考PaddlePaddle[官网](https://www.paddlepaddle.org.cn/install/quick)，安装PaddlePaddle CPU/GPU 1.7版本。
+参考PaddleSlim[文档](https://paddlepaddle.github.io/PaddleSlim/install.html)进行安装。

 ### 2.2 准备模型和校准数据

@@ -49,7 +46,7 @@

 ```python
 import paddle.fluid as fluid
-from paddle.fluid.contrib.slim.quantization import PostTrainingQuantization
+from paddleslim.quant import quant_post

 exe = fluid.Executor(fluid.CPUPlace())
 model_dir = path/to/fp32_model_params
@@ -69,75 +66,23 @@ batch_size = 10
 batch_nums = 10
 algo = "KL"
 quantizable_op_type = ["conv2d", "depthwise_conv2d", "mul"]
-ptq = PostTrainingQuantization(
-            executor=exe,
-            sample_generator=sample_generator,
-            model_dir=model_dir,
-            model_filename=model_filename,
-            params_filename=params_filename,
-            batch_size=batch_size,
-            batch_nums=batch_nums,
-            algo=algo,
-            quantizable_op_type=quantizable_op_type)
-ptq.quantize()
-ptq.save_quantized_model(save_model_path)
+quant_post(executor=exe,
+           model_dir=model_dir,
+           model_filename=model_filename,
+           params_filename=params_filename,
+           quantize_model_path=save_model_path,
+           sample_generator=sample_generator,
+           batch_size=batch_size,
+           batch_nums=batch_nums,
+           algo=algo,
+           quantizable_op_type=quantizable_op_type)
 ```

-对于调用有校准数据训练后量化，以下对接口进行详细介绍。
-
-``` python
-class PostTrainingQuantization(
-                 executor=None,
-                 scope=None,
-                 model_dir=None,
-                 model_filename=None,
-                 params_filename=None,
-                 sample_generator=None,
-                 batch_size=10,
-                 batch_nums=None,
-                 algo="KL",
-                 quantizable_op_type=["conv2d", "depthwise_conv2d", "mul"],
-                 is_full_quantize=False,
-                 weight_bits=8,
-                 activation_bits=8,
-                 is_use_cache_file=False,
-                 cache_dir="./temp_post_training"):
-```
-调用上述api，传入必要的参数。参数说明如下：
-* executor(fluid.Executor)：执行模型的executor，可以指定在cpu或者gpu上执行。
-* scope(fluid.Scope, optional)：模型运行时使用的scope，默认为None，则会使用global_scope()。行首有optional，说明用户可以不设置该输入参数，直接使用默认值，下同。
-* model_dir(str)：待量化模型的路径，其中保存模型文件和权重文件。
-* model_filename(str, optional)：待量化模型的模型文件名，如果模型文件名不是`__model__`，则需要使用model_filename设置模型文件名。
-* params_filename(str, optional)：待量化模型的权重文件名，如果所有权重保存成一个文件，则需要使用params_filename设置权重文件名。
-* sample_generator(Python Generator)：配置的校准数据生成器。
-* batch_size(int, optional)：一次读取校准数据的数量。
-* batch_nums(int, optional)：读取校准数据的次数。如果设置为None，则从sample_generator中读取所有校准数据进行训练后量化；如果设置为非None，则从sample_generator中读取`batch_size*batch_nums`个校准数据。
-* algo(str, optional)：计算待量化激活Tensor的量化因子的方法。设置为`KL`，则使用饱和量化方法，设置为`direct`，则使用非饱和量化方法。默认为`KL`。
-* quantizable_op_type(list[str], optional): 需要量化的op类型，默认是`["conv2d", "depthwise_conv2d", "mul"]`，列表中的值可以是任意支持量化的op类型。
-* is_full_quantize(bool, optional)：是否进行全量化。设置为True，则对模型中所有支持量化的op进行量化；设置为False，则只对`quantizable_op_type` 中op类型进行量化。目前支持的量化类型如下：'conv2d', 'depthwise_conv2d', 'mul', "pool2d", "elementwise_add", "concat", "softmax", "argmax", "transpose", "equal", "gather", "greater_equal", "greater_than", "less_equal", "less_than", "mean", "not_equal", "reshape", "reshape2", "bilinear_interp", "nearest_interp", "trilinear_interp", "slice", "squeeze", "elementwise_sub"。
-* weight_bits(int, optional)：权重量化的比特数，可以设置为1~16。PaddleLite目前仅支持加载权重量化为8bit的量化模型。
-* activation_bits(int, optional)： 激活量化的比特数，可以设置为1~16。PaddleLite目前仅支持加载激活量化为8bit的量化模型。
-* is_use_cache_file(bool, optional)：是否使用缓存文件。如果设置为True，训练后量化过程中的采样数据会保存到磁盘文件中；如果设置为False，所有采样数据会保存到内存中。当待量化的模型很大或者校准数据数量很大，建议设置is_use_cache_file为True。默认为False。
-* cache_dir(str, optional)：当is_use_cache_file等于True，会将采样数据保存到该文件中。量化完成后，该文件中的临时文件会自动删除。
+快速开始请参考[文档](https://paddlepaddle.github.io/PaddleSlim/quick_start/quant_post_tutorial.html#)。

-```python
-PostTrainingQuantization.quantize()
-```
-调用上述接口开始训练后量化。根据校准数据数量、模型的大小和量化op类型不同，训练后量化需要的时间也不一样。比如使用ImageNet2012数据集中100图片对`MobileNetV1`进行训练后量化，花费大概1分钟。
-
-```python
-PostTrainingQuantization.save_quantized_model(save_model_path)
-```
-调用上述接口保存训练后量化模型，其中save_model_path为保存的路径。
+API接口请参考[文档](https://paddlepaddle.github.io/PaddleSlim/api_cn/quantization_api.html#quant-post)。

-训练后量化支持部分量化功能：
-* 方法1：设置quantizable_op_type，则只会对quantizable_op_type中的Op类型进行量化，模型中其他Op类型保持不量化。
-* 方法2：构建网络的时候，将不需要量化的特定Op定义在 `skip_quant` 的name_scope中，则可以跳过特定Op的量化，示例如下。
-```python
-with fluid.name_scope('skip_quant'):
-    pool = fluid.layers.pool2d(input=hidden, pool_size=2, pool_type='avg', pool_stride=2)
-    # 不对pool2d进行量化
-```
+Demo请参考[文档](https://github.com/PaddlePaddle/PaddleSlim/tree/release/1.0.1/demo/quant/quant_post)。

 ## 3 量化模型预测

@@ -158,45 +103,3 @@ with fluid.name_scope('skip_quant'):
 ### 3.2 量化模型预测

 和FP32模型一样，转换后的量化模型可以在Android/IOS APP中加载预测，建议参考[C++ Demo](../demo_guides/cpp_demo)、[Java Demo](../demo_guides/java_demo)、[Android/IOS Demo](../demo_guides/android_app_demo)。
-
-## 4 使用示例
-
-### 4.1 产出量化模型
-
-参考本文 “2.1 安装PaddlePaddle” 安装PaddlePaddle。
-
-下载[打包文件](https://paddle-inference-dist.cdn.bcebos.com/PaddleLite/quantization_demo/post_training_quantization_withdata.tgz)，解压到本地。
-```bash
-wget https://paddle-inference-dist.cdn.bcebos.com/PaddleLite/quantization_demo/post_training_quantization_withdata.tgz
-tar zxvf post_training_quantization_withdata.tgz
-cd post_training_quantization_withdata
-```
-
-执行下面的命令，自动下载预测模型(mobilenetv1_fp32_model)和校准数据集，然后调用有校准数据训练后方法产出量化模型。
-```bash
-sh run_post_training_quanzation.sh
-```
-
-量化模型保存在mobilenetv1_int8_model文件夹中。
-
-### 4.2 量化模型预测
-
-下载测试文件（[benchmark_bin](https://paddle-inference-dist.cdn.bcebos.com/PaddleLite/quantization_demo/benchmark_bin)）或者参考[Benchmark测试方法](../benchmark/benchmark_tools)编译测试文件。
-
-将mobilenetv1_fp32_model、mobilenetv1_int8_model和benchmark_bin文件都保存到手机上。
-```bash
-adb push mobilenetv1_fp32_model /data/local/tmp
-adb push mobilenetv1_int8_model /data/local/tmp
-chmod 777 benchmark_bin
-adb push benchmark_bin /data/local/tmp
-```
-
-测试量化模型和原始模型的性能，依次执行下面命令：
-```bash
-./benchmark_bin --is_quantized_model=true --run_model_optimize=true  --result_filename=res.txt --warmup=10 --repeats=30  --model_dir=mobilenetv1_int8_model/
-./benchmark_bin --is_quantized_model=true --run_model_optimize=true  --result_filename=res.txt --warmup=10 --repeats=30 --model_dir=mobilenetv1_fp32_model/
-cat res.txt
-```
-
-在res.txt文件中可以看到INT8量化模型和FP32原始模型的速度。
-举例来说，在骁龙855手机、单线程的情况下测试mobilenetv1，INT8量化模型的计算时间是14.52ms，FP32原始模型的计算时间是31.7ms。