Refine the docs of quantization, test=develop, test=document_fix (#704)

c2cbef1a · cc · GitHub · 1729bf03 · c2cbef1a · c2cbef1a
7 changed file
--- a/demo/mkldnn_quant/README.md
+++ b/demo/mkldnn_quant/README.md
@@ -9,6 +9,9 @@
 - 在CPU上转换量化模型：在CPU上使用DNNL库转化量化模型为INT8模型。
 - 在CPU上部署预测：在CPU上部署样例并进行预测。

+参考资料：
+* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
+
 ## 1. 准备

 #### 安装Paddle和PaddleSlim

--- a/demo/quant/deploy/TensorRT/README.md
+++ b/demo/quant/deploy/TensorRT/README.md
@@ -8,6 +8,8 @@ NVIDIA TensorRT 是一个高性能的深度学习预测库，适用于Nvidia GPU
 - 产出量化模型：使用PaddleSlim量化训练或离线量化得到量化模型。注意模型中被量化的算子的参数值应该在INT8范围内，但是类型仍为float型。
 - 在Nvidia GPU上部署预测：在GPU上以INT8类型进行预测部署。

+参考资料：
+* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)

 ## 1. 准备环境


--- a/docs/zh_cn/deploy/deploy_cls_model_on_mobile_device.md
+++ b/docs/zh_cn/deploy/deploy_cls_model_on_mobile_device.md
@@ -4,6 +4,8 @@

 Paddle Lite是飞桨轻量化推理引擎，为手机、IOT端提供高效推理能力，并广泛整合跨平台硬件，为端侧部署及应用落地问题提供轻量化的部署方案。

+参考资料：
+* PaddleLite部署量化模型[文档](https://paddle-lite.readthedocs.io/zh/latest/user_guides/quant_aware.html)

 ## 1. 准备环境


--- a/docs/zh_cn/quick_start/dygraph/dygraph_quant_aware_training_tutorial.md
+++ b/docs/zh_cn/quick_start/dygraph/dygraph_quant_aware_training_tutorial.md
@@ -70,6 +70,9 @@ model.evaluate(val_dataset, batch_size=256, verbose=1)
 ### 4.1 将模型转换为模拟量化模型

 当使用普通在线量化时`weight_preprocess_type` 用默认设置None即可，当需要使用PACT在线量化时，则设置为'PACT'。
+
+注意，目前PACT在线量化产出的量化模型，使用PaddleLite在ARM CPU上部署时，精度正确，但是使用PaddleInference在NV GPU和Intel CPU上部署时，可能存在精度问题。所以，请合理选择在线量化方法的种类。
+
 ```python
 quant_config = {
    # weight preprocess type, default is None and no preprocessing is performed.
@@ -108,4 +111,12 @@ quanter.save_quantized_model(

 导出之后，可以在`path`路径下找到导出的量化预测模型。

-量化预测模型可以使用`netron`软件打开，进行可视化查看。该量化预测模型和普通FP32预测模型一样，可以使用PaddleLite和PaddleInference加载预测，具体请参考`推理部署`章节。
+根据部署业务场景，可以使用PaddleLite将该量化模型部署到移动端（ARM CPU），或者使用PaddleInference将该量化模型部署到服务器端（NV GPU和Intel CPU）。
+
+导出的量化模型相比原始FP32模型，模型体积没有明显差别，这是因为量化预测模型中的权重依旧保存为FP32类型。在部署时，使用PaddleLite opt工具转换量化预测模型后，模型体积才会真实减小。
+
+部署参考文档：
+* 部署[文档](../../deploy/index.html)
+* PaddleLite部署量化模型[文档](https://paddle-lite.readthedocs.io/zh/latest/user_guides/quant_aware.html)
+* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
+* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
--- a/docs/zh_cn/quick_start/dygraph/dygraph_quant_post_tutorial.md
+++ b/docs/zh_cn/quick_start/dygraph/dygraph_quant_post_tutorial.md
@@ -82,3 +82,15 @@ paddleslim.quant.quant_post_static(
        sample_generator=train_dataset,
        batch_nums=10)
 ```
+
+注意，使用离线量化方法后模型精度完全错误，可能是模型中存在控制流OP，目前离线量化方法还不支持对这类模型进行量化。
+
+根据部署业务场景，可以使用PaddleLite将该量化模型部署到移动端（ARM CPU），或者使用PaddleInference将该量化模型部署到服务器端（NV GPU和Intel CPU）。
+
+导出的量化模型相比原始FP32模型，模型体积没有明显差别，这是因为量化预测模型中的权重依旧保存为FP32类型。在部署时，使用PaddleLite opt工具转换量化预测模型后，模型体积才会真实减小。
+
+部署参考文档：
+* 部署[文档](../../deploy/index.html)
+* PaddleLite部署量化模型[文档](https://paddle-lite.readthedocs.io/zh/latest/user_guides/quant_aware.html)
+* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
+* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
--- a/docs/zh_cn/quick_start/static/quant_aware_tutorial.md
+++ b/docs/zh_cn/quick_start/static/quant_aware_tutorial.md
@@ -21,7 +21,8 @@ paddle.enable_static()
 ```

 ## 2. 构建网络
-该章节构造一个用于对MNIST数据进行分类的分类模型，选用`MobileNetV1`，并将输入大小设置为`[1, 28, 28]`，输出类别数为10。               为了方便展示示例，我们在`paddleslim.models`下预定义了用于构建分类模型的方法，执行以下代码构建分类模型：
+该章节构造一个用于对MNIST数据进行分类的分类模型，选用`MobileNetV1`，并将输入大小设置为`[1, 28, 28]`，输出类别数为10。               
+为了方便展示示例，我们在`paddleslim.models`下预定义了用于构建分类模型的方法，执行以下代码构建分类模型：



@@ -136,6 +137,7 @@ quant_program = slim.quant.quant_aware(train_program, exe.place, for_test=False)
 val_quant_program = slim.quant.quant_aware(val_program, exe.place, for_test=True)
 ```

+注意，如果静态图模型中有控制流OP，不可以使用静态离线量化方法。

 ## 5. 训练和测试量化后的模型
 微调量化后的模型，训练一个epoch后测试。
@@ -156,22 +158,26 @@ test(val_quant_program)

 ## 6. 保存量化后的模型

-在``4. 量化``中使用接口``slim.quant.quant_aware``接口得到的模型只适合训练时使用，为了得到最终使用时的模型，需要使用[slim.quant.convert](https://paddleslim.readthedocs.io/zh_CN/latest/api_cn/static/quant/quantization_api.html#convert)接口，然后使用[fluid.io.save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/static/save_inference_model_cn.html#save-inference-model)保存模型。``float_prog``的参数数据类型是float32，但是数据范围是int8, 保存之后可使用Paddle executor, PaddleInference predictor 和Paddle-Lite predictor加载执行。``int8_prog``的参数数据类型是int8, 保存后可看到量化后模型大小会减小，**该模型不可以用于预测部署**。
+在``4. 量化``中使用接口``slim.quant.quant_aware``接口得到的模型只适合训练时使用，为了得到最终使用时的模型，需要使用[slim.quant.convert](https://paddleslim.readthedocs.io/zh_CN/latest/api_cn/static/quant/quantization_api.html#convert)接口，然后使用[fluid.io.save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/static/save_inference_model_cn.html#save-inference-model)保存模型。


 ```python
-float_prog, int8_prog = slim.quant.convert(val_quant_program, exe.place, save_int8=True)
-target_vars = [float_prog.global_block().var(outputs[-1])]
+quant_infer_program = slim.quant.convert(val_quant_program, exe.place)
+target_vars = [quant_infer_program.global_block().var(outputs[-1])]
 paddle.static.save_inference_model(
-        path_prefix='./inference_model/float',
+        path_prefix='./quant_infer_model',
        feed_vars=[image],
        fetch_vars=target_vars,
        executor=exe,
        program=float_prog)
-paddle.static.save_inference_model(
-        path_prefix='./inference_model/int8',
-        feed_vars=[image],
-        fetch_vars=target_vars,
-        executor=exe,
-        program=int8_prog)
 ```
+
+根据业务场景，可以使用PaddleLite将该量化模型部署到移动端（ARM CPU），或者使用PaddleInference将该量化模型部署到服务器端（NV GPU和Intel CPU）。
+
+保存的量化模型相比原始FP32模型，模型体积没有明显差别，这是因为量化预测模型中的权重依旧保存为FP32类型。在部署时，使用PaddleLite opt工具转换量化预测模型后，模型体积才会真实减小。
+
+部署参考文档：
+* 部署[简介](../../deploy/index.html)
+* PaddleLite部署量化模型[文档](https://paddle-lite.readthedocs.io/zh/latest/user_guides/quant_aware.html)
+* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
+* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
--- a/docs/zh_cn/quick_start/static/quant_post_static_tutorial.md
+++ b/docs/zh_cn/quick_start/static/quant_post_static_tutorial.md
@@ -152,6 +152,7 @@ slim.quant.quant_post_static(
        batch_nums=10)
 ```

+注意，如果静态图模型中有控制流OP，不可以使用静态离线量化方法。

 加载保存在文件夹``'./quant_post_static_model'``下的量化后的模型进行测试，可看到精度和``3.2 训练和测试``中得到的测试精度相近，因此静态离线量化过程对于此分类模型几乎无损。

@@ -162,3 +163,13 @@ quant_post_static_prog, feed_target_names, fetch_targets = paddle.static.load_in
        executor=exe)
 test(quant_post_static_prog, fetch_targets)
 ```
+
+根据部署业务场景，可以使用PaddleLite将该量化模型部署到移动端（ARM CPU），或者使用PaddleInference将该量化模型部署到服务器端（NV GPU和Intel CPU）。
+
+保存的量化模型相比原始FP32模型，模型体积没有明显差别，这是因为量化预测模型中的权重依旧保存为FP32类型。在部署时，使用PaddleLite opt工具转换量化预测模型后，模型体积才会真实减小。
+
+部署参考文档：
+* 部署[简介](../../deploy/index.html)
+* PaddleLite部署量化模型[文档](https://paddle-lite.readthedocs.io/zh/latest/user_guides/quant_aware.html)
+* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
+* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)