Merge branch 'develop' into teng_ofa1

0279f8fc · Teng Xi · GitHub · 54f014a6 · 98247f3a · 0279f8fc
15 changed file
--- a/demo/mkldnn_quant/quant_aware/CMakeLists.txt
+++ b/demo/mkldnn_quant/quant_aware/CMakeLists.txt
--- a/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial_cn.md
+++ b/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial_cn.md
-# 图像分类INT8模型在CPU优化部署和预测
+# 图像分类INT8量化模型在CPU上的部署和预测
 ## 概述
-本文主要介绍在CPU上转化、部署和执行PaddleSlim产出的量化模型的流程。在Intel(R) Xeon(R) Gold 6271机器上，量化后的INT8模型为优化后FP32模型的3-4倍，而精度仅有极小下降。
+本文主要介绍在CPU上转化PaddleSlim产出的量化模型并部署和预测的流程。在Casecade Lake机器上（例如Intel® Xeon® Gold 6271、6248，X2XX等），INT8模型进行推理的速度通常是FP32模型的3-3.7倍；在SkyLake机器（例如Intel® Xeon® Gold 6148、8180，X1XX等）上，使用INT8模型进行推理的速度通常是FP32模型的1.5倍。
 流程步骤如下：
- 产出量化模型：使用PaddleSlim训练产出量化模型，注意模型的weights的值应该在INT8范围内，但是类型仍为float型。
+- 产出量化模型：使用PaddleSlim训练并产出量化模型。注意模型中被量化的算子的参数值应该在INT8范围内，但是类型仍为float型。
- CPU转换量化模型：在CPU上使用DNNL转化量化模型为真正的INT8模型
+- 在CPU上转换量化模型：在CPU上使用DNNL库转化量化模型为INT8模型。
- CPU部署预测：在CPU上部署demo应用并预测
+- 在CPU上部署预测：在CPU上部署样例并进行预测。
 ## 1. 准备
@@ -30,33 +30,35 @@ import numpy as np
 ## 2. 用PaddleSlim产出量化模型
-使用PaddleSlim产出量化训练模型或者离线量化模型。
+用户可以使用PaddleSlim产出量化训练模型或者离线量化模型。如果用户只想要验证部署和预测流程，可以跳过 2.1 和 2.2, 直接下载[mobilenetv2 post-training quant model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/quant_post_models/mobilenetv2_quant_post.tgz)以及其对应的原始的FP32模型[mobilenetv2 fp32](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/fp32_models/mobilenetv2.tgz)。如果用户要转化部署自己的模型，请根据下面2.1, 2.2的步骤产出量化模型。
 #### 2.1 量化训练
 量化训练流程可以参考 [分类模型的量化训练流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
-**注意量化训练过程中config参数：**
+**量化训练过程中config参数：**
- **quantize_op_types:** 目前CPU上支持量化 `depthwise_conv2d`, `mul`, `conv2d`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`。但是训练阶段插入fake quantize/dequantize op时，只需在前四种op前后插入fake quantize/dequantize ops，因为后面四种op `matmul`, `transpose2`, `reshape2`, `pool2d`的输入输出scale不变，将从前后方op的输入输出scales获得scales,所以`quantize_op_types` 参数只需要 `depthwise_conv2d`, `mul`, `conv2d`, `matmul` 即可。
+- **quantize_op_types:** 目前CPU上量化支持的算子为 `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`。但是在量化训练阶段插入fake_quantize/fake_dequantize算子时，只需在前四种op前后插入fake_quantize/fake_dequantize 算子，因为后面四种算子 `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`的scales将从其他op的`out_threshold`属性获取，因此`quantize_op_types`参数只需要设置为 `depthwise_conv2d`, `conv2d`, `fc`, `matmul` 即可。
 - **其他参数:** 请参考 [PaddleSlim quant_aware API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/#quant_aware)
-#### 2.2 静态离线量化
+#### 2.2 离线量化
-静态离线量化模型产出可以参考[分类模型的静态离线量化流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
+离线量化模型产出可以参考[分类模型的静态离线量化流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
 ## 3. 转化产出的量化模型为DNNL优化后的INT8模型
-为了部署在CPU上，我们将保存的quant模型，通过一个转化脚本，移除fake quantize/dequantize op，fuse一些op，并且完全转化成 INT8 模型。需要使用Paddle所在目录运行下面的脚本，脚本在官网的位置为[save_qat_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_qat_model.py)。复制脚本到demo所在目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`)并执行如下命令：
+为了部署在CPU上，我们将保存的quant模型，通过一个转化脚本，移除fake_quantize/fake_dequantize op，进行算子融合和优化并且转化为INT8模型。脚本在官网的位置为[save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py)。复制脚本到本样例所在目录(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`)，并执行如下命令：
 ```
-python save_qat_model.py --qat_model_path=/PATH/TO/SAVE/FLOAT32/QAT/MODEL --int8_model_save_path=/PATH/TO/SAVE/INT8/MODEL -ops_to_quantize="conv2d,pool2d"
+python save_quant_model.py --quant_model_path=/PATH/TO/SAVE/FLOAT32/QUANT/MODEL --int8_model_save_path=/PATH/TO/SAVE/INT8/MODEL
 ```
 **参数说明：**
- **qat_model_path:** 为输入参数，必填。为量化训练产出的quant模型。
+- **quant_model_path:** 为输入参数，必填。为量化训练产出的quant模型。
- **int8_model_save_path:** 将quant模型经过DNNL优化量化后保存的最终INT8模型路径。注意：qat_model_path必须传入量化训练后的含有fake quant/dequant ops的quant模型
+- **int8_model_save_path:** 将quant模型经过DNNL优化量化后保存的最终INT8模型输出路径。注意：quant_model_path必须传入PaddleSlim量化产出的含有fake_quant/fake_dequant ops的quant模型。
- **ops_to_quantize:** 必填，不可以不设置。表示最终INT8模型中使用量化op的列表。图像分类模型请设置`--ops_to_quantize=“conv2d, pool2d"`。自然语言处理模型，如Ernie模型，请设置`--ops_to_quantize="fc,reshape2,transpose2,matmul"`。用户必须手动设置，因为不是量化所有可量化的op就能达到最优速度。
+- **ops_to_quantize:** 以逗号隔开的指定的需要量化的op类型列表。可选，默认为空，空表示量化所有可量化的op。目前，对于Benchmark中列出的图像分类和自然语言处理模型中，量化所有可量化的op可以获得最好的精度和性能，因此建议用户不设置这个参数。
-  注意：
+- **--op_ids_to_skip:** 以逗号隔开的op id号列表，可选，默认为空。这个列表中的op号将不量化，采用FP32类型。要获取特定op的ID，请先使用`--debug`选项运行脚本，并打开生成的文件`int8_<number>_cpu_quantize_placement_pass.dot`，找出不需量化的op, ID号在Op名称后面的括号中。
-  - 目前支持DNNL量化op列表是`conv2d`, `depthwise_conv2d`, `mul`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `concat`，只能从这个列表中选择。
+- **--debug:** 添加此选项可在每个转换步骤之后生成一系列包含模型图的* .dot文件。 有关DOT格式的说明，请参见[DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html)。要打开`* .dot`文件，请使用系统上可用的任何Graphviz工具（例如Linux上的`xdot`工具或Windows上的`dot`工具有关文档，请参见[Graphviz](http://www.graphviz.org/documentation/)。
-  - 量化所有可量化的Op不一定性能最优，所以用户要手动输入。比如，如果一个op是单个的INT8 op, 不可以与之前的和之后的op融合，那么为了量化这个op，需要先做quantize，然后运行INT8 op, 再dequantize, 这样可能导致最终性能不如保持该op为fp32 op。由于用户模型未知，这里不给出默认设置。图像分类和NLP任务的设置建议已给出。
+- **注意：**
-  - 一个有效找到最优配置的方法是，用户观察这个模型一共用到了哪些可量化的op，选出不同的`ops_to_quantize`组合，多运行几次。
+  - 目前支持DNNL量化的op列表是`conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`,`scale`, `concat`。
+  - 如果设置 `--op_ids_to_skip`,只需要传入所有量化op中想要保持FP32类型的op ID号即可。
+  - 有时量化全部op不一定得到最优性能。例如：如果一个op是单个的INT8 op, 之前和之后的op都为float32 op,那么为了量化这个op，需要先做quantize，然后运行INT8 op, 再dequantize, 这样可能导致最终性能不如保持该op为fp32 op。如果用户使用默认设置性能较差，可以观察这个模型是否有单独的INT8 op，选出不同的`ops_to_quantize`组合，也可以通过`--op_ids_to_skip`排除部分可量化op ID，多运行几次获得最佳设置。
 ## 4. 预测
@@ -90,13 +92,14 @@ val/ILSVRC2012_val_00000002.jpg 0
 ```
 注意：
- 为什么将数据集转化为二进制文件？因为paddle中的数据预处理（resize, crop等）都使用pythong.Image模块进行，训练出的模型也是基于Python预处理的图片，但是我们发现Python测试性能开销很大，导致预测性能下降。为了获得良好性能，在量化模型预测阶段，我们决定使用C++测试，而C++只支持Open-CV等库，Paddle不建议使用外部库，因此我们使用Python将图片预处理然后放入二进制文件，再在C++测试中读出。用户根据自己的需要，可以更改C++测试以使用open-cv库直接读数据并预处理，精度不会有太大下降。我们还提供了python测试`sample_tester.py`作为参考，与C++测试`sample_tester.cc`相比，用户可以看到Python测试更大的性能开销。
+- 为什么将数据集转化为二进制文件？因为paddle中的数据预处理（resize, crop等）都使用pythong.Image模块进行，训练出的模型也是基于Python预处理的图片，但是我们发现Python测试性能开销很大，导致预测性能下降。为了获得良好性能，在量化模型预测阶段，我们决定使用C++测试，而C++只支持Open-CV等库，Paddle不建议使用外部库，因此我们使用Python将图片预处理然后放入二进制文件，再在C++测试中读出。用户根据自己的需要，可以更改C++测试以直接读数据并预处理，精度不会有太大下降。我们还提供了python测试`sample_tester.py`作为参考，与C++测试`sample_tester.cc`相比，用户可以看到Python测试更大的性能开销。
 ### 4.2 部署预测
 #### 部署前提
- 只有使用AVX512系列CPU服务器才能获得性能提升。用户可以通过在命令行红输入`lscpu`查看本机支持指令。
+- 用户可以通过在命令行红输入`lscpu`查看本机支持指令。
- 在支持`avx512_vnni`的CPU服务器上，INT8精度最高，性能提升最快。
+- 在支持`avx512_vnni`的CPU服务器上，INT8精度和性能最高，如：Casecade Lake, Model name: Intel(R) Xeon(R) Gold X2XX，INT8性能提升为FP32模型的3~3.7倍
+- 在支持`avx512`但是不支持`avx512_vnni`的CPU服务器上，如：SkyLake, Model name：Intel(R) Xeon(R) Gold X1XX，INT8性能为FP32性能的1.5倍左右。
 #### 准备预测推理库
@@ -105,18 +108,19 @@ val/ILSVRC2012_val_00000002.jpg 0
 - 用户也可以从Paddle官网下载发布的[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)。请选择`ubuntu14.04_cpu_avx_mkl` 最新发布版或者develop版。
-你可以将准备好的预测库解压并重命名为fluid_inference，放在当前目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`)。或者在cmake时通过设置PADDLE_ROOT来指定Paddle预测库的位置。
+你可以将准备好的预测库解压并重命名为fluid_inference，放在当前目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`)。或者在cmake时通过设置PADDLE_ROOT来指定Paddle预测库的位置。
 #### 编译应用
-样例所在目录为PaddleSlim下`demo/mkldnn_quant/quant_aware/`,样例`sample_tester.cc`和编译所需`cmake`文件夹都在这个目录下。
+样例所在目录为PaddleSlim下`demo/mkldnn_quant/`,样例`sample_tester.cc`和编译所需`cmake`文件夹都在这个目录下。
 ```
 cd /PATH/TO/PaddleSlim
-cd demo/mkldnn_quant/quant_aware
+cd demo/mkldnn_quant/
 mkdir build
 cd build
+cmake -DPADDLE_ROOT=$PADDLE_ROOT ..
 make -j
 ```
-如果你从官网下载解压了[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)到当前目录下，这里`-DPADDLE_ROOT`可以不设置，因为`DPADDLE_ROOT`默认位置`demo/mkldnn_quant/quant_aware/fluid_inference`
+如果你从官网下载解压了[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)到当前目录下，这里`-DPADDLE_ROOT`可以不设置，因为`-DPADDLE_ROOT`默认位置`demo/mkldnn_quant/fluid_inference`
 #### 运行测试
 ```
@@ -137,41 +141,36 @@ echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
 - **infer_model:** 模型所在目录，注意模型参数当前必须是分开保存成多个文件的。可以设置为`PATH/TO/SAVE/INT8/MODEL`, `PATH/TO/SAVE/FLOAT32/MODEL`。无默认值。
 - **infer_data:** 测试数据文件所在路径。注意需要是经`full_ILSVRC2012_val_preprocess`转化后的binary文件。
 - **batch_size:** 预测batch size大小。默认值为50。
- **iterations:** 预测多少batches。默认为0，表示预测infer_data中所有batches (image numbers/batch size)
+- **iterations:** batches迭代数。默认为0，0表示预测infer_data中所有batches (image numbers/batch_size)
 - **num_threads:** 预测使用CPU 线程数，默认为单核一个线程。
- **with_accuracy_layer:** 由于这个测试是Image Classification通用的测试，既可以测试float32模型也可以INT8模型，模型可以包含或者不包含label层，设置此参数更改。
+- **with_accuracy_layer:** 模型为包含精度计算层的测试模型还是不包含精度计算层的预测模型，默认为true。
- **optimize_fp32_model** 是否优化测试FP32模型。样例可以测试保存的INT8模型，也可以优化（fuses等）并测试优化后的FP32模型。默认为False，表示测试转化好的INT8模型，此处无需优化。
+- **use_analysis** 是否使用`paddle::AnalysisConfig`对模型优化、融合(fuse)，加速。默认为false
- **use_profile:** 由Paddle预测库中提供，设置用来进行性能分析。默认值为false。
-你可以直接修改`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`目录下的`run.sh`中的MODEL_DIR和DATA_DIR，即可执行`./run.sh`进行CPU预测。
+你可以直接修改`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`目录下的`run.sh`中的MODEL_DIR和DATA_DIR，即可执行`./run.sh`进行CPU预测。
 ### 4.3 用户编写自己的测试：
 如果用户编写自己的测试：
 1. 测试INT8模型
-    如果用户测试转化好的INT8模型，使用 paddle::NativeConfig 即可测试。在demo中，设置`optimize_fp32_model`为false。
+    如果用户测试转化好的INT8模型，使用 `paddle::NativeConfig` 即可测试。在demo中，设置`use_analysis`为`false`。
 2. 测试FP32模型
-   如果用户要测试PF32模型，可以使用AnalysisConfig对原始FP32模型先优化（fuses等）再测试。AnalysisConfig配置设置如下：
+   如果用户要测试PF32模型，使用`paddle::AnalysisConfig`对原始FP32模型先优化（fuses等）再测试。在样例中，直接设置`use_analysis`为`true`。AnalysisConfig设置如下：
 ```
 static void SetConfig(paddle::AnalysisConfig *cfg) {
  cfg->SetModel(FLAGS_infer_model);  // 必须。表示需要测试的模型
  cfg->DisableGpu();      // 必须。部署在CPU上预测，必须Disablegpu
  cfg->EnableMKLDNN();  //必须。表示使用MKLDNN算子，将比 native 快
-  cfg->SwitchIrOptim();   // 如果传入FP32原始，这个配置设置为true将优化加速模型（如进行fuses等）
+  cfg->SwitchIrOptim();   // 如果传入FP32原始，这个配置设置为true将优化加速模型
-  cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads);  //默认设置为1。表示多线程运行
+  cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads);  //非必须。默认设置为1。表示多线程运行
-  if(FLAGS_use_profile){
-      cfg->EnableProfile();  // 可选。如果设置use_profile，运行结束将展现各个算子所占用时间
-  }
 }
 ```
-在我们提供的样例中，只要设置`optimize_fp32_model`为true,`infer_model`传入原始FP32模型，AnalysisConfig的上述设置将被执行，传入的FP32模型将被DNNL优化加速（包括fuses等）。
+- 在我们提供的样例中，只要设置`use_analysis`为true并且`infer_model`传入原始FP32模型，AnalysisConfig的上述设置将被执行，传入的FP32模型将被DNNL优化加速（包括fuses等）。
-如果infer_model传入INT8模型，则optimize_fp32_model将不起作用，因为INT8模型已经被优化量化。
+- 如果infer_model传入INT8模型，则 `use_analysis`将不起任何作用，因为INT8模型已经被优化量化。
-如果infer_model传入PaddleSlim产出的模型，optimize_fp32_model也不起作用，因为quant模型包含fake quantize/dequantize ops,无法fuse,无法优化。
+- 如果infer_model传入PaddleSlim产出的quant模型，`use_analysis`即使设置为true不起作用，因为quant模型包含fake_quantize/fake_dequantize ops,无法fuse,无法优化。
 ## 5. 精度和性能数据
-INT8模型精度和性能结果参考[CPU部署预测INT8模型的精度和性能](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md)
+INT8模型精度和性能结果参考[CPU部署预测INT8模型的精度和性能](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md)
 ## FAQ
- 自然语言处理模型在CPU上的部署和预测参考样例[ERNIE 模型 QAT INT8 精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn)
+- 自然语言处理模型在CPU上的部署和预测参考样例[ERNIE 模型 QUANT INT8 精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c++/ernie/mkldnn)
- 具体DNNL优化原理可以查看[SLIM QAT for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md)。
+- 具体DNNL量化原理可以查看[SLIM Quant for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/README.md)。
--- a/demo/mkldnn_quant/README_en.md
+++ b/demo/mkldnn_quant/README_en.md
+# Image classification INT8 model deployment and Inference on CPU
+## Overview
+This document describes the process of converting, deploying and executing the DNNL INT8 model using fakely quantized (quant) model generated by PaddleSlim. On Casecade Lake machines (eg. Intel(R) Xeon(R) Gold 6271, 6248, X2XX etc), inference using INT8 model is ususally 3-3.7 times faster than with FP32 model. On SkyLake machines (eg. Intel(R) Xeon(R) Gold 6148, 8180, X1XX etc.), inference using INT8 model is ~1.5 times faster than with FP32 model.
+The process comprises the following steps:
+- Generating a fakely quantized model: Use PaddleSlim to generate fakely quantized model with `quant-aware` or `post-training` training strategy. Note that the parameters of the quantized ops will be in the range of `INT8`, but their type remains `float32`.
+- Converting fakely quantized model into the final DNNL INT8 model: Use provided python script to convert the quant model into DNNL-based and CPU-optimized INT8 model.
+- Deployment and inference on CPU: Deploy the demo on CPUs and run inference.
+## 1. Preparation
+#### Install PaddleSlim
+For PaddleSlim installation, please see [Paddle Installation Document](https://paddlepaddle.github.io/PaddleSlim/install.html)
+```
+git clone https://github.com/PaddlePaddle/PaddleSlim.git
+cd PaddleSlim
+python setup.py install
+```
+#### Use it in examples
+In sample tests, import Paddle and PaddleSlim as follows:
+```
+import paddle
+import paddle.fluid as fluid
+import paddleslim as slim
+import numpy as np
+```
+## 2. Use PaddleSlim to generate a fake quantized model
+One can generate fake-quantized model with post-training or quant-aware strategy. If user would like to skip the step of generating fakely quantized model and check quantization speedup directly, download [mobilenetv2 post-training quant model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/quant_post_models/mobilenetv2_quant_post.tgz) and its original FP32 model [mobilenetv2 fp32 model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/fp32_models/mobilenetv2.tgz), then user could skip this paragraph and go to Point 3 directly
+#### 2.1 Quant-aware training
+To generate fake quantized model with quant-aware strategy, see [Quant-aware training tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
+**The parameters during quant-aware training:**
+- **quantize_op_types:** A list of operators to insert `fake_quantize` and `fake_dequantize` ops around them. In PaddlePaddle, quantization of following operators is supported for CPU: `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`. However, inserting fake_quantize/fake_dequantize operators during training is needed only for the first four of them (`depthwise_conv2d`, `conv2d`, `fc`, `matmul`), so setting the `quantize_op_types` parameter to the list of those four ops is enough. Scala data needed for quantization of the other five operators is reused from the fake ops or gathered from the `out_threshold` attributes of the operators.
+- **Other parameters:** Please read [PaddleSlim quant_aware API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/#quant_aware)
+#### 2.2 Post-training quantization
+To generate post-training fake quantized model, see [Offline post-training quantization tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
+## 3. Convert the fake quantized model to DNNL INT8 model
+In order to deploy an INT8 model on the CPU, we need to collect scales, remove all fake_quantize/fake_dequantize operators, optimize the graph and quantize it, turning it into the final DNNL INT8 model. This is done by the script [save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py). Copy the script to the directory where the demo is located: `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` and run it as follows:
+```
+python save_quant_model.py --quant_model_path=/PATH/TO/SAVE/FLOAT32/quant/MODEL --int8_model_save_path=/PATH/TO/SAVE/INT8/MODEL
+```
+**Available options in the above command and their descriptions are as follows:**
+- **quant_model_path:** input model path, required. A quant model for quantifying training output.
+- **int8_model_save_path:** The final INT8 model output path after the quant model is optimized and quantized by DNNL.
+- **ops_to_quantize:** A comma separated list of specified op types to be quantized. It is optional. If the option is skipped, all quantizable operators will be quantized. Skipping the option is recommended in the first approach as it usually yields best performance and accuracy for image classification models and NLP models listed in the Benchmark..
+- **--op_ids_to_skip:** "A comma-separated list of operator ID numbers. It is optional. Default value is none. The op ids in this list will not be quantized and will adopt FP32 type. To get the ID of a specific op, first run the script using the `--debug` option, and open the generated file `int8_<number>_cpu_quantize_placement_pass.dot` to find the op that does not need to be quantified, and the ID number is in parentheses after the Op name.
+- **--debug:** Generate models graph or not. If this option is present, .dot files with graphs of the model will be generated after each optimization step that modifies the graph. For the description of DOT format, please read [DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html). To open the `*.dot` file, please use any Graphviz tool available on the system(such as the `xdot` tool on Linux or the `dot` tool on Windows. For Graphviz documentation, see [Graphviz](http://www. graphviz.org/documentation/).
+- **Note:**
+  - The DNNL supported quantizable ops are `conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `scale`, `concat`.
+  - If you want to skip quantization of particular operators,  use the `--op_ids_to_skip` option and set it exactly to the list of ids of the operators you want to keep as FP32 operators.
+  - Quantization yields best performance and accuracy when a long sequences of consecutive quantized operators are present in the model. When a model contains quantizable operators surrounded by non-quantizable operators, quantization of the single operators (or very short sequences of them) can give no speedup or even cause drop in performance because of frequent quantizing and dequantizing the data. In that case user can tweak the quantization process by limiting it to the particular types of operators (using the `--ops_to_quantize` option) or by disabling quantization of particular operators, e.g. the single ones or in short sequences (using the `--op_ids_to_skip` option).
+## 4. Inference
+### 4.1 Data preprocessing
+To deploy the model on the CPU, the validation dataset needs to be converted into the binary format. Run the following command in the root directory of Paddle repository to convert the complete ILSVRC2012 val dataset. Use `--local` option to provide your own image classification dataset for conversion. The script is also located on the official website at [full_ILSVRC2012_val_preprocess.py](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py)
+```
+python python/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --output_file=/PATH/TO/SAVE/BINARY/FILE
+```
+or
+```
+python python/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET/ --output_file=/PATH/TO/SAVE/BINARY/FILE
+```
+**Available options in the above command and their descriptions are as follows:**
+- If no parameters are set, the script will download the ILSVRC2012_img_val dataset and convert it into a binary file.
+- **local:** If used, a user's dataset is expected in the `data_dir` option.
+- **data_dir:** user's own data directory.
+- **label_list:** Picture path-Picture category list file, similar to `val_list.txt`.
+- **output_file:** A path for the generated binary file.
+- **data_dim:** The length and width for the pre-processed pictures in the resulting binary. The default value is 224.
+The structure of the directory with the user's dataset should be as follows:
+```
+imagenet_user
+├── val
+│ ├── ILSVRC2012_val_00000001.jpg
+│ ├── ILSVRC2012_val_00000002.jpg
+| |── ...
+└── val_list.txt
+```
+Then, the contents of val_list.txt should be as follows:
+```
+val/ILSVRC2012_val_00000001.jpg 0
+val/ILSVRC2012_val_00000002.jpg 0
+```
+note:
+- Performance measuring is recommended to be done using C++ tests rather than python tests, because python tests incur big overhead of the python itself. However, testing requires the dataset images to be preprocessed first. This can be done easily using native tools in python, but in C++ it requires additional libraries. To avoid introducing C++ external dependencies on image processing libraries like Open-CV, we preprocess the dataset using the python script and save the result in a binary format, ready to use by C++ tests. User can modify the C++ test code to enable image preprocessing with open-cv or any other library and read the image data directly from the original dataset. The accuracy result should differ only a bit from the accuracy obtained using the preprocessed binary dataset. The python test `sample_tester.py` is provided as a reference, to show the difference in performance between it and the C++ test `sample_tester.cc`
+### 4.2 Deploying Inference demo
+#### Deployment premises
+- Users can check which instruction sets are supported by their machines' CPUs by issuing the command `lscpu`.
+- INT8 performance and accuracy is best on CPU servers which support `avx512_vnni` instruction set (e.g. Intel Cascade Lake CPUs: Intel(R) Xeon(R) Gold 6271, 6248 or other X2XX). INT8 inference performance is then 3-3.7 times better than for FP32.
+- On CPU servers that support `avx512` but do not support `avx512_vnni` instructions (SkyLake, Model name: Intel(R) Xeon(R) Gold X1XX, such as 6148), the performance of INT8 models is around 1.5 times faster than FP32 models.
+#### Prepare Paddle inference library
+Users can compile the Paddle inference library from the source code or download the inference library directly.
+- For instructions on how to compile the Paddle inference library from source, see [Compile from Source](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html#id12), checkout release/2.0 or develop branch and compile it.
+- Users can also download the published [inference Library](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html). Please select `ubuntu14.04_cpu_avx_mkl` latest release or develop version. The downloaded library has to be decompressed and renamed into `fluid_inference` directory and placed in current directory (`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`) for the library to be available. Another option is to set the `PADDLE_ROOT` cmake variable to the `fluid_inference` directory location to link the tests with the Paddle inference library properly.
+#### Compile the application
+The source code file of the sample test (`sample_tester.cc`) and the `cmake` files are all located in `demo/mkldnn_quant/`directory.
+```
+cd /PATH/TO/PaddleSlim
+cd demo/mkldnn_quant/
+mkdir build
+cd build
+cmake -DPADDLE_ROOT=$PADDLE_ROOT ..
+make -j
+```
+- `-DPADDLE_ROOT` default value is `demo/mkldnn_quant/fluid_inference`. If users download and unzip the library [Inference library from the official website](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html) in current directory `demo/mkldnn_quant/`, users could skip this option.
+#### Run the test
+```
+# Bind threads to cores
+export KMP_AFFINITY=granularity=fine,compact,1,0
+export KMP_BLOCKTIME=1
+# Turbo Boost could be set to OFF using the command
+echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
+# In the file run.sh, set `MODEL_DIR` to `/PATH/TO/FLOAT32/MODEL` or `/PATH/TO/SAVE/INT8/MODEL`
+# In the file run.sh, set `DATA_FILE` to `/PATH/TO/SAVE/BINARY/FILE`
+# For 1 thread performance:
+./run.sh
+# For 20 thread performance:
+./run.sh -1 20
+```
+**Available options in the above command and their descriptions are as follows:**
+- **infer_model:** Required. Tested model path. Note that the model parameters files need be saved into multiple files.
+- **infer_data:** Required. The path of the tested data file. Note that it needs to be a binary file converted by `full_ILSVRC2012_val_preprocess`.
+- **batch_size:** Batch size. The default value is 50.
+- **iterations:** Batch iterations. The default is 0, which means predict all batches (image numbers/batch size) in infer_data
+- **num_threads:** Number of CPU threads used. The default value is 1.
+- **with_accuracy_layer:** The model is with accuracy layer or not. Default value false.
+- **use_analysis** Whether to use paddle::AnalysisConfig to optimize the model. Default value is false.
+One can directly modify MODEL_DIR and DATA_DIR in `run.sh` under `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` directory, then execute `./run.sh` for CPU inference.
+### 4.3 Writing your own tests:
+When writing their own test, users can:
+1. Test the resulting INT8 model - then paddle::NativeConfig should be used (without applying additional optimizations) and the option `use_analysis` should be set to `false` in the demo.
+2. Test the original FP32 model - then paddle::AnalysisConfig should be used (applying FP32 fuses and optimizations) and the option `use_analysis` should be set to `true` in the demo.
+AnalysisConfig configuration in this demo are set as follows:
+```
+static void SetConfig(paddle::AnalysisConfig *cfg) {
+  cfg->SetModel(FLAGS_infer_model); // Required. The model to be tested
+  cfg->DisableGpu(); // Required. Deploy on the CPU to predict, you must Disablegpu
+  cfg->EnableMKLDNN(); // Required. Configure with MKLDNN enabled make the inference faster than native configuration
+  cfg->SwitchIrOptim(); // Required. Enable SwitchIrOptim will fuses many ops and improve the performance
+  cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads); // The default setting is 1
+}
+```
+**Notes:**
+- If `infer_model` is a path to an FP32 model and `use_analysis` is set to true, the paddle::AnalysisConfig will be called. Hence the FP32 model will be fused and optimized, the performance should be faster than FP32 inference using paddle::NativeConfig.
+- If `infer_model` is a path to converted DNNL INT8 model, the `use_analysis` option will make no difference, because INT8 model has been fused, optimized and quantized.
+- If `infer_model` is a path to a fakely quantized model generated by PaddleSlim, `use_analysis` will not work even if it is set to true, because the fake quantized model contains fake quantize/dequantize ops, which cannot be fused or optimized.
+## 5. Accuracy and performance benchmark
+For INT8 models accuracy and performance results see [CPU deployment predicts the accuracy and performance of INT8 model](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md)
+## FAQ
+- For deploying INT8 NLP models on CPU, see [ERNIE model quant INT8 accuracy and performance reproduction](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c++/ernie/mkldnn)
+- The detailed DNNL quantification process can be viewed in [SLIM quant for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/README.md)
--- a/demo/mkldnn_quant/quant_aware/cmake/FindFluid.cmake
+++ b/demo/mkldnn_quant/quant_aware/cmake/FindFluid.cmake
--- a/demo/mkldnn_quant/quant_aware/cmake/FindGperftools.cmake
+++ b/demo/mkldnn_quant/quant_aware/cmake/FindGperftools.cmake
--- a/demo/mkldnn_quant/quant_aware/run.sh
+++ b/demo/mkldnn_quant/quant_aware/run.sh
 #!/bin/bash
-MODEL_DIR=$HOME/repo/Paddle/resnet50_quant_int8
+MODEL_DIR=./mobilenetv2_INT8
-DATA_FILE=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin
+DATA_FILE=/data/datasets/ImageNet_py/val.bin
-num_threads=10
+num_threads=1
 with_accuracy_layer=false
 use_profile=true
 ITERATIONS=0
@@ -14,4 +14,4 @@ ITERATIONS=0
    --iterations=${ITERATIONS} \
    --with_accuracy_layer=${with_accuracy_layer} \
    --use_profile=${use_profile} \
-    --optimize_fp32_model=false
+    --use_analysis=false
--- a/demo/mkldnn_quant/quant_aware/sample_tester.cc
+++ b/demo/mkldnn_quant/quant_aware/sample_tester.cc
@@ -38,9 +38,14 @@ DEFINE_int32(iterations,
 DEFINE_int32(num_threads, 1, "num of threads to run in parallel");
 DEFINE_bool(with_accuracy_layer,
            true,
-            "Set with_accuracy_layer to true if provided model has accuracy layer and requires label input");
+            "Set with_accuracy_layer to true if provided model has accuracy "
-DEFINE_bool(use_profile, false, "Set use_profile to true to get profile information");
+            "layer and requires label input");
-DEFINE_bool(optimize_fp32_model, false, "If optimize_fp32_model is set to true, fp32 model will be optimized");
+DEFINE_bool(use_profile,
+            false,
+            "Set use_profile to true to get profile information");
+DEFINE_bool(use_analysis,
+            false,
+            "If use_analysis is set to true, the model will be optimized");
 struct Timer {
  std::chrono::high_resolution_clock::time_point start;
@@ -149,6 +154,9 @@ void SetInput(std::vector<std::vector<paddle::PaddleTensor>> *inputs,
      labels_gt->push_back(std::move(labels));
    }
    inputs->push_back(std::move(tmp_vec));
+    if (i > 0 && i % 100) {
+      LOG(INFO) << "Read " << i * 100 * FLAGS_batch_size << " samples";
+    }
  }
 }
@@ -157,7 +165,7 @@ static void PrintTime(int batch_size,
                      double batch_latency,
                      int epoch = 1) {
  double sample_latency = batch_latency / batch_size;
-  LOG(INFO) <<"Model: "<<FLAGS_infer_model;
+  LOG(INFO) << "Model: " << FLAGS_infer_model;
  LOG(INFO) << "====== num of threads: " << num_threads << " ======";
  LOG(INFO) << "====== batch size: " << batch_size << ", iterations: " << epoch;
  LOG(INFO) << "====== batch latency: " << batch_latency
@@ -273,7 +281,7 @@ static void SetIrOptimConfig(paddle::AnalysisConfig *cfg) {
  cfg->DisableGpu();
  cfg->SwitchIrOptim();
  cfg->EnableMKLDNN();
-  if(FLAGS_use_profile){
+  if (FLAGS_use_profile) {
    cfg->EnableProfile();
  }
 }
@@ -297,7 +305,7 @@ int main(int argc, char *argv[]) {
  paddle::AnalysisConfig cfg;
  cfg.SetModel(FLAGS_infer_model);
  cfg.SetCpuMathLibraryNumThreads(FLAGS_num_threads);
-  if (FLAGS_optimize_fp32_model){
+  if (FLAGS_use_analysis) {
    SetIrOptimConfig(&cfg);
  }
@@ -305,11 +313,13 @@ int main(int argc, char *argv[]) {
  std::vector<std::vector<paddle::PaddleTensor>> outputs;
  std::vector<paddle::PaddleTensor> labels_gt;  // optional
  SetInput(&input_slots_all, &labels_gt);       // iterations*batch_size
-  auto predictor = CreatePredictor(reinterpret_cast<paddle::PaddlePredictor::Config *>(&cfg), FLAGS_optimize_fp32_model);
+  auto predictor =
+      CreatePredictor(reinterpret_cast<paddle::PaddlePredictor::Config *>(&cfg),
+                      FLAGS_use_analysis);
  PredictionRun(predictor.get(), input_slots_all, &outputs, FLAGS_num_threads);
  auto acc_pair = CalculateAccuracy(outputs, labels_gt);
-  LOG(INFO) <<"Top1 accuracy: " << std::fixed << std::setw(6)
+  LOG(INFO) << "Top1 accuracy: " << std::fixed << std::setw(6)
-            <<std::setprecision(4) << acc_pair.first;
+            << std::setprecision(4) << acc_pair.first;
-  LOG(INFO) <<"Top5 accuracy: " << std::fixed << std::setw(6)
+  LOG(INFO) << "Top5 accuracy: " << std::fixed << std::setw(6)
-            <<std::setprecision(4) << acc_pair.second;
+            << std::setprecision(4) << acc_pair.second;
 }
--- a/demo/mkldnn_quant/quant_aware/sample_tester.py
+++ b/demo/mkldnn_quant/quant_aware/sample_tester.py
@@ -33,8 +33,7 @@ _logger.setLevel(logging.INFO)
 def parse_args():
    parser = argparse.ArgumentParser()
-    parser.add_argument(
+    parser.add_argument('--batch_size', type=int, default=1, help='Batch size.')
-        '--batch_size', type=int, default=1, help='Batch size.')
    parser.add_argument(
        '--skip_batch_num',
        type=int,
@@ -46,8 +45,7 @@ def parse_args():
        type=str,
        default='',
        help='A path to an Inference model.')
-    parser.add_argument(
+    parser.add_argument('--infer_data', type=str, default='', help='Data file.')
-        '--infer_data', type=str, default='', help='Data file.')
    parser.add_argument(
        '--batch_num',
        type=int,
@@ -155,8 +153,8 @@ class SampleTester(unittest.TestCase):
        inference_scope = fluid.executor.global_scope()
        with fluid.scope_guard(inference_scope):
            if os.path.exists(os.path.join(model_path, '__model__')):
-                [inference_program, feed_target_names, fetch_targets
+                [inference_program, feed_target_names,
-                 ] = fluid.io.load_inference_model(model_path, exe)
+                 fetch_targets] = fluid.io.load_inference_model(model_path, exe)
            else:
                [inference_program, feed_target_names,
                 fetch_targets] = fluid.io.load_inference_model(

--- a/docs/en/quick_start/distillation_tutorial_en.md
+++ b/docs/en/quick_start/distillation_tutorial_en.md
@@ -25,7 +25,7 @@ This tutorial trains and verifies distillation model on the MNIST dataset. The i
 Select `ResNet50` as the teacher to perform distillation training on the students of the` MobileNet` architecture.
 ```python
-model = models.__dict__['MobileNet']()
+model = slim.models.MobileNet()
 student_program = fluid.Program()
 student_startup = fluid.Program()
 with fluid.program_guard(student_program, student_startup):
@@ -42,7 +42,7 @@ with fluid.program_guard(student_program, student_startup):
 ```python
-teacher_model = models.__dict__['ResNet50']()
+model = slim.models.ResNet50()
 teacher_program = fluid.Program()
 teacher_startup = fluid.Program()
 with fluid.program_guard(teacher_program, teacher_startup):

--- a/docs/zh_cn/FAQ/quantization_FAQ.md
+++ b/docs/zh_cn/FAQ/quantization_FAQ.md
@@ -35,9 +35,9 @@ config->EnableTensorRtEngine(1 << 20      /* workspace_size*/,
                        false             /* use_calib_mode*/);
 ```
-  如果量化模型在x86上线，需要使用[INT8 MKL-DNN](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md)
+-  如果量化模型在x86上线，需要使用[INT8 MKL-DNN](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/slim_int8_mkldnn_post_training_quantization.md)
-    - 首先对模型进行转化，可以参考[脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_qat_model.py)
+    - 首先对模型进行转化，可以参考[脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py)
    - 转化之后可使用预测部署API进行加载。比如[c++ API](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/native_infer.html)

--- a/docs/zh_cn/quick_start/distillation_tutorial.md
+++ b/docs/zh_cn/quick_start/distillation_tutorial.md
@@ -27,7 +27,7 @@ import paddleslim as slim
 选择`ResNet50`作为teacher对`MobileNet`结构的student进行蒸馏训练。
 ```python
-model = models.__dict__['MobileNet']()
+model = slim.models.MobileNet()
 student_program = fluid.Program()
 student_startup = fluid.Program()
 with fluid.program_guard(student_program, student_startup):
@@ -44,7 +44,7 @@ with fluid.program_guard(student_program, student_startup):
 ```python
-teacher_model = models.__dict__['ResNet50']()
+model = slim.models.ResNet50()
 teacher_program = fluid.Program()
 teacher_startup = fluid.Program()
 with fluid.program_guard(teacher_program, teacher_startup):

--- a/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md
-# CPU上部署量化模型教程
-在Intel(R) Xeon(R) Gold 6271机器上，经过量化和DNNL加速，INT8模型在单线程上性能为原FP32模型的3~4倍；在 Intel(R) Xeon(R) Gold 6148，单线程性能为原FP32模型的1.5倍，而精度仅有极小下降。图像分类量化的样例教程请参考[图像分类INT8模型在CPU优化部署和预测](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial_cn.md)。自然语言处理模型的量化请参考[ERNIE INT8 模型精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn)
-## 图像分类INT8模型在 Xeon(R) 6271 上的精度和性能
->**图像分类INT8模型在 Intel(R) Xeon(R) Gold 6271 上精度**
-|    Model     | FP32 Top1 Accuracy | INT8 Top1 Accuracy | Top1 Diff | FP32 Top5 Accuracy | INT8 Top5 Accuracy | Top5 Diff |
-| :----------: | :----------------: | :--------------------: | :-------: | :----------------: | :--------------------: | :-------: |
-| MobileNet-V1 |       70.78%       |         70.71%         |  -0.07%   |       89.69%       |         89.41%         |  -0.28%   |
-| MobileNet-V2 |       71.90%       |         72.11%         |  +0.21%   |       90.56%       |         90.62%         |  +0.06%   |
-|  ResNet101   |       77.50%       |         77.64%         |  +0.14%   |       93.58%       |         93.58%         |   0.00%   |
-|   ResNet50   |       76.63%       |         76.47%         |  -0.16%   |       93.10%       |         92.98%         |  -0.12%   |
-|    VGG16     |       72.08%       |         71.73%         |  -0.35%   |       90.63%       |         89.71%         |  -0.92%   |
-|    VGG19     |       72.57%       |         72.12%         |  -0.45%   |       90.84%       |         90.15%         |  -0.69%   |
->**图像分类INT8模型在 Intel(R) Xeon(R) Gold 6271 单核上性能**
-|    Model     | FP32 (images/s) | INT8 (images/s) | Ratio (INT8/FP32)  |
-| :----------: | :-------------: | :-----------------: | :---------------:  |
-| MobileNet-V1 |      74.05      |       196.98        |      2.66          |
-| MobileNet-V2 |      88.60      |       187.67        |      2.12          |
-|  ResNet101   |      7.20       |       26.43         |      3.67          |
-|   ResNet50   |      13.23      |       47.44         |      3.59          |
-|    VGG16     |      3.47       |       10.20         |      2.94          |
-|    VGG19     |      2.83       |       8.67          |      3.06          |
-## 自然语言处理INT8模型在 Xeon(R) 6271 上的精度和性能
->**I. Ernie INT8 DNNL 在 Intel(R) Xeon(R) Gold 6271 的精度结果**
-|     Model    |  FP32 Accuracy | INT8 Accuracy | Accuracy Diff |
-|:------------:|:----------------------:|:----------------------:|:---------:|
-|   Ernie      |          80.20%        |         79.44%   |     -0.76%      |  
->**II. Ernie INT8 DNNL 在 Intel(R) Xeon(R) Gold 6271 上单样本耗时**
-|     Threads  | FP32 Latency (ms) | INT8 Latency (ms)    | Ratio (FP32/INT8) |
-|:------------:|:----------------------:|:-------------------:|:-----------------:|
-| 1 thread     |       237.21          |      79.26           |    2.99X       |
-| 20 threads   |       22.08           |      12.57           |    1.76X       |
--- a/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md
+# Intel CPU上部署量化模型教程
+在Intel Casecade Lake机器上（如：Intel(R) Xeon(R) Gold 6271），经过量化和DNNL加速，INT8模型在单线程上性能为FP32模型的3~3.7倍；在Intel SkyLake机器上（如：Intel(R) Xeon(R) Gold 6148），单线程性能为FP32模型的1.5倍，而精度仅有极小下降。图像分类量化的样例教程请参考[图像分类INT8模型在CPU优化部署和预测](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/mkldnn_quant/README.md)。自然语言处理模型的量化请参考[ERNIE INT8 模型精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn)
+## 图像分类INT8模型在 Xeon(R) 6271 上的精度和性能
+>**图像分类INT8模型在 Intel(R) Xeon(R) Gold 6271 上精度**
+|     Model    | FP32 Top1 Accuracy | INT8 Top1 Accuracy | Top1 Diff | FP32 Top5 Accuracy | INT8 Top5 Accuracy | Top5 Diff |
+|:------------:|:------------------:|:------------------:|:---------:|:------------------:|:------------------:|:---------:|
+| MobileNet-V1 |       70.78%       |       70.74%       |   -0.04%  |       89.69%       |       89.43%       |   -0.26%  |
+| MobileNet-V2 |       71.90%       |       72.21%       |   0.31%   |       90.56%       |       90.62%       |   0.06%   |
+|   ResNet101  |       77.50%       |       77.60%       |   0.10%   |       93.58%       |       93.55%       |   -0.03%  |
+|   ResNet50   |       76.63%       |       76.50%       |   -0.13%  |       93.10%       |       92.98%       |   -0.12%  |
+|     VGG16    |       72.08%       |       71.74%       |   -0.34%  |       90.63%       |       89.71%       |   -0.92%  |
+|     VGG19    |       72.57%       |       72.12%       |   -0.45%  |       90.84%       |       90.15%       |   -0.69%  |
+>**图像分类INT8模型在 Intel(R) Xeon(R) Gold 6271 单核上性能**
+|     Model    | FP32 (images/s) | INT8 (images/s) | Ratio (INT8/FP32) |
+|:------------:|:---------------:|:---------------:|:-----------------:|
+| MobileNet-V1 |      74.05      |      216.36     |        2.92       |
+| MobileNet-V2 |      88.60      |      205.84     |        2.32       |
+|   ResNet101  |       7.20      |      26.48      |        3.68       |
+|   ResNet50   |      13.23      |      50.02      |        3.78       |
+|     VGG16    |       3.47      |      10.67      |        3.07       |
+|     VGG19    |       2.83      |       9.09      |        3.21       |
+## 自然语言处理INT8模型在 Xeon(R) 6271 上的精度和性能
+>**I. Ernie INT8 DNNL 在 Intel(R) Xeon(R) Gold 6271 的精度结果**
+| Model | FP32 Accuracy | INT8 Accuracy | Accuracy Diff |
+| :---: | :-----------: | :-----------: | :-----------: |
+| Ernie |    80.20%     |    79.44%     |    -0.76%     |
+>**II. Ernie INT8 DNNL 在 Intel(R) Xeon(R) Gold 6271 上单样本耗时**
+|  Threads   | FP32 Latency (ms) | INT8 Latency (ms) | Ratio (FP32/INT8) |
+| :--------: | :---------------: | :---------------: | :---------------: |
+|  1 thread  |      237.21       |       79.26       |       2.99X       |
+| 20 threads |       22.08       |       12.57       |       1.76X       |
--- a/docs/zh_cn/tutorials/index.rst
+++ b/docs/zh_cn/tutorials/index.rst
@@ -12,5 +12,5 @@
   paddledetection_slim_pruing_tutorial.md
   paddledetection_slim_prune_dist_tutorial.md
   paddledetection_slim_quantization_tutorial.md
-   image_classification_mkldnn_quant_aware_tutorial.md
+   image_classification_mkldnn_quant_tutorial.md
   paddledetection_slim_sensitivy_tutorial.md
--- a/paddleslim/models/__init__.py
+++ b/paddleslim/models/__init__.py
@@ -16,4 +16,7 @@ from __future__ import absolute_import
 from .util import image_classification
 from .slimfacenet import SlimFaceNet_A_x0_60, SlimFaceNet_B_x0_75, SlimFaceNet_C_x0_75
 from .slim_mobilenet import SlimMobileNet_v1, SlimMobileNet_v2, SlimMobileNet_v3, SlimMobileNet_v4, SlimMobileNet_v5
-__all__ = ["image_classification"]
+from .mobilenet import MobileNet
+from .resnet import ResNet50
+__all__ = ["image_classification", "MobileNet", "ResNet50"]