Doc update with Ernie QAT INT8 benchmarking (#22519)

* Doc update with Ernie QAT INT8 benchmarking test=develop * fixes after review test=develop * remove ernie part, test=develop test=document_fix * Fix model name for qatv2 test=develop test=document_fix * Add Ernie data test=develop test=document_fix * update ERNIE benchmark with baidu QA results, test=develop test=document_fix Co-authored-by: N bingyanghuang <33643817+bingyanghuang@users.noreply.github.com> Co-authored-by: N Michał Gallus <sand3r@interia.eu>

Doc update with Ernie QAT INT8 benchmarking (#22519)
* Doc update with Ernie QAT INT8 benchmarking test=develop * fixes after review test=develop * remove ernie part, test=develop test=document_fix * Fix model name for qatv2 test=develop test=document_fix * Add Ernie data test=develop test=document_fix * update ERNIE benchmark with baidu QA results, test=develop test=document_fix Co-authored-by: N bingyanghuang <33643817+bingyanghuang@users.noreply.github.com> Co-authored-by: N Michał Gallus <sand3r@interia.eu>
fce37bc5 · Wojciech Uss · GitHub · 7cf648b3 · fce37bc5
显示空白变更内容
内联并排

Showing with 65 addition and 31 deletion

python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md ...paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md +65 -31

未找到文件。
--- a/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md
+++ b/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md
 # SLIM Quantization-aware training (QAT) on INT8 MKL-DNN

-This document describes how to use [Paddle Slim](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/paddle_slim/paddle_slim.md) to convert a quantization-aware trained model to INT8 MKL-DNN quantized model. In **Release 1.5**, we have released the QAT1.0 MKL-DNN which enabled the INT8 MKL-DNN kernel for QAT trained model within 0.05% accuracy diff on GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16 and VGG19. In **Release 1.6**, QAT2.0 MKL-DNN, we did the performance optimization based on fake QAT models: ResNet50, ResNet101, Mobilenet-v1, Mobilenet-v2, VGG16 and VGG19 with the minor accuracy drop. Compared with Release 1.5, the QAT2.0 MKL-DNN got better performance gain on inference compared with fake QAT models but got a little bit bigger accuracy diff. We provide the accuracy benchmark both for QAT1.0 MKL-DNN and QAT2.0 MKL-DNN, and performance benchmark on QAT2.0 MKL-DNN.  
+This document describes how to use [Paddle Slim](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/paddle_slim/paddle_slim.md) to convert a quantization-aware trained model to INT8 MKL-DNN quantized model. In **Release 1.5**, we have released the QAT1.0 MKL-DNN which enabled the INT8 MKL-DNN kernel for QAT trained model within 0.05% accuracy diff on GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16 and VGG19. In **Release 1.6**, QAT2.0 MKL-DNN, we did the performance optimization based on fake QAT models: ResNet50, ResNet101, Mobilenet-v1, Mobilenet-v2, VGG16 and VGG19 with the minor accuracy drop. Compared with Release 1.5, the QAT2.0 MKL-DNN got better performance gain on inference compared with fake QAT models but got a little bit bigger accuracy diff. In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie) was added to the QAT2.0 MKL-DNN. We provide the accuracy benchmark both for QAT1.0 MKL-DNN and QAT2.0 MKL-DNN, and performance benchmark on QAT2.0 MKL-DNN.  

 Notes:

 * MKL-DNN and MKL are required. The performance gain can only be obtained with AVX512 series CPU servers.
+* INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension.

 ## 0. Prerequisite
-You need to install at least PaddlePaddle-1.6 python package `pip install paddlepaddle==1.6`.
+You need to install at least PaddlePaddle-1.7 python package `pip install paddlepaddle==1.7`.

 ## 1. How to generate INT8 MKL-DNN QAT model
-You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quantization_mkldnn_pass.py). Users firstly use PaddleSlim quantization strategy to get a saved fake QAT model by [QuantizationFreezePass](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api), then use the `FakeQAT2MkldnnINT8KernelPass` to get the graph which can be run with MKL-DNN INT8 kernel. In Paddle Release 1.6, this pass supports `conv2d` and `depthwise_conv2d` ops with channel-wise quantization for weights. Apart from it, another pass called FakeQAT2MkldnnINT8PerfPass is available for use. This pass allows users to transform their QAT INT8 model into a highly performance-optimized model that is ran using INT8 MKL-DNN kernels.
+You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quantization_mkldnn_pass.py). Users firstly use PaddleSlim quantization strategy to get a saved fake QAT model by [QuantizationFreezePass](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api), then use the `QatInt8MkldnnPass` (from QAT1.0 MKL-DNN) to get a graph which can be run with MKL-DNN INT8 kernel. In Paddle Release 1.6, this pass supports `conv2d` and `depthwise_conv2d` ops with channel-wise quantization for weights. Apart from it, another pass called `Qat2Int8MkldnnPass` (from QAT2.0 MKL-DNN) is available for use. In Release 1.6, this pass additionally supports `pool2d` op and allows users to transform their QAT model into a highly performance-optimized INT8 model that is ran using INT8 MKL-DNN kernels. In Release 1.7, a support for `fc`, `reshape2` and `transpose2` ops was added to the pass.

 ```python
    import paddle.fluid as fluid
-    from paddle.fluid.contrib.slim.quantization import FakeQAT2MkldnnINT8KernelPass
+    from paddle.fluid.contrib.slim.quantization import QatInt8MkldnnPass
+    from paddle.fluid.contrib.slim.quantization import Qat2Int8MkldnnPass
    from paddle.fluid.framework import IrGraph
    from paddle.fluid import core	
    
@@ -23,14 +25,14 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti
    place = fluid.CPUPlace()
    # Convert the IrGraph to MKL-DNN supported INT8 IrGraph by using
    # QAT1.0 MKL-DNN
-    # FakeQAT2MkldnnINT8KernelPass
-    mkldnn_pass = FakeQAT2MkldnnINT8KernelPass(fluid.global_scope(), place)
-    # Apply FakeQAT2MkldnnINT8KernelPass to IrGraph
+    # QatInt8MkldnnPass
+    mkldnn_pass = QatInt8MkldnnPass(fluid.global_scope(), place)
+    # Apply QatInt8MkldnnPass to IrGraph
    mkldnn_pass.apply(graph)
    # QAT2.0 MKL-DNN
-    # FakeQAT2MkldnnINT8PerfPass
-    mkldnn_pass = FakeQAT2MkldnnINT8PerfPass(fluid.global_scope(), place, fluid.core, False)
-    # Apply FakeQAT2MkldnnINT8PerfPass to IrGraph
+    # Qat2Int8MkldnnPass, it requires a list of operators to be quantized
+    mkldnn_pass = Qat2Int8MkldnnPass({'conv2d', 'pool2d'}, fluid.global_scope(), place, fluid.core, False)
+    # Apply Qat2Int8MkldnnPass to IrGraph
    mkldnn_pass.apply(graph)

 ```
@@ -61,9 +63,10 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti
 |     VGG16    |         71.74%         |         71.75%         |   +0.01%  |         89.96%         |         89.73%         |   -0.23%  |
 |     VGG19    |         72.30%         |         72.09%         |   -0.21%  |         90.19%         |         90.13%         |   -0.06%  |

+
 >**III. QAT2.0 MKL-DNN C-API Performance on Intel(R) Xeon(R) Gold 6271**

-|     Model    | FP32 Optimized Throughput (images/s) | INT8 QAT Throughput(images/s) | Ratio(INT8/FP32) |
+|     Model    |        FP32 Optimized Throughput     |       INT8 QAT Throughput     | Ratio(INT8/FP32) |
 |:------------:|:------------------------------------:|:-----------------------------:|:----------------:|
 | MobileNet-V1 |                 73.98                |             227.73            |       3.08       |
 | MobileNet-V2 |                 86.59                |             206.74            |       2.39       |
@@ -76,15 +79,37 @@ Notes:

 * FP32 Optimized Throughput (images/s) is from [int8_mkldnn_quantization.md](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md).

+>**IV. Ernie QAT2.0 MKL-DNN Accuracy on Intel(R) Xeon(R) Gold 6271**
+
+|     Model    |  FP32 Accuracy | QAT INT8 Accuracy | Accuracy Diff |
+|:------------:|:----------------------:|:----------------------:|:---------:|
+|   Ernie      |      0.80              |         0.82           |     +0.02 |               
+
+
+>**V. Ernie QAT2.0 MKL-DNN Performance on Intel(R) Xeon(R) Gold 6271**
+
+|     Threads  | FP32 Latency (ms) | QAT INT8 Latency (ms)    | Latency Diff |
+|:------------:|:----------------------:|:-------------------:|:---------:|
+| 1 thread     |        252.131         |         93.8023    |     2.687x   |
+| 20 threads   |        29.1853         |         17.3765    |     1.680x   |
+
 ## 3. How to reproduce the results
-Three steps to reproduce the above-mentioned accuracy results, and we take ResNet50 benchmark as an example:
- * ### Prepare dataset
+Three steps are needed to reproduce the above-mentioned accuracy and performance results.  Below we explain the steps taking ResNet50 as an example of image classification models. In order to reproduce NLP results, please follow [this guide](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn/README.md).
+### Prepare dataset
+
+#### Image classification
+
+In order to download the dataset for image classification models benchmarking, execute:
+
 ```bash
 cd /PATH/TO/PADDLE
 python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
 ```
 The converted data binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
- * ### Prepare model
+
+### Prepare model
+
+#### Image classification
 You can run the following commands to download ResNet50 model. The exemplary code snippet provided below downloads a ResNet50 QAT model. The reason for having two different versions of the same model originates from having two different QAT training strategies: One for an non-optimized and second for an optimized graph transform which correspond to QAT1.0 and QAT2.0 respectively.

 ```bash
@@ -92,48 +117,57 @@ mkdir -p /PATH/TO/DOWNLOAD/MODEL/
 cd /PATH/TO/DOWNLOAD/MODEL/
 # uncomment for QAT1.0 MKL-DNN
 # export MODEL_NAME=ResNet50
-# export MODEL_FILE_NAME= QAT_models/${MODEL_NAME}_qat_model.tar.gz
+# export MODEL_FILE_NAME=QAT_models/${MODEL_NAME}_qat_model.tar.gz
 # uncomment for QAT2.0 MKL-DNN
 # export MODEL_NAME=resnet50
-# export MODEL_FILE_NAME= QAT2_models/${MODEL_NAME}_quant.tar.gz
+# export MODEL_FILE_NAME=QAT2_models/${MODEL_NAME}_quant.tar.gz
 wget http://paddle-inference-dist.bj.bcebos.com/int8/${MODEL_FILE_NAME}
+mkdir ${MODEL_NAME} && tar -xvf ResNet50_qat_model.tar.gz -C ${MODEL_NAME}
 ```

-Unzip the downloaded model to the folder. To verify all the 7 models, you need to set `MODEL_NAME` to one of the following values in command line:
+Extract the downloaded model to the folder. To verify all the 7 models, you need to set `MODEL_NAME` to one of the following values in command line:
 ```text
 QAT1.0 models
 MODEL_NAME=ResNet50, ResNet101, GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19
 QAT2.0 models
 MODEL_NAME=resnet50, resnet101, mobilenetv1, mobilenetv2, vgg16, vgg19
 ```
-* ### Commands to reproduce benchmark
-You can run `qat_int8_comparison.py` with the following arguments to reproduce the accuracy result on ResNet50. The difference of command line between the QAT1.0 MKL-DNN and QAT2.0 MKL-DNN is that we use argument `qat2` to enable QAT2.0 MKL-DNN. To perform QAT2.0 MKL-DNN the performance test, the environmental variable `OMP_NUM_THREADS=1` and `batch_size=1` parameter should be set.
+### Commands to reproduce benchmark
+
+#### Image classification
+You can use the `qat_int8_image_classification_comparison.py` script to reproduce the accuracy result on ResNet50. The difference between commands usedin the QAT1.0 MKL-DNN and QAT2.0 MKL-DNN is that for QAT2.0 MKL-DNN two additional options are required: the `--qat2` option to enable QAT2.0 MKL-DNN, and the `--quantized_ops` option with a comma-separated list of operators to be quantized. To perform the QAT2.0 MKL-DNN performance test, the environment variable `OMP_NUM_THREADS=1` and `--batch_size=1` option should be set.
 >*QAT1.0*

 - Accuracy benchmark command on QAT1.0 models

 ```bash
 cd /PATH/TO/PADDLE
-OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.001
+OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.001
 ```
 >*QAT2.0*

 - Accuracy benchamrk command on QAT2.0 models
 ```bash
 cd /PATH/TO/PADDLE
-OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME} --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --qat2
+OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py ----qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}_quant --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --qat2 --quantized_ops="conv2d,pool2d"
 ```

 * Performance benchmark command on QAT2.0 models

-```bash
-# 1. Save QAT2.0 INT8 model
-cd /PATH/TO/PADDLE/build
-python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path /PATH/TO/DOWNLOAD/MODEL/${QAT2_MODEL_NAME} --int8_model_save_path /PATH/TO/${QAT2_MODEL_NAME}_qat_int8
+In order to run performance benchmark, follow the steps below.

-# 2. Run the QAT2.0 C-API for performance benchmark
-cd /PATH/TO/PADDLE/build
-OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
-```
+1. Save QAT2.0 INT8 model. You can use the script `save_qat_model.py` for this purpose. It also requires the option `--quantized_ops`  to indicate which operators are to be quantized.
+
+   ```bash
+   cd /PATH/TO/PADDLE/build
+   python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path=/PATH/TO/DOWNLOAD/MODEL/${QAT2_MODEL_NAME} --int8_model_save_path=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --quantized_ops="conv2d,pool2d"
+   ```
+
+2. Run the QAT2.0 C-API test for performance benchmark.
+
+   ```bash
+   cd /PATH/TO/PADDLE/build
+   OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
+   ```

 > Notes: Due to a large amount of images contained in `int8_full_val.bin` dataset (50 000), the accuracy benchmark which includes comparison of unoptimized and optimized QAT model may last long (even several hours). To accelerate the process, it is recommended to set `OMP_NUM_THREADS` to the max number of physical cores available on the server.