未验证 提交 fce37bc5 编写于 作者: W Wojciech Uss 提交者: GitHub

Doc update with Ernie QAT INT8 benchmarking (#22519)

* Doc update with Ernie QAT INT8 benchmarking

test=develop

* fixes after review

test=develop

* remove ernie part, test=develop test=document_fix

* Fix model name for qatv2

test=develop test=document_fix

* Add Ernie data

test=develop test=document_fix

* update ERNIE benchmark with baidu QA results, test=develop test=document_fix
Co-authored-by: Nbingyanghuang <33643817+bingyanghuang@users.noreply.github.com>
Co-authored-by: NMichał Gallus <sand3r@interia.eu>
上级 7cf648b3
# SLIM Quantization-aware training (QAT) on INT8 MKL-DNN # SLIM Quantization-aware training (QAT) on INT8 MKL-DNN
This document describes how to use [Paddle Slim](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/paddle_slim/paddle_slim.md) to convert a quantization-aware trained model to INT8 MKL-DNN quantized model. In **Release 1.5**, we have released the QAT1.0 MKL-DNN which enabled the INT8 MKL-DNN kernel for QAT trained model within 0.05% accuracy diff on GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16 and VGG19. In **Release 1.6**, QAT2.0 MKL-DNN, we did the performance optimization based on fake QAT models: ResNet50, ResNet101, Mobilenet-v1, Mobilenet-v2, VGG16 and VGG19 with the minor accuracy drop. Compared with Release 1.5, the QAT2.0 MKL-DNN got better performance gain on inference compared with fake QAT models but got a little bit bigger accuracy diff. We provide the accuracy benchmark both for QAT1.0 MKL-DNN and QAT2.0 MKL-DNN, and performance benchmark on QAT2.0 MKL-DNN. This document describes how to use [Paddle Slim](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/paddle_slim/paddle_slim.md) to convert a quantization-aware trained model to INT8 MKL-DNN quantized model. In **Release 1.5**, we have released the QAT1.0 MKL-DNN which enabled the INT8 MKL-DNN kernel for QAT trained model within 0.05% accuracy diff on GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16 and VGG19. In **Release 1.6**, QAT2.0 MKL-DNN, we did the performance optimization based on fake QAT models: ResNet50, ResNet101, Mobilenet-v1, Mobilenet-v2, VGG16 and VGG19 with the minor accuracy drop. Compared with Release 1.5, the QAT2.0 MKL-DNN got better performance gain on inference compared with fake QAT models but got a little bit bigger accuracy diff. In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie) was added to the QAT2.0 MKL-DNN. We provide the accuracy benchmark both for QAT1.0 MKL-DNN and QAT2.0 MKL-DNN, and performance benchmark on QAT2.0 MKL-DNN.
Notes: Notes:
* MKL-DNN and MKL are required. The performance gain can only be obtained with AVX512 series CPU servers. * MKL-DNN and MKL are required. The performance gain can only be obtained with AVX512 series CPU servers.
* INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension.
## 0. Prerequisite ## 0. Prerequisite
You need to install at least PaddlePaddle-1.6 python package `pip install paddlepaddle==1.6`. You need to install at least PaddlePaddle-1.7 python package `pip install paddlepaddle==1.7`.
## 1. How to generate INT8 MKL-DNN QAT model ## 1. How to generate INT8 MKL-DNN QAT model
You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quantization_mkldnn_pass.py). Users firstly use PaddleSlim quantization strategy to get a saved fake QAT model by [QuantizationFreezePass](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api), then use the `FakeQAT2MkldnnINT8KernelPass` to get the graph which can be run with MKL-DNN INT8 kernel. In Paddle Release 1.6, this pass supports `conv2d` and `depthwise_conv2d` ops with channel-wise quantization for weights. Apart from it, another pass called FakeQAT2MkldnnINT8PerfPass is available for use. This pass allows users to transform their QAT INT8 model into a highly performance-optimized model that is ran using INT8 MKL-DNN kernels. You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quantization_mkldnn_pass.py). Users firstly use PaddleSlim quantization strategy to get a saved fake QAT model by [QuantizationFreezePass](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api), then use the `QatInt8MkldnnPass` (from QAT1.0 MKL-DNN) to get a graph which can be run with MKL-DNN INT8 kernel. In Paddle Release 1.6, this pass supports `conv2d` and `depthwise_conv2d` ops with channel-wise quantization for weights. Apart from it, another pass called `Qat2Int8MkldnnPass` (from QAT2.0 MKL-DNN) is available for use. In Release 1.6, this pass additionally supports `pool2d` op and allows users to transform their QAT model into a highly performance-optimized INT8 model that is ran using INT8 MKL-DNN kernels. In Release 1.7, a support for `fc`, `reshape2` and `transpose2` ops was added to the pass.
```python ```python
import paddle.fluid as fluid import paddle.fluid as fluid
from paddle.fluid.contrib.slim.quantization import FakeQAT2MkldnnINT8KernelPass from paddle.fluid.contrib.slim.quantization import QatInt8MkldnnPass
from paddle.fluid.contrib.slim.quantization import Qat2Int8MkldnnPass
from paddle.fluid.framework import IrGraph from paddle.fluid.framework import IrGraph
from paddle.fluid import core from paddle.fluid import core
...@@ -23,14 +25,14 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti ...@@ -23,14 +25,14 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti
place = fluid.CPUPlace() place = fluid.CPUPlace()
# Convert the IrGraph to MKL-DNN supported INT8 IrGraph by using # Convert the IrGraph to MKL-DNN supported INT8 IrGraph by using
# QAT1.0 MKL-DNN # QAT1.0 MKL-DNN
# FakeQAT2MkldnnINT8KernelPass # QatInt8MkldnnPass
mkldnn_pass = FakeQAT2MkldnnINT8KernelPass(fluid.global_scope(), place) mkldnn_pass = QatInt8MkldnnPass(fluid.global_scope(), place)
# Apply FakeQAT2MkldnnINT8KernelPass to IrGraph # Apply QatInt8MkldnnPass to IrGraph
mkldnn_pass.apply(graph) mkldnn_pass.apply(graph)
# QAT2.0 MKL-DNN # QAT2.0 MKL-DNN
# FakeQAT2MkldnnINT8PerfPass # Qat2Int8MkldnnPass, it requires a list of operators to be quantized
mkldnn_pass = FakeQAT2MkldnnINT8PerfPass(fluid.global_scope(), place, fluid.core, False) mkldnn_pass = Qat2Int8MkldnnPass({'conv2d', 'pool2d'}, fluid.global_scope(), place, fluid.core, False)
# Apply FakeQAT2MkldnnINT8PerfPass to IrGraph # Apply Qat2Int8MkldnnPass to IrGraph
mkldnn_pass.apply(graph) mkldnn_pass.apply(graph)
``` ```
...@@ -61,10 +63,11 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti ...@@ -61,10 +63,11 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti
| VGG16 | 71.74% | 71.75% | +0.01% | 89.96% | 89.73% | -0.23% | | VGG16 | 71.74% | 71.75% | +0.01% | 89.96% | 89.73% | -0.23% |
| VGG19 | 72.30% | 72.09% | -0.21% | 90.19% | 90.13% | -0.06% | | VGG19 | 72.30% | 72.09% | -0.21% | 90.19% | 90.13% | -0.06% |
>**III. QAT2.0 MKL-DNN C-API Performance on Intel(R) Xeon(R) Gold 6271** >**III. QAT2.0 MKL-DNN C-API Performance on Intel(R) Xeon(R) Gold 6271**
| Model | FP32 Optimized Throughput (images/s) | INT8 QAT Throughput(images/s) | Ratio(INT8/FP32) | | Model | FP32 Optimized Throughput | INT8 QAT Throughput | Ratio(INT8/FP32) |
|:------------:|:------------------------------------:|:-----------------------------:|:----------------:| |:------------:|:------------------------------------:|:-----------------------------:|:----------------:|
| MobileNet-V1 | 73.98 | 227.73 | 3.08 | | MobileNet-V1 | 73.98 | 227.73 | 3.08 |
| MobileNet-V2 | 86.59 | 206.74 | 2.39 | | MobileNet-V2 | 86.59 | 206.74 | 2.39 |
| ResNet101 | 7.15 | 26.69 | 3.73 | | ResNet101 | 7.15 | 26.69 | 3.73 |
...@@ -76,15 +79,37 @@ Notes: ...@@ -76,15 +79,37 @@ Notes:
* FP32 Optimized Throughput (images/s) is from [int8_mkldnn_quantization.md](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md). * FP32 Optimized Throughput (images/s) is from [int8_mkldnn_quantization.md](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md).
>**IV. Ernie QAT2.0 MKL-DNN Accuracy on Intel(R) Xeon(R) Gold 6271**
| Model | FP32 Accuracy | QAT INT8 Accuracy | Accuracy Diff |
|:------------:|:----------------------:|:----------------------:|:---------:|
| Ernie | 0.80 | 0.82 | +0.02 |
>**V. Ernie QAT2.0 MKL-DNN Performance on Intel(R) Xeon(R) Gold 6271**
| Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Latency Diff |
|:------------:|:----------------------:|:-------------------:|:---------:|
| 1 thread | 252.131 | 93.8023 | 2.687x |
| 20 threads | 29.1853 | 17.3765 | 1.680x |
## 3. How to reproduce the results ## 3. How to reproduce the results
Three steps to reproduce the above-mentioned accuracy results, and we take ResNet50 benchmark as an example: Three steps are needed to reproduce the above-mentioned accuracy and performance results. Below we explain the steps taking ResNet50 as an example of image classification models. In order to reproduce NLP results, please follow [this guide](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn/README.md).
* ### Prepare dataset ### Prepare dataset
#### Image classification
In order to download the dataset for image classification models benchmarking, execute:
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/PADDLE
python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
``` ```
The converted data binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin` The converted data binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
* ### Prepare model
### Prepare model
#### Image classification
You can run the following commands to download ResNet50 model. The exemplary code snippet provided below downloads a ResNet50 QAT model. The reason for having two different versions of the same model originates from having two different QAT training strategies: One for an non-optimized and second for an optimized graph transform which correspond to QAT1.0 and QAT2.0 respectively. You can run the following commands to download ResNet50 model. The exemplary code snippet provided below downloads a ResNet50 QAT model. The reason for having two different versions of the same model originates from having two different QAT training strategies: One for an non-optimized and second for an optimized graph transform which correspond to QAT1.0 and QAT2.0 respectively.
```bash ```bash
...@@ -92,48 +117,57 @@ mkdir -p /PATH/TO/DOWNLOAD/MODEL/ ...@@ -92,48 +117,57 @@ mkdir -p /PATH/TO/DOWNLOAD/MODEL/
cd /PATH/TO/DOWNLOAD/MODEL/ cd /PATH/TO/DOWNLOAD/MODEL/
# uncomment for QAT1.0 MKL-DNN # uncomment for QAT1.0 MKL-DNN
# export MODEL_NAME=ResNet50 # export MODEL_NAME=ResNet50
# export MODEL_FILE_NAME= QAT_models/${MODEL_NAME}_qat_model.tar.gz # export MODEL_FILE_NAME=QAT_models/${MODEL_NAME}_qat_model.tar.gz
# uncomment for QAT2.0 MKL-DNN # uncomment for QAT2.0 MKL-DNN
# export MODEL_NAME=resnet50 # export MODEL_NAME=resnet50
# export MODEL_FILE_NAME= QAT2_models/${MODEL_NAME}_quant.tar.gz # export MODEL_FILE_NAME=QAT2_models/${MODEL_NAME}_quant.tar.gz
wget http://paddle-inference-dist.bj.bcebos.com/int8/${MODEL_FILE_NAME} wget http://paddle-inference-dist.bj.bcebos.com/int8/${MODEL_FILE_NAME}
mkdir ${MODEL_NAME} && tar -xvf ResNet50_qat_model.tar.gz -C ${MODEL_NAME}
``` ```
Unzip the downloaded model to the folder. To verify all the 7 models, you need to set `MODEL_NAME` to one of the following values in command line: Extract the downloaded model to the folder. To verify all the 7 models, you need to set `MODEL_NAME` to one of the following values in command line:
```text ```text
QAT1.0 models QAT1.0 models
MODEL_NAME=ResNet50, ResNet101, GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19 MODEL_NAME=ResNet50, ResNet101, GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19
QAT2.0 models QAT2.0 models
MODEL_NAME=resnet50, resnet101, mobilenetv1, mobilenetv2, vgg16, vgg19 MODEL_NAME=resnet50, resnet101, mobilenetv1, mobilenetv2, vgg16, vgg19
``` ```
* ### Commands to reproduce benchmark ### Commands to reproduce benchmark
You can run `qat_int8_comparison.py` with the following arguments to reproduce the accuracy result on ResNet50. The difference of command line between the QAT1.0 MKL-DNN and QAT2.0 MKL-DNN is that we use argument `qat2` to enable QAT2.0 MKL-DNN. To perform QAT2.0 MKL-DNN the performance test, the environmental variable `OMP_NUM_THREADS=1` and `batch_size=1` parameter should be set.
#### Image classification
You can use the `qat_int8_image_classification_comparison.py` script to reproduce the accuracy result on ResNet50. The difference between commands usedin the QAT1.0 MKL-DNN and QAT2.0 MKL-DNN is that for QAT2.0 MKL-DNN two additional options are required: the `--qat2` option to enable QAT2.0 MKL-DNN, and the `--quantized_ops` option with a comma-separated list of operators to be quantized. To perform the QAT2.0 MKL-DNN performance test, the environment variable `OMP_NUM_THREADS=1` and `--batch_size=1` option should be set.
>*QAT1.0* >*QAT1.0*
- Accuracy benchmark command on QAT1.0 models - Accuracy benchmark command on QAT1.0 models
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/PADDLE
OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.001 OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.001
``` ```
>*QAT2.0* >*QAT2.0*
- Accuracy benchamrk command on QAT2.0 models - Accuracy benchamrk command on QAT2.0 models
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/PADDLE
OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME} --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --qat2 OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py ----qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}_quant --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --qat2 --quantized_ops="conv2d,pool2d"
``` ```
* Performance benchmark command on QAT2.0 models * Performance benchmark command on QAT2.0 models
```bash In order to run performance benchmark, follow the steps below.
# 1. Save QAT2.0 INT8 model
cd /PATH/TO/PADDLE/build
python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path /PATH/TO/DOWNLOAD/MODEL/${QAT2_MODEL_NAME} --int8_model_save_path /PATH/TO/${QAT2_MODEL_NAME}_qat_int8
# 2. Run the QAT2.0 C-API for performance benchmark 1. Save QAT2.0 INT8 model. You can use the script `save_qat_model.py` for this purpose. It also requires the option `--quantized_ops` to indicate which operators are to be quantized.
cd /PATH/TO/PADDLE/build
OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1 ```bash
``` cd /PATH/TO/PADDLE/build
python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path=/PATH/TO/DOWNLOAD/MODEL/${QAT2_MODEL_NAME} --int8_model_save_path=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --quantized_ops="conv2d,pool2d"
```
2. Run the QAT2.0 C-API test for performance benchmark.
```bash
cd /PATH/TO/PADDLE/build
OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
```
> Notes: Due to a large amount of images contained in `int8_full_val.bin` dataset (50 000), the accuracy benchmark which includes comparison of unoptimized and optimized QAT model may last long (even several hours). To accelerate the process, it is recommended to set `OMP_NUM_THREADS` to the max number of physical cores available on the server. > Notes: Due to a large amount of images contained in `int8_full_val.bin` dataset (50 000), the accuracy benchmark which includes comparison of unoptimized and optimized QAT model may last long (even several hours). To accelerate the process, it is recommended to set `OMP_NUM_THREADS` to the max number of physical cores available on the server.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册