未验证 提交 2383a9f7 编写于 作者: W Wojciech Uss 提交者: GitHub

[Doc update] Update for QAT INT8 MKL-DNN document (#23361)

* Update for QAT INT8 MKL-DNN document, added info on VNNI in Windows, benchmark results added and updated
上级 ed5766ff
# SLIM Quantization-aware training (QAT) on INT8 MKL-DNN # SLIM Quantization-aware training (QAT) for INT8 MKL-DNN
This document describes how to use [Paddle Slim](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/advanced_usage/paddle_slim/paddle_slim.md) to convert a quantization-aware trained model to INT8 MKL-DNN quantized model. In **Release 1.5**, we have released the QAT1.0 MKL-DNN which enabled the INT8 MKL-DNN kernel for QAT trained model within 0.05% accuracy diff on GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16 and VGG19. In **Release 1.6**, QAT2.0 MKL-DNN, we did the performance optimization based on fake QAT models: ResNet50, ResNet101, Mobilenet-v1, Mobilenet-v2, VGG16 and VGG19 with the minor accuracy drop. Compared with Release 1.5, the QAT2.0 MKL-DNN got better performance gain on inference compared with fake QAT models but got a little bit bigger accuracy diff. In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie) was added to the QAT2.0 MKL-DNN. We provide the accuracy benchmark both for QAT1.0 MKL-DNN and QAT2.0 MKL-DNN, and performance benchmark on QAT2.0 MKL-DNN. This document describes how to use [Paddle Slim](https://paddlepaddle.github.io/PaddleSlim/index.html) to convert a quantization-aware trained model into INT8 MKL-DNN quantized model and run it.
In **Release 1.5**, we have released the first approach to the MKL-DNN-based quantization of QAT models, called QAT1. It enabled the `conv2d` and `mul` INT8 MKL-DNN kernels for QAT trained models (GoogleNet, MobileNetV1, MobileNetV2, ResNet50, ResNet101, VGG16, and VGG19) with 0.05% accuracy diff.
In **Release 1.6**, a new approach was introduced, called QAT2, which adds support for more performance optimizations and more INT8 MKL-DNN kernels. INT8 MKL-DNN models obtained using QAT2 have much better inference performance than using QAT1, with only a little bit bigger accuracy diff.
In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn) was added to the QAT2.
In this document we focus on the QAT2 approach only.
## 0. Prerequisites
* PaddlePaddle in version 1.7.1 or higher is required. For instructions on how to install it see the [installation document](https://www.paddlepaddle.org.cn/install/quick).
* MKL-DNN and MKL are required. The highest performance gain can be observed using CPU servers supporting AVX512 instructions.
* INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension (e.g. CLX class Intel processors). A linux server supports AVX512 VNNI instructions if the output of the command `lscpu` contains the `avx512_vnni` entry in the `Flags` section. AVX512 VNNI support on Windows can be checked using the [`coreinfo`]( https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo) tool.
## 1. Introduction
There are two forms of quantization supported in PaddlePaddle: [post-training quantization](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md) (PTQ) and quantization-aware training (QAT). Using both PTQ and QAT a user can convert models created by PaddleSlim into INT8 models and run INT8 inference on CPU. PTQ is more automatic and requires less model preparation than QAT, but usually QAT gives better accuracy with similar performance. In this document we focus on QAT2 approach to the QAT and INT8 quantization.
## 2. How to turn an FP32 model into a QAT model?
A procedure on how to transform an FP32 model into a QAT model supported by the QAT2 approach is described in [this document](https://github.com/PaddlePaddle/PaddleSlim/blob/80c9fab3f419880dd19ca6ea30e0f46a2fedf6b3/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial.md).
## 3. How to turn a QAT model into an INT8 MKL-DNN model?
A QAT model can be transformed into an INT8 quantized model if it contains enough information about quantization scales for every quantized operator in the graph. The process of quantization is done by the `Qat2Int8MkldnnPass` pass which comprises several steps:
### Gathering scales
The information about the quantization scales is being collected from three types of operators:
* `fake_quantize_moving_average_abs_max` - imitates INT8 quantization of FP32 tensors, but keeps quantized output values as floats; is used before quantized operator (e.g. `conv2d`) to gather scale information for the op's input.
* `fake_dequantize_max_abs` - imitates dequantization of INT8 tensors back into floats; it is used after quantized operator, and contains scale used for the op's weights dequantization.
* `fake_quantize_dequantize_moving_average_abs_max` - imitates immediate quantization and dequantization; it can be used after a quantized operator to get the scale value for the op's output.
Notes: Notes:
* MKL-DNN and MKL are required. The performance gain can only be obtained with AVX512 series CPU servers. 1. As the next steps describe, quantization will be applied later to an optimized FP32 model. It means that quantization scales for inputs and outputs of each quantized operator have to be gathered for tensors which are inputs and outputs of already optimized or fused operators. For example, if a model contains the following sequence of tensors and operators in the graph
* INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension. ```... → input1 → conv2d → output1 → batch_norm → output2 → relu → output3 → ...```
and we want to quantize the `conv2d` op, then after applying FP32 optimizations the sequence will become
```... → input1 → conv2d → output3 → ...```
and the quantization scales have to be collected for the `input1` and `outpu3` tensors in the QAT model.
2. Quantization of the following operators is supported: `conv2d`, `depthwise_conv2d`, `mul`, `fc`, `pool2d`, `reshape2`, `transpose2`, `concat`.
3. The longest sequence of consecutive quantizable operators in the model, the biggest performance boost can be achieved through quantization:
```... → conv2d → conv2d → pool2d → conv2d → conv2d → ...```
Quantizing single operator separated from other quantizable operators can give no performance benefits or even slow down the inference:
```... → swish → fc → softmax → ...`
### Removing fake operators
All the `fake_quantize_*` and `fake_dequantize_*` operators are being removed from the graph.
## 0. Prerequisite ### Dequantizing weights
You need to install at least PaddlePaddle-1.7.1 python package `pip install paddlepaddle==1.7.1`.
## 1. How to generate INT8 MKL-DNN QAT model Weights of `conv2d`, `depthwise_conv2d` and `mul` operators are assumed to be fake-quantized (with integer values in the `int8` range, but kept as `float`s) in QAT models. Here, the information about the scale from `fake_dequantize_max_abs` operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.
You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quantization_mkldnn_pass.py). Users firstly use PaddleSlim quantization strategy to get a saved fake QAT model by [QuantizationFreezePass](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api), then use the `QatInt8MkldnnPass` (from QAT1.0 MKL-DNN) to get a graph which can be run with MKL-DNN INT8 kernel. In Paddle Release 1.6, this pass supports `conv2d` and `depthwise_conv2d` ops with channel-wise quantization for weights. Apart from it, another pass called `Qat2Int8MkldnnPass` (from QAT2.0 MKL-DNN) is available for use. In Release 1.6, this pass additionally supports `pool2d` op and allows users to transform their QAT model into a highly performance-optimized INT8 model that is ran using INT8 MKL-DNN kernels. In Release 1.7, a support for `fc`, `reshape2` and `transpose2` ops was added to the pass.
### Optimizing FP32 graph
A series of standard optimization passes are being applied to the FP32 graph. This gives us an optimized FP32 inference model and we can proceed with INT8 quantization.
### Computing weight scales
After optimization fuses, the weight tensors of `conv2d` or `fc` operators are likely to have different values and require new quantization scales. The weights are static, i.e. they do not change during the inference process, and the scales can be calculated simply as a maximum of absolute values from the tensor. To improve the inference accuracy we calculate the scales for each output channel separately, getting an array of quantization scales for a weight tensor.
### Taking activations into account
The basic datatype used during INT8 inference is signed INT8, with possible values from -128 to 127. However, if `conv2d` or `fc` operator has `relu` or `relu6` activation integrated in it, the output of the operator is known to have non-negative values. In that case we use unsigned INT8 datatype for output tensors, with a wider range for positive values (0 to 255), improving the inference accuracy further.
### Propagation of scales
Some of the operators (e.g. `reshape2`, `transpose2`, `pool2d` with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the `scale` operator, updating the quantization scale accordingly. This approach lets us minimize the number of `fake_quantize` and `fake_dequantize` operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.
### Applying quantization passes
Having gathered all the data needed for quantization we apply the `cpu_quantize_pass` which quantizes the graph, and the `cpu_quantize_squash_pass` which optimizes the INT8 graph.
## 4. Code example
The code snipped shows how the `Qat2Int8MkldnnPass` can be applied to a model graph:
```python ```python
import paddle.fluid as fluid import paddle.fluid as fluid
from paddle.fluid.contrib.slim.quantization import QatInt8MkldnnPass
from paddle.fluid.contrib.slim.quantization import Qat2Int8MkldnnPass from paddle.fluid.contrib.slim.quantization import Qat2Int8MkldnnPass
from paddle.fluid.framework import IrGraph from paddle.fluid.framework import IrGraph
from paddle.fluid import core from paddle.fluid import core
...@@ -23,83 +90,135 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti ...@@ -23,83 +90,135 @@ You can refer to the unit test in [test_quantization_mkldnn_pass.py](test_quanti
# Create the IrGraph by Program # Create the IrGraph by Program
graph = IrGraph(core.Graph(fluid.Program().desc), for_test=False) graph = IrGraph(core.Graph(fluid.Program().desc), for_test=False)
place = fluid.CPUPlace() place = fluid.CPUPlace()
# Convert the IrGraph to MKL-DNN supported INT8 IrGraph by using # Convert the IrGraph to MKL-DNN supported INT8 IrGraph using the
# QAT1.0 MKL-DNN # Qat2Int8MkldnnPass. It requires a list of operators to be quantized
# QatInt8MkldnnPass
mkldnn_pass = QatInt8MkldnnPass(fluid.global_scope(), place)
# Apply QatInt8MkldnnPass to IrGraph
mkldnn_pass.apply(graph)
# QAT2.0 MKL-DNN
# Qat2Int8MkldnnPass, it requires a list of operators to be quantized
mkldnn_pass = Qat2Int8MkldnnPass({'conv2d', 'pool2d'}, fluid.global_scope(), place, fluid.core, False) mkldnn_pass = Qat2Int8MkldnnPass({'conv2d', 'pool2d'}, fluid.global_scope(), place, fluid.core, False)
# Apply Qat2Int8MkldnnPass to IrGraph # Apply Qat2Int8MkldnnPass to IrGraph
mkldnn_pass.apply(graph) mkldnn_pass.apply(graph)
``` ```
## 2. Accuracy and Performance benchmark ## 5. Accuracy and Performance benchmark
>**I. QAT1.0 MKL-DNN Accuracy on Intel(R) Xeon(R) Gold 6271** This section contain QAT2 MKL-DNN accuracy and performance benchmark results measured on two servers:
| Model | Fake QAT Top1 Accuracy | INT8 QAT Top1 Accuracy | Top1 Diff | Fake QAT Top5 Accuracy | INT8 QAT Top5 Accuracy | Top5 Diff | * Intel(R) Xeon(R) Gold 6271 (with AVX512 VNNI support),
|:------------:|:----------------------:|:----------------------:|:---------:|:----------------------:|:----------------------:|:---------:| * Intel(R) Xeon(R) Gold 6148.
| GoogleNet | 70.40% | 70.39% | -0.01% | 89.46% | 89.46% | 0.00% |
| MobileNet-V1 | 70.84% | 70.85% | +0.01% | 89.59% | 89.58% | -0.01% |
| MobileNet-V2 | 72.07% | 72.06% | -0.01% | 90.71% | 90.69% | -0.02% |
| ResNet-101 | 77.49% | 77.52% | +0.03% | 93.68% | 93.67% | -0.01% |
| ResNet-50 | 76.61% | 76.62% | +0.01% | 93.08% | 93.10% | +0.02% |
| VGG16 | 72.71% | 72.69% | -0.02% | 91.11% | 91.09% | -0.02% |
| VGG19 | 73.37% | 73.37% | 0.00% | 91.40% | 91.41% | +0.01% |
Performance benchmarks were run with the following environment settings:
>**II. QAT2.0 MKL-DNN Accuracy on Intel(R) Xeon(R) Gold 6271** * The benchmark threads were assigned to cores by setting
| Model | Fake QAT Top1 Accuracy | INT8 QAT Top1 Accuracy | Top1 Diff | Fake QAT Top5 Accuracy | INT8 QAT Top5 Accuracy | Top5 Diff | ```bash
|:------------:|:----------------------:|:----------------------:|:---------:|:----------------------:|:----------------------:|:---------:| export KMP_AFFINITY=granularity=fine,compact,1,0
| MobileNet-V1 | 70.72% | 70.78% | +0.06% | 89.47% | 89.39% | -0.08% | export KMP_BLOCKTIME=1
| MobileNet-V2 | 72.07% | 72.17% | +0.10% | 90.65% | 90.63% | -0.02% | ```
| ResNet101 | 77.86% | 77.59% | -0.27% | 93.54% | 93.54% | 0.00% |
| ResNet50 | 76.62% | 76.53% | -0.09% | 93.01% | 92.98% | -0.03% |
| VGG16 | 71.74% | 71.75% | +0.01% | 89.96% | 89.73% | -0.23% |
| VGG19 | 72.30% | 72.09% | -0.21% | 90.19% | 90.13% | -0.06% |
* Turbo Boost was set to OFF using the command
>**III. QAT2.0 MKL-DNN C-API Performance on Intel(R) Xeon(R) Gold 6271** ```bash
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
```
| Model | FP32 Optimized Throughput | INT8 QAT Throughput | Ratio(INT8/FP32) | ### Image classification models benchmark results
|:------------:|:------------------------------------:|:-----------------------------:|:----------------:|
| MobileNet-V1 | 73.98 | 227.73 | 3.08 | #### Accuracy
| MobileNet-V2 | 86.59 | 206.74 | 2.39 |
| ResNet101 | 7.15 | 26.69 | 3.73 | >**Intel(R) Xeon(R) Gold 6271**
| ResNet50 | 13.15 | 49.33 | 3.75 |
| VGG16 | 3.34 | 10.15 | 3.04 | | Model | FP32 Top1 Accuracy | INT8 QAT Top1 Accuracy | Top1 Diff | FP32 Top5 Accuracy | INT8 QAT Top5 Accuracy | Top5 Diff |
| VGG19 | 2.83 | 8.67 | 3.07 | | :----------: | :----------------: | :--------------------: | :-------: | :----------------: | :--------------------: | :-------: |
| MobileNet-V1 | 70.78% | 70.71% | -0.07% | 89.69% | 89.41% | -0.28% |
| MobileNet-V2 | 71.90% | 72.11% | +0.21% | 90.56% | 90.62% | +0.06% |
| ResNet101 | 77.50% | 77.64% | +0.14% | 93.58% | 93.58% | 0.00% |
| ResNet50 | 76.63% | 76.47% | -0.16% | 93.10% | 92.98% | -0.12% |
| VGG16 | 72.08% | 71.73% | -0.35% | 90.63% | 89.71% | -0.92% |
| VGG19 | 72.57% | 72.12% | -0.45% | 90.84% | 90.15% | -0.69% |
>**Intel(R) Xeon(R) Gold 6148**
| Model | FP32 Top1 Accuracy | INT8 QAT Top1 Accuracy | Top1 Diff | FP32 Top5 Accuracy | INT8 QAT Top5 Accuracy | Top5 Diff |
| :----------: | :----------------: | :--------------------: | :-------: | :----------------: | :--------------------: | :-------: |
| MobileNet-V1 | 70.78% | 70.85% | 0.07% | 89.69% | 89.41% | -0.28% |
| MobileNet-V2 | 71.90% | 72.08% | 0.18% | 90.56% | 90.66% | +0.10% |
| ResNet101 | 77.50% | 77.51% | 0.01% | 93.58% | 93.50% | -0.08% |
| ResNet50 | 76.63% | 76.55% | -0.08% | 93.10% | 92.96% | -0.14% |
| VGG16 | 72.08% | 71.72% | -0.36% | 90.63% | 89.75% | -0.88% |
| VGG19 | 72.57% | 72.08% | -0.49% | 90.84% | 90.11% | -0.73% |
#### Performance
Image classification models performance was measured using a single thread. The setting is included in the benchmark reproduction commands below.
>**Intel(R) Xeon(R) Gold 6271**
| Model | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
| :----------: | :-------------: | :-----------------: | :---------------: |
| MobileNet-V1 | 74.36 | 210.68 | 2.83 |
| MobileNet-V2 | 89.59 | 186.55 | 2.08 |
| ResNet101 | 7.21 | 26.41 | 3.67 |
| ResNet50 | 13.23 | 48.89 | 3.70 |
| VGG16 | 3.49 | 10.11 | 2.90 |
| VGG19 | 2.84 | 8.69 | 3.06 |
>**Intel(R) Xeon(R) Gold 6148**
| Model | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
| :----------: | :-------------: | :-----------------: | :---------------: |
| MobileNet-V1 | 75.23 | 111.15 | 1.48 |
| MobileNet-V2 | 86.65 | 127.21 | 1.47 |
| ResNet101 | 6.61 | 10.60 | 1.60 |
| ResNet50 | 12.42 | 19.74 | 1.59 |
| VGG16 | 3.31 | 4.74 | 1.43 |
| VGG19 | 2.68 | 3.91 | 1.46 |
Notes: Notes:
* FP32 Optimized Throughput (images/s) is from [int8_mkldnn_quantization.md](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md). * Performance FP32 (images/s) values come from [INT8 MKL-DNN post-training quantization](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md) document.
### NLP models benchmark results
>**IV. Ernie QAT2.0 MKL-DNN Accuracy on Intel(R) Xeon(R) Gold 6271** #### Accuracy
>**Intel(R) Xeon(R) Gold 6271**
| Model | FP32 Accuracy | QAT INT8 Accuracy | Accuracy Diff | | Model | FP32 Accuracy | QAT INT8 Accuracy | Accuracy Diff |
|:------------:|:----------------------:|:----------------------:|:---------:| |:------------:|:----------------------:|:----------------------:|:---------:|
| Ernie | 80.20% | 79.96% | -0.24% | | Ernie | 80.20% | 79.88% | -0.32% |
>**Intel(R) Xeon(R) Gold 6148**
| Model | FP32 Accuracy | QAT INT8 Accuracy | Accuracy Diff |
| :---: | :-----------: | :---------------: | :-----------: |
| Ernie | 80.20% | 79.64% | -0.56% |
#### Performance
>**V. Ernie QAT2.0 MKL-DNN Performance on Intel(R) Xeon(R) Gold 6271**
| Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) | >**Intel(R) Xeon(R) Gold 6271**
|:------------:|:----------------------:|:-------------------:|:---------:|
| 1 thread | 252.131 | 93.8023 | 2.687x |
| 20 threads | 29.1853 | 17.3765 | 1.680x |
## 3. How to reproduce the results | Model | Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) |
To reproduce the above-mentioned Image Classification models accuracy and performance, follow steps as below (taking ResNet50 as an example). |:------------:|:----------------------:|:-------------------:|:---------:|:---------:|
| Ernie | 1 thread | 256.11 | 93.80 | 2.73 |
| Ernie | 20 threads | 30.06 | 16.88 | 1.78 |
>**Intel(R) Xeon(R) Gold 6148**
| Model | Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) |
| :---: | :--------: | :---------------: | :-------------------: | :---------------: |
| Ernie | 1 thread | 254.20 | 169.54 | 1.50 |
| Ernie | 20 threads | 30.99 | 21.81 | 1.42 |
## 6. How to reproduce the results
The steps below show, taking ResNet50 as an example, how to reproduce the above accuracy and performance results for Image Classification models.
To reproduce NLP models results (Ernie), please follow [How to reproduce Ernie QAT results on MKL-DNN](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn/README.md). To reproduce NLP models results (Ernie), please follow [How to reproduce Ernie QAT results on MKL-DNN](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn/README.md).
### Prepare dataset ### Prepare dataset
In order to download the dataset for image classification models benchmarking, execute: Download the dataset for image classification models benchmarking by executing:
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/PADDLE
...@@ -107,65 +226,65 @@ python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py ...@@ -107,65 +226,65 @@ python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
``` ```
The converted data binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin` The converted data binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
### Prepare model ### Prepare models
You can run the following commands to download ResNet50 model. The exemplary code snippet provided below downloads a ResNet50 QAT model. The reason for having two different versions of the same model originates from having two different QAT training strategies: One for an non-optimized and second for an optimized graph transform which correspond to QAT1.0 and QAT2.0 respectively. Run the following commands to download and extract QAT model:
```bash ```bash
mkdir -p /PATH/TO/DOWNLOAD/MODEL/ mkdir -p /PATH/TO/DOWNLOAD/MODEL/
cd /PATH/TO/DOWNLOAD/MODEL/ cd /PATH/TO/DOWNLOAD/MODEL/
# uncomment for QAT1.0 MKL-DNN export QAT_MODEL_NAME=resnet50
# export MODEL_NAME=ResNet50 export QAT_MODEL_ARCHIVE=${QAT_MODEL_NAME}_quant.tar.gz
# export MODEL_FILE_NAME=QAT_models/${MODEL_NAME}_qat_model.tar.gz wget http://paddle-inference-dist.bj.bcebos.com/int8/QAT2_models/${QAT_MODEL_ARCHIVE}
# uncomment for QAT2.0 MKL-DNN mkdir ${QAT_MODEL_NAME} && tar -xvf ${QAT_MODEL_ARCHIVE} -C ${QAT_MODEL_NAME}
# export MODEL_NAME=resnet50
# export MODEL_FILE_NAME=QAT2_models/${MODEL_NAME}_quant.tar.gz
wget http://paddle-inference-dist.bj.bcebos.com/int8/${MODEL_FILE_NAME}
mkdir ${MODEL_NAME} && tar -xvf ResNet50_qat_model.tar.gz -C ${MODEL_NAME}
```
Extract the downloaded model to the folder. To verify all the 7 models, you need to set `MODEL_NAME` to one of the following values in command line:
```text
QAT1.0 models
MODEL_NAME=ResNet50, ResNet101, GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19
QAT2.0 models
MODEL_NAME=resnet50, resnet101, mobilenetv1, mobilenetv2, vgg16, vgg19
``` ```
### Commands to reproduce benchmark
You can use the `qat_int8_image_classification_comparison.py` script to reproduce the accuracy result on ResNet50. The difference between commands usedin the QAT1.0 MKL-DNN and QAT2.0 MKL-DNN is that for QAT2.0 MKL-DNN two additional options are required: the `--qat2` option to enable QAT2.0 MKL-DNN, and the `--quantized_ops` option with a comma-separated list of operators to be quantized. To perform the QAT2.0 MKL-DNN performance test, the environment variable `OMP_NUM_THREADS=1` and `--batch_size=1` option should be set. To download other QAT models, set the `QAT_MODEL_NAME` variable in the above commands to one of the values: `resnet101`, `mobilenetv1`, `mobilenetv2`, `vgg16`, `vgg19`.
>*QAT1.0*
- Accuracy benchmark command on QAT1.0 models Download clean FP32 model for accuracy comparison against the INT8 model:
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/DOWNLOAD/MODEL/
OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py --qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.001 export FP32_MODEL_NAME=resnet50
export FP32_MODEL_ARCHIVE=${FP32_MODEL_NAME}_int8_model.tar.gz
wget http://paddle-inference-dist.bj.bcebos.com/int8/${FP32_MODEL_ARCHIVE}
mkdir ${FP32_MODEL_NAME} && tar -xzvf ${FP32_MODEL_ARCHIVE} -C ${FP32_MODEL_NAME}
``` ```
>*QAT2.0*
- Accuracy benchamrk command on QAT2.0 models To download other FP32 models, set the `FP32_MODEL_NAME` variable to on of the values: `Res101`, `mobilenetv1`, `mobilenet_v2`, `VGG16`, and `VGG19`.
### Run benchmark
#### Accuracy benchmark commands
You can use the `qat2_int8_image_classification_comparison.py` script to reproduce the accuracy result of the INT8 QAT models. The following options are required:
* `--qat_model` - a path to a QAT model that will be transformed into INT8 model.
* `--fp32_model` - a path to an FP32 model whose accuracy will be measured and compared to the accuracy of the INT8 model.
* `--quantized_ops` - a comma-separated list of names of operators to be quantized. The list depends on which operators have quantization scales provided in the model. Also, it may be more optimal in terms of performance to choose only certain types of operators for quantization. For Image Classification models mentioned above the list comprises of `conv2d` and `pool2d` operators.
* `--infer_data` - a path to the validation dataset.
```bash ```bash
cd /PATH/TO/PADDLE cd /PATH/TO/PADDLE
OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat_int8_image_classification_comparison.py ----qat_model=/PATH/TO/DOWNLOAD/MODEL/${MODEL_NAME}_quant --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --qat2 --quantized_ops="conv2d,pool2d" OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/qat2_int8_image_classification_comparison.py --qat_model=/PATH/TO/DOWNLOADED/QAT/MODEL --fp32_model=/PATH/TO/DOWNLOADED/FP32/MODEL --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --quantized_ops="conv2d,pool2d"
``` ```
* Performance benchmark command on QAT2.0 models > Notes: Due to a large amount of images in the `int8_full_val.bin` dataset (50 000), the accuracy benchmark may last long. To accelerate accuracy measuring, it is recommended to set `OMP_NUM_THREADS` to the maximum number of physical cores available on the server.
In order to run performance benchmark, follow the steps below. #### Performance benchmark commands
1. Save QAT2.0 INT8 model. You can use the script `save_qat_model.py` for this purpose. It also requires the option `--quantized_ops` to indicate which operators are to be quantized. To reproduce the performance results, the environment variable `OMP_NUM_THREADS=1` and `--batch_size=1` option should be set.
1. Transform the QAT model into INT8 model by applying the `Qat2Int8MkldnnPass` pass and save the result. You can use the script `save_qat_model.py` for this purpose. It also requires the option `--quantized_ops` with a list of operators to be quantized.
```bash ```bash
cd /PATH/TO/PADDLE/build cd /PATH/TO/PADDLE/build
python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path=/PATH/TO/DOWNLOAD/MODEL/${QAT2_MODEL_NAME} --int8_model_save_path=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --quantized_ops="conv2d,pool2d" python ../python/paddle/fluid/contrib/slim/tests/save_qat_model.py --qat_model_path=/PATH/TO/DOWNLOADED/QAT/MODEL --int8_model_save_path=/PATH/TO/SAVE/QAT/INT8/MODEL --quantized_ops="conv2d,pool2d"
``` ```
2. Run the QAT2.0 C-API test for performance benchmark. 2. Run the C-API test for performance benchmark.
```bash ```bash
cd /PATH/TO/PADDLE/build cd /PATH/TO/PADDLE/build
OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/${QAT2_MODEL_NAME}_qat_int8 --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1 OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/SAVED/QAT/INT8/MODEL --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
``` ```
> Notes: Due to a large amount of images contained in `int8_full_val.bin` dataset (50 000), the accuracy benchmark which includes comparison of unoptimized and optimized QAT model may last long (even several hours). To accelerate the process, it is recommended to set `OMP_NUM_THREADS` to the max number of physical cores available on the server.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册