From 61ec30f03049c544d04df49e2a8983735f60f011 Mon Sep 17 00:00:00 2001
From: lidanqing <danqing.li@intel.com>
Date: Tue, 28 Apr 2020 15:09:29 +0200
Subject: [PATCH] Update QAT INT8 2.0 doc  (#24127)

* update local data preprocess doc

* update for 2.0 QAT

test=develop

test=document_fix

* update benchmark data

test=develop

test=document_fix

Co-authored-by: Wojciech Uss <wojciech.uss@intel.com>
---
 .../tests/api/int8_mkldnn_quantization.md     | 59 +++++++++++------
 .../slim/tests/QAT_mkldnn_int8_readme.md      | 66 +++++++++++--------
 2 files changed, 77 insertions(+), 48 deletions(-)

diff --git a/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md b/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md
index 01c4345cca..1fc35f86bc 100644
--- a/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md
+++ b/paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md
@@ -8,7 +8,6 @@ Follow PaddlePaddle [installation instruction](https://github.com/PaddlePaddle/m
 
 ```bash
 cmake ..  -DWITH_TESTING=ON -WITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_INFERENCE_API_TEST=ON -DON_INFER=ON
-
 ```
 
 Note: MKL-DNN and MKL are required.
@@ -64,14 +63,32 @@ We provide the results of accuracy and performance measured on Intel(R) Xeon(R)
 
 * ## Prepare dataset
 
-Run the following commands to download and preprocess the ILSVRC2012 Validation dataset.
+* Download and preprocess the full ILSVRC2012 Validation dataset.
 
 ```bash
-cd /PATH/TO/PADDLE/build
-python ../paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
+cd /PATH/TO/PADDLE
+python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
 ```
 
-Then the ILSVRC2012 Validation dataset will be preprocessed and saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
+Then the ILSVRC2012 Validation dataset binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
+
+* Prepare user local dataset.
+
+```bash
+cd /PATH/TO/PADDLE/
+python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET --output_file=/PATH/TO/OUTPUT/BINARY
+```
+
+Available options in the above command and their descriptions are as follows:
+- **No parameters set:** The script will download the ILSVRC2012_img_val data from server and convert it into a binary file.
+- **local:** Once set, the script will process user local data.
+- **data_dir:** Path to user local dataset. Default value: None.
+- **label_list:** Path to image_label list file. Default value: `val_list.txt`.
+- **output_file:** Path to the generated binary file. Default value: `imagenet_small.bin`.
+- **data_dim:** The length and width of the preprocessed image. The default value: 224.
+
+The user dataset preprocessed binary file by default is saved in `imagenet_small.bin`.
+
 
 * ## Commands to reproduce image classification benchmark
 
@@ -108,30 +125,30 @@ MODEL_NAME=googlenet, mobilenetv1, mobilenetv2, resnet101, resnet50, vgg16, vgg1
 
 * ## Prepare dataset
 
-* Run the following commands to download and preprocess the Pascal VOC2007 test set.
+* Download and preprocess the full Pascal VOC2007 test set.
   
 ```bash
-cd /PATH/TO/PADDLE/build
-python ../paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=VOC_test_2007 
+cd /PATH/TO/PADDLE
+python paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py
 ```
 
-Then the Pascal VOC2007 test set will be preprocessed and saved by default in `$HOME/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin`
+The Pascal VOC2007 test set binary file is saved by default in `$HOME/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin`
 
-* Run the following commands to prepare your own dataset.
+* Prepare user local dataset.
 
 ```bash
-cd /PATH/TO/PADDLE/build
-python ../paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=local \\
-                                         --data_dir=./third_party/inference_demo/int8v2/pascalvoc_small \\
-                                         --img_annotation_list=test_100.txt \\
-                                         --label_file=label_list \\
-                                         --output_file=pascalvoc_small.bin \\
-                                         --resize_h=300 \\
-                                         --resize_w=300 \\
-                                         --mean_value=[127.5, 127.5, 127.5] \\
-                                         --ap_version=11point \\
+cd /PATH/TO/PADDLE
+python paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET --img_annotation_list=/PATH/TO/ANNOTATION/LIST --label_file=/PATH/TO/LABEL/FILE --output_file=/PATH/TO/OUTPUT/FILE
 ```
-Then the user dataset will be preprocessed and saved by default in `/PATH/TO/PADDLE/build/third_party/inference_demo/int8v2/pascalvoc_small/pascalvoc_small.bin`
+Available options in the above command and their descriptions are as follows:
+- **No parameters set:** The script will download the full pascalvoc test dataset and preprocess and convert it into a binary file.
+- **local:** Once set, the script will process user local data.
+- **data_dir:** Path to user local dataset. Default value: None.
+- **img_annotation_list:** Path to img_annotation list file. Default value: `test_100.txt`.
+- **label_file:** Path to labels list. Default value: `label_list`.
+- **output_file:** Path to generated binary file. Default value: `pascalvoc_small.bin`.
+
+The user dataset preprocessed binary file by default is saved in `pascalvoc_small.bin`.
 
 * ## Commands to reproduce object detection benchmark
 
diff --git a/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md b/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md
index 7ccee1da5c..797e6c4343 100644
--- a/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md
+++ b/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md
@@ -8,10 +8,12 @@ In **Release 1.6**, a new approach was introduced, called QAT2, which adds suppo
 
 In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn) was added to the QAT2.
 
+In **Release 2.0**, further optimizations were added to the QAT2: INT8 `matmul` kernel, inplace execution of activation and `elementwise_add` operators, and broader support for quantization aware strategy from PaddleSlim.
+
 In this document we focus on the QAT2 approach only. 
 
 ## 0. Prerequisites
-* PaddlePaddle in version 1.7.1 or higher is required. For instructions on how to install it see the [installation document](https://www.paddlepaddle.org.cn/install/quick).
+* PaddlePaddle in version 2.0 or higher is required. For instructions on how to install it see the [installation document](https://www.paddlepaddle.org.cn/install/quick).
   
 * MKL-DNN and MKL are required. The highest performance gain can be observed using CPU servers supporting AVX512 instructions.
 * INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension (e.g. CLX class Intel processors). A linux server supports AVX512 VNNI instructions if the output of the command `lscpu` contains the `avx512_vnni` entry in the `Flags` section. AVX512 VNNI support on Windows can be checked using the [`coreinfo`]( https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo) tool.
@@ -30,11 +32,18 @@ A QAT model can be transformed into an INT8 quantized model if it contains enoug
 
 ### Gathering scales
 
-The information about the quantization scales is being collected from three types of operators:
+The information about the quantization scales is collected from two sources:
+
+1. the `out_threshold` attribute of quantizable operators - it contains a single value quantization scale for the operator's output,
+2.  fake quantize/dequantize operators - they imitate quantization from FP32 into INT8, or dequantization in reverse direction, but keep the quantized tensor values as floats.
+
+There are three types of fake quantize/dequantize operators:
 
-* `fake_quantize_moving_average_abs_max` - imitates INT8 quantization of FP32 tensors, but keeps quantized output values as floats; is used before quantized operator (e.g. `conv2d`) to gather scale information for the op's input.
-* `fake_dequantize_max_abs` - imitates dequantization of INT8 tensors back into floats; it is used after quantized operator, and contains scale used for the op's weights dequantization.
-* `fake_quantize_dequantize_moving_average_abs_max` - imitates immediate quantization and dequantization; it can be used after a quantized operator to get the scale value for the op's output.
+* `fake_quantize_moving_average_abs_max` and `fake_quantize_range_abs_max` - used before quantized operator (e.g. `conv2d`), gather single value scale information for the op's input,
+* `fake_dequantize_max_abs` and `fake_channel_wise_dequantize_max_abs` - used after quantized operators, contain scales used for the operators' weights dequantization; the first one collects a single value scale for the weights tensor, whereas the second one collects a vector of scales for each output channel of the weights,
+* `fake_quantize_dequantize_moving_average_abs_max` - used after a quantized operator to get the scale value for the op's output; imitates immediate quantization and dequantization.
+
+Scale values gathered from the fake quantize/dequantize operators have precedence over the scales collected from the `out_threshold` attributes.
 
 Notes:
 
@@ -43,7 +52,7 @@ Notes:
    and we want to quantize the `conv2d` op, then after applying FP32 optimizations the sequence will become
    ```... → input1 → conv2d → output3 → ...```
    and the quantization scales have to be collected for the `input1` and `outpu3` tensors in the QAT model.
-2. Quantization of the following operators is supported: `conv2d`, `depthwise_conv2d`, `mul`, `fc`, `pool2d`, `reshape2`, `transpose2`, `concat`.
+2. Quantization of the following operators is supported: `conv2d`, `depthwise_conv2d`, `mul`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `concat`.
 3. The longest sequence of consecutive quantizable operators in the model, the biggest performance boost can be achieved through quantization:
    ```... → conv2d → conv2d → pool2d → conv2d → conv2d → ...``` 
    Quantizing single operator separated from other quantizable operators can give no performance benefits or even slow down the inference:
@@ -55,7 +64,7 @@ All the `fake_quantize_*` and `fake_dequantize_*` operators are being removed fr
 
 ### Dequantizing weights
 
-Weights of `conv2d`, `depthwise_conv2d` and `mul` operators are assumed to be fake-quantized (with integer values in the `int8` range, but kept as `float`s) in QAT models. Here, the information about the scale from `fake_dequantize_max_abs` operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.
+Weights of `conv2d`, `depthwise_conv2d` and `mul` operators are assumed to be fake-quantized (with integer values in the `int8` range, but kept as `float`s) in QAT models. Here, the information about the scale from `fake_dequantize_max_abs` and `fake_channel_wise_dequantize_max_abs` operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.
 
 ### Optimizing FP32 graph
 
@@ -71,7 +80,7 @@ The basic datatype used during INT8 inference is signed INT8, with possible valu
 
 ### Propagation of scales
 
-Some of the operators (e.g. `reshape2`, `transpose2`, `pool2d` with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the `scale` operator, updating the quantization scale accordingly. This approach lets us minimize the number of `fake_quantize` and `fake_dequantize` operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.
+Some of the operators (e.g. `reshape2`, `transpose2`, `pool2d` with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the `scale` operator, updating the quantization scale accordingly. This approach lets us minimize the number of fake quantize/dequantize operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.
 
 ### Applying quantization passes
 
@@ -153,25 +162,25 @@ Image classification models performance was measured using a single thread. The
 
 >**Intel(R) Xeon(R) Gold 6271**
 
-|    Model     | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
-| :----------: | :-------------: | :-----------------: | :---------------: |
-| MobileNet-V1 |      74.36      |       210.68        |       2.83        |
-| MobileNet-V2 |      89.59      |       186.55        |       2.08        |
-|  ResNet101   |      7.21       |        26.41        |       3.67        |
-|   ResNet50   |      13.23      |        48.89        |       3.70        |
-|    VGG16     |      3.49       |        10.11        |       2.90        |
-|    VGG19     |      2.84       |        8.69         |       3.06        |
+|    Model     | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32)  |
+| :----------: | :-------------: | :-----------------: | :---------------:  |
+| MobileNet-V1 |      77.00      |       210.76        |      2.74          |
+| MobileNet-V2 |      88.43      |       182.47        |      2.06          |
+|  ResNet101   |      7.20       |        25.88        |      3.60          |
+|   ResNet50   |      13.26      |        47.44        |      3.58          |
+|    VGG16     |      3.48       |        10.11        |      2.90          |
+|    VGG19     |      2.83       |        8.77         |      3.10          |
 
 >**Intel(R) Xeon(R) Gold 6148**
 
 |    Model     | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
 | :----------: | :-------------: | :-----------------: | :---------------: |
-| MobileNet-V1 |      75.23      |       111.15        |       1.48        |
-| MobileNet-V2 |      86.65      |       127.21        |       1.47        |
-|  ResNet101   |      6.61       |        10.60        |       1.60        |
-|   ResNet50   |      12.42      |        19.74        |       1.59        |
-|    VGG16     |      3.31       |        4.74         |       1.43        |
-|    VGG19     |      2.68       |        3.91         |       1.46        |
+| MobileNet-V1 |      75.23      |       103.63        |      1.38         |
+| MobileNet-V2 |      86.65      |       128.14        |      1.48         |
+|  ResNet101   |      6.61       |       10.79         |      1.63         |
+|   ResNet50   |      12.42      |       19.65         |      1.58         |
+|    VGG16     |      3.31       |        4.74         |      1.43         |
+|    VGG19     |      2.68       |        3.91         |      1.46         |
 
 Notes:
 
@@ -200,16 +209,16 @@ Notes:
 
 |  Model  |     Threads  | FP32 Latency (ms) | QAT INT8 Latency (ms)    | Ratio (FP32/INT8) |
 |:------------:|:----------------------:|:-------------------:|:---------:|:---------:|
-| Ernie | 1 thread     |        256.11        |     93.80    |   2.73   |
-| Ernie | 20 threads   |        30.06        |    16.88    |   1.78   |
+| Ernie | 1 thread     |       236.72        |     83.70    |   2.82x   |
+| Ernie | 20 threads   |       27.40         |     15.01    |   1.83x   |
 
 
 >**Intel(R) Xeon(R) Gold 6148**
 
 | Model |  Threads   | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) |
 | :---: | :--------: | :---------------: | :-------------------: | :---------------: |
-| Ernie |  1 thread  |      254.20       |        169.54         |       1.50        |
-| Ernie | 20 threads |       30.99       |         21.81         |       1.42        |
+| Ernie |  1 thread  |    248.42         |       169.30           |       1.46       |
+| Ernie | 20 threads |    28.92          |       20.83            |       1.39       |
 
 ## 6. How to reproduce the results
 
@@ -261,7 +270,10 @@ You can use the `qat2_int8_image_classification_comparison.py` script to reprodu
 
 * `--qat_model` - a path to a QAT model that will be transformed into INT8 model.
 * `--fp32_model` - a path to an FP32 model whose accuracy will be measured and compared to the accuracy of the INT8 model.
-* `--quantized_ops` - a comma-separated list of names of operators to be quantized. The list depends on which operators have quantization scales provided in the model. Also, it may be more optimal in terms of performance to choose only certain types of operators for quantization. For Image Classification models mentioned above the list comprises of `conv2d` and `pool2d` operators.
+* `--quantized_ops` - a comma-separated list of names of operators to be quantized. When deciding which operators to put on the list, the following have to be considered:
+  * Only operators which support quantization will be taken into account.
+  * All the quantizable operators from the list, which are present in the model, must have quantization scales provided in the model. Otherwise, the quantization procedure will fail with a message saying which variable is missing a quantization scale.
+  * Sometimes it may be suboptimal to quantize all quantizable operators in the model (cf. *Notes* in the **Gathering scales** section above). To find the optimal configuration for this option, user can run benchmark a few times with different lists of quantized operators present in the model and compare the results. For Image Classification models mentioned above the list comprises of `conv2d` and `pool2d` operators.
 * `--infer_data` - a path to the validation dataset.
 
 ```bash
-- 
GitLab