# Image classification INT8 model deployment and Inference on CPU
## Overview
This document describes the process of converting, deploying and executing the DNNL INT8 model using fakely quantized (quant) model generated by PaddleSlim. On Casecade Lake machines (eg. Intel(R) Xeon(R) Gold 6271, 6248, X2XX etc), inference using INT8 model is ususally 3-3.7 times faster than with FP32 model. On SkyLake machines (eg. Intel(R) Xeon(R) Gold 6148, 8180, X1XX etc.), inference using INT8 model is ~1.5 times faster than with FP32 model.
The process comprises the following steps:
- Generating a fakely quantized model: Use PaddleSlim to generate fakely quantized model with `quant-aware` or `post-training` training strategy. Note that the parameters of the quantized ops will be in the range of `INT8`, but their type remains `float32`.
- Converting fakely quantized model into the final DNNL INT8 model: Use provided python script to convert the quant model into DNNL-based and CPU-optimized INT8 model.
- Deployment and inference on CPU: Deploy the demo on CPUs and run inference.
## 1. Preparation
#### Install PaddleSlim
For PaddleSlim installation, please see [Paddle Installation Document](https://paddlepaddle.github.io/PaddleSlim/install.html)
In sample tests, import Paddle and PaddleSlim as follows:
```
import paddle
import paddle.fluid as fluid
import paddleslim as slim
import numpy as np
```
## 2. Use PaddleSlim to generate a fake quantized model
One can generate fake-quantized model with post-training or quant-aware strategy. If user would like to skip the step of generating fakely quantized model and check quantization speedup directly, download [mobilenetv2 post-training quant model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/quant_post_models/mobilenetv2_quant_post.tgz) and its original FP32 model [mobilenetv2 fp32 model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/fp32_models/mobilenetv2.tgz), then user could skip this paragraph and go to Point 3 directly
#### 2.1 Quant-aware training
To generate fake quantized model with quant-aware strategy, see [Quant-aware training tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
**The parameters during quant-aware training:**
-**quantize_op_types:** A list of operators to insert `fake_quantize` and `fake_dequantize` ops around them. In PaddlePaddle, quantization of following operators is supported for CPU: `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`. However, inserting fake_quantize/fake_dequantize operators during training is needed only for the first four of them (`depthwise_conv2d`, `conv2d`, `fc`, `matmul`), so setting the `quantize_op_types` parameter to the list of those four ops is enough. Scala data needed for quantization of the other five operators is reused from the fake ops or gathered from the `out_threshold` attributes of the operators.
To generate post-training fake quantized model, see [Offline post-training quantization tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
## 3. Convert the fake quantized model to DNNL INT8 model
In order to deploy an INT8 model on the CPU, we need to collect scales, remove all fake_quantize/fake_dequantize operators, optimize the graph and quantize it, turning it into the final DNNL INT8 model. This is done by the script [save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py). Copy the script to the directory where the demo is located: `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` and run it as follows:
**Available options in the above command and their descriptions are as follows:**
-**quant_model_path:** input model path, required. A quant model for quantifying training output.
-**int8_model_save_path:** The final INT8 model output path after the quant model is optimized and quantized by DNNL.
-**ops_to_quantize:** A comma separated list of specified op types to be quantized. It is optional. If the option is skipped, all quantizable operators will be quantized. Skipping the option is recommended in the first approach as it usually yields best performance and accuracy for image classification models and NLP models listed in the Benchmark..
-**--op_ids_to_skip:** "A comma-separated list of operator ID numbers. It is optional. Default value is none. The op ids in this list will not be quantized and will adopt FP32 type. To get the ID of a specific op, first run the script using the `--debug` option, and open the generated file `int8_<number>_cpu_quantize_placement_pass.dot` to find the op that does not need to be quantified, and the ID number is in parentheses after the Op name.
-**--debug:** Generate models graph or not. If this option is present, .dot files with graphs of the model will be generated after each optimization step that modifies the graph. For the description of DOT format, please read [DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html). To open the `*.dot` file, please use any Graphviz tool available on the system(such as the `xdot` tool on Linux or the `dot` tool on Windows. For Graphviz documentation, see [Graphviz](http://www. graphviz.org/documentation/).
-**Note:**
- The DNNL supported quantizable ops are `conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `scale`, `concat`.
- If you want to skip quantization of particular operators, use the `--op_ids_to_skip` option and set it exactly to the list of ids of the operators you want to keep as FP32 operators.
- Quantization yields best performance and accuracy when a long sequences of consecutive quantized operators are present in the model. When a model contains quantizable operators surrounded by non-quantizable operators, quantization of the single operators (or very short sequences of them) can give no speedup or even cause drop in performance because of frequent quantizing and dequantizing the data. In that case user can tweak the quantization process by limiting it to the particular types of operators (using the `--ops_to_quantize` option) or by disabling quantization of particular operators, e.g. the single ones or in short sequences (using the `--op_ids_to_skip` option).
## 4. Inference
### 4.1 Data preprocessing
To deploy the model on the CPU, the validation dataset needs to be converted into the binary format. Run the following command in the root directory of Paddle repository to convert the complete ILSVRC2012 val dataset. Use `--local` option to provide your own image classification dataset for conversion. The script is also located on the official website at [full_ILSVRC2012_val_preprocess.py](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py)
**Available options in the above command and their descriptions are as follows:**
- If no parameters are set, the script will download the ILSVRC2012_img_val dataset and convert it into a binary file.
-**local:** If used, a user's dataset is expected in the `data_dir` option.
-**data_dir:** user's own data directory.
-**label_list:** Picture path-Picture category list file, similar to `val_list.txt`.
-**output_file:** A path for the generated binary file.
-**data_dim:** The length and width for the pre-processed pictures in the resulting binary. The default value is 224.
The structure of the directory with the user's dataset should be as follows:
```
imagenet_user
├── val
│ ├── ILSVRC2012_val_00000001.jpg
│ ├── ILSVRC2012_val_00000002.jpg
| |── ...
└── val_list.txt
```
Then, the contents of val_list.txt should be as follows:
```
val/ILSVRC2012_val_00000001.jpg 0
val/ILSVRC2012_val_00000002.jpg 0
```
note:
- Performance measuring is recommended to be done using C++ tests rather than python tests, because python tests incur big overhead of the python itself. However, testing requires the dataset images to be preprocessed first. This can be done easily using native tools in python, but in C++ it requires additional libraries. To avoid introducing C++ external dependencies on image processing libraries like Open-CV, we preprocess the dataset using the python script and save the result in a binary format, ready to use by C++ tests. User can modify the C++ test code to enable image preprocessing with open-cv or any other library and read the image data directly from the original dataset. The accuracy result should differ only a bit from the accuracy obtained using the preprocessed binary dataset. The python test `sample_tester.py` is provided as a reference, to show the difference in performance between it and the C++ test `sample_tester.cc`
### 4.2 Deploying Inference demo
#### Deployment premises
- Users can check which instruction sets are supported by their machines' CPUs by issuing the command `lscpu`.
- INT8 performance and accuracy is best on CPU servers which support `avx512_vnni` instruction set (e.g. Intel Cascade Lake CPUs: Intel(R) Xeon(R) Gold 6271, 6248 or other X2XX). INT8 inference performance is then 3-3.7 times better than for FP32.
- On CPU servers that support `avx512` but do not support `avx512_vnni` instructions (SkyLake, Model name: Intel(R) Xeon(R) Gold X1XX, such as 6148), the performance of INT8 models is around 1.5 times faster than FP32 models.
#### Prepare Paddle inference library
Users can compile the Paddle inference library from the source code or download the inference library directly.
- For instructions on how to compile the Paddle inference library from source, see [Compile from Source](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html#id12), checkout release/2.0 or develop branch and compile it.
- Users can also download the published [inference Library](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html). Please select `ubuntu14.04_cpu_avx_mkl` latest release or develop version. The downloaded library has to be decompressed and renamed into `fluid_inference` directory and placed in current directory (`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`) for the library to be available. Another option is to set the `PADDLE_ROOT` cmake variable to the `fluid_inference` directory location to link the tests with the Paddle inference library properly.
#### Compile the application
The source code file of the sample test (`sample_tester.cc`) and the `cmake` files are all located in `demo/mkldnn_quant/`directory.
```
cd /PATH/TO/PaddleSlim
cd demo/mkldnn_quant/
mkdir build
cd build
cmake -DPADDLE_ROOT=$PADDLE_ROOT ..
make -j
```
-`-DPADDLE_ROOT` default value is `demo/mkldnn_quant/fluid_inference`. If users download and unzip the library [Inference library from the official website](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html) in current directory `demo/mkldnn_quant/`, users could skip this option.
#### Run the test
```
# Bind threads to cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
# Turbo Boost could be set to OFF using the command
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# In the file run.sh, set `MODEL_DIR` to `/PATH/TO/FLOAT32/MODEL` or `/PATH/TO/SAVE/INT8/MODEL`
# In the file run.sh, set `DATA_FILE` to `/PATH/TO/SAVE/BINARY/FILE`
# For 1 thread performance:
./run.sh
# For 20 thread performance:
./run.sh -1 20
```
**Available options in the above command and their descriptions are as follows:**
-**infer_model:** Required. Tested model path. Note that the model parameters files need be saved into multiple files.
-**infer_data:** Required. The path of the tested data file. Note that it needs to be a binary file converted by `full_ILSVRC2012_val_preprocess`.
-**batch_size:** Batch size. The default value is 50.
-**iterations:** Batch iterations. The default is 0, which means predict all batches (image numbers/batch size) in infer_data
-**num_threads:** Number of CPU threads used. The default value is 1.
-**with_accuracy_layer:** The model is with accuracy layer or not. Default value false.
-**use_analysis** Whether to use paddle::AnalysisConfig to optimize the model. Default value is false.
One can directly modify MODEL_DIR and DATA_DIR in `run.sh` under `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` directory, then execute `./run.sh` for CPU inference.
### 4.3 Writing your own tests:
When writing their own test, users can:
1. Test the resulting INT8 model - then paddle::NativeConfig should be used (without applying additional optimizations) and the option `use_analysis` should be set to `false` in the demo.
2. Test the original FP32 model - then paddle::AnalysisConfig should be used (applying FP32 fuses and optimizations) and the option `use_analysis` should be set to `true` in the demo.
AnalysisConfig configuration in this demo are set as follows:
cfg->SetModel(FLAGS_infer_model); // Required. The model to be tested
cfg->DisableGpu(); // Required. Deploy on the CPU to predict, you must Disablegpu
cfg->EnableMKLDNN(); // Required. Configure with MKLDNN enabled make the inference faster than native configuration
cfg->SwitchIrOptim(); // Required. Enable SwitchIrOptim will fuses many ops and improve the performance
cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads); // The default setting is 1
}
```
**Notes:**
- If `infer_model` is a path to an FP32 model and `use_analysis` is set to true, the paddle::AnalysisConfig will be called. Hence the FP32 model will be fused and optimized, the performance should be faster than FP32 inference using paddle::NativeConfig.
- If `infer_model` is a path to converted DNNL INT8 model, the `use_analysis` option will make no difference, because INT8 model has been fused, optimized and quantized.
- If `infer_model` is a path to a fakely quantized model generated by PaddleSlim, `use_analysis` will not work even if it is set to true, because the fake quantized model contains fake quantize/dequantize ops, which cannot be fused or optimized.
## 5. Accuracy and performance benchmark
For INT8 models accuracy and performance results see [CPU deployment predicts the accuracy and performance of INT8 model](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md)
## FAQ
- For deploying INT8 NLP models on CPU, see [ERNIE model quant INT8 accuracy and performance reproduction](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c++/ernie/mkldnn)
- The detailed DNNL quantification process can be viewed in [SLIM quant for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/README.md)