-[Algorithm Background](https://paddlepaddle.github.io/PaddleSlim/algo/algo.html): Introduce the background of quantization, pruning, distillation, NAS.
-[Algorithm Background](https://paddleslim.readthedocs.io/en/latest/intro_en.html): Introduce the background of quantization, pruning, distillation, NAS.
-[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim): Introduce how to use PaddleSlim in PaddleDetection library.
-[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim): Introduce how to use PaddleSlim in PaddleDetection library.
@@ -13,7 +13,7 @@ The process comprises the following steps:
...
@@ -13,7 +13,7 @@ The process comprises the following steps:
#### Install PaddleSlim
#### Install PaddleSlim
For PaddleSlim installation, please see [Paddle Installation Document](https://paddlepaddle.github.io/PaddleSlim/install.html)
For PaddleSlim installation, please see [Paddle Installation Document](https://paddleslim.readthedocs.io/zh_CN/latest/api_cn/static/quant/quantization_api.html#quant-aware)
@@ -34,15 +34,15 @@ One can generate fake-quantized model with post-training or quant-aware strategy
...
@@ -34,15 +34,15 @@ One can generate fake-quantized model with post-training or quant-aware strategy
#### 2.1 Quant-aware training
#### 2.1 Quant-aware training
To generate fake quantized model with quant-aware strategy, see [Quant-aware training tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
To generate fake quantized model with quant-aware strategy, see [Quant-aware training tutorial](https://paddleslim.readthedocs.io/en/latest/quick_start/quant_aware_tutorial_en.html)
**The parameters during quant-aware training:**
**The parameters during quant-aware training:**
-**quantize_op_types:** A list of operators to insert `fake_quantize` and `fake_dequantize` ops around them. In PaddlePaddle, quantization of following operators is supported for CPU: `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`. However, inserting fake_quantize/fake_dequantize operators during training is needed only for the first four of them (`depthwise_conv2d`, `conv2d`, `fc`, `matmul`), so setting the `quantize_op_types` parameter to the list of those four ops is enough. Scala data needed for quantization of the other five operators is reused from the fake ops or gathered from the `out_threshold` attributes of the operators.
-**quantize_op_types:** A list of operators to insert `fake_quantize` and `fake_dequantize` ops around them. In PaddlePaddle, quantization of following operators is supported for CPU: `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`. However, inserting fake_quantize/fake_dequantize operators during training is needed only for the first four of them (`depthwise_conv2d`, `conv2d`, `fc`, `matmul`), so setting the `quantize_op_types` parameter to the list of those four ops is enough. Scala data needed for quantization of the other five operators is reused from the fake ops or gathered from the `out_threshold` attributes of the operators.
To generate post-training fake quantized model, see [Offline post-training quantization tutorial](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
To generate post-training fake quantized model, see [Offline post-training quantization tutorial](https://paddleslim.readthedocs.io/en/latest/quick_start/index_en.html)
## 3. Convert the fake quantized model to DNNL INT8 model
## 3. Convert the fake quantized model to DNNL INT8 model
In order to deploy an INT8 model on the CPU, we need to collect scales, remove all fake_quantize/fake_dequantize operators, optimize the graph and quantize it, turning it into the final DNNL INT8 model. This is done by the script [save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py). Copy the script to the directory where the demo is located: `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` and run it as follows:
In order to deploy an INT8 model on the CPU, we need to collect scales, remove all fake_quantize/fake_dequantize operators, optimize the graph and quantize it, turning it into the final DNNL INT8 model. This is done by the script [save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py). Copy the script to the directory where the demo is located: `/PATH_TO_PaddleSlim/demo/mkldnn_quant/` and run it as follows:
-**int8_model_save_path:** The final INT8 model output path after the quant model is optimized and quantized by DNNL.
-**int8_model_save_path:** The final INT8 model output path after the quant model is optimized and quantized by DNNL.
-**ops_to_quantize:** A comma separated list of specified op types to be quantized. It is optional. If the option is skipped, all quantizable operators will be quantized. Skipping the option is recommended in the first approach as it usually yields best performance and accuracy for image classification models and NLP models listed in the Benchmark..
-**ops_to_quantize:** A comma separated list of specified op types to be quantized. It is optional. If the option is skipped, all quantizable operators will be quantized. Skipping the option is recommended in the first approach as it usually yields best performance and accuracy for image classification models and NLP models listed in the Benchmark..
-**--op_ids_to_skip:** "A comma-separated list of operator ID numbers. It is optional. Default value is none. The op ids in this list will not be quantized and will adopt FP32 type. To get the ID of a specific op, first run the script using the `--debug` option, and open the generated file `int8_<number>_cpu_quantize_placement_pass.dot` to find the op that does not need to be quantified, and the ID number is in parentheses after the Op name.
-**--op_ids_to_skip:** "A comma-separated list of operator ID numbers. It is optional. Default value is none. The op ids in this list will not be quantized and will adopt FP32 type. To get the ID of a specific op, first run the script using the `--debug` option, and open the generated file `int8_<number>_cpu_quantize_placement_pass.dot` to find the op that does not need to be quantified, and the ID number is in parentheses after the Op name.
-**--debug:** Generate models graph or not. If this option is present, .dot files with graphs of the model will be generated after each optimization step that modifies the graph. For the description of DOT format, please read [DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html). To open the `*.dot` file, please use any Graphviz tool available on the system(such as the `xdot` tool on Linux or the `dot` tool on Windows. For Graphviz documentation, see [Graphviz](http://www.graphviz.org/documentation/).
-**--debug:** Generate models graph or not. If this option is present, .dot files with graphs of the model will be generated after each optimization step that modifies the graph. For the description of DOT format, please read [DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html). To open the `*.dot` file, please use any Graphviz tool available on the system(such as the `xdot` tool on Linux or the `dot` tool on Windows. For Graphviz documentation, see [Graphviz](http://www.graphviz.org/documentation/).
-**Note:**
-**Note:**
- The DNNL supported quantizable ops are `conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `scale`, `concat`.
- The DNNL supported quantizable ops are `conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `scale`, `concat`.
@@ -11,7 +11,7 @@ BiGRU is to train a BiGRU based LAC model from scratch; BERT fine-tuned is to fi
...
@@ -11,7 +11,7 @@ BiGRU is to train a BiGRU based LAC model from scratch; BERT fine-tuned is to fi
## Introduction
## Introduction
Lexical Analysis of Chinese, or LAC for short, is a lexical analysis model that completes the tasks of Chinese word segmentation, part-of-speech tagging, and named entity recognition in a single model. We conduct an overall evaluation of word segmentation, part-of-speech tagging, and named entity recognition on a self-built dataset. We use the finetuned [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) model as the Teacher model and GRU as the Student model, which are needed by the Pantheon framework for online distillation.
Lexical Analysis of Chinese, or LAC for short, is a lexical analysis model that completes the tasks of Chinese word segmentation, part-of-speech tagging, and named entity recognition in a single model. We conduct an overall evaluation of word segmentation, part-of-speech tagging, and named entity recognition on a self-built dataset. We use the finetuned ERNIE model as the Teacher model and GRU as the Student model, which are needed by the Pantheon framework for online distillation.
# Nerual Architecture Search for Image Classification
# Nerual Architecture Search for Image Classification
This tutorial shows how to use [API](../api/nas_api.md) about SANAS in PaddleSlim. We start experiment based on MobileNetV2 as example. The tutorial contains follow section.
This tutorial shows how to use [API](https://paddleslim.readthedocs.io/en/latest/api_en/paddleslim.nas.html) about SANAS in PaddleSlim. We start experiment based on MobileNetV2 as example. The tutorial contains follow section.
# Training-aware Quantization of image classification model - quick start
# Training-aware Quantization of image classification model - quick start
This tutorial shows how to do training-aware quantization using [API](https://paddlepaddle.github.io/PaddleSlim/api_en/paddleslim.quant.html#paddleslim.quant.quanter.quant_aware) in PaddleSlim. We use MobileNetV1 to train image classification model as example. The tutorial contains follow sections:
This tutorial shows how to do training-aware quantization using [API](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html) in PaddleSlim. We use MobileNetV1 to train image classification model as example. The tutorial contains follow sections:
1. Necessary imports
1. Necessary imports
2. Model architecture
2. Model architecture
...
@@ -90,7 +90,7 @@ test(val_program)
...
@@ -90,7 +90,7 @@ test(val_program)
## 4. Quantization
## 4. Quantization
We call ``quant_aware`` API to add quantization and dequantization operators in ``train_program`` and ``val_program`` according to [default configuration](https://paddlepaddle.github.io/PaddleSlim/api_cn/quantization_api.html#id2).
We call ``quant_aware`` API to add quantization and dequantization operators in ``train_program`` and ``val_program`` according to [default configuration](https://paddleslim.readthedocs.io/en/latest/api_en/paddleslim.quant.html).
The model in ``4. Quantization`` after calling ``slim.quant.quant_aware`` API is only suitable to train. To get the inference model, we should use [slim.quant.convert](https://paddlepaddle.github.io/PaddleSlim/api_en/paddleslim.quant.html#paddleslim.quant.quanter.convert) API to change model architecture and use [fluid.io.save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/io_cn/save_inference_model_cn.html#save-inference-model) to save model. ``float_prog``'s parameters are float32 dtype but in int8's range which can be used in ``fluid`` or ``paddle-lite``. ``paddle-lite`` will change the parameters' dtype from float32 to int8 first when loading the inference model. ``int8_prog``'s parameters are int8 dtype and we can get model size after quantization by saving it. ``int8_prog`` cannot be used in ``fluid`` or ``paddle-lite``.
The model in ``4. Quantization`` after calling ``slim.quant.quant_aware`` API is only suitable to train. To get the inference model, we should use [slim.quant.convert](https://paddleslim.readthedocs.io/zh_CN/latest/api_cn/static/quant/quantization_api.html#convert) API to change model architecture and use [fluid.io.save_inference_model](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/io_cn/save_inference_model_cn.html#save-inference-model) to save model. ``float_prog``'s parameters are float32 dtype but in int8's range which can be used in ``fluid`` or ``paddle-lite``. ``paddle-lite`` will change the parameters' dtype from float32 to int8 first when loading the inference model. ``int8_prog``'s parameters are int8 dtype and we can get model size after quantization by saving it. ``int8_prog`` cannot be used in ``fluid`` or ``paddle-lite``.
# Post-training Quantization of image classification model - quick start
# Post-training Quantization of image classification model - quick start
This tutorial shows how to do post training quantization using [API](https://paddlepaddle.github.io/PaddleSlim/api_en/paddleslim.quant.html#paddleslim.quant.quanter.quant_post) in PaddleSlim. We use MobileNetV1 to train image classification model as example. The tutorial contains follow sections:
This tutorial shows how to do post training quantization using [API](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html) in PaddleSlim. We use MobileNetV1 to train image classification model as example. The tutorial contains follow sections:
# Pruning of image classification model - sensitivity
# Pruning of image classification model - sensitivity
In this tutorial, you will learn how to use [sensitivity API of PaddleSlim](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/#sensitivity) by a demo of MobileNetV1 model on MNIST dataset。
In this tutorial, you will learn how to use [sensitivity API of PaddleSlim](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html) by a demo of MobileNetV1 model on MNIST dataset。
This tutorial following workflow:
This tutorial following workflow:
1. Import dependency
1. Import dependency
...
@@ -107,7 +107,7 @@ params = params[:5]
...
@@ -107,7 +107,7 @@ params = params[:5]
### 7.1 Compute in single process
### 7.1 Compute in single process
Apply sensitivity analysis on pretrained model by calling [sensitivity API](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/#sensitivity).
Apply sensitivity analysis on pretrained model by calling [sensitivity API](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html).
The sensitivities will be appended into the file given by option `sensitivities_file` during computing.
The sensitivities will be appended into the file given by option `sensitivities_file` during computing.
The information in this file won`t be computed repeatedly.
The information in this file won`t be computed repeatedly.
...
@@ -197,7 +197,7 @@ Pruning model according to the sensitivities generated in section 7.3.3.
...
@@ -197,7 +197,7 @@ Pruning model according to the sensitivities generated in section 7.3.3.
### 8.1 Get pruning ratios
### 8.1 Get pruning ratios
Get a group of ratios by calling [get_ratios_by_loss](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/#get_ratios_by_loss) fuction:
Get a group of ratios by calling [get_ratios_by_loss](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html) fuction:
```python
```python
...
@@ -223,7 +223,7 @@ print("FLOPs after pruning: {}".format(slim.analysis.flops(pruned_program)))
...
@@ -223,7 +223,7 @@ print("FLOPs after pruning: {}".format(slim.analysis.flops(pruned_program)))
### 8.3 Pruning test network
### 8.3 Pruning test network
Note:The `only_graph` should be set to True while pruning test network. [Pruner API](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/#pruner)
Note:The `only_graph` should be set to True while pruning test network. [Pruner API](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html)
[1] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.
[1] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.