The graph kernel fusion is used to analyze and optimize the computational graph logic of the existing network, as well as split, reconstruct, and fuse the original computing logic to reduce the overhead of operator execution gaps and improve the computing resource utilization of devices, thereby optimizing the overall execution time of the network.
> The example in this tutorial applies to hardware platforms based on the Ascend 910 AI processor, whereas does not support CPU and GPU scenarios.
## Enabling Method
The optimization of graph kernel fusion in MindSpore is distributed in multiple compilation and execution steps at the network layer. By default, the function is disabled. You can specify the `enable_graph_kernel=True` parameter for `context` in the training script to enable the graph kernel fusion.
```python
frommindsporeimportcontext
context.set_context(enable_graph_kernel=True)
```
### Sample Scripts
1. Simple example
To illustrate the fusion scenario, two simple networks are constructed. The `NetBasicFuse` network includes multiplication and addition, and the `NetCompositeFuse` network includes multiplication, addition, and exponentiation. The following code example is saved as the `test_graph_kernel_fusion.py` file.
Take the training model of the `BERT-large` network as an example. For details about the dataset and training script, see <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/bert>. You only need to modify the `context` parameter.
## Effect Evaluation
To verify whether the graph kernel fusion takes effect, you can compare the changes of the computational graph before and after the fusion is enabled as well as the change of the network training time for one step.
### Computational Graph
1. Basic operator fusion: Analyze associated basic operators on the network. Fuse multiple basic operators into a composite operator on the condition that performance benefits can be obtained. The following uses `NetBasicFuse` as an example.
After the script execution is complete, you will find some `.dot` files in the script running directory. Use the `dot` tool to convert the `.dot` files into `.png` files for viewing. `6_validate.dot` and `hwopt_d_fuse_basic_opt_end_graph_0.dot` are used to generate the initial computational graph and the computational graph after basic operator fusion.
As shown in Figure 1, there are two basic operators in the initial computing of the constructed network. After the graph kernel fusion function is enabled, the two basic operators (`Mul` and `TensorAdd`) automatically compose one operator (composite operator). In Figure 2, the upper right part is the composite operator after fusion. Currently, the network only needs to execute one composite operator to complete the `Mul` and `TensorAdd` computing.
Figure 2 Computational graph after basic operator fusion
2. Composite operator fusion: Analyze the original composite operator and its related basic operators. The original composite operator and a basic operator compose a larger composite operator on the condition that performance benefits can be obtained. The following uses `NetCompositeFuse` as an example.
Similarly, `6_validate.dot`, `hwopt_d_fuse_basic_opt_end_graph_0.dot`, and `hwopt_d_composite_opt_end_graph_0.dot` are used to generate the initial computational graph, the computational graph after basic operator fusion, and the computational graph after composite operator fusion.
As shown in Figure 3, there are three basic operators in the initial computing of the constructed network. After the graph kernel fusion function is enabled, the first two basic operators (`Mul` and `TensorAdd`) automatically compose one operator (composite operator) at the basic operator fusion stage. As shown in Figure 4, the upper right part shows the composite operator after fusion, and the lower left part shows a basic operator `Pow`. At the subsequent composite operator fusion stage, the remaining basic operator (`Pow`) and an existing composite operator are further fused to form a new composite operator. In Figure 5, the upper right part is the composite operator after the three basic operators are fused. Currently, the network only needs to execute one composite operator to complete the `Mul`, `TensorAdd`, and `Pow` computing.
Figure 5 Computational graph after composite operator fusion
### Training Time for One Step
BERT-large scenario: After the graph kernel fusion function is enabled for the BERT-large network, the training time for one step can be improved by more than 10% while the accuracy is the same as that before the function is enabled.
Deep learning technologies are used on an increasing number of applications on mobile or edge devices. Take mobile phones as an example. To provide user-friendly and intelligent services, the deep learning function is integrated into operating systems and applications. However, this function involves training or inference, containing a large number of models and weight files. The original weight file of AlexNet has exceeded 200 MB, and the new model is developing towards a more complex structure with more parameters. Due to limited hardware resources of a mobile or edge device, a model needs to be simplified and the quantization technology is used to solve this problem.
## Concept
### Quantization
Quantization is a process of approximating (usually INT8) a fixed point of a floating-point model weight of a continuous value (or a large quantity of possible discrete values) or tensor data that flows through a model to a limited quantity (or a relatively small quantity) of discrete values at a relatively low inference accuracy loss. It is a process of approximately representing 32-bit floating-point data with fewer bits, while the input and output of the model are still floating-point data. In this way, the model size and memory usage can be reduced, the model inference speed can be accelerated, and the power consumption can be reduced.
As described above, compared with the FP32 type, low-accuracy data representation types such as FP16, INT8, and INT4 occupy less space. Replacing the high-accuracy data representation type with the low-accuracy data representation type can greatly reduce the storage space and transmission time. Low-bit computing has higher performance. Compared with FP32, INT8 has a three-fold or even higher acceleration ratio. For the same computing, INT8 has obvious advantages in power consumption.
Currently, there are two types of quantization solutions in the industry: quantization aware training and post-training quantization training.
### Fake Quantization Node
A fake quantization node is a node inserted during quantization aware training, and is used to search for network data distribution and feed back a lost accuracy. The specific functions are as follows:
- Find the distribution of network data, that is, find the maximum and minimum values of the parameters to be quantized.
- Simulate the accuracy loss of low-bit quantization, apply the loss to the network model, and transfer the loss to the loss function, so that the optimizer optimizes the loss value during training.
## Quantization Aware Training
MindSpore quantization aware training is to replace high-accuracy data with low-accuracy data to simplify the model training process. In this process, the accuracy loss is inevitable. Therefore, a fake quantization node is used to simulate the accuracy loss, and backward propagation learning is used to reduce the accuracy loss. MindSpore adopts the solution in reference [1] for the quantization of weights and data.
The procedure for the quantization aware training model is the same as that for the common training. After the network is defined and the model is generated, additional operations need to be performed. The complete process is as follows:
1. Process data and load datasets.
2. Define a network.
3. Define a fusion network. After a network is defined, replace the specified operators to define a fusion network.
4. Define an optimizer and loss function.
5. Perform model training. Generate a fusion model based on the fusion network training.
6. Generate a quantization network. After the fusion model is obtained based on the fusion network training, insert a fake quantization node into the fusion model by using a conversion API to generate a quantization network.
7. Perform quantization training. Generate a quantization model based on the quantization network training.
Compared with common training, the quantization aware training requires additional steps which are steps 3, 6, and 7 in the preceding process.
> - Fusion network: network after the specified operators (`nn.Conv2dBnAct` and `nn.DenseBnAct`) are used for replacement.
> - Fusion model: model in the checkpoint format generated by the fusion network training.
> - Quantization network: network obtained after the fusion model uses a conversion API (`convert_quant_network`) to insert a fake quantization node.
> - Quantization model: model in the checkpoint format obtained after the quantization network training.
Next, the LeNet network is used as an example to describe steps 3 and 6.
> You can obtain the complete executable sample code at <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/lenet_quant>.
### Defining a Fusion Network
Define a fusion network and replace the specified operators.
1. Use the `nn.Conv2dBnAct` operator to replace the three operators `nn.Conv2d`, `nn.batchnorm`, and `nn.relu` in the original network model.
2. Use the `nn.DenseBnAct` operator to replace the three operators `nn.Dense`, `nn.batchnorm`, and `nn.relu` in the original network model.
> Even if the `nn.Dense` and `nn.Conv2d` operators are not followed by `nn.batchnorm` and `nn.relu`, the preceding two replacement operations must be performed as required.
The definition of the original network model is as follows:
### Converting the Fusion Model into a Quantization Network
Use the `convert_quant_network` API to automatically insert a fake quantization node into the fusion model to convert the fusion model into a quantization network.
The preceding describes the quantization aware training from scratch. A more common case is that an existing model file needs to be converted to a quantization model. The model file and training script obtained through common network model training are available for quantization aware training. To use a checkpoint file for retraining, perform the following steps:
1. Process data and load datasets.
2. Define a network.
3. Define a fusion network.
4. Define an optimizer and loss function.
5. Load a model file and retrain the model. Load an existing model file and retrain the model based on the fusion network to generate a fusion model. For details, see <https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#id6>.
6. Generate a quantization network.
7. Perform quantization training.
### Inference
The inference using a quantization model is the same as common model inference. The inference can be performed by directly using the checkpoint file or converting the checkpoint file into a common model format (such as ONNX or GEIR).
For details, see <https://www.mindspore.cn/tutorial/en/master/use/multi_platform_inference.html>.
- To use a checkpoint file obtained after quantization aware training for inference, perform the following steps:
1. Load the quantization model.
2. Perform the inference.
- Convert the checkpoint file into a common model format such as ONNX for inference. (This function is coming soon.)
## References
[1] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
[2] Krishnamoorthi R. Quantizing deep convolutional networks for efficient inference: A whitepaper[J]. arXiv preprint arXiv:1806.08342, 2018.
[3] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
-[Inference on the Ascend 910 AI processor](#inference-on-the-ascend-910-ai-processor)
-[Inference Using a Checkpoint File](#inference-using-a-checkpoint-file)
-[Inference on the Ascend 310 AI processor](#inference-on-the-ascend-310-ai-processor)
-[Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-1)
-[Inference Using an ONNX or GEIR File](#inference-using-an-onnx-or-geir-file)
-[Inference on a GPU](#inference-on-a-gpu)
-[Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-2)
-[Inference Using an ONNX File](#inference-using-an-onnx-file)
-[Inference on a CPU](#inference-on-a-cpu)
-[Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-3)
-[Inference Using an ONNX File](#inference-using-an-onnx-file-1)
-[On-Device Inference](#on-device-inference)
<!-- /TOC -->
...
...
@@ -12,30 +23,92 @@
## Overview
Models based on MindSpore training can be used for inference on different hardware platforms. This document introduces the inference process on each platform.
Models trained by MindSpore support the inference on different hardware platforms. This document describes the inference process on each platform.
1. Inference on the Ascend 910 AI processor
The inference can be performed in either of the following methods based on different principles:
- Use a checkpoint file for inference. That is, use the inference API to load data and the checkpoint file for inference in the MindSpore training environment.
- Convert the checkpiont file into a common model format, such as ONNX or GEIR, for inference. The inference environment does not depend on MindSpore. In this way, inference can be performed across hardware platforms as long as the platform supports ONNX or GEIR inference. For example, models trained on the Ascend 910 AI processor can be inferred on the GPU or CPU.
MindSpore provides the `model.eval` API for model validation. You only need to import the validation dataset. The processing method of the validation dataset is the same as that of the training dataset. For details about the complete code, see <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/lenet/eval.py>.
MindSpore supports the following inference scenarios based on the hardware platform:
| Hardware Platform | Model File Format | Description |
| Ascend 910 AI processor | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| Ascend 310 AI processor | ONNX or GEIR | Equipped with the ACL framework and supports the model in OM format. You need to use a tool to convert a model into the OM format. |
| GPU | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| GPU | ONNX | Supports ONNX Runtime or SDK, for example, TensorRT. |
| CPU | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| CPU | ONNX | Supports ONNX Runtime or SDK, for example, TensorRT. |
> Open Neural Network Exchange (ONNX) is an open file format designed for machine learning. It is used to store trained models. It enables different AI frameworks (such as PyTorch and MXNet) to store model data in the same format and interact with each other. For details, visit the ONNX official website <https://onnx.ai/>.
> Graph Engine Intermediate Representation (GEIR) is an open file format defined by Huawei for machine learning and can better adapt to the Ascend AI processor. It is similar to ONNX.
> Ascend Computer Language (ACL) provides C++ API libraries for users to develop deep neural network applications, including device management, context management, stream management, memory management, model loading and execution, operator loading and execution, and media data processing. It matches the Ascend AI processor and enables hardware running management and resource management.
> Offline Model (OM) is supported by the Huawei Ascend AI processor. It implements preprocessing functions that can be completed without devices, such as operator scheduling optimization, weight data rearrangement and compression, and memory usage optimization.
> NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime to improve the inference speed of the deep learning model on edge devices. For details, see <https://developer.nvidia.com/tensorrt>.
## Inference on the Ascend 910 AI processor
### Inference Using a Checkpoint File
1. Input a validation dataset to validate a model using the `model.eval` API. The processing method of the validation dataset is the same as that of the training dataset.
```python
res=model.eval(dataset)
```
In the preceding information:
`model.eval` is an API for model validation. For details about the API, see <https://www.mindspore.cn/api/en/master/api/python/mindspore/mindspore.html#mindspore.Model.eval>.
2. Use the `model.predict` API to perform inference.
```python
res=model.eval(dataset)
model.predict(input_data)
```
In addition, the` model.predict` interface can be used for inference. For detailed usage, please refer to API description.
In the preceding information:
`model.predict` is an API for inference. For details about the API, see <https://www.mindspore.cn/api/en/master/api/python/mindspore/mindspore.html#mindspore.Model.predict>.
## Inference on the Ascend 310 AI processor
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
### Inference Using an ONNX or GEIR File
The Ascend 310 AI processor is equipped with the ACL framework and supports the OM format which needs to be converted from the model in ONNX or GEIR format. For inference on the Ascend 310 AI processor, perform the following steps:
1. Generate a model in ONNX or GEIR format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2. Convert the ONNX or GEIR model file into an OM model file and perform inference.
- For performing inference in the cloud environment (ModelArt), see the [Ascend 910 training and Ascend 310 inference samples](https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_0026.html).
- For details about the local bare-metal environment where the Ascend 310 AI processor is deployed locally (compared with the cloud environment), see the document of the Ascend 310 AI processor software package.
## Inference on a GPU
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
### Inference Using an ONNX File
1. Generate a model in ONNX format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2.Inference on the Ascend 310 AI processor
2.Perform inference on a GPU by referring to the runtime or SDK document. For example, use TensorRT to perform inference on the NVIDIA GPU. For details, see [TensorRT backend for ONNX](https://github.com/onnx/onnx-tensorrt).
1. Export the ONNX or GEIR model by referring to the [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
## Inference on a CPU
2. For performing inference in the cloud environment, see the [Ascend 910 training and Ascend 310 inference samples](https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_0026.html). For details about the bare-metal environment (compared with the cloud environment where the Ascend 310 AI processor is deployed locally), see the description document of the Ascend 310 AI processor software package.
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
3. Inference on a GPU
### Inference Using an ONNX File
Similar to the inference on a GPU, the following steps are required:
1. Export the ONNX model by referring to the [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
1. Generate a model in ONNX format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2. Perform inference on the NVIDIA GPU by referring to [TensorRT backend for ONNX](https://github.com/onnx/onnx-tensorrt).
2. Perform inference on a CPU by referring to the runtime or SDK document. For details about how to use the ONNX Runtime, see the [ONNX Runtime document](https://github.com/microsoft/onnxruntime).
## On-Device Inference
The On-Device Inference is based on the MindSpore Predict. Please refer to [On-Device Inference Tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/on_device_inference.html) for details.
MindSpore Predict is an inference engine for on-device inference. For details, see [On-Device Inference](https://www.mindspore.cn/tutorial/en/master/advanced_use/on_device_inference.html).