提交 b5f81005 编写于 作者: L leiyuning

add en version of tutorials

上级 59b07379
# Graph Kernel Fusion
<!-- TOC -->
- [Graph Kernel Fusion](#graph-kernel-fusion)
- [Overview](#overview)
- [Enabling Method](#enabling-method)
- [Sample Scripts](#sample-scripts)
- [Effect Evaluation](#effect-evaluation)
- [Computational Graph](#computational-graph)
- [Training Time for One Step](#training-time-for-one-step)
<!-- /TOC -->
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/graph_kernel_fusion.md" target="_blank"><img src="../_static/logo_source.png"></a>
## Overview
The graph kernel fusion is used to analyze and optimize the computational graph logic of the existing network, as well as split, reconstruct, and fuse the original computing logic to reduce the overhead of operator execution gaps and improve the computing resource utilization of devices, thereby optimizing the overall execution time of the network.
> The example in this tutorial applies to hardware platforms based on the Ascend 910 AI processor, whereas does not support CPU and GPU scenarios.
## Enabling Method
The optimization of graph kernel fusion in MindSpore is distributed in multiple compilation and execution steps at the network layer. By default, the function is disabled. You can specify the `enable_graph_kernel=True` parameter for `context` in the training script to enable the graph kernel fusion.
```python
from mindspore import context
context.set_context(enable_graph_kernel=True)
```
### Sample Scripts
1. Simple example
To illustrate the fusion scenario, two simple networks are constructed. The `NetBasicFuse` network includes multiplication and addition, and the `NetCompositeFuse` network includes multiplication, addition, and exponentiation. The following code example is saved as the `test_graph_kernel_fusion.py` file.
```python
import numpy as np
import mindspore.context as context
from mindspore import Tensor
from mindspore.nn import Cell
from mindspore.ops import operations as P
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
# save graph ir files.
context.set_context(save_graphs=True)
# enable graph kernel fusion.
context.set_context(enable_graph_kernel=True)
# example for basic fusion.
class NetBasicFuse(Cell):
def __init__(self):
super(NetBasicFuse, self).__init__()
self.add = P.TensorAdd()
self.mul = P.Mul()
def construct(self, x):
mul_res = self.mul(x, 2.0)
add_res = self.add(mul_res, 1.0)
return add_res
# example for composite fusion.
class NetCompositeFuse(Cell):
def __init__(self):
super(NetCompositeFuse, self).__init__()
self.add = P.TensorAdd()
self.mul = P.Mul()
self.pow = P.Pow()
def construct(self, x):
mul_res = self.mul(x, 2.0)
add_res = self.add(mul_res, 1.0)
pow_res = self.pow(add_res, 3.0)
return pow_res
def test_basic_fuse():
x = np.random.randn(4, 4).astype(np.float32)
net = NetBasicFuse()
result = net(Tensor(x))
print("================result=======================")
print("x: {}".format(x))
print("result: {}".format(result))
print("=======================================")
def test_composite_fuse():
x = np.random.randn(4, 4).astype(np.float32)
net = NetCompositeFuse()
result = net(Tensor(x))
print("================result=======================")
print("x: {}".format(x))
print("result: {}".format(result))
print("=======================================")
```
2. `BERT-large` training network
Take the training model of the `BERT-large` network as an example. For details about the dataset and training script, see <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/bert>. You only need to modify the `context` parameter.
## Effect Evaluation
To verify whether the graph kernel fusion takes effect, you can compare the changes of the computational graph before and after the fusion is enabled as well as the change of the network training time for one step.
### Computational Graph
1. Basic operator fusion: Analyze associated basic operators on the network. Fuse multiple basic operators into a composite operator on the condition that performance benefits can be obtained. The following uses `NetBasicFuse` as an example.
```bash
pytest -s test_graph_kernel_fusion::test_basic_fuse
```
After the script execution is complete, you will find some `.dot` files in the script running directory. Use the `dot` tool to convert the `.dot` files into `.png` files for viewing. `6_validate.dot` and `hwopt_d_fuse_basic_opt_end_graph_0.dot` are used to generate the initial computational graph and the computational graph after basic operator fusion.
As shown in Figure 1, there are two basic operators in the initial computing of the constructed network. After the graph kernel fusion function is enabled, the two basic operators (`Mul` and `TensorAdd`) automatically compose one operator (composite operator). In Figure 2, the upper right part is the composite operator after fusion. Currently, the network only needs to execute one composite operator to complete the `Mul` and `TensorAdd` computing.
![Initial computational graph](./images/graph_kernel_fusion_example_fuse_basic_before.png)
Figure 1 Initial computational graph
![Basic operator fusion](./images/graph_kernel_fusion_example_fuse_basic_after.png)
Figure 2 Computational graph after basic operator fusion
2. Composite operator fusion: Analyze the original composite operator and its related basic operators. The original composite operator and a basic operator compose a larger composite operator on the condition that performance benefits can be obtained. The following uses `NetCompositeFuse` as an example.
```bash
pytest -s test_graph_kernel_fusion::test_composite_fuse
```
Similarly, `6_validate.dot`, `hwopt_d_fuse_basic_opt_end_graph_0.dot`, and `hwopt_d_composite_opt_end_graph_0.dot` are used to generate the initial computational graph, the computational graph after basic operator fusion, and the computational graph after composite operator fusion.
As shown in Figure 3, there are three basic operators in the initial computing of the constructed network. After the graph kernel fusion function is enabled, the first two basic operators (`Mul` and `TensorAdd`) automatically compose one operator (composite operator) at the basic operator fusion stage. As shown in Figure 4, the upper right part shows the composite operator after fusion, and the lower left part shows a basic operator `Pow`. At the subsequent composite operator fusion stage, the remaining basic operator (`Pow`) and an existing composite operator are further fused to form a new composite operator. In Figure 5, the upper right part is the composite operator after the three basic operators are fused. Currently, the network only needs to execute one composite operator to complete the `Mul`, `TensorAdd`, and `Pow` computing.
![Initial computational graph](./images/graph_kernel_fusion_example_fuse_composite_before.png)
Figure 3 Initial computational graph
![Basic operator fusion](./images/graph_kernel_fusion_example_fuse_composite_middle.png)
Figure 4 Computational graph after basic operator fusion
![Composite operator fusion](./images/graph_kernel_fusion_example_fuse_composite_after.png)
Figure 5 Computational graph after composite operator fusion
### Training Time for One Step
BERT-large scenario: After the graph kernel fusion function is enabled for the BERT-large network, the training time for one step can be improved by more than 10% while the accuracy is the same as that before the function is enabled.
\ No newline at end of file
# Quantization
<!-- TOC -->
- [Quantization](#quantization)
- [Background](#background)
- [Concept](#concept)
- [Quantization](#quantization-1)
- [Fake Quantization Node](#fake-quantization-node)
- [Quantization Aware Training](#quantization-aware-training)
- [Quantization Aware Training Example](#quantization-aware-training-example)
- [Defining a Fusion Network](#defining-a-fusion-network)
- [Converting the Fusion Model into a Quantization Network](#converting-the-fusion-model-into-a-quantization-network)
- [Retraining and Inference](#retraining-and-inference)
- [Importing a Model for Retraining](#importing-a-model-for-retraining)
- [Inference](#inference)
- [References](#references)
<!-- /TOC -->
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/quantization_aware.md" target="_blank"><img src="../_static/logo_source.png"></a>
## Background
Deep learning technologies are used on an increasing number of applications on mobile or edge devices. Take mobile phones as an example. To provide user-friendly and intelligent services, the deep learning function is integrated into operating systems and applications. However, this function involves training or inference, containing a large number of models and weight files. The original weight file of AlexNet has exceeded 200 MB, and the new model is developing towards a more complex structure with more parameters. Due to limited hardware resources of a mobile or edge device, a model needs to be simplified and the quantization technology is used to solve this problem.
## Concept
### Quantization
Quantization is a process of approximating (usually INT8) a fixed point of a floating-point model weight of a continuous value (or a large quantity of possible discrete values) or tensor data that flows through a model to a limited quantity (or a relatively small quantity) of discrete values at a relatively low inference accuracy loss. It is a process of approximately representing 32-bit floating-point data with fewer bits, while the input and output of the model are still floating-point data. In this way, the model size and memory usage can be reduced, the model inference speed can be accelerated, and the power consumption can be reduced.
As described above, compared with the FP32 type, low-accuracy data representation types such as FP16, INT8, and INT4 occupy less space. Replacing the high-accuracy data representation type with the low-accuracy data representation type can greatly reduce the storage space and transmission time. Low-bit computing has higher performance. Compared with FP32, INT8 has a three-fold or even higher acceleration ratio. For the same computing, INT8 has obvious advantages in power consumption.
Currently, there are two types of quantization solutions in the industry: quantization aware training and post-training quantization training.
### Fake Quantization Node
A fake quantization node is a node inserted during quantization aware training, and is used to search for network data distribution and feed back a lost accuracy. The specific functions are as follows:
- Find the distribution of network data, that is, find the maximum and minimum values of the parameters to be quantized.
- Simulate the accuracy loss of low-bit quantization, apply the loss to the network model, and transfer the loss to the loss function, so that the optimizer optimizes the loss value during training.
## Quantization Aware Training
MindSpore quantization aware training is to replace high-accuracy data with low-accuracy data to simplify the model training process. In this process, the accuracy loss is inevitable. Therefore, a fake quantization node is used to simulate the accuracy loss, and backward propagation learning is used to reduce the accuracy loss. MindSpore adopts the solution in reference [1] for the quantization of weights and data.
Aware quantization training specifications
| Specification | Description |
| ------------- | ---------------------------------------- |
| Hardware | Supports hardware platforms based on the GPU or Ascend AI 910 processor. |
| Network | Supports networks such as LeNet and ResNet50. For details, see <https://gitee.com/mindspore/mindspore/tree/master/model_zoo>. |
| Algorithm | Supports symmetric and asymmetric quantization algorithms in MindSpore fake quantization training. |
| Solution | Supports 4-, 7-, and 8-bit quantization solutions. |
## Quantization Aware Training Example
The procedure for the quantization aware training model is the same as that for the common training. After the network is defined and the model is generated, additional operations need to be performed. The complete process is as follows:
1. Process data and load datasets.
2. Define a network.
3. Define a fusion network. After a network is defined, replace the specified operators to define a fusion network.
4. Define an optimizer and loss function.
5. Perform model training. Generate a fusion model based on the fusion network training.
6. Generate a quantization network. After the fusion model is obtained based on the fusion network training, insert a fake quantization node into the fusion model by using a conversion API to generate a quantization network.
7. Perform quantization training. Generate a quantization model based on the quantization network training.
Compared with common training, the quantization aware training requires additional steps which are steps 3, 6, and 7 in the preceding process.
> - Fusion network: network after the specified operators (`nn.Conv2dBnAct` and `nn.DenseBnAct`) are used for replacement.
> - Fusion model: model in the checkpoint format generated by the fusion network training.
> - Quantization network: network obtained after the fusion model uses a conversion API (`convert_quant_network`) to insert a fake quantization node.
> - Quantization model: model in the checkpoint format obtained after the quantization network training.
Next, the LeNet network is used as an example to describe steps 3 and 6.
> You can obtain the complete executable sample code at <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/lenet_quant>.
### Defining a Fusion Network
Define a fusion network and replace the specified operators.
1. Use the `nn.Conv2dBnAct` operator to replace the three operators `nn.Conv2d`, `nn.batchnorm`, and `nn.relu` in the original network model.
2. Use the `nn.DenseBnAct` operator to replace the three operators `nn.Dense`, `nn.batchnorm`, and `nn.relu` in the original network model.
> Even if the `nn.Dense` and `nn.Conv2d` operators are not followed by `nn.batchnorm` and `nn.relu`, the preceding two replacement operations must be performed as required.
The definition of the original network model is as follows:
```python
class LeNet5(nn.Cell):
def __init__(self, num_class=10):
super(LeNet5, self).__init__()
self.num_class = num_class
self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
self.bn1 = nn.batchnorm(6)
self.act1 = nn.relu()
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
self.bn2 = nn.batchnorm(16)
self.act2 = nn.relu()
self.fc1 = nn.Dense(16 * 5 * 5, 120)
self.fc2 = nn.Dense(120, 84)
self.act3 = nn.relu()
self.fc3 = nn.Dense(84, self.num_class)
self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
def construct(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.act1(x)
x = self.max_pool2d(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.act2(x)
x = self.max_pool2d(x)
x = self.flattern(x)
x = self.fc1(x)
x = self.act3(x)
x = self.fc2(x)
x = self.act3(x)
x = self.fc3(x)
return x
```
The following shows the fusion network after operators are replaced:
```python
class LeNet5(nn.Cell):
def __init__(self, num_class=10):
super(LeNet5, self).__init__()
self.num_class = num_class
self.conv1 = nn.Conv2dBnAct(1, 6, kernel_size=5, batchnorm=True, activation='relu')
self.conv2 = nn.Conv2dBnAct(6, 16, kernel_size=5, batchnorm=True, activation='relu')
self.fc1 = nn.DenseBnAct(16 * 5 * 5, 120, activation='relu')
self.fc2 = nn.DenseBnAct(120, 84, activation='relu')
self.fc3 = nn.DenseBnAct(84, self.num_class)
self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
def construct(self, x):
x = self.conv1(x)
x = self.max_pool2d(x)
x = self.conv2(x)
x = self.max_pool2d(x)
x = self.flattern(x)
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
```
### Converting the Fusion Model into a Quantization Network
Use the `convert_quant_network` API to automatically insert a fake quantization node into the fusion model to convert the fusion model into a quantization network.
```python
from mindspore.train.quant import quant as qat
net = qat.convert_quant_network(net, quant_delay=0, bn_fold=False, freeze_bn=10000, weight_bits=8, act_bits=8)
```
## Retraining and Inference
### Importing a Model for Retraining
The preceding describes the quantization aware training from scratch. A more common case is that an existing model file needs to be converted to a quantization model. The model file and training script obtained through common network model training are available for quantization aware training. To use a checkpoint file for retraining, perform the following steps:
1. Process data and load datasets.
2. Define a network.
3. Define a fusion network.
4. Define an optimizer and loss function.
5. Load a model file and retrain the model. Load an existing model file and retrain the model based on the fusion network to generate a fusion model. For details, see <https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#id6>.
6. Generate a quantization network.
7. Perform quantization training.
### Inference
The inference using a quantization model is the same as common model inference. The inference can be performed by directly using the checkpoint file or converting the checkpoint file into a common model format (such as ONNX or GEIR).
For details, see <https://www.mindspore.cn/tutorial/en/master/use/multi_platform_inference.html>.
- To use a checkpoint file obtained after quantization aware training for inference, perform the following steps:
1. Load the quantization model.
2. Perform the inference.
- Convert the checkpoint file into a common model format such as ONNX for inference. (This function is coming soon.)
## References
[1] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
[2] Krishnamoorthi R. Quantizing deep convolutional networks for efficient inference: A whitepaper[J]. arXiv preprint arXiv:1806.08342, 2018.
[3] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
\ No newline at end of file
......@@ -48,6 +48,8 @@ MindSpore Tutorials
advanced_use/distributed_training_tutorials
advanced_use/mixed_precision
advanced_use/graph_kernel_fusion
advanced_use/quantization_aware
.. toctree::
:glob:
......
# Multi-platform Inference
# Multi-Platform Inference
<!-- TOC -->
- [Multi-platform Inference](#multi-platform-inference)
- [Multi-Platform Inference](#multi-platform-inference)
- [Overview](#overview)
- [Inference on the Ascend 910 AI processor](#inference-on-the-ascend-910-ai-processor)
- [Inference Using a Checkpoint File](#inference-using-a-checkpoint-file)
- [Inference on the Ascend 310 AI processor](#inference-on-the-ascend-310-ai-processor)
- [Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-1)
- [Inference Using an ONNX or GEIR File](#inference-using-an-onnx-or-geir-file)
- [Inference on a GPU](#inference-on-a-gpu)
- [Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-2)
- [Inference Using an ONNX File](#inference-using-an-onnx-file)
- [Inference on a CPU](#inference-on-a-cpu)
- [Inference Using a Checkpoint File](#inference-using-a-checkpoint-file-3)
- [Inference Using an ONNX File](#inference-using-an-onnx-file-1)
- [On-Device Inference](#on-device-inference)
<!-- /TOC -->
......@@ -12,30 +23,92 @@
## Overview
Models based on MindSpore training can be used for inference on different hardware platforms. This document introduces the inference process on each platform.
Models trained by MindSpore support the inference on different hardware platforms. This document describes the inference process on each platform.
1. Inference on the Ascend 910 AI processor
The inference can be performed in either of the following methods based on different principles:
- Use a checkpoint file for inference. That is, use the inference API to load data and the checkpoint file for inference in the MindSpore training environment.
- Convert the checkpiont file into a common model format, such as ONNX or GEIR, for inference. The inference environment does not depend on MindSpore. In this way, inference can be performed across hardware platforms as long as the platform supports ONNX or GEIR inference. For example, models trained on the Ascend 910 AI processor can be inferred on the GPU or CPU.
MindSpore provides the `model.eval` API for model validation. You only need to import the validation dataset. The processing method of the validation dataset is the same as that of the training dataset. For details about the complete code, see <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/lenet/eval.py>.
MindSpore supports the following inference scenarios based on the hardware platform:
| Hardware Platform | Model File Format | Description |
| ----------------------- | ----------------- | ---------------------------------------- |
| Ascend 910 AI processor | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| Ascend 310 AI processor | ONNX or GEIR | Equipped with the ACL framework and supports the model in OM format. You need to use a tool to convert a model into the OM format. |
| GPU | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| GPU | ONNX | Supports ONNX Runtime or SDK, for example, TensorRT. |
| CPU | Checkpoint | The training environment dependency is the same as that of MindSpore. |
| CPU | ONNX | Supports ONNX Runtime or SDK, for example, TensorRT. |
> Open Neural Network Exchange (ONNX) is an open file format designed for machine learning. It is used to store trained models. It enables different AI frameworks (such as PyTorch and MXNet) to store model data in the same format and interact with each other. For details, visit the ONNX official website <https://onnx.ai/>.
> Graph Engine Intermediate Representation (GEIR) is an open file format defined by Huawei for machine learning and can better adapt to the Ascend AI processor. It is similar to ONNX.
> Ascend Computer Language (ACL) provides C++ API libraries for users to develop deep neural network applications, including device management, context management, stream management, memory management, model loading and execution, operator loading and execution, and media data processing. It matches the Ascend AI processor and enables hardware running management and resource management.
> Offline Model (OM) is supported by the Huawei Ascend AI processor. It implements preprocessing functions that can be completed without devices, such as operator scheduling optimization, weight data rearrangement and compression, and memory usage optimization.
> NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime to improve the inference speed of the deep learning model on edge devices. For details, see <https://developer.nvidia.com/tensorrt>.
## Inference on the Ascend 910 AI processor
### Inference Using a Checkpoint File
1. Input a validation dataset to validate a model using the `model.eval` API. The processing method of the validation dataset is the same as that of the training dataset.
```python
res = model.eval(dataset)
```
In the preceding information:
`model.eval` is an API for model validation. For details about the API, see <https://www.mindspore.cn/api/en/master/api/python/mindspore/mindspore.html#mindspore.Model.eval>.
> Inference sample code: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/lenet/eval.py>.
2. Use the `model.predict` API to perform inference.
```python
res = model.eval(dataset)
model.predict(input_data)
```
In addition, the` model.predict` interface can be used for inference. For detailed usage, please refer to API description.
In the preceding information:
`model.predict` is an API for inference. For details about the API, see <https://www.mindspore.cn/api/en/master/api/python/mindspore/mindspore.html#mindspore.Model.predict>.
## Inference on the Ascend 310 AI processor
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
### Inference Using an ONNX or GEIR File
The Ascend 310 AI processor is equipped with the ACL framework and supports the OM format which needs to be converted from the model in ONNX or GEIR format. For inference on the Ascend 310 AI processor, perform the following steps:
1. Generate a model in ONNX or GEIR format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2. Convert the ONNX or GEIR model file into an OM model file and perform inference.
- For performing inference in the cloud environment (ModelArt), see the [Ascend 910 training and Ascend 310 inference samples](https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_0026.html).
- For details about the local bare-metal environment where the Ascend 310 AI processor is deployed locally (compared with the cloud environment), see the document of the Ascend 310 AI processor software package.
## Inference on a GPU
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
### Inference Using an ONNX File
1. Generate a model in ONNX format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2. Inference on the Ascend 310 AI processor
2. Perform inference on a GPU by referring to the runtime or SDK document. For example, use TensorRT to perform inference on the NVIDIA GPU. For details, see [TensorRT backend for ONNX](https://github.com/onnx/onnx-tensorrt).
1. Export the ONNX or GEIR model by referring to the [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
## Inference on a CPU
2. For performing inference in the cloud environment, see the [Ascend 910 training and Ascend 310 inference samples](https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_0026.html). For details about the bare-metal environment (compared with the cloud environment where the Ascend 310 AI processor is deployed locally), see the description document of the Ascend 310 AI processor software package.
### Inference Using a Checkpoint File
The inference is the same as that on the Ascend 910 AI processor.
3. Inference on a GPU
### Inference Using an ONNX File
Similar to the inference on a GPU, the following steps are required:
1. Export the ONNX model by referring to the [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
1. Generate a model in ONNX format on the training platform. For details, see [Export GEIR Model and ONNX Model](https://www.mindspore.cn/tutorial/en/master/use/saving_and_loading_model_parameters.html#geironnx).
2. Perform inference on the NVIDIA GPU by referring to [TensorRT backend for ONNX](https://github.com/onnx/onnx-tensorrt).
2. Perform inference on a CPU by referring to the runtime or SDK document. For details about how to use the ONNX Runtime, see the [ONNX Runtime document](https://github.com/microsoft/onnxruntime).
## On-Device Inference
The On-Device Inference is based on the MindSpore Predict. Please refer to [On-Device Inference Tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/on_device_inference.html) for details.
MindSpore Predict is an inference engine for on-device inference. For details, see [On-Device Inference](https://www.mindspore.cn/tutorial/en/master/advanced_use/on_device_inference.html).
\ No newline at end of file
......@@ -18,7 +18,7 @@
<!-- /TOC -->
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/aware_quantization.md" target="_blank"><img src="../_static/logo_source.png"></a>
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/quantization_aware.md" target="_blank"><img src="../_static/logo_source.png"></a>
## 背景
......@@ -32,7 +32,7 @@
如上所述,与FP32类型相比,FP16、INT8、INT4等低精度数据表达类型所占用空间更小。使用低精度数据表达类型替换高精度数据表达类型,可以大幅降低存储空间和传输时间。而低比特的计算性能也更高,INT8相对比FP32的加速比可达到3倍甚至更高,对于相同的计算,功耗上也有明显优势。
当前业界量化方案主要分为两种:感知量化训练(Aware Quantization Training)和训练后量化(Post-training Quantization)。
当前业界量化方案主要分为两种:感知量化训练(Quantization Aware Training)和训练后量化(Post-training Quantization)。
### 伪量化节点
......
......@@ -49,7 +49,7 @@ MindSpore教程
advanced_use/distributed_training_tutorials
advanced_use/mixed_precision
advanced_use/graph_kernel_fusion
advanced_use/aware_quantization
advanced_use/quantization_aware
.. toctree::
:glob:
......
......@@ -67,7 +67,7 @@ CPU | ONNX格式 | 支持ONNX推理的runtime/SDK,如TensorRT。
model.predict(input_data)
```
其中,
`model.eval`为推理接口,对应接口说明:<https://www.mindspore.cn/api/zh-CN/master/api/python/mindspore/mindspore.html#mindspore.Model.predict>
`model.predict`为推理接口,对应接口说明:<https://www.mindspore.cn/api/zh-CN/master/api/python/mindspore/mindspore.html#mindspore.Model.predict>
## Ascend 310 AI处理器上推理
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册