doc22_074.md

# 量化

> 原文：[`pytorch.org/docs/stable/quantization.html`](https://pytorch.org/docs/stable/quantization.html)> 
>
> 译者：[飞龙](https://github.com/wizardforcel)
>
> 协议：[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)


警告

量化处于测试阶段，可能会有变化。

## 量化简介

量化是指在执行计算和存储张量时使用比浮点精度更低的比特宽度的技术。量化模型在张量上执行一些或所有操作时，使用降低的精度而不是完整精度（浮点）值。这样可以实现更紧凑的模型表示，并在许多硬件平台上使用高性能的矢量化操作。PyTorch 支持 INT8 量化，相比典型的 FP32 模型，可以将模型大小减小 4 倍，内存带宽需求减小 4 倍。与 FP32 计算相比，硬件对 INT8 计算的支持通常快 2 到 4 倍。量化主要是一种加速推断的技术，只支持量化操作符的前向传播。

PyTorch 支持多种量化深度学习模型的方法。在大多数情况下，模型在 FP32 中训练，然后将模型转换为 INT8。此外，PyTorch 还支持量化感知训练，该训练模型在前向和后向传递中使用伪量化模块模拟量化误差。请注意，整个计算过程都是在浮点数中进行的。在量化感知训练结束时，PyTorch 提供转换函数将训练好的模型转换为较低精度。

在较低级别，PyTorch 提供了一种表示量化张量并对其进行操作的方法。它们可以用于直接构建在较低精度中执行全部或部分计算的模型。还提供了更高级别的 API，其中包含将 FP32 模型转换为较低精度的典型工作流程，以最小化精度损失。

## 量化 API 摘要

PyTorch 提供三种不同的量化模式：Eager 模式量化，FX 图模式量化（维护）和 PyTorch 2 导出量化。

Eager 模式量化是一个测试功能。用户需要手动进行融合并指定量化和去量化发生的位置，它只支持模块而不支持功能。

FX 图模式量化是 PyTorch 中的自动量化工作流程，目前是一个原型功能，自从我们有了 PyTorch 2 导出量化以来，它处于维护模式。它通过添加对功能的支持和自动化量化过程来改进 Eager 模式量化，尽管人们可能需要重构模型以使其与 FX 图模式量化兼容（使用`torch.fx`进行符号跟踪）。请注意，FX 图模式量化不适用于任意模型，因为模型可能无法进行符号跟踪，我们将把它集成到领域库中，如 torchvision，用户将能够使用 FX 图模式量化对支持的领域库中的模型进行量化。对于任意模型，我们将提供一般性指导，但要使其正常工作，用户可能需要熟悉`torch.fx`，特别是如何使模型进行符号跟踪。

PyTorch 2 导出量化是新的完整图模式量化工作流程，作为 PyTorch 2.1 原型功能发布。随着 PyTorch 2 的推出，我们正在转向更好的解决方案，用于完整程序捕获（torch.export），因为与 FX 图模式量化使用的程序捕获解决方案（torch.fx.symbolic_trace）相比，它可以捕获更高比例（14K 模型上的 88.8%比 14K 模型上的 72.7%）。torch.export 仍然存在一些限制，涉及到一些 Python 结构，并需要用户参与以支持导出模型中的动态性，但总体而言，它是对以前的程序捕获解决方案的改进。PyTorch 2 导出量化是为 torch.export 捕获的模型构建的，考虑到建模用户和后端开发人员的灵活性和生产力。其主要特点是（1）可编程 API，用于配置如何对模型进行量化，可以扩展到更多用例（2）简化的用户体验，用于建模用户和后端开发人员，因为他们只需要与一个对象（量化器）交互，表达用户关于如何量化模型以及后端支持的意图（3）可选的参考量化模型表示，可以用整数操作表示量化计算，更接近硬件中实际发生的量化计算。

鼓励新用户首先尝试 PyTorch 2 导出量化，如果效果不佳，用户可以尝试急切模式量化。

以下表格比较了急切模式量化、FX 图模式量化和 PyTorch 2 导出量化之间的区别：

|  | 急切模式量化 | FX 图模式量化 | PyTorch 2 导出量化 |
| --- | --- | --- | --- |
| 发布状态 | beta | 原型（维护） | 原型 |
| 运算符融合 | 手动 | 自动 | 自动 |
| 量化/去量化放置 | 手动 | 自动 | 自动 |
| 量化模块 | 支持 | 支持 | 支持 |
| 量化功能/Torch 操作 | 手动 | 自动 | 支持 |
| 自定义支持 | 有限支持 | 完全支持 | 完全支持 |
| 量化模式支持 | 训练后量化：静态、动态、仅权重量化感知训练：静态 | 训练后量化：静态、动态、仅权重量化感知训练：静态 | 由后端特定量化器定义 |
| 输入/输出模型类型 | `torch.nn.Module` | `torch.nn.Module`（可能需要一些重构以使模型与 FX 图模式量化兼容） | `torch.fx.GraphModule`（由`torch.export`捕获） |

支持三种类型的量化：

1.  动态量化（权重量化，激活以浮点数读取/存储并量化计算）

1.  静态量化（权重量化，激活量化，训练后需要校准）

1.  静态量化感知训练（权重量化，激活量化，训练期间建模量化数值）

请查看我们的[PyTorch 量化介绍](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)博客文章，了解这些量化类型之间的权衡更全面的概述。

动态量化和静态量化的运算符覆盖范围不同，详见下表。

|  | 静态量化 | 动态量化 |
| --- | --- | --- |
| nn.Linearnn.Conv1d/2d/3d | YY | YN |
| nn.LSTMnn.GRU | Y（通过自定义模块）N | YY |
| nn.RNNCellnn.GRUCellnn.LSTMCell | NNN | YYY |
| nn.EmbeddingBag | Y（激活为 fp32） | Y |
| nn.Embedding | Y | Y |
| nn.MultiheadAttention | Y（通过自定义模块） | 不支持 |
| 激活 | 广泛支持 | 保持不变，计算保持在 fp32 中 |

### 急切模式量化

有关量化流程的一般介绍，包括不同类型的量化，请参阅一般量化流程。

#### 训练后动态量化

这是最简单的量化形式，其中权重在预先量化，但激活在推断期间动态量化。这适用于模型执行时间主要由从内存加载权重而不是计算矩阵乘法所主导的情况。这对于批量大小较小的 LSTM 和 Transformer 类型模型是真实的。

图表：

```py
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8 
```

PTDQ API 示例:

```py
import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32) 
```

要了解更多关于动态量化的信息，请参阅我们的[动态量化教程](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html)。

#### 后训练静态量化

后训练静态量化（PTQ 静态）量化模型的权重和激活。它将激活融合到可能的前置层中。它需要使用代表性数据集进行校准，以确定激活的最佳量化参数。后训练静态量化通常用于 CNN 是典型用例的情况下，其中内存带宽和计算节省都很重要。

在应用后训练静态量化之前，我们可能需要修改模型。请参阅急切模式静态量化的模型准备。

图表：

```py
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8 
```

PTSQ API 示例：

```py
import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32) 
```

要了解更多关于静态量化的信息，请参阅[静态量化教程](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html)。

#### 静态量化的量化感知训练

量化感知训练（QAT）在训练期间模拟量化的效果，从而使准确性比其他量化方法更高。我们可以对静态、动态或仅权重量化进行 QAT。在训练期间，所有计算都是在浮点数中进行的，通过 fake_quant 模块模拟量化的效果，通过夹紧和四舍五入来模拟 INT8 的效果。在模型转换后，权重和激活被量化，并且激活被融合到可能的前置层中。它通常与 CNN 一起使用，并且与静态量化相比，准确性更高。

在应用后训练静态量化之前，我们可能需要修改模型。请参阅急切模式静态量化的模型准备。

图表：

```py
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8 
```

QAT API 示例：

```py
import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32) 
```

要了解更多关于量化感知训练，请参阅[QAT 教程](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html)。

#### 急切模式静态量化的模型准备

目前在进行急切模式量化之前需要对模型定义进行一些修改。这是因为目前量化是基于模块的。具体来说，对于所有量化技术，用户需要：

1.  将需要输出重新量化的操作（因此具有额外参数）从功能形式转换为模块形式（例如，使用`torch.nn.ReLU`而不是`torch.nn.functional.relu`）。

1.  指定模型的哪些部分需要量化，可以通过在子模块上分配`.qconfig`属性或通过指定`qconfig_mapping`来实现。例如，设置`model.conv1.qconfig = None`意味着`model.conv`层不会被量化，设置`model.linear1.qconfig = custom_qconfig`意味着`model.linear1`的量化设置将使用`custom_qconfig`而不是全局 qconfig。

对于量化激活的静态量化技术，用户需要额外执行以下操作：

1.  指定激活量化和去量化的位置。这是使用`QuantStub`和`DeQuantStub`模块完成的。

1.  使用`FloatFunctional`将需要特殊处理以进行量化的张量操作包装成模块。例如，需要特殊处理以确定输出量化参数的操作如`add`和`cat`。

1.  融合模块：将操作/模块组合成单个模块以获得更高的准确性和性能。这是使用`fuse_modules()`API 完成的，该 API 接受要融合的模块列表。我们目前支持以下融合：[Conv, Relu]，[Conv, BatchNorm]，[Conv, BatchNorm, Relu]，[Linear, Relu]

### （原型-维护模式）FX 图模式量化[]（＃prototype-maintaince-mode-fx-graph-mode-quantization“跳转到此标题”）

后训练量化中有多种量化类型（仅权重、动态和静态），配置通过 qconfig_mapping（prepare_fx 函数的参数）完成。

FXPTQ API 示例：

```py
import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize) 
```

请按照以下教程了解有关 FX 图模式量化的更多信息：

+   [使用 FX 图模式量化的用户指南](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html)

+   [FX 图模式后训练静态量化](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html)

+   [FX 图模式后训练动态量化](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html)

### （原型）PyTorch 2 导出量化[]（＃prototype-pytorch-2-export-quantization“跳转到此标题”）

API 示例：

```py
import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1\. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result shoud mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2\. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3\. lowering
# lower to target backend 
```

请按照以下教程开始 PyTorch 2 导出量化：

建模用户：

+   [PyTorch 2 导出后训练量化](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html)

+   [通过电感器在 X86 后端进行 PyTorch 2 导出后训练量化](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_x86_inductor.html)

+   [PyTorch 2 导出量化感知训练](https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html)

后端开发人员（请查看所有建模用户文档）：

+   [如何为 PyTorch 2 导出量化器编写量化器](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html)

## 量化堆栈

量化是将浮点模型转换为量化模型的过程。因此，在高层次上，量化堆栈可以分为两部分：1）用于量化模型的构建块或抽象 2）用于将浮点模型转换为量化模型的量化流的构建块或抽象

### 量化模型

#### 量化张量

为了在 PyTorch 中进行量化，我们需要能够在张量中表示量化数据。量化张量允许存储量化数据（表示为 int8/uint8/int32）以及量化参数，如比例和零点。量化张量允许进行许多有用的操作，使量化算术变得容易，同时允许以量化格式序列化数据。

PyTorch 支持对称和非对称的每个张量和每个通道量化。每个张量意味着张量中的所有值以相同的方式使用相同的量化参数进行量化。每个通道意味着对于每个维度，通常是张量的通道维度，张量中的值使用不同的量化参数进行量化。这样可以减少将张量转换为量化值时的误差，因为异常值只会影响它所在的通道，而不是整个张量。

通过将浮点张量转换为

[！[_images/math-quantizer-equation.png]（../Images/161907b1eaa52e48f126f8041595a277.png）]（_images/math-quantizer-equation.png）

请注意，我们确保在量化后浮点数中的零不会出现错误，从而确保像填充这样的操作不会导致额外的量化误差。

以下是量化张量的一些关键属性：

+   QScheme（torch.qscheme）：指定我们量化张量的方式的枚举

    +   torch.per_tensor_affine

    +   torch.per_tensor_symmetric

    +   torch.per_channel_affine

    +   torch.per_channel_symmetric

+   dtype（torch.dtype）：量化张量的数据类型

    +   torch.quint8

    +   torch.qint8

    +   torch.qint32

    +   torch.float16

+   量化参数（根据 QScheme 的不同而变化）：所选量化方式的参数

    +   torch.per_tensor_affine 将具有量化参数

        +   比例（浮点数）

        +   零点（整数）

    +   torch.per_channel_affine 将具有量化参数

        +   每通道比例（浮点数列表）

        +   每通道零点（整数列表）

        +   轴（整数）

#### 量化和去量化

模型的输入和输出是浮点张量，但量化模型中的激活是量化的，因此我们需要运算符在浮点张量和量化张量之间进行转换。

+   量化（浮点->量化）

    +   torch.quantize_per_tensor(x，scale，zero_point，dtype）

    +   torch.quantize_per_channel(x，scales，zero_points，axis，dtype）

    +   torch.quantize_per_tensor_dynamic(x，dtype，reduce_range)

    +   to(torch.float16)

+   去量化（量化->浮点）

    +   quantized_tensor.dequantize() - 在 torch.float16 张量上调用 dequantize 将张量转换回 torch.float

    +   torch.dequantize(x)

#### 量化运算符/模块[]（＃量化运算符模块“到此标题的永久链接”）

+   量化运算符是将量化张量作为输入的运算符，并输出量化张量。

+   量化模块是执行量化操作的 PyTorch 模块。它们通常用于加权操作，如线性和卷积。

#### 量化引擎

当执行量化模型时，qengine（torch.backends.quantized.engine）指定要用于执行的后端。重要的是要确保 qengine 与量化模型在量化激活和权重的值范围方面是兼容的。

### 量化流程

#### 观察器和 FakeQuantize

+   观察器是 PyTorch 模块，用于：

    +   收集张量统计信息，如通过观察器传递的张量的最小值和最大值

    +   并根据收集的张量统计数据计算量化参数

+   FakeQuantize 是 PyTorch 模块，用于：

    +   在网络中为张量模拟量化（执行量化/去量化）

    +   它可以根据观察器收集的统计数据计算量化参数，也可以学习量化参数

#### QConfig

+   QConfig 是 Observer 或 FakeQuantize 模块类的命名元组，可以配置 qscheme、dtype 等。它用于配置如何观察运算符

    +   运算符/模块的量化配置

        +   不同类型的观察器/FakeQuantize

        +   dtype

        +   qscheme

        +   quant_min/quant_max：可用于模拟低精度张量

    +   当前支持激活和权重的配置

    +   我们根据为给定运算符或模块配置的 qconfig 插入输入/权重/输出观察器

#### 一般量化流程

一般来说，流程如下

+   准备

    +   基于用户指定的 qconfig 插入 Observer/FakeQuantize 模块

+   校准/训练（取决于后训练量化或量化感知训练）

    +   允许 Observer 收集统计信息或 FakeQuantize 模块学习量化参数

+   转换

    +   将校准/训练好的模型转换为量化模型

有不同的量化模式，它们可以按两种方式分类：

在我们应用量化流程的位置方面，我们有：

1.  后训练量化（在训练后应用量化，量化参数是基于样本校准数据计算的）

1.  量化感知训练（在训练过程中模拟量化，以便量化参数可以与使用训练数据训练的模型一起学习）

在我们量化运算符的方式方面，我们可以有：

+   仅权重量化（仅权重是静态量化的）

+   动态量化（权重静态量化，激活动态量化）

+   静态量化（权重和激活均为静态量化）

我们可以在同一量化流程中混合不同的量化运算符方式。例如，我们可以有后训练量化，其中既有静态量化的运算符，也有动态量化的运算符。

## 量化支持矩阵

### 量化模式支持

|  | 量化模式 | 数据集要求 | 最适用于 | 精度 | 注释 |
| --- | --- | --- | --- | --- | --- |
| 后训练量化 | 动态/仅权重量化 | 激活动态量化（fp16，int8）或未量化，权重静态量化（fp16，int8，in4） | 无 | LSTM，MLP，嵌入，Transformer | 良好 | 在性能受计算或内存限制的情况下，易于使用，接近静态量化时的性能 |
| 静态量化 | 激活和权重静态量化（int8） | 校准数据集 | CNN | 良好 | 提供最佳性能，可能对精度有很大影响，适用于仅支持 int8 计算的硬件 |
| 量化感知训练 | 动态量化 | 激活和权重均为虚假量化 | 微调数据集 | MLP，嵌入 | 最佳 | 目前支持有限 |
| 静态量化 | 激活和权重均为虚假量化 | 微调数据集 | CNN，MLP，嵌入 | 最佳 | 通常在静态量化导致精度下降时使用，并用于缩小精度差距 |

请查看我们的[Pytorch 量化简介](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)博客文章，以获取更全面的这些量化类型之间权衡的概述。

### 量化流程支持

PyTorch 提供两种量化模式：Eager Mode Quantization 和 FX Graph Mode Quantization。

Eager Mode Quantization 是一个测试功能。用户需要手动进行融合并指定量化和去量化发生的位置，它仅支持模块而不支持功能。

FX Graph Mode Quantization 是 PyTorch 中的自动量化框架，目前是一个原型功能。它通过添加对功能的支持和自动化量化过程来改进 Eager Mode Quantization，尽管人们可能需要重构模型以使其与 FX Graph Mode Quantization 兼容（通过 `torch.fx` 进行符号跟踪）。请注意，FX Graph Mode Quantization 不适用于任意模型，因为模型可能无法进行符号跟踪，我们将把它集成到领域库中，如 torchvision，并且用户将能够使用 FX Graph Mode Quantization 对类似于受支持领域库中的模型进行量化。对于任意模型，我们将提供一般性指导，但要使其正常工作，用户可能需要熟悉 `torch.fx`，特别是如何使模型符号可跟踪。

鼓励量化的新用户首先尝试 FX 图模式量化，如果不起作用，用户可以尝试遵循[使用 FX 图模式量化](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html)的指南或回退到急切模式量化。

以下表格比较了急切模式量化和 FX 图模式量化之间的差异。

|  | 急切模式量化 | FX 图模式量化 |
| --- | --- | --- |
| 发布状态 | beta | 原型 |
| 运算符融合 | 手动 | 自动 |
| 量化/去量化放置 | 手动 | 自动 |
| 量化模块 | 支持 | 支持 |
| 量化功能/火炬操作 | 手动 | 自动 |
| 定制支持 | 有限支持 | 完全支持 |
| 量化模式支持 | 后训练量化：静态，动态，仅权重量化感知训练：静态 | 后训练量化：静态，动态，仅权重量化感知训练：静态 |
| 输入/输出模型类型 | `torch.nn.Module` | `torch.nn.Module`（可能需要一些重构以使模型与 FX 图模式量化兼容） |

### 后端/硬件支持

| 硬件 | 内核库 | 急切模式量化 | FX 图模式量化 | 量化模式支持 |
| --- | --- | --- | --- | --- |
| 服务器 CPU | fbgemm/onednn | 支持 | 所有支持 |
| 移动 CPU | qnnpack/xnnpack |
| 服务器 GPU | TensorRT（早期原型） | 不支持，需要图形 | 支持 | 静态量化 |

今天，PyTorch 支持以下用于高效运行量化运算符的后端：

+   具有 AVX2 支持或更高版本的 x86 CPU（没有 AVX2，某些操作具有低效的实现），通过由[fbgemm](https://github.com/pytorch/FBGEMM)和[onednn](https://github.com/oneapi-src/oneDNN)优化的 x86（请参阅[RFC](https://github.com/pytorch/pytorch/issues/83888)中的详细信息）

+   ARM CPU（通常在移动/嵌入式设备中找到），通过[qnnpack](https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/quantized/cpu/qnnpack)

+   （早期原型）通过[fx2trt](https://developer.nvidia.com/tensorrt)支持 NVidia GPU

#### 本机 CPU 后端注意事项

我们同时暴露了 x86 和 qnnpack，使用相同的本机 pytorch 量化运算符，因此我们需要额外的标志来区分它们。根据 PyTorch 构建模式自动选择 x86 和 qnnpack 的相应实现，尽管用户可以通过将 torch.backends.quantization.engine 设置为 x86 或 qnnpack 来覆盖此设置。

在准备量化模型时，必须确保 qconfig 和用于量化计算的引擎与将执行模型的后端匹配。qconfig 控制量化过程中使用的观察者类型。qengine 控制在为线性和卷积函数和模块打包权重时使用 x86 还是 qnnpack 特定的打包函数。例如：

x86 的默认设置：

```py
# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86' 
```

qnnpack 的默认设置：

```py
# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack' 
```

### 运算符支持

动态和静态量化之间的运算符覆盖范围在下表中捕获。请注意，对于 FX 图模式量化，相应的功能也受支持。

|  | 静态量化 | 动态量化 |
| --- | --- | --- |
| nn.Linear | YY | YN |
| nn.LSTMnn.GRU | NN | YY |
| nn.RNNCellnn.GRUCellnn.LSTMCell | NNN | YYY |
| nn.EmbeddingBag | Y（激活在 fp32 中） | Y |
| nn.Embedding | Y | Y |
| nn.MultiheadAttention | 不支持 | 不支持 |
| 激活 | 广泛支持 | 保持不变，计算保持在 fp32 中 |

注意：这将很快更新一些从本机 backend_config_dict 生成的信息。

## 量化 API 参考

量化 API 参考包含有关量化 API 的文档，例如量化传递、量化张量操作以及支持的量化模块和函数。

## 量化后端配置

量化后端配置包含有关如何为各种后端配置量化工作流程的文档。

## 量化精度调试

量化精度调试包含有关如何调试量化精度的文档。

## 量化定制

尽管提供了默认的观察者实现来根据观察到的张量数据选择比例因子和偏差，但开发人员可以提供自己的量化函数。量化可以选择性地应用于模型的不同部分，或者为模型的不同部分进行不同的配置。

我们还为**conv1d()**、**conv2d()**、**conv3d()**和**linear()**提供了通道量化支持。

量化工作流程通过向模型的模块层次结构添加（例如，将观察者添加为`.observer`子模块）或替换（例如，将`nn.Conv2d`转换为`nn.quantized.Conv2d`）子模块来实现。这意味着模型在整个过程中保持常规的基于`nn.Module`的实例，因此可以与 PyTorch 的其余 API 一起使用。

### 量化定制模块 API

Eager 模式和 FX 图模式的量化 API 提供了一个钩子，供用户以自定义方式指定模块的量化，具有用户定义的观察和量化逻辑。用户需要指定：

1.  源 fp32 模块的 Python 类型（存在于模型中）

1.  观察模块的 Python 类型（由用户提供）。该模块需要定义一个`from_float`函数，该函数定义了如何从原始 fp32 模块创建观察模块。

1.  量化模块的 Python 类型（由用户提供）。该模块需要定义一个`from_observed`函数，该函数定义了如何从观察到的模块创建量化模块。

1.  描述（1）、（2）、（3）的配置，传递给量化 API。

然后框架将执行以下操作：

1.  在准备模块交换期间，它将使用（2）中类的`from_float`函数，将（1）中指定类型的每个模块转换为（2）中指定类型。

1.  在转换模块交换期间，它将使用（3）中的类的`from_observed`函数，将（2）中指定类型的每个模块转换为（3）中指定类型。

目前，ObservedCustomModule 将具有单个张量输出的要求，并且观察者将由框架（而不是用户）添加到该输出上。观察者将作为自定义模块实例的属性存储在`activation_post_process`键下。在未来可能会放宽这些限制。

自定义 API 示例：

```py
import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict) 
```

## 最佳实践

1. 如果您正在使用`x86`后端，我们需要使用 7 位而不是 8 位。确保您减少`quant_min`、`quant_max`的范围，例如，如果`dtype`是`torch.quint8`，请确保将自定义的`quant_min`设置为`0`，`quant_max`设置为`127`（`255` / `2`）；如果`dtype`是`torch.qint8`，请确保将自定义的`quant_min`设置为`-64`（`-128` / `2`），`quant_max`设置为`63`（`127` / `2`），如果您调用`torch.ao.quantization.get_default_qconfig(backend)`或`torch.ao.quantization.get_default_qat_qconfig(backend)`函数来获取`x86`或`qnnpack`后端的默认`qconfig`，我们已经正确设置了这些。

2. 如果选择了`onednn`后端，将在默认的 qconfig 映射`torch.ao.quantization.get_default_qconfig_mapping('onednn')`和默认的 qconfig`torch.ao.quantization.get_default_qconfig('onednn')`中使用 8 位激活。建议在支持向量神经网络指令（VNNI）的 CPU 上使用。否则，将激活的观察者的`reduce_range`设置为 True，以在没有 VNNI 支持的 CPU 上获得更好的准确性。

## 常见问题

1.  如何在 GPU 上进行量化推断?：

    我们目前还没有官方的 GPU 支持，但这是一个积极开发的领域，您可以在[这里](https://github.com/pytorch/pytorch/issues/87395)找到更多信息

1.  我如何为我的量化模型获得 ONNX 支持?

    如果在导出模型时出现错误（使用`torch.onnx`下的 API），您可以在 PyTorch 存储库中打开一个问题。在问题标题前加上`[ONNX]`并将问题标记为`module: onnx`。

    如果您在 ONNX Runtime 中遇到问题，请在[GitHub - microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/issues/)上打开一个问题。

1.  如何在 LSTM 中使用量化?：

    LSTM 通过我们的自定义模块 API 在急切模式和 fx 图模式量化中得到支持。示例可以在急切模式中找到：[pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782) FX 图模式中：[pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116)

## 常见错误

### 将一个非量化的张量传递给一个量化的内核

如果您看到类似以下错误：

```py
RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend... 
```

这意味着您正在尝试将一个非量化的张量传递给一个量化的内核。一个常见的解决方法是使用`torch.ao.quantization.QuantStub`来对张量进行量化。这在 Eager 模式量化中需要手动完成。一个端到端的例子：

```py
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x 
```

### 将一个量化的张量传递给一个非量化的内核

如果您看到类似以下错误：

```py
RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend. 
```

这意味着您正在尝试将一个量化的张量传递给一个非量化的内核。一个常见的解决方法是使用`torch.ao.quantization.DeQuantStub`来对张量进行去量化。这在 Eager 模式量化中需要手动完成。一个端到端的例子：

```py
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv1 = torch.nn.Conv2d(1, 1, 1)
        # this module will not be quantized (see `qconfig = None` logic below)
        self.conv2 = torch.nn.Conv2d(1, 1, 1)
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv1(x)
        # during the convert step, this will be replaced with a
        # `dequantize` call
        x = self.dequant(x)
        x = self.conv2(x)
        return x

m = M()
m.qconfig = some_qconfig
# turn off quantization for conv2
m.conv2.qconfig = None 
```

### 保存和加载量化模型

在对一个量化模型调用`torch.load`时，如果出现类似以下错误：

```py
AttributeError: 'LinearPackedParams' object has no attribute '_modules' 
```

这是因为直接使用`torch.save`和`torch.load`保存和加载一个量化模型是不受支持的。要保存/加载量化模型，可以使用以下方法：

1.  保存/加载量化模型的 state_dict

一个例子：

```py
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        x = self.relu(x)
        return x

m = M().eval()
prepare_orig = prepare_fx(m, {'' : default_qconfig})
prepare_orig(torch.rand(5, 5))
quantized_orig = convert_fx(prepare_orig)

# Save/load using state_dict
b = io.BytesIO()
torch.save(quantized_orig.state_dict(), b)

m2 = M().eval()
prepared = prepare_fx(m2, {'' : default_qconfig})
quantized = convert_fx(prepared)
b.seek(0)
quantized.load_state_dict(torch.load(b)) 
```

1.  使用`torch.jit.save`和`torch.jit.load`保存/加载脚本化的量化模型

一个例子：

```py
# Note: using the same model M from previous example
m = M().eval()
prepare_orig = prepare_fx(m, {'' : default_qconfig})
prepare_orig(torch.rand(5, 5))
quantized_orig = convert_fx(prepare_orig)

# save/load using scripted model
scripted = torch.jit.script(quantized_orig)
b = io.BytesIO()
torch.jit.save(scripted, b)
b.seek(0)
scripted_quantized = torch.jit.load(b) 
```

### 在使用 FX 图模式量化时出现符号跟踪错误

符号跟踪是(原型-维护模式) FX 图模式量化的要求，因此如果您传递一个不能被符号跟踪的 PyTorch 模型到 torch.ao.quantization.prepare_fx 或 torch.ao.quantization.prepare_qat_fx，我们可能会看到以下类似的错误：

```py
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow 
```

请查看[符号跟踪的限制](https://pytorch.org/docs/2.0/fx.html#limitations-of-symbolic-tracing)并使用-[使用 FX 图模式量化的用户指南](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html)来解决问题。