[Paddle-TRT] Better Paddle-TensorRT support for PaddleSlim quant models (!25097) · 合并请求 · PaddlePaddle / Paddle

[Paddle-TRT] Better Paddle-TensorRT support for PaddleSlim quant models !25097

Created by: cryoco

PR types

Function optimization

PR changes

Others

Describe

TRT support of PaddleSlim quant models

Remove restraints of quantizing extra ops for TRT inference

While using Paddle-TRT int8 to preform inference with models trained by PaddleSlim quant-aware training, there are some restraints during the training phase: if there are ops like elementwise_add, pool2d, leaky_relu, etc. in the model, they must be quantized, or TRT will throw an error during inference. This damages the flexibility for the utilization of the int8 quantization. We added support to ensure that while training with PaddleSlim QAT, users doesn't have to add extra quant ops for inferencing with TensorRT, with little performance damaging.

quant_conv2d_dequant_fuse_pass refactor

quant_conv2d_dequant_fuse_pass is used to extract input and output scales from quant/dequant op and set them into attributes of quantized op. The former quant_conv2d_dequant_fuse_pass has problems as following:

Using a matrix-like structure to store IR nodes requires a threshold of the max number of quantized nodes preceded by the same quant node, which leads to serious problems when model structure is like:
Converting weights from int8 range to fp32 range happens in tensorrt_subgraph_pass, which means if some quantized op is outside trt subgraph, it's weight will not be converted. This might cause wrong results when trt subgraph can't cover all quantized nodes.

We refactored this pass by splitting the fusion into 2 phases, DeleteQuant Fuse and Dequant fuse, so that the threshold of branch nodes is not needed. Moreover, the range of quantized weights in conv/mul/fc is converted to fp32 in this pass instead of tensorrt_subgraph_pass, to produce the right result when trt subgraph can't cover all quantized nodes.

PaddlePaddle / Paddle 大约 1 年 前同步成功