Enable INT8 Infererence with model trained/saved from QAT (Quantization-Aware Training) (#17267) · Issue · PaddlePaddle / Paddle

Enable INT8 Infererence with model trained/saved from QAT (Quantization-Aware Training)

Created by: jianhang-liu

There are two possible solutions: 有两种可能的方案：

Save INT8 model after SLIM QAT -> Run INT8 model directly with INT8 kernels (e.g. conv2d, mul)
Save FP32 model after SLIM QAT -> Follow with post-training quantization (e.g. INT8v2) and run

Solution1 may have big performance concern (the saved INT8 model can't be easily applied with inference only optimization like op fusion); We are evaluating Solution2. If it's feasible (i.e. improved accuracy), the risk will be minimized. 方案1潜在有性能问题，因为QAT中直接保存的INT8模型很难再继续进行图优化（例如针对推理的op fusion); 我们正在评估方案2。如果可行，工作量和风险都会大大降低

PaddlePaddle / Paddle 大约 1 年 前同步成功

Enable INT8 Infererence with model trained/saved from QAT (Quantization-Aware Training)

PaddlePaddle / Paddle
大约 1 年前同步成功