Enable INT8 Infererence with model trained/saved from QAT (Quantization-Aware Training)
Created by: jianhang-liu
There are two possible solutions: 有两种可能的方案:
- Save INT8 model after SLIM QAT -> Run INT8 model directly with INT8 kernels (e.g. conv2d, mul)
- Save FP32 model after SLIM QAT -> Follow with post-training quantization (e.g. INT8v2) and run
Solution1 may have big performance concern (the saved INT8 model can't be easily applied with inference only optimization like op fusion); We are evaluating Solution2. If it's feasible (i.e. improved accuracy), the risk will be minimized. 方案1潜在有性能问题,因为QAT中直接保存的INT8模型很难再继续进行图优化(例如针对推理的op fusion); 我们正在评估方案2。如果可行,工作量和风险都会大大降低