QAT&INT8 performance improvement
Created by: wojtuss
Currently we have two new QAT trained models of ResNet50 (from Baidu) at our disposal: I. with conv2d and mul’s weigths in FP32 full range (Model1), II. with conv2d and mul’s weights in FP32 but (fake) quantized (Model2). Model2 has the advantage that it can be used directly in inference (fake INT8). Model1 cannot be used directly, it has to be modified first (either weights have to be fake-quantized resulting in fake INT8 model, or fake quantize/dequantize ops have to be removed resulting in full FP32 inference model).
Our general approach to getting an optimized INT8 model in both cases is as follows:
- gather scale values from fake quantize/dequantize operators,
- extract FP32 inference model from the QAT model, i.e. a. remove fake quantize/dequantize operators, b. dequantize conv2d and mul’s weights, if needed,
- optimize the FP32 inference model using standard fusing passes (e.g. conv2d+bn, conv2d+relu, …),
- quantize the optimized FP32 model using standard INT8v2 quantization passes (cpu_quantize_pass, cpu_quantize_squash_pass).
Initially, we started work with Model1, finished implementing all but the last step. That is, from Model1 we got fully optimized FP32 inference model (we verified that accuracy is preserved at this point) which is ready to be quantized. Only step 4. remained to be implemented and applied. After switching to Model2, the steps 2.b and step 4. remains to be implemented and applied.
In case the step 2.b for Model2 was too difficult to perform preserving the accuracy, we would recommend continuing work on Model1.
We expect the whole procedure will be working next week (WW31).