Fixed-point quantization uses lower bits, for example, 2-bit, 3-bit or 8-bit fixed point to represent weights and activations, which usually are in singe-precision float-point with 32 bits. The fixed-point representation has advantages in reducing memory bandwidth, lowering power consumption and computational resources as well as the model storage requirements. It is especially important for the inference in embedded-device deployment. According to some experiments, the apporach to quantize the model trained in float point directly works effectively on the large models, like the VGG model having many parameters. But the accuracy drops a lot for the small model. In order to improve the tradeoff between accuracy and latency, many quantized training apporaches are proposed. This document is to design a quantized training framework on Fluid. The first part will introduce how to quantize, The second part will describe the quantized training framework. The last part will illustrate how to calculate the quantization scale. ### How to quantize There are many ways to quantize the float value to fixed-point value. For example: $$ r = min(max(x, a), b)$$ $$ s = \frac{b - a}{n - 1} $$ $$ q = \left \lfloor \frac{r - a}{s} \right \rceil $$ where, $x$ is the float value to be quantized, $[a, b]$ is the quantization range, $a$ is the minimum value and $b$ is the maximal value. $\left \lfloor \right \rceil$ denotes rounding to the nearest integer. If the quantization level is $k$, $n$ is $2^{k - 1}$, for example, $k$ is 8 and $n$ is 128. $q$ is the quantized integer. The quantization we applied is parameterized by the number of quantization levels and maximum absolute value: $$ M = max(abs(x)) $$ $$ q = \left \lfloor \frac{x}{M} * (n - 1) \right \rceil $$ where, $x$ is the float value to be quantized, $M$ is maximum absolute value. $\left \lfloor \right \rceil$ denotes rounding to the nearest integer. For 8 bit quantization, $n=2^{8 - 1}=128$. $q$ is the quantized integer. Wether the *min-max* quantization or *max-abs* quantization, they also can be represent: $q = scale * r + b$ We call *min-max*, *max-abs* as the quantization arguments, also call them quantization scale or quantization range. How to calculate the quantization scale (or maximum absolute value) for inference will be described in the last part. ### Training Framework #### Forward pass The forward pass is simulated quantization, see Figure 1. The training framework is as following figure.
Figure 1. Forward in training with simulated quantization.
Figure 2. Equivalent forward in training with simulated quantization.
Figure 3. Backward and weight updating in training with simulated quantization.