Created by: Jie-Fang
For the original amp, all the gradients are FP32. Thus in multi-gpus training, FP32 gradients are transfered across gpus which may cause communication overhead. In the amp, we will obtain FP16 gradients of some ops automatically and we can transfer them instead to reduce the overhead.