Created by: wangxicoding
Compress fp32 gradient to fp16 for communication, reduce communication size and bandwidth usage. Suitable for use on cards that do not support tensor core, such as P4.