Created by: wangxicoding
Compress fp32 gradient to fp16 for communication, reduce communication size and bandwidth usage. Suitable for use on cards that do not support tensor core, such as P4.
This PR is the second version of fp16 compression. For previous versions, see https://github.com/PaddlePaddle/Paddle/pull/22434 . Version2 may better than version1.