Created by: typhoonzero
For reference: https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
- update conv cudnn bwd for fp16
- more accurate mode for fp16 softmax (need future log_softmax support)
- The optimizer change is for loss and gradient scaling in: https://github.com/PaddlePaddle/models/pull/1533
TODO: support log_softmax and nll_loss (need test full training)
Previous work: https://github.com/PaddlePaddle/Paddle/pull/14992