Created by: kexinzhao
Currently, the default example code use non-cudnn softmax op to generate the prediction result. We thus need to add float16 support to non-cudnn softmax op on GPU in addition to the cudnn one.