Created by: kexinzhao
fix #9625 (closed) fix #9626 (closed)
cublasHgemm does true FP16 computation which is slow for non-Volta GPUs. So we use cublasGemmEx instead which does pesudo FP16 computation: input/output in fp16, computation in fp32, which can also be accelerated using tensor cores in volta GPUs.
By testing, I found that using GemmEx instead of Hgemm provides significant speed up on both Titan XP and V100 GPU.
Vgg16 imagenet batch size = 1, 1000 iterations total time spent on float16 mul op:
V100 GPU: Hgemm vs GemmEx 1501 ms vs 451 ms
Titan Xp GPU: Hgemm vs GemmEx 3259 ms vs 703ms
Tensor core example: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/