Update the cuda API and enable tensor core for GEMM (!9622) · 合并请求 · PaddlePaddle / Paddle

Update the cuda API and enable tensor core for GEMM !9622

Created by: kexinzhao

cublasHgemm does true FP16 computation which is slow for non-Volta GPUs. So we use cublasGemmEx instead which does pesudo FP16 computation: input/output in fp16, computation in fp32, which can also be accelerated using tensor cores in volta GPUs.

By testing, I found that using GemmEx instead of Hgemm provides significant speed up on both Titan XP and V100 GPU.

Vgg16 imagenet batch size = 1, 1000 iterations total time spent on float16 mul op:

V100 GPU: Hgemm vs GemmEx 1501 ms vs 451 ms

Titan Xp GPU: Hgemm vs GemmEx 3259 ms vs 703ms

Tensor core example: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

PaddlePaddle / Paddle 大约 2 年 前同步成功

Update the cuda API and enable tensor core for GEMM !9622

PaddlePaddle / Paddle
大约 2 年前同步成功