Created by: gongweibao
PR Description:
Previous conv2d only supports NCHW computation even though data format is NHWC. Transforming data format from NHWC to NCHW will be applied internally. But on Tensor Cores, conv2d is faster with NHWC format when the data is fp16, it will improve perf when we use amp.
This pr implements NHWC computation when the data is fp16 and device has Tensor Cores. The perf results are listed below:
FP32 | FP16 NCHW | FP16 NHWC | |
---|---|---|---|
ResNet50 | 0.40 | 0.195(2x) | 0.16(2.2x) |
VGG16 | 0.77 | 0.55(1.4x) | 0.32(2.4x) |
Inception V4 | 0.55 | 0.30(1.83x) | 0.27(2x) |
GoogleNet | 0.12 | 0.1(1.2x) | 0.075(1.6x) |