Speed up resnet model's throughput on fp16
Created by: gongweibao
Target:
Resnet50, batchsize 128 or 256 1400 images / v100 gpu card 10000+ images/v100 machine
TODO
- 1. NHWC According to https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/?linkId=100000002357715
The tensors operated on by Tensor Cores should be in a channel-interleaved data layout in memory (Number-Height-Width-Channel, often called NHWC) in order to get the best performance
- batchnorm
- conv
- pool
other:?
-
pool_with_index
-
conv_tranpose
-
prior_box
-
shuffle_channel
-
affine_channel
-
MaxOutFunctor
-
2. elementwise According to nvprof result: elmentwise(add,mul) op's kernel should be speed up:
95 Type Time(%) Time Calls Avg Min Max Name
96 GPU activities: 14.04% 5.32175s 45200 117.74us 2.2390us 598.30us void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half cons t *, __half*, float, float)
97 12.15% 4.60367s 6800 677.01us 62.463us 3.0237ms void cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>(flo at, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int =1>, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>, cudnnTensorStruct, __half const *, float, __half const , float, cudnnTensorStru ct*, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const *, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bo ol=1, int=1>*, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const *, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, i nt=512, bool=1, int=1> const , cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const , cudnn::detail::bn_bw_1C11_kernel_new<__half, f loat, float2, int=512, bool=1, int=1>)
98 7.39% 2.80034s 6400 437.55us 114.91us 1.1186ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<paddle::platf orm::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long> (paddle::platform::float16, int=1)
99 6.35% 2.40750s 9800 245.66us 30.688us 1.1385ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<paddle::p latform::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePoint er> const , Eigen::TensorConversionOp<paddle::platform::float16, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_cmp_op<paddle::platform::float16 const , paddle::plat form::float16 const , Eigen::internal::ComparisonName>, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const > const > const > const , Eigen::GpuDevice>, long>(paddle::platform::float16, int=1)
100 5.68% 2.15245s 5000 430.49us 123.65us 1.3419ms void cudnn::detail::bn_fw_tr_1C11_kernel_NCHW<__half, float, int=512, bool=1, int=1>(cudnnTe nsorStruct, __half const *, cudnn::detail::bn_fw_tr_1C11_kernel_NCHW<__half, float, int=512, bool=1, int=1>, cudnnTensorStruct*, float const *, float const , cudnnTensor Struct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*)
101 5.34% 2.02383s 6200 326.42us 204.51us 646.55us volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
102 4.65% 1.76357s 6600 267.21us 179.52us 392.35us volta_s884cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt
103 4.46% 1.69164s 9800 172.62us 19.872us 785.18us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<paddle::platf orm::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(paddle::platform::float16, int=1)
104 3.83% 1.45276s 3800 382.30us 258.43us 592.63us volta_s884cudnn_fp16_64x64_sliced1x4_ldg8_wgrad_idx_exp_interior_nhwc_nt
105 3.47% 1.31410s 3200 410.66us 218.59us 576.73us volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
- 3. Move cudnn cache to global scope
- 4. Add DALI support for data preprocess. - Paddle iterator.
- 5. Verify fp16optimizer's functionationlity. - AMP
- 6. Check exhaustive search on fp16.
- 7. Assert all op use same layout(such as NHWC) in compilation.