sync nccl allreduce or async
Created by: Yancey1989
Async NCCL AllReduce may lead to worse performance
For the current implementation of PaddlePaddle multi-devices training, computation and NCCL communication are placed on two separate streams, using cudaStreamWaitEvent to ensure correct execution order, but the API cudaStreamWaitEvent
may cost much more time then expected:
Conclusions
-
conclusion1: async NCCL call cost much more time on some API like
cudaStreamWaitEvent
:item async nccl sync nccl cudaStreamWaitEvent total call 28.468ms 2.3190ms -
conclusion2: async NCCL call cost much more than at launching cuda kernel:
item async nccl sync nccl (cublasGemm + ncclAllreduce) * 200 iterations 367 ms 342 ms
Reproducing the experiment
-
test code: https://gist.github.com/Yancey1989/4fa7b6bee1b9697ee651e25fc2fc5ab0
-
test env:
- devices: 2*V100
- develop docker image: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- test tool: nvprof(GPU),
std::chrono
(CPU)
-
slice of the nvprof logs
- async nccl
device: [0], 200 iterations spent: [367 ms] device: [1], 200 iterations spent: [367 ms] nccl mode: [0], spent: [372 ms] ==106784== Profiling application: ./a.out ==106784== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 73.11% 515.54ms 400 1.2889ms 1.2077ms 1.4020ms volta_sgemm_128x128_nn 24.16% 170.34ms 400 425.86us 387.36us 1.2963ms ncclAllReduceKernel_sum_f32(ncclColl) 2.61% 18.433ms 18 1.0241ms 1.6960us 3.7585ms [CUDA memcpy HtoD] 0.11% 791.89us 426 1.8580us 1.5360us 4.0640us [CUDA memset] API calls: 39.30% 1.22022s 18 67.790ms 11.755us 603.66ms cudaMemcpy 24.27% 753.44ms 4 188.36ms 60.561us 453.07ms cudaStreamCreate 20.29% 629.87ms 2 314.94ms 314.93ms 314.94ms cudaStreamSynchronize 11.89% 369.18ms 16 23.074ms 3.7290us 181.27ms cudaFree 0.98% 30.376ms 38 799.36us 26.065us 1.7913ms cudaMalloc 0.92% 28.468ms 1598 17.814us 1.2410us 92.921us cudaStreamWaitEvent 0.61% 19.092ms 200 95.460us 25.119us 179.81us cudaLaunchCooperativeKernelMultiDevice 0.48% 15.034ms 400 37.584us 12.078us 65.239us cudaMemsetAsync 0.32% 9.8792ms 8 1.2349ms 24.633us 5.0830ms cudaHostAlloc 0.23% 7.0656ms 552 12.800us 225ns 1.4237ms cuDeviceGetAttribute 0.20% 6.2121ms 2000 3.1060us 778ns 32.559us cudaEventRecord 0.12% 3.8755ms 400 9.6880us 7.7120us 47.937us cudaLaunch 0.09% 2.9464ms 16 184.15us 4.0510us 1.4329ms cudaDeviceEnablePeerAccess 0.09% 2.7151ms 9600 282ns 200ns 10.445us cudaSetupArgument 0.05% 1.4069ms 6 234.49us 228.19us 238.98us cuDeviceTotalMem 0.05% 1.4013ms 834 1.6800us 758ns 248.82us cudaEventCreateWithFlags
- sync nccl
device: [1], 200 iterations spent: [342 ms] device: [0], 200 iterations spent: [342 ms] nccl mode: [1], spent: [346 ms] ==106738== Profiling application: ./a.out ==106738== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 73.05% 498.74ms 400 1.2469ms 1.1381ms 1.4009ms volta_sgemm_128x128_nn 23.52% 160.60ms 400 401.51us 385.34us 520.86us ncclAllReduceKernel_sum_f32(ncclColl) 3.32% 22.656ms 18 1.2586ms 1.6960us 3.8746ms [CUDA memcpy HtoD] 0.11% 718.69us 426 1.6870us 1.5360us 2.2720us [CUDA memset] API calls: 36.41% 1.36380s 4 340.95ms 73.252us 729.21ms cudaStreamCreate 34.25% 1.28301s 18 71.278ms 12.831us 642.06ms cudaMemcpy 17.26% 646.56ms 402 1.6084ms 4.1910us 1.7638ms cudaStreamSynchronize 10.05% 376.60ms 16 23.537ms 4.0230us 190.75ms cudaFree 0.75% 27.921ms 38 734.76us 47.780us 1.9109ms cudaMalloc 0.22% 8.2958ms 8 1.0370ms 57.985us 4.0407ms cudaHostAlloc 0.16% 5.8558ms 552 10.608us 227ns 408.19us cuDeviceGetAttribute 0.14% 5.1147ms 400 12.786us 9.8410us 46.193us cudaMemsetAsync 0.12% 4.6512ms 9600 484ns 201ns 692.95us cudaSetupArgument 0.11% 4.1579ms 200 20.789us 17.218us 58.022us cudaLaunchCooperativeKernelMultiDevice 0.10% 3.7761ms 400 9.4400us 7.2630us 42.761us cudaLaunch 0.08% 2.8901ms 2000 1.4450us 858ns 8.8220us cudaEventRecord 0.07% 2.7301ms 16 170.63us 3.0050us 1.7013ms cudaDeviceEnablePeerAccess 0.06% 2.3190ms 1598 1.4510us 931ns 6.9540us cudaStreamWaitEvent