sync nccl allreduce or async (#15049) · Issue · PaddlePaddle / Paddle

sync nccl allreduce or async

Created by: Yancey1989

Async NCCL AllReduce may lead to worse performance

For the current implementation of PaddlePaddle multi-devices training, computation and NCCL communication are placed on two separate streams, using cudaStreamWaitEvent to ensure correct execution order, but the API cudaStreamWaitEvent may cost much more time then expected:

Conclusions

conclusion1: async NCCL call cost much more time on some API like cudaStreamWaitEvent:

item async nccl sync nccl

cudaStreamWaitEvent total call 28.468ms 2.3190ms
conclusion2: async NCCL call cost much more than at launching cuda kernel:

item async nccl sync nccl

(cublasGemm + ncclAllreduce) * 200 iterations 367 ms 342 ms

item	async nccl	sync nccl
cudaStreamWaitEvent total call	28.468ms	2.3190ms

item	async nccl	sync nccl
(cublasGemm + ncclAllreduce) * 200 iterations	367 ms	342 ms

Reproducing the experiment

test code: https://gist.github.com/Yancey1989/4fa7b6bee1b9697ee651e25fc2fc5ab0
test env:
- devices: 2*V100
- develop docker image: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
- test tool: nvprof(GPU), std::chrono (CPU)

slice of the nvprof logs

async nccl

device: [0], 200 iterations spent: [367 ms]
device: [1], 200 iterations spent: [367 ms]
nccl mode: [0], spent: [372 ms]
==106784== Profiling application: ./a.out
==106784== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   73.11%  515.54ms       400  1.2889ms  1.2077ms  1.4020ms  volta_sgemm_128x128_nn
                  24.16%  170.34ms       400  425.86us  387.36us  1.2963ms  ncclAllReduceKernel_sum_f32(ncclColl)
                    2.61%  18.433ms        18  1.0241ms  1.6960us  3.7585ms  [CUDA memcpy HtoD]
                    0.11%  791.89us       426  1.8580us  1.5360us  4.0640us  [CUDA memset]
      API calls:   39.30%  1.22022s        18  67.790ms  11.755us  603.66ms  cudaMemcpy
                  24.27%  753.44ms         4  188.36ms  60.561us  453.07ms  cudaStreamCreate
                  20.29%  629.87ms         2  314.94ms  314.93ms  314.94ms  cudaStreamSynchronize
                  11.89%  369.18ms        16  23.074ms  3.7290us  181.27ms  cudaFree
                    0.98%  30.376ms        38  799.36us  26.065us  1.7913ms  cudaMalloc
                    0.92%  28.468ms      1598  17.814us  1.2410us  92.921us  cudaStreamWaitEvent
                    0.61%  19.092ms       200  95.460us  25.119us  179.81us  cudaLaunchCooperativeKernelMultiDevice
                    0.48%  15.034ms       400  37.584us  12.078us  65.239us  cudaMemsetAsync
                    0.32%  9.8792ms         8  1.2349ms  24.633us  5.0830ms  cudaHostAlloc
                    0.23%  7.0656ms       552  12.800us     225ns  1.4237ms  cuDeviceGetAttribute
                    0.20%  6.2121ms      2000  3.1060us     778ns  32.559us  cudaEventRecord
                    0.12%  3.8755ms       400  9.6880us  7.7120us  47.937us  cudaLaunch
                    0.09%  2.9464ms        16  184.15us  4.0510us  1.4329ms  cudaDeviceEnablePeerAccess
                    0.09%  2.7151ms      9600     282ns     200ns  10.445us  cudaSetupArgument
                    0.05%  1.4069ms         6  234.49us  228.19us  238.98us  cuDeviceTotalMem
                    0.05%  1.4013ms       834  1.6800us     758ns  248.82us  cudaEventCreateWithFlags

sync nccl

device: [1], 200 iterations spent: [342 ms]
device: [0], 200 iterations spent: [342 ms]
nccl mode: [1], spent: [346 ms]
==106738== Profiling application: ./a.out
==106738== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   73.05%  498.74ms       400  1.2469ms  1.1381ms  1.4009ms  volta_sgemm_128x128_nn
                  23.52%  160.60ms       400  401.51us  385.34us  520.86us  ncclAllReduceKernel_sum_f32(ncclColl)
                    3.32%  22.656ms        18  1.2586ms  1.6960us  3.8746ms  [CUDA memcpy HtoD]
                    0.11%  718.69us       426  1.6870us  1.5360us  2.2720us  [CUDA memset]
      API calls:   36.41%  1.36380s         4  340.95ms  73.252us  729.21ms  cudaStreamCreate
                  34.25%  1.28301s        18  71.278ms  12.831us  642.06ms  cudaMemcpy
                  17.26%  646.56ms       402  1.6084ms  4.1910us  1.7638ms  cudaStreamSynchronize
                  10.05%  376.60ms        16  23.537ms  4.0230us  190.75ms  cudaFree
                    0.75%  27.921ms        38  734.76us  47.780us  1.9109ms  cudaMalloc
                    0.22%  8.2958ms         8  1.0370ms  57.985us  4.0407ms  cudaHostAlloc
                    0.16%  5.8558ms       552  10.608us     227ns  408.19us  cuDeviceGetAttribute
                    0.14%  5.1147ms       400  12.786us  9.8410us  46.193us  cudaMemsetAsync
                    0.12%  4.6512ms      9600     484ns     201ns  692.95us  cudaSetupArgument
                    0.11%  4.1579ms       200  20.789us  17.218us  58.022us  cudaLaunchCooperativeKernelMultiDevice
                    0.10%  3.7761ms       400  9.4400us  7.2630us  42.761us  cudaLaunch
                    0.08%  2.8901ms      2000  1.4450us     858ns  8.8220us  cudaEventRecord
                    0.07%  2.7301ms        16  170.63us  3.0050us  1.7013ms  cudaDeviceEnablePeerAccess
                    0.06%  2.3190ms      1598  1.4510us     931ns  6.9540us  cudaStreamWaitEvent

PaddlePaddle / Paddle 大约 1 年 前同步成功

sync nccl allreduce or async

Async NCCL AllReduce may lead to worse performance

Conclusions

Reproducing the experiment

PaddlePaddle / Paddle
大约 1 年前同步成功