Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #15049

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 12月 25, 2018 by saxon_zh@saxon_zhGuest

sync nccl allreduce or async

Created by: Yancey1989

Async NCCL AllReduce may lead to worse performance

For the current implementation of PaddlePaddle multi-devices training, computation and NCCL communication are placed on two separate streams, using cudaStreamWaitEvent to ensure correct execution order, but the API cudaStreamWaitEvent may cost much more time then expected:

Conclusions

  • conclusion1: async NCCL call cost much more time on some API like cudaStreamWaitEvent:

    item async nccl sync nccl
    cudaStreamWaitEvent total call 28.468ms 2.3190ms
  • conclusion2: async NCCL call cost much more than at launching cuda kernel:

    item async nccl sync nccl
    (cublasGemm + ncclAllreduce) * 200 iterations 367 ms 342 ms

Reproducing the experiment

  • test code: https://gist.github.com/Yancey1989/4fa7b6bee1b9697ee651e25fc2fc5ab0

  • test env:

    • devices: 2*V100
    • develop docker image: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
    • test tool: nvprof(GPU), std::chrono (CPU)
  • slice of the nvprof logs

    • async nccl
    device: [0], 200 iterations spent: [367 ms]
    device: [1], 200 iterations spent: [367 ms]
    nccl mode: [0], spent: [372 ms]
    ==106784== Profiling application: ./a.out
    ==106784== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
    GPU activities:   73.11%  515.54ms       400  1.2889ms  1.2077ms  1.4020ms  volta_sgemm_128x128_nn
                      24.16%  170.34ms       400  425.86us  387.36us  1.2963ms  ncclAllReduceKernel_sum_f32(ncclColl)
                        2.61%  18.433ms        18  1.0241ms  1.6960us  3.7585ms  [CUDA memcpy HtoD]
                        0.11%  791.89us       426  1.8580us  1.5360us  4.0640us  [CUDA memset]
          API calls:   39.30%  1.22022s        18  67.790ms  11.755us  603.66ms  cudaMemcpy
                      24.27%  753.44ms         4  188.36ms  60.561us  453.07ms  cudaStreamCreate
                      20.29%  629.87ms         2  314.94ms  314.93ms  314.94ms  cudaStreamSynchronize
                      11.89%  369.18ms        16  23.074ms  3.7290us  181.27ms  cudaFree
                        0.98%  30.376ms        38  799.36us  26.065us  1.7913ms  cudaMalloc
                        0.92%  28.468ms      1598  17.814us  1.2410us  92.921us  cudaStreamWaitEvent
                        0.61%  19.092ms       200  95.460us  25.119us  179.81us  cudaLaunchCooperativeKernelMultiDevice
                        0.48%  15.034ms       400  37.584us  12.078us  65.239us  cudaMemsetAsync
                        0.32%  9.8792ms         8  1.2349ms  24.633us  5.0830ms  cudaHostAlloc
                        0.23%  7.0656ms       552  12.800us     225ns  1.4237ms  cuDeviceGetAttribute
                        0.20%  6.2121ms      2000  3.1060us     778ns  32.559us  cudaEventRecord
                        0.12%  3.8755ms       400  9.6880us  7.7120us  47.937us  cudaLaunch
                        0.09%  2.9464ms        16  184.15us  4.0510us  1.4329ms  cudaDeviceEnablePeerAccess
                        0.09%  2.7151ms      9600     282ns     200ns  10.445us  cudaSetupArgument
                        0.05%  1.4069ms         6  234.49us  228.19us  238.98us  cuDeviceTotalMem
                        0.05%  1.4013ms       834  1.6800us     758ns  248.82us  cudaEventCreateWithFlags
    • sync nccl
    device: [1], 200 iterations spent: [342 ms]
    device: [0], 200 iterations spent: [342 ms]
    nccl mode: [1], spent: [346 ms]
    ==106738== Profiling application: ./a.out
    ==106738== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
    GPU activities:   73.05%  498.74ms       400  1.2469ms  1.1381ms  1.4009ms  volta_sgemm_128x128_nn
                      23.52%  160.60ms       400  401.51us  385.34us  520.86us  ncclAllReduceKernel_sum_f32(ncclColl)
                        3.32%  22.656ms        18  1.2586ms  1.6960us  3.8746ms  [CUDA memcpy HtoD]
                        0.11%  718.69us       426  1.6870us  1.5360us  2.2720us  [CUDA memset]
          API calls:   36.41%  1.36380s         4  340.95ms  73.252us  729.21ms  cudaStreamCreate
                      34.25%  1.28301s        18  71.278ms  12.831us  642.06ms  cudaMemcpy
                      17.26%  646.56ms       402  1.6084ms  4.1910us  1.7638ms  cudaStreamSynchronize
                      10.05%  376.60ms        16  23.537ms  4.0230us  190.75ms  cudaFree
                        0.75%  27.921ms        38  734.76us  47.780us  1.9109ms  cudaMalloc
                        0.22%  8.2958ms         8  1.0370ms  57.985us  4.0407ms  cudaHostAlloc
                        0.16%  5.8558ms       552  10.608us     227ns  408.19us  cuDeviceGetAttribute
                        0.14%  5.1147ms       400  12.786us  9.8410us  46.193us  cudaMemsetAsync
                        0.12%  4.6512ms      9600     484ns     201ns  692.95us  cudaSetupArgument
                        0.11%  4.1579ms       200  20.789us  17.218us  58.022us  cudaLaunchCooperativeKernelMultiDevice
                        0.10%  3.7761ms       400  9.4400us  7.2630us  42.761us  cudaLaunch
                        0.08%  2.8901ms      2000  1.4450us     858ns  8.8220us  cudaEventRecord
                        0.07%  2.7301ms        16  170.63us  3.0050us  1.7013ms  cudaDeviceEnablePeerAccess
                        0.06%  2.3190ms      1598  1.4510us     931ns  6.9540us  cudaStreamWaitEvent
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#15049
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7