Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • 合并请求
  • !23899

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板

use 32 bit index to improve expand op !23899

  • Report abuse
!23899 已合并 4月 15, 2020 由 saxon_zh@saxon_zh 创建
#<User:0x00007fedf444f068>
  • 概览 0
  • 提交 2
  • 变更 2

Created by: zhangting2020

background

  • Related to: benchmark #335. The default type of tensor's index is long in Eigen, this PR use 32 bit (int) index to improve expand op.
  • test case:
config = ExpandConfig(x_shape=[32, 807, 1],
                      expand_times=[4, 1, 807])

perfomance

GPU:CUDA10,v100,use nvprof

before after speed up
1.2544 ms 0.7556 ms 39.8%

CPU:6148,use profile

before after speed up
3765.22 ms 1136.7 ms 69.8%

details

GPU

before:

W0325 06:53:01.038626 29504 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0325 06:53:01.044080 29504 device_context.cc:243] device: 0, cuDNN Version: 7.5.
feed random data
{
  framework: "paddle",
  version: "0.0.0",
  name: "expand",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 777.73128, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==29504== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==29504== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.98%  6.19575s       100  61.958ms  54.682ms  339.99ms  [CUDA memcpy DtoH]
                    1.98%  125.44ms       100  1.2544ms  1.2529ms  1.2604ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=3)
                    0.03%  2.0082ms       103  19.497us  2.4960us  29.215us  [CUDA memcpy HtoD]
                    0.00%  7.5190us         4  1.8790us  1.5990us  2.7200us  [CUDA memset]
      API calls:   47.07%  6.35675s       203  31.314ms  19.971us  343.26ms  cudaMemcpy
                   22.75%  3.07266s         8  384.08ms  1.6010us  3.07231s  cudaStreamCreateWithFlags
                    9.88%  1.33484s         1  1.33484s  1.33484s  1.33484s  cudaStreamCreate
                    9.74%  1.31526s         5  263.05ms  23.420us  1.31509s  cudaMemGetInfo
                    6.07%  819.79ms       406  2.0192ms  5.6880us  66.099ms  cuModuleUnload
                    4.28%  578.66ms         4  144.67ms     750ns  578.66ms  cudaFree
                    0.10%  12.846ms        14  917.58us  5.2290us  12.290ms  cudaMalloc
                    0.04%  5.9010ms       100  59.010us  53.821us  73.255us  cudaLaunchKernel
  • after:
W0416 04:00:06.941998 19678 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0416 04:00:07.149387 19678 device_context.cc:245] device: 0, cuDNN Version: 7.5.
{
  framework: "paddle",
  version: "0.0.0",
  name: "expand",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 711.64674, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==19678== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==19678== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.84%  6.58699s       100  65.870ms  61.893ms  306.97ms  [CUDA memcpy DtoH]
                    1.13%  75.555ms       100  755.55us  755.13us  757.56us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
                    0.03%  2.0074ms       103  19.489us  2.2720us  29.664us  [CUDA memcpy HtoD]
                    0.00%  7.9670us         4  1.9910us  1.6000us  2.3990us  [CUDA memset]
      API calls:   40.03%  6.70104s       203  33.010ms  22.238us  309.42ms  cudaMemcpy
                   37.08%  6.20759s         8  775.95ms  1.8530us  6.20726s  cudaStreamCreateWithFlags
                    9.26%  1.55004s         1  1.55004s  1.55004s  1.55004s  cudaStreamCreate
                    8.71%  1.45876s         4  364.69ms  1.0130us  1.45875s  cudaFree
                    4.81%  806.02ms       406  1.9853ms  5.2450us  68.096ms  cuModuleUnload
                    0.04%  6.1261ms       100  61.260us  45.299us  103.78us  cudaLaunchKernel
                    0.01%  2.3611ms         6  393.52us  379.07us  438.43us  cuDeviceTotalMem
                    0.01%  1.8451ms       564  3.2710us     131ns  128.21us  cuDeviceGetAttribute
                    0.01%  1.6857ms         1  1.6857ms  1.6857ms  1.6857ms  cudaHostAlloc
                    0.01%  1.4976ms       101  14.828us  11.093us  24.552us  cudaStreamSynchronize

CPU

  • before:
-------------------------       Event Summary       -------------------------

Event                               Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::expand                     100         376524      3755.3      4245.9      3765.24     0.973765
  thread0::expand/compute           100         376522      3755.28     4245.85     3765.22     0.973758
  thread0::expand/infer_shape       100         0.840627    0.006981    0.018734    0.00840627  2.17403e-06
  thread0::expand/prepare_data      100         0.189063    0.0014      0.010331    0.00189063  4.88954e-07
thread0::fetch                      100         10143.2     89.0853     324.859     101.432     0.0262323
thread0::feed                       100         1.0212      0.00856     0.016463    0.010212    2.64102e-06
  • after:
-------------------------       Event Summary       -------------------------

Event                               Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::expand                     100         113673      1130.89     1409.54     1136.73     0.922633
  thread0::expand/compute           100         113670      1130.86     1409.49     1136.7      0.922614
  thread0::expand/infer_shape       100         0.81159     0.006871    0.012413    0.0081159   6.58732e-06
  thread0::expand/prepare_data      100         0.221432    0.00154     0.008571    0.00221432  1.79727e-06
thread0::fetch                      100         9530.76     88.7755     359.891     95.3076     0.077357
thread0::feed                       100         1.21227     0.010558    0.019541   
指派人
分配到
审核者
Request review from
无
里程碑
无
分配里程碑
工时统计
标识: paddlepaddle/Paddle!23899
Source branch: github/fork/zhangting2020/expand_perf
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7