Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • 合并请求
  • !15898

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板

[RFC] Opt/hash op !15898

  • Report abuse
!15898 已关闭 2月 24, 2019 由 saxon_zh@saxon_zh 创建
#<User:0x00007f7e1508f030>
  • 概览 0
  • 提交 2
  • 变更 4

Created by: jianhang-liu

Hash OP has two main computations: 1 Call XXH64() for hash 2 "hash % mod_by" where mod_by is actually the size of lookup table

We tried below optimizations:

  1. Replaced "hash % mod_by" with "hash - (hash / mod_by) * mod_by" and speed up "hash / mod_by" with an external library "libdivide"
  2. Inline compile XXH64() as well as one additional compile flag XXH_FORCE_MEMORY_ACCESS

Actually, above item 1 contribute much more than item 2.

With above changes, we got 20%~35% performance gain in OP level. However, the performance gain in model level is <5% only. It's due to "strange" performance drop in another OP (i.e. fused_embding_seq_pool). Need further check why that performance drop happen. @tensor-tang Any idea?

See below for benchmark:

  1. On 2699v3 Before ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10021.3ms, fps: 0.0997878 ====== sample number: 48685, average latency of each sample: 0.205839ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 25657 0.001896 21.0386 0.00329375 0.279999 thread0::hash 5842200 24603.8 0.002024 21.2326 0.00421139 0.268505 thread0::fc 973700 14539.6 0.013026 21.344 0.0149324 0.158673 thread0::sequence_enumerate 5842200 12935.1 0.001709 20.8716 0.00221408 0.141163 thread0::sum 973700 6370.22 0.004313 20.5508 0.00654229 0.0695192 thread0::softsign 973700 2327.97 0.001672 0.024116 0.00239085 0.0254055 thread0::cos_sim 486850 2241.46 0.003985 8.81306 0.00460401 0.0244614 thread0::feed 1947400 1748.23 0.000584 0.030094 0.000897723 0.0190787 thread0::fetch 486850 1209.13 0.00206 0.023377 0.00248357 0.0131954

After ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 9750.24ms, fps: 0.102562 ====== sample number: 48685, average latency of each sample: 0.200272ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 27892.1 0.001933 21.3465 0.00358068 0.314239 thread0::hash 5842200 18945 0.001756 22.4899 0.00324279 0.213439 thread0::fc 973700 14914.7 0.013256 22.2217 0.0153175 0.168032 thread0::sequence_enumerate 5842200 12822.3 0.001682 20.3096 0.00219477 0.144459 thread0::sum 973700 6496.36 0.004225 20.9529 0.00667183 0.0731895 thread0::cos_sim 486850 2360.41 0.004238 21.2927 0.00484834 0.026593 thread0::softsign 973700 2291.19 0.001608 19.7146 0.00235308 0.0258132 thread0::feed 1947400 1796.12 0.000586 21.3117 0.000922318 0.0202355 thread0::fetch 486850 1242.6 0.002142 0.023137 0.00255233 0.0139994

  1. On SKX6140 Before ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10666.7ms, fps: 0.0937501 ====== ====== sample number: 48685, average latency of each sample: 0.219095ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::hash 5842200 24818.2 0.002372 1.32739 0.0042481 0.282479 thread0::fused_embedding_seq_pool 7789600 24470.3 0.002441 1.67723 0.00314141 0.27852 thread0::sequence_enumerate 5842200 15027.8 0.002197 0.908093 0.00257229 0.171045 thread0::fc 973700 9299.96 0.007387 1.60737 0.00955115 0.105851 thread0::sum 973700 5202.84 0.004106 0.425563 0.00534337 0.0592183 thread0::softsign 973700 2895.76 0.002396 0.410873 0.00297397 0.0329593 thread0::feed 1947400 2652.1 0.001111 1.30264 0.00136187 0.030186 thread0::cos_sim 486850 2392.37 0.004267 1.35109 0.00491398 0.0272298 thread0::fetch 486850 1099.23 0.001871 0.404445 0.00225783 0.0125113

** After ** ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10493.8ms, fps: 0.095294 ====== sample number: 48685, average latency of each sample: 0.215546ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 27182.2 0.002518 2.44027 0.00348955 0.314938 thread0::hash 5842200 20382.6 0.002411 2.05364 0.00348885 0.236156 thread0::sequence_enumerate 5842200 15313.4 0.002242 2.27128 0.00262117 0.177424 thread0::fc 973700 9241.42 0.00752 2.52018 0.00949103 0.107073 thread0::sum 973700 5307.15 0.004207 2.44368 0.0054505 0.0614896 thread0::softsign 973700 2947.92 0.002332 0.077244 0.00302754 0.0341551 thread0::feed 1947400 2537.44 0.00107 0.027547 0.00130299 0.0293993 thread0::cos_sim 486850 2297.62 0.003981 0.444106 0.00471936 0.0266206 thread0::fetch 486850 1100.08 0.001824 0.02604 0.00225959 0.0127457

指派人
分配到
审核者
Request review from
无
里程碑
无
分配里程碑
工时统计
标识: paddlepaddle/Paddle!15898
Source branch: github/fork/jianhang-liu/opt/hash_op
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7