Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #25990

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 8月 05, 2020 by saxon_zh@saxon_zhGuest

oneDNN bidirectional fusion_gru pass

Created by: grygielski

GRU models optimization

Since we are developing oneDNN version of fusion_gru operator, we've came to an idea to introduce more possible improvements while waiting for int8 kernel. We've prepared a Proof of Concept pass merging 2 fusion_gru operators followed by concat into single bidirectional fusion_gru operator.

Problems

However, there is one problem with such approach. Because of the way oneDNN gru kernel is implemented, we can only get good numerical results, when every sentence in a batch is of the same length. Thus we can only apply it with BatchSize==1 for now. It works a bit faster than Native PP solution and also oneDNN fusion_gru without bidirectional pass. Maybe we could speed it up even more with omitting reorders between bidirectional fusion_gru operators but it has to be implemented and tested, for now all we have is simple, working PoC.

Question

The question here is whether we should continue on developing this pass. Will it be useful in real applications with such requirements (BS==1 or Equal length of each sentence in a batch). We just don't want to invest our time on that if it won't find and use-case.

Data types

This pass would speed-up all oneDNN fusion_gru kernels (fp32, bf16, int8). However, these restrictions on BS or length would be the same for all of them. I've gathered results of fp32 kernel of every option in a table.

fp32 performance comparison (on my local machine):

BS=1 CPU_THREADS=1 BS=50 CPU_THREADS=4
Native PP 1627 FPS 3485 FPS
oneDNN fusion_gru 1590 FPS 5368 FPS
oneDNN bidirectional fusion_gru 1749 FPS ---

These tests were performed on CAPI test model (https://github.com/PaddlePaddle/Paddle/pull/25534)

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#25990
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7