oneDNN bidirectional fusion_gru pass (#25990) · Issue · PaddlePaddle / Paddle

oneDNN bidirectional fusion_gru pass

Created by: grygielski

GRU models optimization

Since we are developing oneDNN version of fusion_gru operator, we've came to an idea to introduce more possible improvements while waiting for int8 kernel. We've prepared a Proof of Concept pass merging 2 fusion_gru operators followed by concat into single bidirectional fusion_gru operator.

Problems

However, there is one problem with such approach. Because of the way oneDNN gru kernel is implemented, we can only get good numerical results, when every sentence in a batch is of the same length. Thus we can only apply it with BatchSize==1 for now. It works a bit faster than Native PP solution and also oneDNN fusion_gru without bidirectional pass. Maybe we could speed it up even more with omitting reorders between bidirectional fusion_gru operators but it has to be implemented and tested, for now all we have is simple, working PoC.

Question

The question here is whether we should continue on developing this pass. Will it be useful in real applications with such requirements (BS==1 or Equal length of each sentence in a batch). We just don't want to invest our time on that if it won't find and use-case.

Data types

This pass would speed-up all oneDNN fusion_gru kernels (fp32, bf16, int8). However, these restrictions on BS or length would be the same for all of them. I've gathered results of fp32 kernel of every option in a table.

fp32 performance comparison (on my local machine):

	BS=1 CPU_THREADS=1	BS=50 CPU_THREADS=4
Native PP	1627 FPS	3485 FPS
oneDNN fusion_gru	1590 FPS	5368 FPS
oneDNN bidirectional fusion_gru	1749 FPS	---

These tests were performed on CAPI test model (https://github.com/PaddlePaddle/Paddle/pull/25534)

PaddlePaddle / Paddle 1 年多 前同步成功