oneDNN bidirectional fusion_gru pass
Created by: grygielski
GRU models optimization
Since we are developing oneDNN version of fusion_gru
operator, we've came to an idea to introduce more possible improvements while waiting for int8 kernel. We've prepared a Proof of Concept pass merging 2 fusion_gru
operators followed by concat
into single bidirectional fusion_gru
operator.
Problems
However, there is one problem with such approach. Because of the way oneDNN gru kernel is implemented, we can only get good numerical results, when every sentence in a batch is of the same length. Thus we can only apply it with BatchSize==1
for now. It works a bit faster than Native PP solution and also oneDNN fusion_gru without bidirectional pass. Maybe we could speed it up even more with omitting reorders between bidirectional fusion_gru operators but it has to be implemented and tested, for now all we have is simple, working PoC.
Question
The question here is whether we should continue on developing this pass. Will it be useful in real applications with such requirements (BS==1
or Equal length of each sentence in a batch
). We just don't want to invest our time on that if it won't find and use-case.
Data types
This pass would speed-up all oneDNN fusion_gru kernels (fp32, bf16, int8). However, these restrictions on BS or length would be the same for all of them. I've gathered results of fp32 kernel of every option in a table.
fp32 performance comparison (on my local machine):
BS=1 CPU_THREADS=1 | BS=50 CPU_THREADS=4 | |
---|---|---|
Native PP | 1627 FPS | 3485 FPS |
oneDNN fusion_gru | 1590 FPS | 5368 FPS |
oneDNN bidirectional fusion_gru | 1749 FPS | --- |
These tests were performed on CAPI test model (https://github.com/PaddlePaddle/Paddle/pull/25534)