Created by: yihuaxu
Based on investigation of ERNIE, we found that layernorm is lack of multi-threads JIT implemention. This PR is to add the layernorm multi-threads optimizing by using OpenMP. But based on initial benchmark with ERNIE, single thread layernorm performance seems to be worse than before.
CPU Model: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Thread Number | Baseline | Optimization |
---|---|---|
1 thread | 2341.13ms | 2502.5ms |
20 threads | 4009.16ms | 847.298ms |