Created by: zhaoyuchen2018
Elementwise performace is poor as walk into CommonGradBroadcastCUDA. add some new kernels for different data pattern.
elementwise | before opt(ms) | after opt(ms) |
---|---|---|
x=2048X100X1X32,y=1X128X32 | 156 | 46 |
x=2048X100X32,y=2048X100X1 | 11 | 0.48 |
x=2048X100X32,y=1X100X32 | 10 | 0.27 |
x=2048X100X32,y=2048X1X32 | 11 | 0.22 |
test case : https://github.com/PaddlePaddle/Paddle/pull/23209