[Speed]implement cudnn sequence softmax cudnn (!8978) · 合并请求 · PaddlePaddle / Paddle

[Speed]implement cudnn sequence softmax cudnn !8978

Created by: dzhwinter

fix https://github.com/PaddlePaddle/Paddle/issues/8594. This PR contains four operators. sequence_softmax, sequence_softmax_grad, softmax , softmax_grad. Take sequence_softmax op as example, compare with the previous implement, the sequence_softmax operator time cost is lower than mul operator now. The time cost from 1.94211 -> 0.981581 in every minibach.

before optimize

thread0::sum                                2080        208.066     0.031584    1.06134     0.100032
thread0::sequence_softmax                   67          130.121     0.078336    9.44381     1.94211
thread0::mul_grad                           741         121.969     0.062656    1.37158     0.164601
thread0::lod_tensor_to_array                2           85.2417     30.0434     55.1984     42.6209

after optimize

thread0::sum                              204128      17150.6     0.011904    8.13488     0.0840188
thread0::mul_grad                         72729       11242.9     0.042464    6.78758     0.154587
thread0::sequence_softmax_grad            6575        6767.53     0.039008    7.94013     1.02928
thread0::sequence_softmax                 6575        6453.9      0.044608    8.09661     0.981581

PaddlePaddle / Paddle 大约 2 年 前同步成功

[Speed]implement cudnn sequence softmax cudnn !8978

PaddlePaddle / Paddle
大约 2 年前同步成功