Speed/sequence expand (!9289) · 合并请求 · PaddlePaddle / Paddle

Speed/sequence expand !9289

Created by: dzhwinter

fix https://github.com/PaddlePaddle/Paddle/issues/9220 single batch sequence expand op 0.595545 -> 0.331348, improved ~1x speed. Need to note that there is a lot of memory visiting inside Cuda kernel, and shared memory will given us a furture enhancement.

before optimize

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              40856       3485.82     0.012192    89.6786     0.0853197
thread0::mul_grad                         14556       2180.88     0.047808    3.56227     0.149827
thread0::sequence_softmax_grad            1316        1359.13     0.052192    39.9963     1.03278
thread0::mul                              14556       1185.32     0.037536    2.49907     0.0814315
thread0::sequence_softmax                 1316        1111.14     0.046752    7.49488     0.844328
thread0::sequence_pool                    1336        927.845     0.043168    89.3362     0.694495
thread0::sequence_expand_grad             1316        783.738     0.046336    4.13078     0.595545
thread0::sequence_pool_grad               1336        566.56      0.03648     1.31149     0.424072
thread0::sequence_expand                  1316        538.142     0.029536    18.5873     0.408922
thread0::elementwise_add_grad             6580        482.06      0.0304      0.765728    0.0732613

after optimize

thread0::sum                              40112       3388.82     0.012928    121.016     0.084484
thread0::mul_grad                         14292       2073.97     0.042816    1.68877     0.145114
thread0::sequence_softmax_grad            1292        1400.66     0.050528    5.28992     1.0841
thread0::sequence_softmax                 1292        1177.63     0.045984    2.67725     0.91148
thread0::mul                              14292       1105.29     0.033536    2.02675     0.0773365
thread0::sequence_pool                    1312        856.938     0.039168    1.75539     0.653154
thread0::sequence_pool_grad               1312        617.013     0.03408     2.27245     0.470284
thread0::elementwise_add_grad             6460        452.989     0.025408    0.492832    0.0701221
thread0::sequence_expand_grad             1292        428.101     0.140192    12.6613     0.331348
thread0::sequence_expand                  1292        354.059     0.139392    0.813536    0.274039
thread0::elementwise_add                  6460        307.216     0.027424    29.2352     0.0475566
thread0::lstm_grad                        40          304.187     6.11402     8.52278     7.60468

PaddlePaddle / Paddle 大约 2 年 前同步成功

Speed/sequence expand !9289

PaddlePaddle / Paddle
大约 2 年前同步成功