Created by: dzhwinter
fix https://github.com/PaddlePaddle/Paddle/issues/9220 single batch sequence expand op 0.595545 -> 0.331348, improved ~1x speed. Need to note that there is a lot of memory visiting inside Cuda kernel, and shared memory will given us a furture enhancement.
before optimize
Event Calls Total Min. Max. Ave.
thread0::sum 40856 3485.82 0.012192 89.6786 0.0853197
thread0::mul_grad 14556 2180.88 0.047808 3.56227 0.149827
thread0::sequence_softmax_grad 1316 1359.13 0.052192 39.9963 1.03278
thread0::mul 14556 1185.32 0.037536 2.49907 0.0814315
thread0::sequence_softmax 1316 1111.14 0.046752 7.49488 0.844328
thread0::sequence_pool 1336 927.845 0.043168 89.3362 0.694495
thread0::sequence_expand_grad 1316 783.738 0.046336 4.13078 0.595545
thread0::sequence_pool_grad 1336 566.56 0.03648 1.31149 0.424072
thread0::sequence_expand 1316 538.142 0.029536 18.5873 0.408922
thread0::elementwise_add_grad 6580 482.06 0.0304 0.765728 0.0732613
after optimize
thread0::sum 40112 3388.82 0.012928 121.016 0.084484
thread0::mul_grad 14292 2073.97 0.042816 1.68877 0.145114
thread0::sequence_softmax_grad 1292 1400.66 0.050528 5.28992 1.0841
thread0::sequence_softmax 1292 1177.63 0.045984 2.67725 0.91148
thread0::mul 14292 1105.29 0.033536 2.02675 0.0773365
thread0::sequence_pool 1312 856.938 0.039168 1.75539 0.653154
thread0::sequence_pool_grad 1312 617.013 0.03408 2.27245 0.470284
thread0::elementwise_add_grad 6460 452.989 0.025408 0.492832 0.0701221
thread0::sequence_expand_grad 1292 428.101 0.140192 12.6613 0.331348
thread0::sequence_expand 1292 354.059 0.139392 0.813536 0.274039
thread0::elementwise_add 6460 307.216 0.027424 29.2352 0.0475566
thread0::lstm_grad 40 304.187 6.11402 8.52278 7.60468