Created by: dzhwinter
fix https://github.com/PaddlePaddle/Paddle/issues/9099 every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration. for example, the sequence_pool op enhanced from 0.815583 -> 0.119373 the sequence_pool_grad enhanced from 0.579614 -> 0.0830757
before optimize
Event Calls Total Min. Max. Ave.
thread0::sum 72772 6579.74 0.013088 3.43046 0.0904158
thread0::mul_grad 25928 4135.29 0.049952 4.4888 0.159491
thread0::sequence_softmax_grad 2344 3067.88 0.05872 95.0493 1.30882
thread0::sequence_softmax 2344 2617.72 0.04976 17.122 1.11677
thread0::mul 25928 2260.75 0.038624 8.36944 0.0871933
thread0::sequence_pool 2380 1941.09 0.045984 89.9217 0.815583
thread0::sequence_expand_grad 2344 1730.34 0.05296 8.10054 0.738201
thread0::sequence_pool_grad 2380 1379.48 0.03824 137.793 0.579614
after optimize
thread0::sigmoid_grad 7035 304.461 0.024032 89.5161 0.0432781
thread0::sequence_pool 2381 284.226 0.053984 56.5243 0.119373
thread0::sigmoid 7035 214.732 0.02448 3.57146 0.0305233
thread0::tanh 7071 214.441 0.023712 1.59734 0.0303269
thread0::tanh_grad 7071 206.762 0.023328 0.1432 0.0292408
thread0::sequence_pool_grad 2381 197.803 0.057408 0.934464 0.0830757
thread0::adam 936 187.986 0.024384 1.28653 0.20084