Created by: dzhwinter
partly fix https://github.com/PaddlePaddle/Paddle/issues/8567.
speed words/s: 2238.753745 ->2927.114484
-------------------------> Profiling Report <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::while 47 20700.4 287.48 638.011 440.435
thread0::concat 5989 9223.49 0.048768 13.0808 1.54007
thread0::sequence_softmax 2971 2737.77 0.041792 8.53923 0.921498
thread0::mul 32869 2614.54 0.028864 74.3539 0.0795443
thread0::sequence_pool 3018 1135.52 0.031136 5.23907 0.37625
thread0::array_to_lod_tensor 47 988.262 11.4716 26.9624 21.0268
thread0::lod_tensor_to_array 47 980.838 11.2301 27.4485 20.8689
thread0::sequence_expand 2971 672.922 0.028352 4.32195 0.226497
thread0::softmax 2971 593.552 0.042048 4.48938 0.199782
thread0::sum 14855 554.084 0.027488 4.52547 0.0372995
thread0::elementwise_add 14855 520.216 0.02112 4.62784 0.0350196
thread0::shrink_rnn_memory 11884 504.603 0.004416 4.59072 0.0424607
thread0::elementwise_mul 11884 475.482 0.020384 4.70397 0.0400103
thread0::lstm 94 406.33 3.02909 4.8807 4.32266
thread0::write_to_array 9007 246.557 0.003616 5.08422 0.0273739
thread0::tanh 8960 215.664 0.01648 5.01373 0.0240696
thread0::sigmoid 8913 198.42 0.016128 4.36742 0.0222619
thread0::reorder_lod_tensor_by_rank 141 188.17 0.465824 1.82832 1.33454
thread0::read_from_array 8913 169.311 0.01168 4.77021 0.018996
thread0::reshape 2971 105.058 0.026304 4.25789 0.035361
thread0::less_than 3018 37.8448 0.00176 4.4079 0.0125397
thread0::lookup_table 94 11.661 0.06768 0.157824 0.124053
thread0::increment 2971 10.7855 0.00176 0.027904 0.00363027
thread0::mean 47 4.93552 0.063264 0.125184 0.105011
thread0::lod_rank_table 47 2.82765 0.036128 0.07024 0.0601627
thread0::cross_entropy 47 1.72854 0.031968 0.049632 0.0367775
thread0::fill_constant_batch_size_like 47 1.16784 0.021408 0.031424 0.0248477
thread0::feed 141 1.10317 0.005056 0.015488 0.00782389
thread0::fetch 47 1.01466 0.019008 0.0312 0.0215884
thread0::fill_constant 94 0.72048 0.005568 0.013216 0.00766468
thread0::max_sequence_len 47 0.326048 0.005216 0.012416 0.00693719
pass_id=0, test_loss: 6.738341, words/s: 2927.114484, sec/pass: 237.334072