Created by: dzhwinter
fix https://github.com/PaddlePaddle/Paddle/issues/8594. This PR contains four operators. sequence_softmax, sequence_softmax_grad, softmax , softmax_grad. Take sequence_softmax op as example, compare with the previous implement, the sequence_softmax operator time cost is lower than mul operator now. The time cost from 1.94211 -> 0.981581 in every minibach.
before optimize
thread0::sum 2080 208.066 0.031584 1.06134 0.100032
thread0::sequence_softmax 67 130.121 0.078336 9.44381 1.94211
thread0::mul_grad 741 121.969 0.062656 1.37158 0.164601
thread0::lod_tensor_to_array 2 85.2417 30.0434 55.1984 42.6209
after optimize
thread0::sum 204128 17150.6 0.011904 8.13488 0.0840188
thread0::mul_grad 72729 11242.9 0.042464 6.78758 0.154587
thread0::sequence_softmax_grad 6575 6767.53 0.039008 7.94013 1.02928
thread0::sequence_softmax 6575 6453.9 0.044608 8.09661 0.981581