Created by: dzhwinter
A future more work on https://github.com/PaddlePaddle/Paddle/issues/8594 In our current implementation, a lod describe the Tensor in several blocks, then every range in the lod will call Cuda kernel once. However, this implement doesn't optimize enough, because Cuda kernel launch time is far more than the Cuda kernel execution time. So I merge these operations into one Cuda kernel to accelerate the sequence_softmax kernel.