[Paddle-TRT] Fixes #24731, opt for SoftmaxKernelWithEltadd kernel, test=develop !24834
Created by: zlsh80826
PR types
Function optimization
PR changes
OPs
Describe
Few changes to give the kernel SoftmaxKernelWithEltadd
1.25x speedup
- Change the blockReduce behavior as issue #24731 (closed) describe. Every thread calls the blockReduceXXX can directly obtain the reduced value. Thus, shared memory broadcast isn't needed.
- Number of launch threads for
SoftmaxKernelWithEltadd
align to 32 instead of power of 2. - vectorization when available