Created by: cryoco
PR types
Function optimization
PR changes
Others
Describe
This commit is for issue PaddlePaddle#25014 and is based on PR PaddlePaddle#25003. It also adds fp16 support back to emb_eltwise_layernorm_plugin.
The patch for issue PaddlePaddle#25014 is to remove unnecessary data allocation and memory copies in enqueue. The experiment shows that it can improve the end-to-end performance of ERINE on NVIDIA Tesla T4 GPU by 8%