Created by: jeng1220
PR types
Performance optimization and Bug fixes
PR changes
OPs
Describe
This commit is for issue PaddlePaddle#25014 and is based on PR PaddlePaddle#25003. It also adds fp16 support back to emb_eltwise_layernorm_plugin.
The patch for issue PaddlePaddle#25014 is to remove unnecessary data allocation and memory copies in enqueue. The experiment shows that it can improve the end-to-end performance of ERINE on NVIDIA Tesla T4 GPU by 8%