Refine stack op to improve xlnet performance, test=develop (#22142)
stack's wait cost a lot of cpu time, use cuda kernel to do memory copy
will reduce cpu time.
Signed-off-by: Nzhaoyuchen <zhaoyuchen01@baidu.com>
Showing
想要评论请 注册 或 登录