Unify the gpu implementation of stack and unstack to reuse the optimization. (#49748)
* Unify the gpu implementation of stack and unstack to reuse the optimization. * Optimize the cuda implementation of unstack. * Use GpuMemcpyAsync instead of memory::Copy. * Fix error of calculating the index. * Use FastDivMod to further imporve the performance of unstack.
Showing
想要评论请 注册 或 登录