Lars op optimiztion with cudaLaunchCooperativeKernel method (#35652)
* A leap of try for cudaLaunchCooperativeKernel * fix bugs * Totally replace the lar cuda kernel * Fix bugs * fix code according to comments * fix codes according to review comments * adding some function overload * relocate the power operation.
Showing
想要评论请 注册 或 登录