You need to sign in or sign up before continuing.
Fix NCCLBcast hang up bug in Parallel Executor (#11377)
* 1. Create buddy allocator in each places before NcclBcast the variables 2. Check the memory usage of ALL gpus rather than the first one * 1. Make NCCLGroupGuard guards only the ncclBcast part, which avoid ncclGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard * Remove the memory usage check of gpus * Fix code style
Showing
想要评论请 注册 或 登录