• Q
    Fix NCCLBcast hang up bug in Parallel Executor (#11377) · 046bb5c8
    Qiyang Min 提交于
    * 1. Create buddy allocator in each places before NcclBcast the variables
    2. Check the memory usage of ALL gpus rather than the first one
    
    * 1. Make NCCLGroupGuard guards only the ncclBcast part, which avoid ncclGroupEnd blocking the exception throwing
    2. NOTE the usage of NCCLGroupGuard
    
    * Remove the memory usage check of gpus
    
    * Fix code style
    046bb5c8
parallel_executor.cc 8.4 KB