Zero3 Fix allreduce optimization for extra large tensor (#3832)
Grad tensors that don't fit in the bucket flat buffer are not added to it, but still added to params_in_ipg_bucket
if such tensors exists use reduce_scatter of params_in_ipg_bucket instead of allreduce. since allreduce assumes all grads are in ipg_bucket_flat_buffer.
Add test for reduce scatter=false
Fix padding to zeros instead of undefined values
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Showing
想要评论请 注册 或 登录