Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 (#2999)
* * try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2
* * fix format error
* * fix format issue
* * add TODO for integrated testing of TP and ZeRO 1/2/3
* fix default pg error
---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Showing
想要评论请 注册 或 登录