zero3 performance optimizations (#3622)
* Remove dead code
params_already_reduced is not used
* Prevent evaluation of debug strings
Debug strings are evaluated even when logging is disabled
* Use contiguous gradients tensor reduce scatter between ranks
Use allreduce instead of reduce scatter. lower cpu overhead.
* move overflow tracker to optimizer.step
Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step
---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Showing
想要评论请 注册 或 登录