-
由 hablb 提交于
* Remove dead code params_already_reduced is not used * Prevent evaluation of debug strings Debug strings are evaluated even when logging is disabled * Use contiguous gradients tensor reduce scatter between ranks Use allreduce instead of reduce scatter. lower cpu overhead. * move overflow tracker to optimizer.step Don't check overflow in gradients for every bucket. Do overflow chack once on grad flat buffer just before optimizer step --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
0977106a