• H
    zero3 performance optimizations (#3622) · 0977106a
    hablb 提交于
    * Remove dead code
    
    params_already_reduced is not used
    
    * Prevent evaluation of debug strings
    
    Debug strings are evaluated even when logging is disabled
    
    * Use contiguous gradients tensor reduce scatter between ranks
    
    Use allreduce instead of reduce scatter. lower cpu overhead.
    
    * move overflow tracker to optimizer.step
    
    Don't check overflow in gradients for every bucket.
    Do overflow chack once on grad flat buffer just before optimizer step
    
    ---------
    Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
    0977106a
test_zero.py 51.6 KB