• J
    [Auto Parallel-Performance] Sharding Comm Optimization (#48604) · 5592f8ad
    JZ-LIANG 提交于
    * remove deps and prior comm
    
    * grad comm fuse
    
    * add deps for amp&global norm
    
    * stage2 broadcast prior deps
    
    * stage2 grad overlap
    
    * stream_analyzer bugfix
    
    * overlap enable
    
    * dep op namescope
    
    * depend support multiple inputs
    
    * check finite deps
    
    * stage2 param comm overlap
    
    * Set kD2HStream
    
    * grad comm hierarchical
    
    * grad comm hierarchical
    
    * new unitest
    Co-authored-by: Nchenruibiao <chenruibiao@baidu.com>
    5592f8ad
utils.py 79.3 KB