[Auto Parallel-Performance] Sharding Comm Optimization (#48604)
* remove deps and prior comm
* grad comm fuse
* add deps for amp&global norm
* stage2 broadcast prior deps
* stage2 grad overlap
* stream_analyzer bugfix
* overlap enable
* dep op namescope
* depend support multiple inputs
* check finite deps
* stage2 param comm overlap
* Set kD2HStream
* grad comm hierarchical
* grad comm hierarchical
* new unitest
Co-authored-by: Nchenruibiao <chenruibiao@baidu.com>
Showing
想要评论请 注册 或 登录