Optimizing Network Performance for Distributed DNN Training on GPU Clusters
Created by: gongweibao
-
AllReduce selectedrows
- without csc
- with csc
-
Optimizing Network Performance for Distributed DNN Training on GPU Clusters
- Get the system arch and performance.
- Analysis the operator time and communication time.
-
Mixed precision.
- On Bert.
- On Resnet 50 on imagenet dataset.
-
Dynamic(static) LA(lazy allreduce) overlap
- FUse allreduce tensor and analysis the performance.
- Implement the Hierarchical All-reduce.
-
CSC communication
- resnet
- bert
- Pserver sync from step to var