Created by: cjld
This PR adds topological sorting at all reduce pass.
Previously, the run order was determined by the sort (dictionary order) in the optimizer, and the wrong order resulted in poor performance.
This pr improves 8 GPUs performance by 5%