The problem of improving the performance of Parallel_Do
Created by: tonyyang-svail
The current design uses parallel_do to describe the multi-GPU training.
In order to achieve good performance, we need to maintain a copy of the parameters on each GPU. This includes the parameter itself, the aggregation of the gradient across all devices, and momentum/velocity if needed by the optimizer.
This requires compatibility support of initializer, backward and optimizer.
LayerHelper(@reyoung) needs to be aware which parameter is created inside of a parallel_do, then put their initializers in a new parallel_do block.
Backward(@JiayiFeng) needs to support inserting AllReduce into the programDesc when the parameter is created inside of a parallel_do.
Optimizer(@jacquesqiao) needs to create a new parallel_do whose block contains the sgd
if the parameter is created inside of a parallel_do.
Moreover, the support of memory optimization(@QiJune) is also needed.
This large amount of dependencies doesn't look right to me.