The problem of improving the performance of Parallel_Do (#8592) · Issue · PaddlePaddle / Paddle

The problem of improving the performance of Parallel_Do

Created by: tonyyang-svail

The current design uses parallel_do to describe the multi-GPU training.

In order to achieve good performance, we need to maintain a copy of the parameters on each GPU. This includes the parameter itself, the aggregation of the gradient across all devices, and momentum/velocity if needed by the optimizer.

This requires compatibility support of initializer, backward and optimizer.

LayerHelper(@reyoung) needs to be aware which parameter is created inside of a parallel_do, then put their initializers in a new parallel_do block.

Backward(@JiayiFeng) needs to support inserting AllReduce into the programDesc when the parameter is created inside of a parallel_do.

Optimizer(@jacquesqiao) needs to create a new parallel_do whose block contains the sgd if the parameter is created inside of a parallel_do.

Moreover, the support of memory optimization(@QiJune) is also needed.

This large amount of dependencies doesn't look right to me.

PaddlePaddle / Paddle 大约 2 年 前同步成功

The problem of improving the performance of Parallel_Do

PaddlePaddle / Paddle
大约 2 年前同步成功