【paddle.fleet】add gradient Merge Optimizer (!25625) · 合并请求 · PaddlePaddle / Paddle

【paddle.fleet】add gradient Merge Optimizer !25625

Created by: mapingshuo

PR types

New features

PR changes

APIs

Describe

在分布式训练中，经常遇到显存或者内存不足的情况，这通常有三种原因：

中间层输出占据的显存超出了内存/显存大小
参数过大（比如CPU的embedding）
输入Var过大（比如视频输入）

GradientMerge(梯度累加) 策略的做法，是将大Batch的输入改成小Batch，分别进行“前向+反向”网络计算梯度；最后将梯度做累加。

In the distributed training, we often encounter insufficient memory problem. There are usually 3 reasons:

The memory occupied by the middle layer output exceeds the memory size
The parameter is too large (such as CPU embedding)
Input Var is too large (such as input)

The GradientMerge (gradient accumulation) strategy is to split the input of a large batch to a small batch, it performs a "forward + backward" network to calculate the gradient for each small batch; finally, the gradient is accumulated and the parameter is updated by specific optimize algorithm

PaddlePaddle / Paddle 大约 2 年 前同步成功