Created by: mapingshuo
PR types
New features
PR changes
APIs
Describe
在分布式训练中,经常遇到显存或者内存不足的情况,这通常有三种原因:
- 中间层输出占据的显存超出了内存/显存大小
- 参数过大(比如CPU的embedding)
- 输入Var过大(比如视频输入)
GradientMerge(梯度累加) 策略的做法,是将大Batch的输入改成小Batch,分别进行“前向+反向”网络计算梯度;最后将梯度做累加。
In the distributed training, we often encounter insufficient memory problem. There are usually 3 reasons:
- The memory occupied by the middle layer output exceeds the memory size
- The parameter is too large (such as CPU embedding)
- Input Var is too large (such as input)
The GradientMerge (gradient accumulation) strategy is to split the input of a large batch to a small batch, it performs a "forward + backward" network to calculate the gradient for each small batch; finally, the gradient is accumulated and the parameter is updated by specific optimize algorithm