Implement Zero Redundancy Optimizer
Created by: mapingshuo
Zero Redundancy Optimizer(ZeRO) is a training strategy to reduce the memory usage of parameter, gradient and optimize states. The key results of ZeRO-DP contain two parts:
i) ZeRO-DP incurs no additional communication using P_os and P_g, while enabling up to 8x memory reduction
ii) ZeRO-DP incurs a maximum of 1:5x communication when using P_p in addition to P_os and P_g, while further reducing the memory footprint by Nd times.
We will add this strategy to the Fleet API soon.
Reference
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. ArXiv:1910.02054