Implementing DC-ASGD (#185) · Issue · PaddlePaddle / Paddle

Implementing DC-ASGD

Created by: yingfeng

DC-ASGD is a remarkable work to resolve the convergence problem of ASGD using delayed compensation from Microsoft. According to the author's claim, the performance of DC-ASGD could be the same with ASGD while the accuracy of model would not be affected.
The major step of DC-ASGD could be seen as follows: Worker Side:

PS Side:

However, there are some shortcomings within the algorithm above:

DC-ASGD requires PS server side to store M copies of weights, while M equals to number of workers. It is of high memory wasting
An alternative solution of DC-ASGD as mentioned in the paper to get avoid of saving M copies of weights is to let worker push both gradients and weights to PS server, however, it also has the shortcoming of bandwidth wasting

I have a proposal to implement an improved version of DC-ASGD without above shortcomings:

Pull W_t to worker
Compute G_m at worker side
Compute W_t+1 at worker side using the same algorithm mentioned in the above algorithm 2. This is the major differenciation
Push W_t+1 to PS server

Such a proposal does not need a high memory wasting at PS server side, while the bandwidth is also the same. The major issue is: at PS server side, there has to exist a lock, because each worker will push w_t+1 to server side, and will cover the existing weight matrix.

Implementing such kinds of DC-ASGD to paddle is easy, however, I'm not sure about the correctness and the eventual convergence, as well as the effects of lock at PS side.

PaddlePaddle / Paddle 大约 2 年 前同步成功

Implementing DC-ASGD

PaddlePaddle / Paddle
大约 2 年前同步成功