Implementing DC-ASGD
Created by: yingfeng
DC-ASGD is a remarkable work to resolve the convergence problem of ASGD using delayed compensation from Microsoft. According to the author's claim, the performance of DC-ASGD could be the same with ASGD while the accuracy of model would not be affected.
The major step of DC-ASGD could be seen as follows:
Worker Side:
However, there are some shortcomings within the algorithm above:
- DC-ASGD requires PS server side to store M copies of weights, while M equals to number of workers. It is of high memory wasting
- An alternative solution of DC-ASGD as mentioned in the paper to get avoid of saving M copies of weights is to let worker push both gradients and weights to PS server, however, it also has the shortcoming of bandwidth wasting
I have a proposal to implement an improved version of DC-ASGD without above shortcomings:
- Pull W_t to worker
- Compute G_m at worker side
- Compute W_t+1 at worker side using the same algorithm mentioned in the above algorithm 2. This is the major differenciation
- Push W_t+1 to PS server
Such a proposal does not need a high memory wasting at PS server side, while the bandwidth is also the same. The major issue is: at PS server side, there has to exist a lock, because each worker will push w_t+1 to server side, and will cover the existing weight matrix.
Implementing such kinds of DC-ASGD to paddle is easy, however, I'm not sure about the correctness and the eventual convergence, as well as the effects of lock at PS side.