Created by: chengduoZH
p - lr.broadcast(grad_dsize) * g is time-consuming.
p - lr.broadcast(grad_dsize) * g
https://github.com/PaddlePaddle/Paddle/blob/1a3d4b0d3d037aed9cd2999bbedfcbcd7a98c58c/paddle/operators/sgd_op.h#L52-L53