Created by: wangxicoding
-
Regularization is part of loss, but its derivative equation is certain. So paddle add weight decay to grad before optimizer after allreduce. Because weight decay is part of grad, so we need to move it to the local before local grad accumulates. Otherwise, the model is very likely overfitting, harm test convergence. In resnet50 imagnet dataset, DGC loss 2% test accuracy before we apply this change because of overfitting.
-
In paddle, loss has divided by nranks. Because dgc_op is before allreduce, so local regular_coeff need div nranks too. But now we multi grad with nranks in dgc_op, in that case, regular_coeff don't need to /nranks, can prevent precision loss. For coeff often equal with 1e-4, if nranks=32, coeff/nranks will be 3.125e-6, the numerical accuracy of coeff/nranks will be too low. In resnet50 imagenet dataset, DGC loss 0.6% test accuracy before we multi grad with nranks because of precision loss.
After the above repair, DGC accuracy fully matches the baseline in resnet50 imagenet dataset.