Fix the bug that regularization does not take effect in Adam
Created by: lcy-seso
Adam works well in practice and compares favorably to other adaptive learning-method algorithms. It becomes a popular (almost a default optimizer for many tasks) optimizer for deep neural networks.
But the current implementation of Adam in V2 API ignore the weight_decay_rate
parameter. This means the l2
regularization does not work for Adam and Adamax at all even if the user sets it.
A recent paper Fixing Weight Decay Regularization in Adam points out that decoupling weight decay and the optimization steps achieve a better learning performance.
Correctly implementing regularization is important for a learning task.
We have this PR https://github.com/PaddlePaddle/Paddle/pull/2097 , but it does not correctly implement the L2 regularization in Adam and Adamax.
A related issue reported by one of our users: https://github.com/PaddlePaddle/Paddle/issues/4162
I will fix this.