How to apply the L2 regularization in Adam and Adamax.
Created by: lcy-seso
- The current implementations of Adam and Adamax do not take weight decay rate into consideration (https://github.com/PaddlePaddle/Paddle/issues/5836).
- PaddlePaddle uses a different way to implement the L2 regularization which potentially causes a problem in Adam and Adamax.
- A related issue is: https://github.com/PaddlePaddle/Paddle/issues/5899
As a result, how to implement the L2 regularization for Adam and Adamax is not very clear (for me).
This PR https://github.com/PaddlePaddle/Paddle/pull/5900 adds the L2 weight decay by following the AdamW in the paper: Fixing Weight Decay Regularization in Adam (but still not exactly the same).