Error/Gradient clipping survey and plan
Created by: reyoung
Gradient Clipping
Exploding gradients can be handled by gradient clipping. Before optimizing a parameter, we can clip its gradient to stabilize the training process.
The simplest clipping is just clip_by_value
. It means we will limit the values of tensor within [clip_min, clip_max]. Every value of this tensor is larger than clip_max, will be clip_max. Every value of this tensor is less than clip_min, will be clip_min.
Just clip a value is not good because it will change the direction of gradients. If we do not want to change the direction of one gradient of the parameter, we can just scale the gradient and make the l2-norm of this gradient is less than a limit.
If we want the whole direction of gradients are not changed, we can scale all gradients and make the l2-norm of them is less than a limit.
So, there are two methods will be implemented.
clip_by_value
-
clip_by_l2_norm
, which will takes a list of gradient. There could be two higher level APIclip_by_local_l2_norm
andclip_by_global_l2_norm
, which will pass the current gradient or all gradients toclip_by_l2_norm
Error clipping
Just clipping the gradient after backwards cannot handle the exploding while backwards. Gradients could have been exploded during calculate the backward stage.
There is a trick in the previous Paddle called error clipping
. It just clipping the gradient of hidden layers while backwards. Tensorflow does not provide this feature by default, but a user could implement this feature by hacking backwards method.
We should make our backward
customizable in Python to support error clipping
or other manipulation.
Maybe we can add a backward in Python and takes a Python callback. If the user does not provide any callback, it just generates backward operator in normal. If user customizes that callback, users can create error clipping
by themselves.