Require sparse gradient clipping
Created by: gavin1332
Sparse gradient clipping is helpful in many NLP sequential tasks, such as machine translation, language model, semantic role labeling, etc., which is applied on embedding matrix to shrink the gradient accumulated along the sequence to avoid gradient exploding. We found a gradient clipping operator for dense matrix, but it is not suitable for sparse-updated embedding, so we need the sparse version.
Besides, L1/L2 regularization and Ada- series optimization algorithm should be also adapted for sparse updating.