A temporary guide to stablize models' training
Created by: kuke
Currently, the convergence of many configurations in v2 APIs has not been validated. One major chanllenge comes from that the training process is always interrupted by some fatal error related to float point exception in the very begining. And we have concluded that this type of erros results from the overflow due to gradient explosion in backpropogation.
A threshold for clipping gradient of parameters can be enabled to suppress the gradient explosion. But gradient clipping seems not enough to defeat the float point exception. Instead, one necessary and more effective way is to clip error in the crucial position with proper threshold. error
is the gradient of cost function with respect to the output of each layer, which propagates backward by following chain rule. In PaddlePaddle, the threshold for error clipping can be set via args layer_attr
layer by layer:
layer_attr=paddle.attr.ExtraAttr(error_clipping_threshold=10.0)
As long as error
is clipped after the layers sensitive to numerical instability, the float point exception can be avoided expectedly. Take the seq2seq demo adapted from machine translation in PaddleBook for example:
- Enable the log of error clipping by setting
log_error_clipping=True
andlog_clipping=True
inpaddle.init()
; - Set the threshold for error clipping in the input mixed layer of decoder, after
simple_attention()
which is sensitive because of thesoftmax
computation.
Then begin to train the model. At first, a lot of error clipping information appears
Pass 0, Batch 50, Cost 237.458789, {'classification_error_evaluator': 0.9572649598121643}
I0525 00:46:05.145277 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=150.321 avg error=6.00274
I0525 00:46:05.186667 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=1453.23 avg error=57.54
I0525 00:46:05.237247 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=3381.55 avg error=143.592
I0525 00:46:05.304813 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=5546.9 avg error=235.026
I0525 00:46:05.364843 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=7751.46 avg error=330.331
I0525 00:46:05.423358 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=9669.27 avg error=351.612
I0525 00:46:05.490279 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=11335.1 avg error=474.481
I0525 00:46:05.529522 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=12848.3 avg error=594.606
I0525 00:46:05.577287 2043461632 Layer.cpp:363] layer=input_recurrent@decoder_group need clipping, max error=14418.5 avg error=709.603
Take it easy
Usually after serveral hundreds of batches, the training would turn to be stable. As the error becomes smaller, the threshold barely acts again.
Pass 0, Batch 340, Cost 163.044409, {'classification_error_evaluator': 0.9482758641242981}
.........
Pass 0, Batch 350, Cost 146.991077, {'classification_error_evaluator': 0.9910714030265808}
.........
Pass 0, Batch 360, Cost 167.668896, {'classification_error_evaluator': 0.93388432264328}
.........
Pass 0, Batch 370, Cost 180.562292, {'classification_error_evaluator': 0.9770992398262024}
.........
Pass 0, Batch 380, Cost 211.419653, {'classification_error_evaluator': 0.9756097793579102}
.........
Pass 0, Batch 390, Cost 155.465637, {'classification_error_evaluator': 0.9576271176338196}
.........
Pass 0, Batch 400, Cost 157.087720, {'classification_error_evaluator': 0.9473684430122375}
Otherwise, a larger threshold or other hyper parameters with proper value may be needed.
Enjoy!