bert 训练validation 出现loss为inf的情况
Created by: ccmeteorljh
paddle version:1.4 开启fp16,loss_scaling=8的时候不收敛 训练参数:
batch_size: 4096
bert_config_path: bert_config.json
checkpoints: output/bert_normal_task
data_dir: train_data
epoch: 100
for_test: False
generate_neg_sample: True
init_checkpoint: None
is_distributed: False
learning_rate: 0.0001
loss_scaling: 8.0
lr_scheduler: noam_decay
max_seq_len: 512
num_train_steps: 1000000
reduce_master_grad: False
save_steps: 10000
skip_steps: 20
test_set_dir:
use_cuda: True
use_fast_executor: False
use_fp16: True
validation_set_dir: test_data
validation_steps: 1000
vocab_path: vocab.txt
warmup_steps: 4000
weight_decay: 0.01
weight_sharing: True
训练结果:
epoch: 1, progress: 1/100, step: 29360, loss: 60.687500, ppl: 1009.000000, next_sent_acc: 0.578125, speed: 2.013807 ste
ps/s, file: part-00097.gz
('feed_queue size', 70L)
current learning_rate:0.000037
epoch: 1, progress: 1/100, step: 29380, loss: 59.750000, ppl: 880.000000, next_sent_acc: 0.493056, speed: 2.013392 step
s/s, file: part-00097.gz
('feed_queue size', 70L)
current learning_rate:0.000037
epoch: 1, progress: 1/100, step: 29400, loss: 60.281250, ppl: 955.500000, next_sent_acc: 0.524306, speed: 2.010673 step
s/s, file: part-00097.gz
('feed_queue size', 70L)
current learning_rate:0.000037
epoch: 1, progress: 1/100, step: 29420, loss: 60.218750, ppl: 943.000000, next_sent_acc: 0.563194, speed: 2.014425 step
s/s, file: part-00097.gz
('feed_queue size', 70L)
current learning_rate:0.000037
epoch: 1, progress: 1/100, step: 29440, loss: 60.468750, ppl: 970.000000, next_sent_acc: 0.562847, speed: 2.021429 step
s/s, file: part-00097.gz
('feed_queue size', 70L)
current learning_rate:0.000037
epoch: 1, progress: 1/100, step: 29460, loss: 59.531250, ppl: 867.000000, next_sent_acc: 0.578125, speed: 2.014519 step
s/s, file: part-00097.gz
开启fp16,scale_loss=32 和128的时候,出现validation的loss的inf的情况;
epoch: 1, progress: 1/100, step: 980, loss: 232.250000, ppl: 697.000000, next_sent_acc: 0.531250, speed: 2.083186 steps
/s, file: part-00054.gz
('feed_queue size', 70L)
current learning_rate:0.000025
epoch: 1, progress: 1/100, step: 1000, loss: 229.500000, ppl: 708.500000, next_sent_acc: 0.609375, speed: 2.078022 step
s/s, file: part-00054.gz
/root/paddlejob/workspace/env_run/predict.py:71: RuntimeWarning: overflow encountered in add
cost += each_total_cost
[validation_set] epoch: 1, step: 816, loss: inf, global ppl: 649.037231, batch-averged ppl: 649.037231, next_sent_acc:
0.491685, speed: 0.245354 steps/s
('feed_queue size', 66L)
current learning_rate:0.000025
epoch: 1, progress: 1/100, step: 1020, loss: 228.000000, ppl: 663.500000, next_sent_acc: 0.567708, speed: 0.219097 step
s/s, file: part-00054.gz
('feed_queue size', 70L)
current learning_rate:0.000026
epoch: 1, progress: 1/100, step: 1040, loss: 227.750000, ppl: 664.000000, next_sent_acc: 0.621875, speed: 2.050249 step
s/s, file: part-00054.gz
('feed_queue size', 70L)
current learning_rate:0.000026
epoch: 1, progress: 1/100, step: 1060, loss: 225.750000, ppl: 646.000000, next_sent_acc: 0.687500, speed: 2.059907 step
s/s, file: part-00054.gz