Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #2065

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 4月 16, 2019 by saxon_zh@saxon_zhGuest

bert 训练validation 出现loss为inf的情况

Created by: ccmeteorljh

paddle version:1.4 开启fp16,loss_scaling=8的时候不收敛 训练参数:

batch_size: 4096                                                                                                                                         
bert_config_path: bert_config.json                                                                                                                       
checkpoints: output/bert_normal_task                                                                                                                     
data_dir: train_data                                                                                                                                     
epoch: 100                                                                                                                                               
for_test: False                                                                                                                                          
generate_neg_sample: True                                                                                                                                
init_checkpoint: None                                                                                                                                    
is_distributed: False                                                                                                                                    
learning_rate: 0.0001                                                                                                                                    
loss_scaling: 8.0                                                                                                                                        
lr_scheduler: noam_decay                                                                                                                                 
max_seq_len: 512                                                                                                                                         
num_train_steps: 1000000                                                                                                                                 
reduce_master_grad: False                                                                                                                                
save_steps: 10000                                                                                                                                        
skip_steps: 20                                                                                                                                           
test_set_dir:                                                                                                                                            
use_cuda: True                                                                                                                                           
use_fast_executor: False                                                                                                                                 
use_fp16: True                                                                                                                                           
validation_set_dir: test_data                                                                                                                            
validation_steps: 1000                                                                                                                                   
vocab_path: vocab.txt                                                                                                                                    
warmup_steps: 4000                                                                                                                                       
weight_decay: 0.01                                                                                                                                       
weight_sharing: True

训练结果:

epoch: 1, progress: 1/100, step: 29360, loss: 60.687500, ppl: 1009.000000, next_sent_acc: 0.578125, speed: 2.013807 ste
ps/s, file: part-00097.gz                                                                                              
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000037                                                                                         
epoch: 1, progress: 1/100, step: 29380, loss: 59.750000, ppl: 880.000000, next_sent_acc: 0.493056, speed: 2.013392 step
s/s, file: part-00097.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000037                                                                                         
epoch: 1, progress: 1/100, step: 29400, loss: 60.281250, ppl: 955.500000, next_sent_acc: 0.524306, speed: 2.010673 step
s/s, file: part-00097.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000037                                                                                         
epoch: 1, progress: 1/100, step: 29420, loss: 60.218750, ppl: 943.000000, next_sent_acc: 0.563194, speed: 2.014425 step
s/s, file: part-00097.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000037                                                                                         
epoch: 1, progress: 1/100, step: 29440, loss: 60.468750, ppl: 970.000000, next_sent_acc: 0.562847, speed: 2.021429 step
s/s, file: part-00097.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000037                                                                                         
epoch: 1, progress: 1/100, step: 29460, loss: 59.531250, ppl: 867.000000, next_sent_acc: 0.578125, speed: 2.014519 step
s/s, file: part-00097.gz  

开启fp16,scale_loss=32 和128的时候,出现validation的loss的inf的情况;

epoch: 1, progress: 1/100, step: 980, loss: 232.250000, ppl: 697.000000, next_sent_acc: 0.531250, speed: 2.083186 steps
/s, file: part-00054.gz                                                                                                
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000025                                                                                         
epoch: 1, progress: 1/100, step: 1000, loss: 229.500000, ppl: 708.500000, next_sent_acc: 0.609375, speed: 2.078022 step
s/s, file: part-00054.gz                                                                                               
/root/paddlejob/workspace/env_run/predict.py:71: RuntimeWarning: overflow encountered in add                           
  cost += each_total_cost                                                                                              
[validation_set] epoch: 1, step: 816, loss: inf, global ppl: 649.037231, batch-averged ppl: 649.037231, next_sent_acc: 
0.491685, speed: 0.245354 steps/s                                                                                      
('feed_queue size', 66L)                                                                                               
current learning_rate:0.000025                                                                                         
epoch: 1, progress: 1/100, step: 1020, loss: 228.000000, ppl: 663.500000, next_sent_acc: 0.567708, speed: 0.219097 step
s/s, file: part-00054.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000026                                                                                         
epoch: 1, progress: 1/100, step: 1040, loss: 227.750000, ppl: 664.000000, next_sent_acc: 0.621875, speed: 2.050249 step
s/s, file: part-00054.gz                                                                                               
('feed_queue size', 70L)                                                                                               
current learning_rate:0.000026                                                                                         
epoch: 1, progress: 1/100, step: 1060, loss: 225.750000, ppl: 646.000000, next_sent_acc: 0.687500, speed: 2.059907 step
s/s, file: part-00054.gz                                                                                               
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#2065
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7