模型训练随机出core的问题。
Created by: HugoLian
各位大神,我有个一paddle训练,网络结构类似 book教程中的系统推荐,模型训练到不同的pass之后随机报错退出,退出时报错代码为:
I0728 00:45:40.022342 1868 Stat.cpp:132] Stat=BackwardTimer TID=1875 total=101.164 avg=0.046 max=0.461 min=0 count=2160
Stat=BackwardTimer TID=1876 total=105.213 avg=0.048 max=0.739 min=0 count=2160
Stat=BackwardTimer TID=1871 total=110.272 avg=0.051 max=0.553 min=0 count=2160
Stat=BackwardTimer TID=1877 total=110.925 avg=0.051 max=0.546 min=0 count=2160
Stat=BackwardTimer TID=1872 total=105.79 avg=0.048 max=0.673 min=0 count=2160
Stat=BackwardTimer TID=1874 total=107.7 avg=0.049 max=0.536 min=0 count=2160
Stat=BackwardTimer TID=1873 total=104.855 avg=0.048 max=0.727 min=0 count=2160
I0728 00:45:40.022374 1868 Stat.cpp:140] ======= BarrierStatSet status ======
I0728 00:45:40.022379 1868 Stat.cpp:153] --------------------------------------------------
I0728 00:46:22.938666 1868 Tester.cpp:127] Test samples=886436 cost=0.619395 Eval:
I0728 00:46:22.953542 1868 GradientMachine.cpp:112] Saving parameters to ./output2/pass-00030
I0728 00:46:23.030747 1868 Util.cpp:230] copy trainer_config_feed.py to ./output2/pass-00030
*** Aborted at 1501173992 (unix time) try "date -d @1501173992" if you are using GNU date ***
PC: @ 0x7fe550fdac98 (unknown)
*** SIGSEGV (@0x0) received by PID 1868 (TID 0x7fe507fff700) from PID 0; stack trace: ***
@ 0x7fe55123e160 (unknown)
@ 0x7fe550fdac98 (unknown)
/home/iknow/lianjie/paddle/paddle_internal_release_tools/idl/paddle/output/bin/paddle_local: line 109: 1868 Segmentation fault (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}
请问训练过程中随机退出都可能有什么原因呢?这个个train.sh的参数设置有关系么?
cfg=trainer_config_feed.py
paddle train \
--config=$cfg \
--save_dir=./output2 \
--trainer_count=7 \
--log_period=2000 \
--num_passes=150 \
--use_gpu=false \
--show_parameter_stats_period=500 \
--test_all_data_in_one_period=1 \
--dot_period=30 \
--saving_period=1 \
--num_gradient_servers=1 \
2>&1 | tee 'train.log.2'
另外原来好像见过一个输入初始pass,就可以从指定参数基础上开始训练,现在找不到了,paddle是支持这个的吧?