程序跑几百batch之后getNextBatch出错
Created by: hiahiahu
paddle_trainer.INFO
I0116 01:21:18.930677 22650 TrainerInternal.cpp:206] _layer_output.bias avg_abs_val=0.00357206 max_val=0.00560724 avg_abs_grad=210.49 max_grad=210.49
I0116 01:21:18.930753 22650 TrainerInternal.cpp:160] Batch=400 samples=1600000 AvgCost=0.238351 CurrentCost=0.193124 Eval: classification_error=0.0886062 auc=0.88588 CurrentEval: classification_error=0.077955 auc=0.921465
I0116 01:21:20.684520 22650 Tester.cpp:101] Test samples=9000 cost=1.20793 Eval: classification_error=0.347556 auc=0.59058
train.log
Mon Jan 16 01:22:25 2017[1,34]<stderr>:F0116 01:22:25.547391 17089 PythonUtil.cpp:108] Call function getNextBatch failed.
Mon Jan 16 01:22:25 2017[1,34]<stderr>:*** Check failure stack trace: ***
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x915068 google::LogMessage::Fail()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x914fc0 google::LogMessage::SendToLog()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x914a55 google::LogMessage::Flush()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x917816 google::LogMessageFatal::~LogMessageFatal()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x857031 paddle::checkPythonError()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x6cadaf paddle::PyDataProvider::getNextBatchInternal()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x60575e paddle::DataProvider::getNextBatch()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x71c6d9 paddle::Trainer::trainOnePass()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x71cc29 paddle::Trainer::train()
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x597655 main
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x7f869b00abd5 __libc_start_main
Mon Jan 16 01:22:25 2017[1,34]<stderr>: @ 0x5a57c5 (unknown)
Mon Jan 16 01:22:25 2017[1,34]<stderr>:./train.sh: line 195: 17089 Aborted PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Mon Jan 16 01:22:25 2017[1,34]<stderr>:+ '[' 134 -ne 0 ']'
Mon Jan 16 01:22:25 2017[1,34]<stderr>:+ kill_pserver2_exit
Mon Jan 16 01:22:25 2017[1,34]<stderr>:+ ps aux
其中dataprovider.py里面对应逻辑
@provider(use_seq=False, init_hook=initHook, should_shuffle=True, pool_size=PoolSize(50))
def processData(obj, file_name):
with open(file_name, 'r') as fdata:
for line in fdata:
## query bid negbid show click common_show qbfea qnfea
it = line.strip().split('\t')
if int(it[4]) > 1:
qbfea = map(float, it[-2].split())
yield qbfea, 1
if int(it[5]) > 2:
qnfea = map(float, it[-1].split())
yield qnfea, 0
麻烦问下这种情况可能出错的原因,和debug方法 辛苦!