v2 API集群训练core dump
Created by: typhoonzero
v2分布式训练失败报错如下:
Mon Jul 3 20:01:59 2017[1,0]<stdout>:Pass 0, Batch 0, Cost 1.157251, {'__auc_evaluator_0__': 0.5791015625, 'classification_error_evaluator': 0.4375}
Mon Jul 3 20:01:59 2017[1,0]<stderr>:*** Aborted at 1499083319 (unix time) try "date -d @1499083319" if you are using GNU date ***
Mon Jul 3 20:01:59 2017[1,0]<stderr>:PC: @ 0x0 (unknown)
Mon Jul 3 20:01:59 2017[1,0]<stderr>:*** SIGSEGV (@0x8) received by PID 1227 (TID 0x7f492b5fe700) from PID 8; stack trace: ***
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f495e753160 (unknown)
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f49586f9972 paddle::ProtoClient::recv()
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f4958f16126 paddle::ParameterClient2::sendParallel()
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f4958801a5c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f4957c8a8a0 execute_native_thread_routine
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f495e74b1c3 start_thread
Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x7f495dd7312d __clone
100 406k 0 406k 0 Mon Jul 3 20:01:59 2017[1,0]<stderr>: @ 0x0 (unknown)
0 3454k 0 --:--:-- --:--:-- --:--:-- 3476k
Mon Jul 3 20:01:59 2017[1,0]<stderr>:./train.sh: line 239: 1227 Segmentation fault python27-gcc482/bin/python conf/trainer_config.conf