pslib 运行过程不稳定会挂掉
Created by: ccmeteorljh
运行过程中报如下错误
Wed Jan 16 14:58:51 2019[1,69]<stderr>:ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Wed Jan 16 14:58:51 2019[1,69]<stderr>:*** Aborted at 1547621931 (unix time) try "date -d @1547621931" if you are using GNU date ***
Wed Jan 16 14:58:51 2019[1,69]<stderr>:PC: @ 0x0 (unknown)
Wed Jan 16 14:58:51 2019[1,69]<stderr>:*** SIGABRT (@0x1f50000c856) received by PID 51286 (TID 0x7fcdd6bfd700) from PID 51286; stack trace: ***
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf437ea160 (unknown)
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42d583f7 __GI_raise
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.059838 46094 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:I0116 14:58:51.046589 3433 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:16, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42d597d8 __GI_abort
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.062235 46107 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42d96554 __libc_message
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.064942 46094 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42d9bdbe malloc_printerr
Wed Jan 16 14:58:51 2019[1,89]<stderr>:I0116 14:58:51.068962 38124 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42d9ca97 _int_free
Wed Jan 16 14:58:51 2019[1,37]<stderr>:I0116 14:58:51.075512 43317 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,89]<stderr>:I0116 14:58:51.074071 38124 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf083537f4 paddle::ps::DownpourBrpcPsClient::push_sparse()
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fced1ae75eb paddle::framework::AsyncExecutorThreadWorker::PushSparse()
Wed Jan 16 14:58:51 2019[1,67]<stderr>:I0116 14:58:51.076078 63050 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fced1ae8c9b paddle::framework::AsyncExecutorThreadWorker::TrainFilesWithTimer()
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fce905d28a0 execute_native_thread_routine
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf437e21c3 start_thread
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x7fcf42e0a12d __clone
Wed Jan 16 14:58:51 2019[1,69]<stderr>: @ 0x0 (unknown)
Wed Jan 16 14:58:51 2019[1,11]<stderr>:I0116 14:58:51.103016 27729 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,11]<stderr>:I0116 14:58:51.105809 27711 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,25]<stderr>:I0116 14:58:51.105068 5095 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,71]<stderr>:I0116 14:58:51.117740 42291 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,39]<stderr>:I0116 14:58:51.104988 48156 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.125352 13529 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.125396 13552 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.126184 13534 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.127045 13548 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,19]<stderr>:I0116 14:58:51.115668 61190 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.128327 46090 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:15, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,57]<stderr>:I0116 14:58:51.150101 51396 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.173548 44909 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,77]<stderr>:I0116 14:58:51.168828 43745 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.178635 44909 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.180130 44932 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,25]<stderr>:I0116 14:58:51.192992 5117 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,85]<stderr>:E0116 14:58:51.224426 58277 data_feed.cc:67] pick file:./data/part-00342
Wed Jan 16 14:58:51 2019[1,0]<stderr>:W0116 14:58:51.223918 28251 src/brpc/input_messenger.cpp:213] Fail to read from fd=240 SocketId=25769815336@10.182.164.150:38870@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.241304 38592 src/brpc/input_messenger.cpp:213] Fail to read from fd=91 SocketId=19982@10.182.164.150:40548@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,10]<stderr>:W0116 14:58:51.229271 10723 src/brpc/input_messenger.cpp:213] Fail to read from fd=330 SocketId=11659@10.182.164.150:46651@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,30]<stderr>:W0116 14:58:51.237868 46766 src/brpc/input_messenger.cpp:213] Fail to read from fd=68 SocketId=17179879939@10.182.164.150:53683@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,4]<stderr>:W0116 14:58:51.239457 40157 src/brpc/input_messenger.cpp:213] Fail to read from fd=44 SocketId=8589974274@10.182.164.150:12094@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,24]<stderr>:W0116 14:58:51.228867 53537 src/brpc/input_messenger.cpp:213] Fail to read from fd=28 SocketId=25769821340@10.182.164.150:51311@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,28]<stderr>:W0116 14:58:51.229332 24371 src/brpc/input_messenger.cpp:213] Fail to read from fd=42 SocketId=34359756416@10.182.164.150:27667@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,16]<stderr>:W0116 14:58:51.246500 6318 src/brpc/input_messenger.cpp:213] Fail to read from fd=118 SocketId=8589958924@10.182.164.150:17028@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,10]<stderr>:W0116 14:58:51.229865 10982 src/brpc/input_messenger.cpp:213] Fail to read from fd=257 SocketId=18689@10.182.164.150:46117@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242110 38274 src/brpc/input_messenger.cpp:213] Fail to read from fd=760 SocketId=8589970433@10.182.164.150:39585@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,18]<stderr>:W0116 14:58:51.229128 44628 src/brpc/input_messenger.cpp:213] Fail to read from fd=136 SocketId=42949688960@10.182.164.150:36645@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,22]<stderr>:W0116 14:58:51.229554 463 src/brpc/input_messenger.cpp:213] Fail to read from fd=255 SocketId=25769828226@10.182.164.150:23723@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,26]<stderr>:W0116 14:58:51.227822 46900 src/brpc/input_messenger.cpp:213] Fail to read from fd=30 SocketId=34359760513@10.182.164.150:51662@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,20]<stderr>:W0116 14:58:51.232104 17858 src/brpc/input_messenger.cpp:213] Fail to read from fd=179 SocketId=34359754627@10.182.164.150:10298@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242452 38243 src/brpc/input_messenger.cpp:213] Fail to read from fd=370 SocketId=25769806851@10.182.164.150:41399@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,4]<stderr>:W0116 14:58:51.239852 40175 src/brpc/input_messenger.cpp:213] Fail to read from fd=301 SocketId=17179907841@10.182.164.150:12234@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,6]<stderr>:W0116 14:58:51.230291 50143 src/brpc/input_messenger.cpp:213] Fail to read from fd=92 SocketId=8589935249@10.182.164.150:28313@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,20]<stderr>:W0116 14:58:51.232159 17857 src/brpc/input_messenger.cpp:213] Fail to read from fd=187 SocketId=17179876867@10.182.164.150:60134@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242687 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=429 SocketId=8589935497@10.182.164.150:41400@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242799 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=208 SocketId=25769806980@10.182.164.150:40692@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,0]<stderr>:W0116 14:58:51.225190 28474 src/brpc/input_messenger.cpp:213] Fail to read from fd=246 SocketId=34359748608@10.182.164.150:38295@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242858 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=795 SocketId=21772@10.182.164.150:40768@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,2]<stderr>:W0116 14:58:51.232579 6931 src/brpc/input_messenger.cpp:213] Fail to read from fd=177 SocketId=17179894912@10.182.164.150:57423@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,12]<stderr>:W0116 14:58:51.241056 39209 src/brpc/input_messenger.cpp:213] Fail to read from fd=273 SocketId=8589949466@10.182.164.150:52555@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,12]<stderr>:W0116 14:58:51.241058 38989 src/brpc/input_messenger.cpp:213] Fail to read from fd=212 SocketId=25769820803@10.182.164.150:53620@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,2]<stderr>:W0116 14:58:51.232733 6637 src/brpc/input_messenger.cpp:213] Fail to read from fd=181 SocketId=25769818240@10.182.164.150:57426@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,18]<stderr>:W0116 14:58:51.230209 44510 src/brpc/input_messenger.cpp:213] Fail to read from fd=221 SocketId=60129545987@10.182.164.150:36943@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.243179 38496 src/brpc/input_messenger.cpp:213] Fail to read from fd=694 SocketId=8589939984@10.182.164.150:41181@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,6]<stderr>:W0116 14:58:51.231586 50103 src/brpc/input_messenger.cpp:213] Fail to read from fd=232 SocketId=17179873922@10.182.164.150:28639@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,17]<stderr>:I0116 14:58:51.248335 22604 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
--------------------------------------------------------------------------
mpirun noticed that process rank 69 with PID 51286 on node 10.182.164.150 exited on signal 6 (Aborted).