Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #15388

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 1月 17, 2019 by saxon_zh@saxon_zhGuest

pslib 运行过程不稳定会挂掉

Created by: ccmeteorljh

运行过程中报如下错误

Wed Jan 16 14:58:51 2019[1,69]<stderr>:ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Wed Jan 16 14:58:51 2019[1,69]<stderr>:*** Aborted at 1547621931 (unix time) try "date -d @1547621931" if you are using GNU date ***
Wed Jan 16 14:58:51 2019[1,69]<stderr>:PC: @                0x0 (unknown)
Wed Jan 16 14:58:51 2019[1,69]<stderr>:*** SIGABRT (@0x1f50000c856) received by PID 51286 (TID 0x7fcdd6bfd700) from PID 51286; stack trace: ***
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf437ea160 (unknown)
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42d583f7 __GI_raise
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.059838 46094 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:I0116 14:58:51.046589  3433 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:16, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42d597d8 __GI_abort
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.062235 46107 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42d96554 __libc_message
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.064942 46094 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42d9bdbe malloc_printerr
Wed Jan 16 14:58:51 2019[1,89]<stderr>:I0116 14:58:51.068962 38124 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42d9ca97 _int_free
Wed Jan 16 14:58:51 2019[1,37]<stderr>:I0116 14:58:51.075512 43317 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,89]<stderr>:I0116 14:58:51.074071 38124 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf083537f4 paddle::ps::DownpourBrpcPsClient::push_sparse()
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fced1ae75eb paddle::framework::AsyncExecutorThreadWorker::PushSparse()
Wed Jan 16 14:58:51 2019[1,67]<stderr>:I0116 14:58:51.076078 63050 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fced1ae8c9b paddle::framework::AsyncExecutorThreadWorker::TrainFilesWithTimer()
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fce905d28a0 execute_native_thread_routine
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf437e21c3 start_thread
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @     0x7fcf42e0a12d __clone
Wed Jan 16 14:58:51 2019[1,69]<stderr>:    @                0x0 (unknown)
Wed Jan 16 14:58:51 2019[1,11]<stderr>:I0116 14:58:51.103016 27729 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,11]<stderr>:I0116 14:58:51.105809 27711 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,25]<stderr>:I0116 14:58:51.105068  5095 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,71]<stderr>:I0116 14:58:51.117740 42291 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,39]<stderr>:I0116 14:58:51.104988 48156 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.125352 13529 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.125396 13552 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.126184 13534 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,49]<stderr>:I0116 14:58:51.127045 13548 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,19]<stderr>:I0116 14:58:51.115668 61190 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,33]<stderr>:I0116 14:58:51.128327 46090 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:15, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,57]<stderr>:I0116 14:58:51.150101 51396 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.173548 44909 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,77]<stderr>:I0116 14:58:51.168828 43745 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.178635 44909 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,53]<stderr>:I0116 14:58:51.180130 44932 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,25]<stderr>:I0116 14:58:51.192992  5117 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
Wed Jan 16 14:58:51 2019[1,85]<stderr>:E0116 14:58:51.224426 58277 data_feed.cc:67] pick file:./data/part-00342
Wed Jan 16 14:58:51 2019[1,0]<stderr>:W0116 14:58:51.223918 28251 src/brpc/input_messenger.cpp:213] Fail to read from fd=240 SocketId=25769815336@10.182.164.150:38870@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.241304 38592 src/brpc/input_messenger.cpp:213] Fail to read from fd=91 SocketId=19982@10.182.164.150:40548@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,10]<stderr>:W0116 14:58:51.229271 10723 src/brpc/input_messenger.cpp:213] Fail to read from fd=330 SocketId=11659@10.182.164.150:46651@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,30]<stderr>:W0116 14:58:51.237868 46766 src/brpc/input_messenger.cpp:213] Fail to read from fd=68 SocketId=17179879939@10.182.164.150:53683@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,4]<stderr>:W0116 14:58:51.239457 40157 src/brpc/input_messenger.cpp:213] Fail to read from fd=44 SocketId=8589974274@10.182.164.150:12094@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,24]<stderr>:W0116 14:58:51.228867 53537 src/brpc/input_messenger.cpp:213] Fail to read from fd=28 SocketId=25769821340@10.182.164.150:51311@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,28]<stderr>:W0116 14:58:51.229332 24371 src/brpc/input_messenger.cpp:213] Fail to read from fd=42 SocketId=34359756416@10.182.164.150:27667@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,16]<stderr>:W0116 14:58:51.246500  6318 src/brpc/input_messenger.cpp:213] Fail to read from fd=118 SocketId=8589958924@10.182.164.150:17028@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,10]<stderr>:W0116 14:58:51.229865 10982 src/brpc/input_messenger.cpp:213] Fail to read from fd=257 SocketId=18689@10.182.164.150:46117@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242110 38274 src/brpc/input_messenger.cpp:213] Fail to read from fd=760 SocketId=8589970433@10.182.164.150:39585@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,18]<stderr>:W0116 14:58:51.229128 44628 src/brpc/input_messenger.cpp:213] Fail to read from fd=136 SocketId=42949688960@10.182.164.150:36645@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,22]<stderr>:W0116 14:58:51.229554   463 src/brpc/input_messenger.cpp:213] Fail to read from fd=255 SocketId=25769828226@10.182.164.150:23723@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,26]<stderr>:W0116 14:58:51.227822 46900 src/brpc/input_messenger.cpp:213] Fail to read from fd=30 SocketId=34359760513@10.182.164.150:51662@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,20]<stderr>:W0116 14:58:51.232104 17858 src/brpc/input_messenger.cpp:213] Fail to read from fd=179 SocketId=34359754627@10.182.164.150:10298@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242452 38243 src/brpc/input_messenger.cpp:213] Fail to read from fd=370 SocketId=25769806851@10.182.164.150:41399@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,4]<stderr>:W0116 14:58:51.239852 40175 src/brpc/input_messenger.cpp:213] Fail to read from fd=301 SocketId=17179907841@10.182.164.150:12234@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,6]<stderr>:W0116 14:58:51.230291 50143 src/brpc/input_messenger.cpp:213] Fail to read from fd=92 SocketId=8589935249@10.182.164.150:28313@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,20]<stderr>:W0116 14:58:51.232159 17857 src/brpc/input_messenger.cpp:213] Fail to read from fd=187 SocketId=17179876867@10.182.164.150:60134@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242687 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=429 SocketId=8589935497@10.182.164.150:41400@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242799 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=208 SocketId=25769806980@10.182.164.150:40692@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,0]<stderr>:W0116 14:58:51.225190 28474 src/brpc/input_messenger.cpp:213] Fail to read from fd=246 SocketId=34359748608@10.182.164.150:38295@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.242858 38444 src/brpc/input_messenger.cpp:213] Fail to read from fd=795 SocketId=21772@10.182.164.150:40768@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,2]<stderr>:W0116 14:58:51.232579  6931 src/brpc/input_messenger.cpp:213] Fail to read from fd=177 SocketId=17179894912@10.182.164.150:57423@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,12]<stderr>:W0116 14:58:51.241056 39209 src/brpc/input_messenger.cpp:213] Fail to read from fd=273 SocketId=8589949466@10.182.164.150:52555@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,12]<stderr>:W0116 14:58:51.241058 38989 src/brpc/input_messenger.cpp:213] Fail to read from fd=212 SocketId=25769820803@10.182.164.150:53620@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,2]<stderr>:W0116 14:58:51.232733  6637 src/brpc/input_messenger.cpp:213] Fail to read from fd=181 SocketId=25769818240@10.182.164.150:57426@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,18]<stderr>:W0116 14:58:51.230209 44510 src/brpc/input_messenger.cpp:213] Fail to read from fd=221 SocketId=60129545987@10.182.164.150:36943@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,82]<stderr>:W0116 14:58:51.243179 38496 src/brpc/input_messenger.cpp:213] Fail to read from fd=694 SocketId=8589939984@10.182.164.150:41181@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,6]<stderr>:W0116 14:58:51.231586 50103 src/brpc/input_messenger.cpp:213] Fail to read from fd=232 SocketId=17179873922@10.182.164.150:28639@8000: Connection reset by peer
Wed Jan 16 14:58:51 2019[1,17]<stderr>:I0116 14:58:51.248335 22604 baidu/paddlepaddle/pslib/src/communicate/downpour_ps_client.cc:720] Waiting for async_call_num comsume, task_num:14, max_task_limit:13
--------------------------------------------------------------------------
mpirun noticed that process rank 69 with PID 51286 on node 10.182.164.150 exited on signal 6 (Aborted).
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#15388
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7