Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #14216

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 11月 02, 2018 by saxon_zh@saxon_zhGuest

集群训练了几轮,突然挂掉了

Created by: 333caowei

mpi训练,训练到了第四轮,然后突然就挂了,错误如下: train.log

Fri Nov  2 12:41:00 2018[1,1]<stdout>:Epoch_id: 4, Batch_id: 370, Cost: 0.004106
Fri Nov  2 12:41:00 2018[1,7]<stdout>:Epoch_id: 4, Batch_id: 530, Cost: 0.004209
Fri Nov  2 12:41:00 2018[1,11]<stdout>:Epoch_id: 4, Batch_id: 700, Cost: 0.003778
Fri Nov  2 12:41:00 2018[1,13]<stdout>:*** Aborted at 1541133660 (unix time) try "date -d @1541133660" if you are using GNU date ***
Fri Nov  2 12:41:01 2018[1,2]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov  2 12:41:01 2018[1,15]<stdout>:F1102 12:41:01.557739 19970 grpc_client.cc:295] Send name:[source_fc_1_128.w@GRAD.block1], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov  2 12:41:01 2018[1,3]<stdout>:F1102 12:41:01.562139   423 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,15]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,12]<stdout>:F1102 12:41:01.555218 15430 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov  2 12:41:01 2018[1,18]<stdout>:F1102 12:41:01.556782 35944 grpc_client.cc:295] Send name:[author_emb.w@GRAD.block13], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,3]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,18]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,9]<stdout>:F1102 12:41:01.569141 19539 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f60ba0e7ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f80ca0f0ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb2b6a21ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f60ba0eb96c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f80ca0f496c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb2b6a2596c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f60ba0e79e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f80ca0f09e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb2b6a219e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f60ba0ece7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f80ca0f5e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb2b6a26e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f60ba9491d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f80ca9521d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb2b72831d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f815f0098a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f816944d1c3  start_thread
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @     0x7f8168a7512d  __clone
Fri Nov  2 12:41:01 2018[1,7]<stdout>:F1102 12:41:01.554478 46925 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,7]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,15]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2e3609aebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2e3609e96c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,10]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2e3609a9e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2e3609fe7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:F1102 12:41:01.556509 40030 grpc_client.cc:295] Send name:[author_emb.w@GRAD.block13], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2e368fc1d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2ecafb38a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2ed53f71c3  start_thread
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @     0x7f2ed4a1f12d  __clone
Fri Nov  2 12:41:01 2018[1,8]<stdout>:F1102 12:41:01.570008 31496 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov  2 12:41:01 2018[1,7]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,14]<stdout>:F1102 12:41:01.555438 15781 grpc_client.cc:295] Send name:[source_fc_0_256.w@GRAD.block2], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f61510de8a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,12]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,9]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb34da188a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f61594441c3  start_thread
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa95888febd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f5322f6febd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa95889396c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb355d7e1c3  start_thread
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @     0x7f6158a6c12d  __clone
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f5322f7396c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa95888f9e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @     0x7fb3553a612d  __clone
Fri Nov  2 12:41:01 2018[1,3]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f5322f6f9e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,11]<stdout>:F1102 12:41:01.557358 53527 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa958894e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f5322f74e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,18]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa9590f11d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f53237d11d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,8]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f71d51e6ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f71d51ea96c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f08f3533ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f08f353796c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f71d51e69e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,14]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f53b9f668a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f71d51ebe7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f08f35339e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe068c40ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f53c22cc1c3  start_thread
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f71d5a481d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f08f3538e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe068c4496c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f08f3d951d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa9ec9178a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe068c409e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @     0x7f53c18f412d  __clone
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe068c45e7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa9f6d5b1c3  start_thread
Fri Nov  2 12:41:01 2018[1,9]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe0694a21d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,11]<stdout>:*** Check failure stack trace: ***
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @     0x7fa9f638312d  __clone
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f726926e8a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f09875bb8a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f184e7b7ebd  google::LogMessage::Fail()
Fri Nov  2 12:41:01 2018[1,12]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f72736b21c3  start_thread
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f184e7bb96c  google::LogMessage::SendToLog()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f09919ff1c3  start_thread
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f184e7b79e3  google::LogMessage::Flush()
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @     0x7f7272cda12d  __clone
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @     0x7f099102712d  __clone
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f184e7bce7e  google::LogMessageFatal::~LogMessageFatal()
Fri Nov  2 12:41:01 2018[1,8]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f184f0191d0  paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov  2 12:41:01 2018[1,1]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f18e491d8a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f18ecc831c3  start_thread
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe0fdb598a0  execute_native_thread_routine
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @     0x7f18ec2ab12d  __clone
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe107f9d1c3  start_thread
Fri Nov  2 12:41:01 2018[1,11]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @     0x7fe1075c512d  __clone
Fri Nov  2 12:41:01 2018[1,14]<stdout>:    @              (nil)  (unknown)
Fri Nov  2 12:41:01 2018[1,0]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
mpirun: killing job...

server.log:

Fri Nov  2 10:44:01 2018[1,16]<stdout>:get_startup_program() is deprecated, call            get_pserver_programs() to get pserver main and startup            in a single call.passing pserver_program to get_startup_program()                is deprecated, you can use new API get_pserver_programs() to                get both pserver main program and startup program.
Fri Nov  2 10:44:01 2018[1,3]<stdout>:get_startup_program() is deprecated, call            get_pserver_programs() to get pserver main and startup            in a single call.passing pserver_program to get_startup_program()                is deprecated, you can use new API get_pserver_programs() to                get both pserver main program and startup program.
Fri Nov  2 10:44:01 2018[1,7]<stdout>:get_startup_program() is deprecated, call            get_pserver_programs() to get pserver main and startup            in a single call.passing pserver_program to get_startup_program()                is deprecated, you can use new API get_pserver_programs() to                get both pserver main program and startup program.
Fri Nov  2 10:44:01 2018[1,13]<stdout>:get_startup_program() is deprecated, call            get_pserver_programs() to get pserver main and startup            in a single call.passing pserver_program to get_startup_program()                is deprecated, you can use new API get_pserver_programs() to                get both pserver main program and startup program.
Fri Nov  2 10:44:01 2018[1,10]<stdout>:get_startup_program() is deprecated, call            get_pserver_programs() to get pserver main and startup            in a single call.passing pserver_program to get_startup_program()                is deprecated, you can use new API get_pserver_programs() to                get both pserver main program and startup program.
Fri Nov  2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': double free or corruption (!prev): 0x000000000346d8c0 ***
Fri Nov  2 12:41:01 2018[1,5]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov  2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': free(): corrupted unsorted chunks: 0x000000000346b630 ***
Fri Nov  2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': free(): corrupted unsorted chunks: 0x000000000346c0e0 ***
mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
---------------------------------------------------
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#14216
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7