集群训练了几轮,突然挂掉了
Created by: 333caowei
mpi训练,训练到了第四轮,然后突然就挂了,错误如下: train.log
Fri Nov 2 12:41:00 2018[1,1]<stdout>:Epoch_id: 4, Batch_id: 370, Cost: 0.004106
Fri Nov 2 12:41:00 2018[1,7]<stdout>:Epoch_id: 4, Batch_id: 530, Cost: 0.004209
Fri Nov 2 12:41:00 2018[1,11]<stdout>:Epoch_id: 4, Batch_id: 700, Cost: 0.003778
Fri Nov 2 12:41:00 2018[1,13]<stdout>:*** Aborted at 1541133660 (unix time) try "date -d @1541133660" if you are using GNU date ***
Fri Nov 2 12:41:01 2018[1,2]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov 2 12:41:01 2018[1,15]<stdout>:F1102 12:41:01.557739 19970 grpc_client.cc:295] Send name:[source_fc_1_128.w@GRAD.block1], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov 2 12:41:01 2018[1,3]<stdout>:F1102 12:41:01.562139 423 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,15]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,12]<stdout>:F1102 12:41:01.555218 15430 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov 2 12:41:01 2018[1,18]<stdout>:F1102 12:41:01.556782 35944 grpc_client.cc:295] Send name:[author_emb.w@GRAD.block13], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,3]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,18]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,9]<stdout>:F1102 12:41:01.569141 19539 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f60ba0e7ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f80ca0f0ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb2b6a21ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f60ba0eb96c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f80ca0f496c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb2b6a2596c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f60ba0e79e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f80ca0f09e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb2b6a219e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f60ba0ece7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f80ca0f5e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb2b6a26e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f60ba9491d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f80ca9521d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb2b72831d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f815f0098a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f816944d1c3 start_thread
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ 0x7f8168a7512d __clone
Fri Nov 2 12:41:01 2018[1,7]<stdout>:F1102 12:41:01.554478 46925 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,7]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,15]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2e3609aebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2e3609e96c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,10]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2e3609a9e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2e3609fe7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,1]<stdout>:F1102 12:41:01.556509 40030 grpc_client.cc:295] Send name:[author_emb.w@GRAD.block13], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2e368fc1d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2ecafb38a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2ed53f71c3 start_thread
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ 0x7f2ed4a1f12d __clone
Fri Nov 2 12:41:01 2018[1,8]<stdout>:F1102 12:41:01.570008 31496 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:OS Error
Fri Nov 2 12:41:01 2018[1,7]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,14]<stdout>:F1102 12:41:01.555438 15781 grpc_client.cc:295] Send name:[source_fc_0_256.w@GRAD.block2], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f61510de8a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,12]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,9]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb34da188a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f61594441c3 start_thread
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa95888febd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f5322f6febd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa95889396c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb355d7e1c3 start_thread
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ 0x7f6158a6c12d __clone
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f5322f7396c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa95888f9e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ 0x7fb3553a612d __clone
Fri Nov 2 12:41:01 2018[1,3]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f5322f6f9e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,11]<stdout>:F1102 12:41:01.557358 53527 grpc_client.cc:295] Send name:[nce_w@GRAD.block5], ep:[10.109.92.33:8000] meets grpc error:Socket closed
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa958894e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f5322f74e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,18]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa9590f11d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f53237d11d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,8]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f71d51e6ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,1]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f71d51ea96c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f08f3533ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f08f353796c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f71d51e69e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,14]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f53b9f668a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f71d51ebe7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f08f35339e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe068c40ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f53c22cc1c3 start_thread
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f71d5a481d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f08f3538e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe068c4496c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f08f3d951d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa9ec9178a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe068c409e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ 0x7f53c18f412d __clone
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe068c45e7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa9f6d5b1c3 start_thread
Fri Nov 2 12:41:01 2018[1,9]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe0694a21d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,11]<stdout>:*** Check failure stack trace: ***
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ 0x7fa9f638312d __clone
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f726926e8a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f09875bb8a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f184e7b7ebd google::LogMessage::Fail()
Fri Nov 2 12:41:01 2018[1,12]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f72736b21c3 start_thread
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f184e7bb96c google::LogMessage::SendToLog()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f09919ff1c3 start_thread
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f184e7b79e3 google::LogMessage::Flush()
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ 0x7f7272cda12d __clone
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ 0x7f099102712d __clone
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f184e7bce7e google::LogMessageFatal::~LogMessageFatal()
Fri Nov 2 12:41:01 2018[1,8]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f184f0191d0 paddle::operators::distributed::GRPCClient::Proceed()
Fri Nov 2 12:41:01 2018[1,1]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f18e491d8a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f18ecc831c3 start_thread
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe0fdb598a0 execute_native_thread_routine
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ 0x7f18ec2ab12d __clone
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe107f9d1c3 start_thread
Fri Nov 2 12:41:01 2018[1,11]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ 0x7fe1075c512d __clone
Fri Nov 2 12:41:01 2018[1,14]<stdout>: @ (nil) (unknown)
Fri Nov 2 12:41:01 2018[1,0]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
mpirun: killing job...
server.log:
Fri Nov 2 10:44:01 2018[1,16]<stdout>:get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.
Fri Nov 2 10:44:01 2018[1,3]<stdout>:get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.
Fri Nov 2 10:44:01 2018[1,7]<stdout>:get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.
Fri Nov 2 10:44:01 2018[1,13]<stdout>:get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.
Fri Nov 2 10:44:01 2018[1,10]<stdout>:get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.
Fri Nov 2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': double free or corruption (!prev): 0x000000000346d8c0 ***
Fri Nov 2 12:41:01 2018[1,5]<stdout>:*** Aborted at 1541133661 (unix time) try "date -d @1541133661" if you are using GNU date ***
Fri Nov 2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': free(): corrupted unsorted chunks: 0x000000000346b630 ***
Fri Nov 2 12:41:01 2018[1,5]<stdout>:*** Error in `/home/disk1/normandy/maybach/app-user-20181102103833-1358/workspace/python27-gcc482/bin/python': free(): corrupted unsorted chunks: 0x000000000346c0e0 ***
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
---------------------------------------------------