Check failed: len >= 0
Created by: Bella-Zhao
Mpi 训练报错:
Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** Aborted at 1511402223 (unix time) try "date -d @1511402223" if you are using GNU date ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>:PC: @ 0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,1]<stderr>:*** SIGSEGV (@0x8) received by PID 18318 (TID 0x7f42197fb700) from PID 8; stack trace: ***
Thu Nov 23 09:57:03 2017[1,8]<stderr>:F1123 09:57:03.537968 11345 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.86.20.35: Connection reset by peer [104]
Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e43f50160 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>:PC: @ 0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>:*** SIGSEGV (@0x8) received by PID 5775 (TID 0x7f49fd782700) from PID 8; stack trace: ***
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55e10ce160 (unknown)
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2d627d google::LogMessage::Fail()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db12f012 paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2d9d2c google::LogMessage::SendToLog()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55dbdba17a paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2d5da3 google::LogMessage::Flush()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2d5fa9 google::LogMessage::~LogMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2360ec _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db2d9257 google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55da28c8a0 execute_native_thread_routine
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e15827d google::LogMessage::Fail()
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3dfb1012 paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db12dbd1 paddle::SocketChannel::read()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55e10c61c3 start_thread
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3ec3c17a paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e15bd2c google::LogMessage::SendToLog()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db12e170 paddle::SocketChannel::readMessage()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55e06ee12d __clone
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e157da3 google::LogMessage::Flush()
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x7f55db12f006 paddle::ProtoClient::recv()
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e0b80ec _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Thu Nov 23 09:57:03 2017[1,8]<stderr>: @ 0x0 (unknown)
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3d10e8a0 execute_native_thread_routine
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e43f481c3 start_thread
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e157fa9 google::LogMessage::~LogMessage()
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e4357012d __clone
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x7f4e3e15b257 google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:03 2017[1,1]<stderr>: @ 0x0 (unknown)
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.341764 27629 SocketChannel.cpp:101] Check failed: len > 0 peer=10.86.20.34 curIov=1024 iovCnt=12730 iovs[curIov].base=0x60223800 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.337268 27602 SocketChannel.cpp:101] Check failed: len > 0 peer=10.86.20.35 curIov=1024 iovCnt=12723 iovs[curIov].base=0x6020d100 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>:F1123 09:57:09.341764 27629 SocketChannel.cpp:101] Check failed: len > 0 peer=10.86.20.34 curIov=1024 iovCnt=12730 iovs[curIov].base=0x60223800 iovs[curIov].iov_len=256: Broken pipe [32]F1123 09:57:09.343297 27605 SocketChannel.cpp:101] Check failed: len > 0 peer=10.86.20.38 curIov=1024 iovCnt=12752 iovs[curIov].base=0x60212800 iovs[curIov].iov_len=256: Broken pipe [32]
Thu Nov 23 09:57:09 2017[1,12]<stderr>:*** Check failure stack trace: ***
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862c27d google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862c27d google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862c27d google::LogMessage::Fail()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862fd2c google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862fd2c google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862fd2c google::LogMessage::SendToLog()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bda3 google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bda3 google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bda3 google::LogMessage::Flush()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bfa9 google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862f257 google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bfa9 google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84839bc paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862bfa9 google::LogMessage::~LogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862f257 google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c862f257 google::ErrnoLogMessage::~ErrnoLogMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84839bc paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84839bc paddle::readwritev<>()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84849cd paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c848581c paddle::ProtoClient::send()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84849cd paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c84849cd paddle::SocketChannel::writeMessage()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c848581c paddle::ProtoClient::send()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c911007a paddle::ParameterClient2::sendParallel()
Thu Nov 23 09:57:09 2017[1,12]<stderr>: @ 0x7f63c848581c paddle::ProtoClient::send()
任务相关配置如下:
paddle.init(use_gpu=False,
trainer_count=int(os.getenv("PADDLE_TRAINER_COUNT", "1")),
port=int(os.getenv("PADDLE_PORT", "7164")),
ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1"),
ports_num_for_sparse=int(os.getenv('PADDLE_PORTS_NUM_FOR_SPARSE', "1")))
paddle cluster_train \
--config=train_mpi.py \
--time_limit=50:00:00 \
--submitter=zhaoyijin \
--num_nodes=15 \
--job_priority=normal \
--fs_name=hdfs://...省略 \
--fs_ugi=weigou,123abc \
--num_passes=14 \
--train_data_path=/....省略/train \
--test_data_path=/....省略/test \
--output_path=/....省略 \
--thirdparty=./my_thirdparty \
--where=...省略 \
--job_name=paddle_dssm_zhaoyijin \
--ports_num_for_sparse=1 \
--use_remote_sparse=1