网络配置dropout后训练错误
Created by: copytang
网络结构,双塔型结构,如下: 单侧结构:[256relu+dropout=0.2]->[128relu+dropout=0.5]->[64relu+dropout=0.5]
单机版正常运行,集群版运行错误,错误信息如下: Mon Jul 31 18:03:13 2017[1,9]:101 1098 SocketChannel.cpp:: ck failed: len > 0 pe] Cheken piped: len > 0 peer=70.87.87.2731 018:vCnt=4 iovs[curIov].base=0x7ffa73ffec40:13.844094 1070 SocketChannel.cpp:: 101Broken pipe [Check failed: len > 0 peer=10.87.138.24 curIov=0 iovCnt=4 Mon Jul 31 18:03:13 2017[1,9]:iovs[curIov].base=0x7ffae3ffe iovs[curIov].iov_len= curIov=0 iovCnt=4 iovs[curIov].base=0x7ffbc2bfcc40 iovs[curIov].iov_len=16: Broken pipe [32]: Broken pipe [32]F0731 18:03:13.844498 1067 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.25 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffafd7fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.844630 1065 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.26 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffafebfcc40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.844714 1076 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.19 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffac7ffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.844947 1061 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.28 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffb197fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.842777 1028 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.13 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffba6bfcc40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.845197 1019 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.17 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffbc3ffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.845266 1063 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.27 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffaffffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.845346 1030 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.14 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffba57fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.845851 1102 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.11 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffa717fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13 Mon Jul 31 18:03:13 2017[1,9]:.845903 1057 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.20 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffb1bffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.846002 1072 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.23 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffae2bfcc40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.843533 1034 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.45 curIov=0 iovCnt=5 iovs[curIov].base=0x7ffb82bfcc40 iovs[curIov].iov_len=16: Broken pipe [32] Mon Jul 31 18:03:13 2017[1,9]: [32]F0731 18:03:13.848150 1055 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.21 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffb357fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848290 1087 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.25 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffaa97fac40 iovs[curIov].iov_len=16F: 31 18:03 [32.848299 1081 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.32 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffac57fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848317 1094 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.24 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffa8d7fac40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848337 1089 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.26 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffa8fffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848343 1100 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.28 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffa72bfcc40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848388 1083 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.87.21 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffaabffec40 iovs[curIov].iov_len=16: Broken pipe [32]F0731 18:03:13.848433 1078 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.138.30 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffac6bfcc40 iovs[curIov].iov_len=16F0731 18:03:: Broken pipe84843832 1085 SocketChannel.cpp Mon Jul 31 18:03:13 2017[1,9]:101] Check failed: len > 0 peer=10.87.87.20 curIov=0 iovCnt=4 iovs[curIov].base=0x7ffaaabfcc40 iovs[curIov].iov_len=16: Broken pipe [32] Mon Jul 31 18:03:13 2017[1,9]:*** Check failure stack trace: *** Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x76b112 paddle::ProtoClient::recv() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912e99 google::LogMessage::~LogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0xf0b668 paddle::ParameterClient2::recv() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x91316d google::LogMessage::Fail() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912e99 google::LogMessage::~LogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x769b9c paddle::readwritev<>() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912e99 google::LogMessage::~LogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x769b9c paddle::readwritev<>() Mon Jul 31 18:03:13 2017[1,9]: @ 0x7ffe0aa7b8a0 execute_native_thread_routine Mon Jul 31 18:03:13 2017[1,9]: @ 0x912e99 google::LogMessage::~LogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x7ffe0aeda1c3 start_thread Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:13 2017[1,9]: @ 0x912c93 google::LogMessage::Flush() Mon Jul 31 18:03:13 2017[1,9]: @ 0x7ffe0a1ec12d __clone Mon Jul 31 18:03:13 2017[1,9]: @ 0x912e99 google::LogMessage::~LogMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x0 (unknown) Mon Jul 31 18:03:13 2017[1,9]: @ 0x76abad paddle::SocketChannel::writeMessage() Mon Jul 31 18:03:13 2017[1,9]: @ 0x916c1c google::LogMessage::SendToLog() Mon Jul 31 18:03:14 2017[1,29]:./train.sh: line 207: 9019 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg} Mon Jul 31 18:03:14 2017[1,29]:+ '[' 139 -ne 0 ']' 任务链接: http://10.87.87.21:8920/fileview.html?path=/home/disk1/normandy/maybach/41383/ train.log链接: http://10.87.87.21:8920/filetree?action=cat&path=/home/disk1/normandy/maybach/41383/workspace/log/train.log