Check failed: request.ParseFromString(str)
Created by: cszhou
问题描述
跑v1版本的MPI任务。 最重要的报错信息我理解是: Check failed: request.ParseFromString(str) 请教下,这个是什么导致的呢,要怎么改。
train.log
Tue Nov 14 11:18:38 2017[1,1]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,1]<stderr>:PC: @ 0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>:F1114 11:18:38.644645 26017 SocketChannel.cpp:101] Check failed: len > 0 peer=10.87.143.36 curIov=530659 iovCnt=9031075 iovs[curIov].base=0x7fc8c042bd8c iovs[curIov].iov_len=16: Connection reset by peer [104]
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** Check failure stack trace: ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x91316d google::LogMessage::Fail()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:PC: @ 0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** SIGSEGV (@0x8) received by PID 20795 (TID 0x7fc61cdfa700) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x916c1c google::LogMessage::SendToLog()
Tue Nov 14 11:18:38 2017[1,3]<stderr>:PC: @ 0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x912c93 google::LogMessage::Flush()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc909125160 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x912e99 google::LogMessage::~LogMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x769b9c paddle::readwritev<>()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x76abad paddle::SocketChannel::writeMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x760890 paddle::RemoteParameterUpdater::controller()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x76b9ec paddle::ProtoClient::send()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0xf0ab5b paddle::ParameterClient2::sendParallel()
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc908cbe8a0 execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x734534 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc90911d1c3 start_thread
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc908cbe8a0 execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc90842f12d __clone
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x7fc90911d1c3 start_thread
Tue Nov 14 11:18:38 2017[1,0]<stderr>: @ 0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>:*** SIGSEGV (@0x8) received by PID 20686 (TID 0x7fd58af4a700) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x7fd592f1c160 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x7614d9 paddle::SparseRemoteParameterUpdater::startPass()
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x734534 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Tue Nov 14 11:18:38 2017[1,1]<stderr>:*** SIGSEGV (@0x8) received by PID 20343 (TID 0x7f72c3b5e780) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x7fd592ab58a0 execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x7fd592f141c3 start_thread
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x7fd59222612d __clone
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x7f72c3319160 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>: @ 0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x75bdf3 paddle::RemoteParameterUpdater::startPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x74ebd6 paddle::SyncThreadPool::execPlusOwner()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x74edb7 paddle::ParameterUpdaterComposite::startPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x7463b4 paddle::Trainer::startTrainPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x7499a0 paddle::Trainer::trainOnePass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x74a375 paddle::Trainer::train()
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x5a3d70 main
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x7f72c255dbd5 __libc_start_main
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x5b2169 (unknown)
Tue Nov 14 11:18:38 2017[1,1]<stderr>: @ 0x0 (unknown)
Tue Nov 14 11:18:39 2017[1,3]<stderr>:train.sh: line 207: 20686 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_v=10000 GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Tue Nov 14 11:18:39 2017[1,3]<stderr>:+ '[' 139 -ne 0 ']'
Tue Nov 14 11:18:39 2017[1,3]<stderr>:+ kill_pserver2_exit
文件 paddle_pserver2.INFO文件的内容
Log file created at: 2017/11/14 11:16:24
Running on machine: ********************
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1114 11:16:24.435446 13258 Util.cpp:166] commandline: ./paddle_pserver2 --num_gradient_servers=5 --nics=xgbe0 --port=30735 --ports_num=8 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job
I1114 11:16:24.435662 13258 Util.cpp:134] Calling runInitFunctions
I1114 11:16:24.436022 13258 Util.cpp:148] Call runInitFunctions done.
I1114 11:16:24.436322 13258 ParameterServerController.cpp:83] number of parameterServer instances: 9
I1114 11:16:24.436332 13258 ParameterServerController.cpp:87] Starting parameterServer[0]
I1114 11:16:24.436380 13258 ParameterServerController.cpp:87] Starting parameterServer[1]
I1114 11:16:24.436410 13258 ParameterServerController.cpp:87] Starting parameterServer[2]
I1114 11:16:24.436503 13258 ParameterServerController.cpp:87] Starting parameterServer[3]
I1114 11:16:24.436501 13260 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436506 13259 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436542 13261 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436559 13258 ParameterServerController.cpp:87] Starting parameterServer[4]
I1114 11:16:24.436645 13258 ParameterServerController.cpp:87] Starting parameterServer[5]
I1114 11:16:24.436647 13262 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436727 13263 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436731 13258 ParameterServerController.cpp:87] Starting parameterServer[6]
I1114 11:16:24.436810 13264 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436815 13258 ParameterServerController.cpp:87] Starting parameterServer[7]
I1114 11:16:24.436872 13265 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436897 13258 ParameterServerController.cpp:87] Starting parameterServer[8]
I1114 11:16:24.436949 13266 LightNetwork.cpp:269] tcp server start
I1114 11:16:24.436975 13258 ParameterServerController.cpp:96] Waiting parameterServer[0]
I1114 11:16:24.437043 13267 LightNetwork.cpp:269] tcp server start
I1114 11:18:21.855499 25554 LightNetwork.cpp:322] worker started, peer = 10.87.143.37
I1114 11:18:22.070394 25585 LightNetwork.cpp:322] worker started, peer = 10.87.143.37
---------------------------------
I1114 11:18:25.643290 25806 LightNetwork.cpp:322] worker started, peer = 10.87.143.33
I1114 11:18:25.643299 25807 LightNetwork.cpp:322] worker started, peer = 10.87.143.33
I1114 11:18:27.344852 25796 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.344966 25796 ParameterServer2.cpp:306] pserver: new cpuvector: size=2424832
I1114 11:18:27.375301 25792 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.375375 25792 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.377532 25794 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.377636 25794 ParameterServer2.cpp:306] pserver: new cpuvector: size=2457600
I1114 11:18:27.380144 25790 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.380233 25790 ParameterServer2.cpp:306] pserver: new cpuvector: size=2621440
I1114 11:18:27.386515 25795 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.386530 25793 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.386603 25793 ParameterServer2.cpp:306] pserver: new cpuvector: size=2457600
I1114 11:18:27.386617 25795 ParameterServer2.cpp:306] pserver: new cpuvector: size=2686976
I1114 11:18:27.388229 25791 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.388288 25791 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.433893 25789 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.433990 25789 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.502650 25949 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.502769 25950 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.503219 25955 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.503235 25956 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.647248 25805 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:27.647248 25803 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:28.077597 25589 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:28.077668 25588 ParameterServer2.cpp:707] pserver: getParameter
F1114 11:18:38.827989 25754 ProtoServer.h:224] Check failed: request.ParseFromString(str)
paddle_trainer.INFO文件的内容
Log file created at: 2017/11/14 11:17:17
Running on machine: ************
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1114 11:17:17.111040 20686 Util.cpp:166] commandline: ./paddle_trainer --num_gradient_servers=5 --trainer_id=3 --pservers=10.87.143.36,10.87.143.37,10.87.143.34,10.87.143.33,10.87.143.35 --rdma_tcp=tcp --nics=xgbe0 --port=30735 --ports_num=8 --dot_period=1 --load_missing_parameter_strategy=rand --test_all_data_in_one_period=1 --config_args=is_local=0 --log_period=1 --trainer_count=8 --num_passes=1 --saving_period=1 --ports_num_for_sparse=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
I1114 11:17:17.111369 20686 Util.cpp:134] Calling runInitFunctions
I1114 11:17:17.111743 20686 Util.cpp:148] Call runInitFunctions done.
I1114 11:17:17.126471 20686 TrainerConfigHelper.cpp:52] Parsing trainer config conf/trainer_config.conf
I1114 11:17:18.872766 20686 Trainer.cpp:162] trainer mode: SgdSparseCpuTraining
I1114 11:17:18.872820 20686 TrainerInternal.cpp:239] Sgd sparse training can not work with ConcurrentRemoteParameterUpdater, automatically reset --use_old_updater=true
I1114 11:17:18.873054 20770 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:17:19.500422 20872 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:17:19.500449 20873 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:17:19.500493 20874 Thread.h:271] SyncThreadPool worker thread 2
I1114 11:17:19.500550 20875 Thread.h:271] SyncThreadPool worker thread 3
I1114 11:17:19.500596 20876 Thread.h:271] SyncThreadPool worker thread 4
I1114 11:17:19.500658 20877 Thread.h:271] SyncThreadPool worker thread 5
I1114 11:17:19.500699 20878 Thread.h:271] SyncThreadPool worker thread 6
I1114 11:17:19.500779 20879 Thread.h:271] SyncThreadPool worker thread 7
I1114 11:17:23.032294 20986 MultiGradientMachine.cpp:495] gradComputeThread 0
I1114 11:17:23.032326 20987 MultiGradientMachine.cpp:495] gradComputeThread 1
I1114 11:17:23.032397 20988 MultiGradientMachine.cpp:495] gradComputeThread 2
I1114 11:17:23.032454 20989 MultiGradientMachine.cpp:495] gradComputeThread 3
I1114 11:17:23.032506 20990 MultiGradientMachine.cpp:495] gradComputeThread 4
I1114 11:17:23.032559 20991 MultiGradientMachine.cpp:495] gradComputeThread 5
I1114 11:17:23.032613 20992 MultiGradientMachine.cpp:495] gradComputeThread 6
I1114 11:17:23.032645 20993 MultiGradientMachine.cpp:495] gradComputeThread 7
I1114 11:17:23.381920 20686 PyDataProvider2.cpp:243] loading dataprovider v1_data_provider3::processMultiViewTrainingData
I1114 11:17:47.634187 20686 PyDataProvider2.cpp:226] Instance 0x7fd5880e4b50 loaded.
I1114 11:17:47.634263 20686 PyDataProvider2.cpp:278] Provider Skip Shuffle 0
I1114 11:17:47.634366 20686 PyDataProvider2.cpp:317] Data header size 55
I1114 11:17:47.634376 20686 PyDataProvider2.cpp:319] Dim = 30493720 Type = 1 SeqType = 0
I1114 11:17:47.634382 20686 PyDataProvider2.cpp:319] Dim = 34 Type = 1 SeqType = 0
I1114 11:17:47.634724 20686 PyDataProvider2.cpp:319] Dim = 2511246 Type = 1 SeqType = 0
I1114 11:17:47.634732 20686 PyDataProvider2.cpp:319] Dim = 79717 Type = 1 SeqType = 0
I1114 11:17:47.634738 20686 PyDataProvider2.cpp:319] Dim = 625 Type = 1 SeqType = 0
I1114 11:17:47.634745 20686 PyDataProvider2.cpp:319] Dim = 85 Type = 1 SeqType = 0
I1114 11:17:47.634752 20686 PyDataProvider2.cpp:319] Dim = 1149072 Type = 1 SeqType = 0
I1114 11:17:47.634758 20686 PyDataProvider2.cpp:319] Dim = 3930 Type = 1 SeqType = 0
I1114 11:17:47.634765 20686 PyDataProvider2.cpp:319] Dim = 136 Type = 1 SeqType = 0
I1114 11:17:47.634773 20686 PyDataProvider2.cpp:319] Dim = 9864 Type = 1 SeqType = 0
I1114 11:17:47.634779 20686 PyDataProvider2.cpp:319] Dim = 1080 Type = 1 SeqType = 0
I1114 11:17:47.634785 20686 PyDataProvider2.cpp:319] Dim = 186030 Type = 1 SeqType = 0
I1114 11:17:47.634793 20686 PyDataProvider2.cpp:319] Dim = 32 Type = 0 SeqType = 0
I1114 11:17:47.634799 20686 PyDataProvider2.cpp:319] Dim = 2 Type = 3 SeqType = 0
I1114 11:17:47.634814 20686 PyDataProvider2.cpp:228] Py Field Done
I1114 11:17:47.635741 20686 GradientMachine.cpp:86] Initing parameters..
I1114 11:18:08.373350 20686 Parameter.cpp:103] ___fc_layer_0__.w0: initial_mean=0, initial_std=0.00018109
I1114 11:18:08.373468 20686 Parameter.cpp:103] ___fc_layer_0__.wbias: initial_mean=0, initial_std=0
I1114 11:18:08.373507 20686 Parameter.cpp:103] ___fc_layer_1__.w0: initial_mean=0, initial_std=0.171499
-------------------------
I1114 11:18:25.403275 20686 Parameter.cpp:103] ___fc_layer_70__.w0: initial_mean=0, initial_std=1
I1114 11:18:25.403285 20686 Parameter.cpp:103] ___fc_layer_70__.wbias: initial_mean=0, initial_std=0
I1114 11:18:25.403293 20686 GradientMachine.cpp:93] Init parameters done.
I1114 11:18:25.403503 20770 ParameterClient2.cpp:114] pserver 0 10.87.143.36:30743
I1114 11:18:25.403770 20770 ParameterClient2.cpp:114] pserver 1 10.87.143.37:30743
-----------------------------
I1114 11:18:25.643389 20686 ParameterClient2.cpp:114] pserver 38 10.87.143.35:30741
I1114 11:18:25.643450 20686 ParameterClient2.cpp:114] pserver 39 10.87.143.35:30742
I1114 11:18:27.408072 25944 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:18:27.408226 25947 Thread.h:271] SyncThreadPool worker thread 4
I1114 11:18:27.408140 25945 Thread.h:271] SyncThreadPool worker thread 2
I1114 11:18:27.408181 25946 Thread.h:271] SyncThreadPool worker thread 3
I1114 11:18:27.408072 25943 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:18:27.643769 25958 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:18:27.643833 25959 Thread.h:271] SyncThreadPool worker thread 2
------------------------------------
I1114 11:18:27.645926 25996 Thread.h:271] SyncThreadPool worker thread 39
I1114 11:18:28.264272 20686 PyDataProvider2.cpp:409] Reseting 1
I1114 11:18:28.264405 20686 PyDataProvider2.cpp:429] Start new thread.
I1114 11:18:28.264582 26031 PyDataProvider2.cpp:335] Creating context
I1114 11:18:28.264701 26031 PyDataProvider2.cpp:346] Create context done
I1114 11:18:28.264725 20686 PyDataProvider2.cpp:436] Reset done