Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #5618

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 11月 14, 2017 by saxon_zh@saxon_zhGuest

Check failed: request.ParseFromString(str)

Created by: cszhou

问题描述

跑v1版本的MPI任务。 最重要的报错信息我理解是: Check failed: request.ParseFromString(str) 请教下,这个是什么导致的呢,要怎么改。

train.log

Tue Nov 14 11:18:38 2017[1,1]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,1]<stderr>:PC: @                0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>:*** Aborted at 1510629518 (unix time) try "date -d @1510629518" if you are using GNU date ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>:F1114 11:18:38.644645 26017 SocketChannel.cpp:101] Check failed: len > 0  peer=10.87.143.36 curIov=530659 iovCnt=9031075 iovs[curIov].base=0x7fc8c042bd8c iovs[curIov].iov_len=16: Connection reset by peer [104]
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** Check failure stack trace: ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x91316d  google::LogMessage::Fail()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:PC: @                0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:*** SIGSEGV (@0x8) received by PID 20795 (TID 0x7fc61cdfa700) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x916c1c  google::LogMessage::SendToLog()
Tue Nov 14 11:18:38 2017[1,3]<stderr>:PC: @                0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x912c93  google::LogMessage::Flush()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc909125160 (unknown)
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x912e99  google::LogMessage::~LogMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x916147  google::ErrnoLogMessage::~ErrnoLogMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x769b9c  paddle::readwritev<>()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x76abad  paddle::SocketChannel::writeMessage()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x760890 paddle::RemoteParameterUpdater::controller()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x76b9ec  paddle::ProtoClient::send()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0xf0ab5b  paddle::ParameterClient2::sendParallel()
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc908cbe8a0 execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @           0x734534  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc90911d1c3 start_thread
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc908cbe8a0  execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc90842f12d __clone
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @     0x7fc90911d1c3  start_thread
Tue Nov 14 11:18:38 2017[1,0]<stderr>:    @                0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>:*** SIGSEGV (@0x8) received by PID 20686 (TID 0x7fd58af4a700) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @     0x7fd592f1c160 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @           0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @           0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @           0x7614d9 paddle::SparseRemoteParameterUpdater::startPass()
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @           0x734534 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
Tue Nov 14 11:18:38 2017[1,1]<stderr>:*** SIGSEGV (@0x8) received by PID 20343 (TID 0x7f72c3b5e780) from PID 8; stack trace: ***
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @     0x7fd592ab58a0 execute_native_thread_routine
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @     0x7fd592f141c3 start_thread
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @     0x7fd59222612d __clone
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @     0x7f72c3319160 (unknown)
Tue Nov 14 11:18:38 2017[1,3]<stderr>:    @                0x0 (unknown)
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x76b112 paddle::ProtoClient::recv()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0xf0c1aa paddle::ParameterClient2::waitPassStart()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x75bdf3 paddle::RemoteParameterUpdater::startPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x74ebd6 paddle::SyncThreadPool::execPlusOwner()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x74edb7 paddle::ParameterUpdaterComposite::startPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x7463b4 paddle::Trainer::startTrainPass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x7499a0 paddle::Trainer::trainOnePass()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x74a375 paddle::Trainer::train()
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x5a3d70 main
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @     0x7f72c255dbd5 __libc_start_main
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @           0x5b2169 (unknown)
Tue Nov 14 11:18:38 2017[1,1]<stderr>:    @                0x0 (unknown)
Tue Nov 14 11:18:39 2017[1,3]<stderr>:train.sh: line 207: 20686 Segmentation fault      PYTHONPATH=./paddle:$PYTHONPATH GLOG_v=10000 GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Tue Nov 14 11:18:39 2017[1,3]<stderr>:+ '[' 139 -ne 0 ']'
Tue Nov 14 11:18:39 2017[1,3]<stderr>:+ kill_pserver2_exit

文件 paddle_pserver2.INFO文件的内容

Log file created at: 2017/11/14 11:16:24
Running on machine: ********************
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1114 11:16:24.435446 13258 Util.cpp:166] commandline: ./paddle_pserver2 --num_gradient_servers=5 --nics=xgbe0 --port=30735 --ports_num=8 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job 
I1114 11:16:24.435662 13258 Util.cpp:134] Calling runInitFunctions
I1114 11:16:24.436022 13258 Util.cpp:148] Call runInitFunctions done.
I1114 11:16:24.436322 13258 ParameterServerController.cpp:83] number of parameterServer instances: 9
I1114 11:16:24.436332 13258 ParameterServerController.cpp:87] Starting parameterServer[0]
I1114 11:16:24.436380 13258 ParameterServerController.cpp:87] Starting parameterServer[1]
I1114 11:16:24.436410 13258 ParameterServerController.cpp:87] Starting parameterServer[2]
I1114 11:16:24.436503 13258 ParameterServerController.cpp:87] Starting parameterServer[3]
I1114 11:16:24.436501 13260 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436506 13259 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436542 13261 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436559 13258 ParameterServerController.cpp:87] Starting parameterServer[4]
I1114 11:16:24.436645 13258 ParameterServerController.cpp:87] Starting parameterServer[5]
I1114 11:16:24.436647 13262 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436727 13263 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436731 13258 ParameterServerController.cpp:87] Starting parameterServer[6]
I1114 11:16:24.436810 13264 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436815 13258 ParameterServerController.cpp:87] Starting parameterServer[7]
I1114 11:16:24.436872 13265 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436897 13258 ParameterServerController.cpp:87] Starting parameterServer[8]
I1114 11:16:24.436949 13266 LightNetwork.cpp:269] tcp server start 
I1114 11:16:24.436975 13258 ParameterServerController.cpp:96] Waiting parameterServer[0]
I1114 11:16:24.437043 13267 LightNetwork.cpp:269] tcp server start 
I1114 11:18:21.855499 25554 LightNetwork.cpp:322] worker started, peer = 10.87.143.37
I1114 11:18:22.070394 25585 LightNetwork.cpp:322] worker started, peer = 10.87.143.37
---------------------------------
I1114 11:18:25.643290 25806 LightNetwork.cpp:322] worker started, peer = 10.87.143.33
I1114 11:18:25.643299 25807 LightNetwork.cpp:322] worker started, peer = 10.87.143.33
I1114 11:18:27.344852 25796 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.344966 25796 ParameterServer2.cpp:306] pserver: new cpuvector: size=2424832
I1114 11:18:27.375301 25792 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.375375 25792 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.377532 25794 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.377636 25794 ParameterServer2.cpp:306] pserver: new cpuvector: size=2457600
I1114 11:18:27.380144 25790 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.380233 25790 ParameterServer2.cpp:306] pserver: new cpuvector: size=2621440
I1114 11:18:27.386515 25795 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.386530 25793 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.386603 25793 ParameterServer2.cpp:306] pserver: new cpuvector: size=2457600
I1114 11:18:27.386617 25795 ParameterServer2.cpp:306] pserver: new cpuvector: size=2686976
I1114 11:18:27.388229 25791 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.388288 25791 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.433893 25789 ParameterServer2.cpp:260] pserver: setParameter
I1114 11:18:27.433990 25789 ParameterServer2.cpp:306] pserver: new cpuvector: size=2588672
I1114 11:18:27.502650 25949 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.502769 25950 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.503219 25955 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.503235 25956 LightNetwork.cpp:322] worker started, peer = 10.87.143.36
I1114 11:18:27.647248 25805 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:27.647248 25803 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:28.077597 25589 ParameterServer2.cpp:707] pserver: getParameter
I1114 11:18:28.077668 25588 ParameterServer2.cpp:707] pserver: getParameter
F1114 11:18:38.827989 25754 ProtoServer.h:224] Check failed: request.ParseFromString(str) 

paddle_trainer.INFO文件的内容

Log file created at: 2017/11/14 11:17:17
Running on machine: ************
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1114 11:17:17.111040 20686 Util.cpp:166] commandline: ./paddle_trainer --num_gradient_servers=5 --trainer_id=3 --pservers=10.87.143.36,10.87.143.37,10.87.143.34,10.87.143.33,10.87.143.35 --rdma_tcp=tcp --nics=xgbe0 --port=30735 --ports_num=8 --dot_period=1 --load_missing_parameter_strategy=rand --test_all_data_in_one_period=1 --config_args=is_local=0 --log_period=1 --trainer_count=8 --num_passes=1 --saving_period=1 --ports_num_for_sparse=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0 
I1114 11:17:17.111369 20686 Util.cpp:134] Calling runInitFunctions
I1114 11:17:17.111743 20686 Util.cpp:148] Call runInitFunctions done.
I1114 11:17:17.126471 20686 TrainerConfigHelper.cpp:52] Parsing trainer config conf/trainer_config.conf
I1114 11:17:18.872766 20686 Trainer.cpp:162] trainer mode: SgdSparseCpuTraining
I1114 11:17:18.872820 20686 TrainerInternal.cpp:239] Sgd sparse training can not work with ConcurrentRemoteParameterUpdater, automatically reset --use_old_updater=true
I1114 11:17:18.873054 20770 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:17:19.500422 20872 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:17:19.500449 20873 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:17:19.500493 20874 Thread.h:271] SyncThreadPool worker thread 2
I1114 11:17:19.500550 20875 Thread.h:271] SyncThreadPool worker thread 3
I1114 11:17:19.500596 20876 Thread.h:271] SyncThreadPool worker thread 4
I1114 11:17:19.500658 20877 Thread.h:271] SyncThreadPool worker thread 5
I1114 11:17:19.500699 20878 Thread.h:271] SyncThreadPool worker thread 6
I1114 11:17:19.500779 20879 Thread.h:271] SyncThreadPool worker thread 7
I1114 11:17:23.032294 20986 MultiGradientMachine.cpp:495] gradComputeThread 0
I1114 11:17:23.032326 20987 MultiGradientMachine.cpp:495] gradComputeThread 1
I1114 11:17:23.032397 20988 MultiGradientMachine.cpp:495] gradComputeThread 2
I1114 11:17:23.032454 20989 MultiGradientMachine.cpp:495] gradComputeThread 3
I1114 11:17:23.032506 20990 MultiGradientMachine.cpp:495] gradComputeThread 4
I1114 11:17:23.032559 20991 MultiGradientMachine.cpp:495] gradComputeThread 5
I1114 11:17:23.032613 20992 MultiGradientMachine.cpp:495] gradComputeThread 6
I1114 11:17:23.032645 20993 MultiGradientMachine.cpp:495] gradComputeThread 7
I1114 11:17:23.381920 20686 PyDataProvider2.cpp:243] loading dataprovider v1_data_provider3::processMultiViewTrainingData
I1114 11:17:47.634187 20686 PyDataProvider2.cpp:226] Instance 0x7fd5880e4b50 loaded.
I1114 11:17:47.634263 20686 PyDataProvider2.cpp:278] Provider Skip Shuffle 0
I1114 11:17:47.634366 20686 PyDataProvider2.cpp:317] Data header size 55
I1114 11:17:47.634376 20686 PyDataProvider2.cpp:319] Dim = 30493720 Type = 1 SeqType = 0
I1114 11:17:47.634382 20686 PyDataProvider2.cpp:319] Dim = 34 Type = 1 SeqType = 0
I1114 11:17:47.634724 20686 PyDataProvider2.cpp:319] Dim = 2511246 Type = 1 SeqType = 0
I1114 11:17:47.634732 20686 PyDataProvider2.cpp:319] Dim = 79717 Type = 1 SeqType = 0
I1114 11:17:47.634738 20686 PyDataProvider2.cpp:319] Dim = 625 Type = 1 SeqType = 0
I1114 11:17:47.634745 20686 PyDataProvider2.cpp:319] Dim = 85 Type = 1 SeqType = 0
I1114 11:17:47.634752 20686 PyDataProvider2.cpp:319] Dim = 1149072 Type = 1 SeqType = 0
I1114 11:17:47.634758 20686 PyDataProvider2.cpp:319] Dim = 3930 Type = 1 SeqType = 0
I1114 11:17:47.634765 20686 PyDataProvider2.cpp:319] Dim = 136 Type = 1 SeqType = 0
I1114 11:17:47.634773 20686 PyDataProvider2.cpp:319] Dim = 9864 Type = 1 SeqType = 0
I1114 11:17:47.634779 20686 PyDataProvider2.cpp:319] Dim = 1080 Type = 1 SeqType = 0
I1114 11:17:47.634785 20686 PyDataProvider2.cpp:319] Dim = 186030 Type = 1 SeqType = 0
I1114 11:17:47.634793 20686 PyDataProvider2.cpp:319] Dim = 32 Type = 0 SeqType = 0
I1114 11:17:47.634799 20686 PyDataProvider2.cpp:319] Dim = 2 Type = 3 SeqType = 0
I1114 11:17:47.634814 20686 PyDataProvider2.cpp:228] Py Field Done
I1114 11:17:47.635741 20686 GradientMachine.cpp:86] Initing parameters..
I1114 11:18:08.373350 20686 Parameter.cpp:103] ___fc_layer_0__.w0: initial_mean=0, initial_std=0.00018109
I1114 11:18:08.373468 20686 Parameter.cpp:103] ___fc_layer_0__.wbias: initial_mean=0, initial_std=0
I1114 11:18:08.373507 20686 Parameter.cpp:103] ___fc_layer_1__.w0: initial_mean=0, initial_std=0.171499
-------------------------
I1114 11:18:25.403275 20686 Parameter.cpp:103] ___fc_layer_70__.w0: initial_mean=0, initial_std=1
I1114 11:18:25.403285 20686 Parameter.cpp:103] ___fc_layer_70__.wbias: initial_mean=0, initial_std=0
I1114 11:18:25.403293 20686 GradientMachine.cpp:93] Init parameters done.
I1114 11:18:25.403503 20770 ParameterClient2.cpp:114] pserver 0 10.87.143.36:30743
I1114 11:18:25.403770 20770 ParameterClient2.cpp:114] pserver 1 10.87.143.37:30743
-----------------------------
I1114 11:18:25.643389 20686 ParameterClient2.cpp:114] pserver 38 10.87.143.35:30741
I1114 11:18:25.643450 20686 ParameterClient2.cpp:114] pserver 39 10.87.143.35:30742
I1114 11:18:27.408072 25944 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:18:27.408226 25947 Thread.h:271] SyncThreadPool worker thread 4
I1114 11:18:27.408140 25945 Thread.h:271] SyncThreadPool worker thread 2
I1114 11:18:27.408181 25946 Thread.h:271] SyncThreadPool worker thread 3
I1114 11:18:27.408072 25943 Thread.h:271] SyncThreadPool worker thread 0
I1114 11:18:27.643769 25958 Thread.h:271] SyncThreadPool worker thread 1
I1114 11:18:27.643833 25959 Thread.h:271] SyncThreadPool worker thread 2
------------------------------------
I1114 11:18:27.645926 25996 Thread.h:271] SyncThreadPool worker thread 39
I1114 11:18:28.264272 20686 PyDataProvider2.cpp:409] Reseting 1
I1114 11:18:28.264405 20686 PyDataProvider2.cpp:429] Start new thread.
I1114 11:18:28.264582 26031 PyDataProvider2.cpp:335] Creating context
I1114 11:18:28.264701 26031 PyDataProvider2.cpp:346] Create context done
I1114 11:18:28.264725 20686 PyDataProvider2.cpp:436] Reset done
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#5618
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7