集群上训练模型,训练几轮之后报错退出
Created by: duanyuzhuo
提交mpi集群训练模型,训练过程中突然abort并出现下列log:
Wed Jan 16 12:49:48 2019[1,0]<stderr>:[INFO 2019-01-16 12:49:48,651 train.py:110] Pass 11, Batch 140, Cost 1.168355, {'classification_error_evaluator': 0.267578125}
Wed Jan 16 12:50:13 2019[1,17]<stderr>:*** Aborted at 1547614213 (unix time) try "date -d @1547614213" if you are using GNU date ***
Wed Jan 16 12:50:13 2019[1,17]<stderr>:PC: @ 0x0 (unknown)
Wed Jan 16 12:50:13 2019[1,0]<stderr>:*** Aborted at 1547614213 (unix time) try "date -d @1547614213" if you are using GNU date ***
Wed Jan 16 12:50:13 2019[1,84]<stderr>:*** Aborted at 1547614213 (unix time) try "date -d @1547614213" if you are using GNU date ***
Wed Jan 16 12:50:13 2019[1,42]<stderr>:*** Aborted at 1547614213 (unix time) try "date -d @1547614213" if you are using GNU date ***
Wed Jan 16 12:50:13 2019[1,84]<stderr>:PC: @ 0x0 (unknown)
Wed Jan 16 12:50:13 2019[1,17]<stderr>:*** SIGSEGV (@0x8) received by PID 51898 (TID 0x7faae0dfa700) from PID 8; stack trace: ***
Wed Jan 16 12:50:13 2019[1,0]<stderr>:PC: @ 0x0 (unknown)
Wed Jan 16 12:50:13 2019[1,42]<stderr>:PC: @ 0x0 (unknown)
Wed Jan 16 12:50:13 2019[1,17]<stderr>: @ 0x7faeb2cf9160 (unknown)
Wed Jan 16 12:50:13 2019[1,3]<stderr>:*** Aborted at 1547614213 (unix time) try "date -d @1547614213" if you are using GNU date ***
看下来第一个正经的错误为类似如下网络报错
Wed Jan 16 12:50:13 2019[1,70]<stderr>:F0116 12:50:13.364187 34906 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.182.76.157: Connection reset by peer [104]
最后报错如下
Wed Jan 16 12:50:14 2019[1,84]<stderr>:./train.sh: line 239: 40228 Segmentation fault python27-gcc482/bin/python conf/trainer_config.conf
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ '[' 139 -ne 0 ']'
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ kill_pserver2_exit
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ ps aux
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ grep paddle_pserver2
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ grep paddle_cluster_job
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ grep -v grep
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ cut -c10-14
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ xargs kill -9
Wed Jan 16 12:50:14 2019[1,84]<stderr>:usage: kill [ -s signal | -p ] [ -a ] pid ...
Wed Jan 16 12:50:14 2019[1,84]<stderr>: kill -l [ signal ]
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Wed Jan 16 12:50:14 2019[1,84]<stderr>:[./common.sh : 399] [kill_pserver2_exit]
Wed Jan 16 12:50:14 2019[1,84]<stderr>:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit'
Wed Jan 16 12:50:14 2019[1,84]<stderr>:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit
模型是线上例行任务,检查过数据和代码,没有问题。