mpirun train.sh failed
Created by: lxk1990727
http://nmg01-hpc-hlan-mon.dmop.baidu.com:8090/job/i-58508/ [11-13 10:24:40] [0] + check_return 'mpirun setup.sh failed' [11-13 10:24:40] [0] + '[' 0 -ne 0 ']' [11-13 10:24:40] [0] + '[' -n '' -o -n '' ']' [11-13 10:24:40] [0] + '[' -d ./thirdparty/thirdparty -a -f ./thirdparty/thirdparty/before_hook.sh ']' [11-13 10:24:40] [0] + check_return 'mpirun start_server.sh failed' [11-13 10:24:40] [0] + mpirun ./start_server.sh cpu [11-13 10:24:40] [0] + '[' 0 -ne 0 ']' [11-13 10:24:40] [0] + '[' -n '' ']' [11-13 10:24:40] [0] + '[' -n '' ']' [11-13 10:24:40] [0] + sleep 20 [11-13 10:25:00] [0] + '[' -n /app/ecom/fcr/lixiaokang04/paddle/data/text_cls/test ']' [11-13 10:25:00] [0] + check_return 'sh ./test.sh failed' [11-13 10:25:00] [0] + mpirun ./test.sh
[11-13 10:25:00] [0] + python ./checkslownode.py job.58508.instances.nmg01-hpc-mix-hl0654.nmg01.baidu.com wanghuaidong@baidu.com [11-13 10:25:00] [0] + check_return 'run check slow node daemon failed' [11-13 10:25:00] [0] + '[' 0 -ne 0 ']' [11-13 10:25:02] [0] + check_return 'mpirun train.sh failed' [11-13 10:25:02] [0] + '[' 1 -ne 0 ']' [11-13 10:25:02] [0] + echo '[job.sh : 130] [main]' [11-13 10:25:02] [0] [job.sh : 130] [main] [11-13 10:25:02] [0] + echo '[FATAL]: mpirun train.sh failed' [11-13 10:25:02] [0] [FATAL]: mpirun train.sh failed [11-13 10:25:02] [0] + get_stack [11-13 10:25:02] [0] + set +x [11-13 10:25:02] [0] [11-13 10:25:02] [0] *Shell Script Stack Trace [11-13 10:25:02] [0] @: [./log.sh: 61] check_return [11-13 10:25:02] [0] @: [job.sh: 130] main [11-13 10:25:02] [0] [11-13 10:25:02] [0] + exit 1 [11-13 10:25:02] [9] [nmg01-hpc-mix-hl0655.nmg01.baidu.com:03810] [[23213,0],9] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [5] [nmg01-hpc-mix-hl0409.nmg01.baidu.com:02486] [[23213,0],5] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [2] [nmg01-taihang-d10643.nmg01.baidu.com:14227] [[23213,0],2] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [7] [nmg01-hpc-mix-hl0651.nmg01.baidu.com:18379] [[23213,0],7] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [4] [nmg01-taihang-d10646.nmg01.baidu.com:09900] [[23213,0],4] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [1] [nmg01-taihang-d10600.nmg01.baidu.com:08310] [[23213,0],1] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [8] [nmg01-hpc-mix-hlan2100.nmg01.baidu.com:07147] [[23213,0],8] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [3] [nmg01-taihang-d10645.nmg01.baidu.com:22179] [[23213,0],3] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [6] [nmg01-hpc-mix-hl0650.nmg01.baidu.com:12024] [[23213,0],6] routed:binomial: Connection to lifeline [[23213,0],0] lost [11-13 10:25:02] [0] [ERROR] failed to run user_main 'sh -x job.sh', task status is Status:kTaskStatusExited, task return 1