Cluster train on GPU at Baidu , paddle_pserver2 exited with SIGSEGV when starting.
Created by: xdsoul
GPU集群模式,采用单机2卡进行测试。显示pserver启动时core,未进入真正的training,因为没有调用 dataprovider的init hook function。 部分日志如下,详细请帮忙加hi联系。
=============================================
- ./paddle_pserver2 --num_gradient_servers=1 --nics=eth2 --port=7164 --ports_num=8 --rdma_tcp=tcp --comment=paddle_cluster_job *** Aborted at 1513913615 (unix time) try "date -d @1513913615" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGSEGV (@0x1170) received by PID 13236 (TID 0x7f973a6f4700) from PID 4464; stack trace: *** @ 0x7f97443ea160 (unknown) @ 0x7f974209f74f (unknown) @ 0x7f97421f8ee4 (unknown) @ 0x7f9742192588 (unknown) @ 0x7f97443e21c3 start_thread @ 0x7f97430cd12d __clone @ 0x0 (unknown) ./start_server.sh: line 33: 13236 Segmentation fault GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_pserver2 --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --nics=${nics} ${server_arg} --rdma_tcp=${rdma_tcp} --comment=$comment
- check_return 'paddle_pserver2 failed'
- '[' 139 -ne 0 ']'
- echo '[./start_server.sh : 34] [main]' [./start_server.sh : 34] [main]
- echo '[FATAL]: paddle_pserver2 failed' [FATAL]: paddle_pserver2 failed
- get_stack
- set +x
*Shell Script Stack Trace @: [./log.sh: 61] check_return @: [./start_server.sh: 34] main
- exit 1
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[58121,1],0] Exit code: 1