python训练配置脚本本地运行通过,MPI集群报错SIGFPE
Created by: youan1
如题:相应的paddle脚本本地运行通过,MPI集群报错, job 链接为 http://10.86.102.41:8900/fileview.html?path=/home/disk1/normandy/maybach/329760/ 相应的脚本见附件
具体错误如下:
r>:+ python27-gcc482/bin/python conf/trainer_config.conf Fri Aug 18 12:44:11 2017[1,93]:Thread [140666704901888] Forwarding fc_layer_4, Fri Aug 18 12:44:11 2017[1,93]:*** Aborted at 1503031451 (unix time) try "date -d @1503031451" if you are using GNU date *** Fri Aug 18 12:44:11 2017[1,93]:PC: @ 0x0 (unknown) Fri Aug 18 12:44:11 2017[1,93]:*** SIGFPE (@0x7fef7e79ba90) received by PID 2986 (TID 0x7fef84fa3700) from PID 2121906832; stack trace: *** Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef84b7a160 (unknown) Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef7e79ba90 paddle::CpuMatrix::mul<>() Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef7e797ed3 paddle::CpuMatrix::mul() Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef7e5ae59a paddle::FullyConnectedLayer::forward() Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef7e61c562 paddle::NeuralNetwork::forward() Fri Aug 18 12:44:11 2017[1,93]: @ 0x7fef7e6104c3 paddle::GradientMachine::forwardBackward() Fri Aug 18 12:44:12 2017[1,93]: @ 0x7fef7e8381c4 GradientMachine::forwardBackward() Fri Aug 18 12:44:12 2017[1,93]: @ 0x7fef7e4e51d9 _wrap_GradientMachine_forwardBackward Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b4cb9 PyEval_EvalFrameEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6b28 PyEval_EvalCodeEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b5d10 PyEval_EvalFrameEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6b28 PyEval_EvalCodeEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b5d10 PyEval_EvalFrameEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6b28 PyEval_EvalCodeEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b5d10 PyEval_EvalFrameEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6b28 PyEval_EvalCodeEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b5d10 PyEval_EvalFrameEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6b28 PyEval_EvalCodeEx Fri Aug 18 12:44:12 2017[1,93]: @ 0x4b6c52 PyEval_EvalCode Fri Aug 18 12:44:12 2017[1,93]: @ 0x4e1c7d PyRun_FileExFlags Fri Aug 18 12:44:12 2017[1,93]: @ 0x4e3501 PyRun_SimpleFileExFlags Fri Aug 18 12:44:12 2017[1,93]: @ 0x4159dd Py_Main Fri Aug 18 12:44:12 2017[1,93]: @ 0x7fef840d4bd5 __libc_start_main Fri Aug 18 12:44:12 2017[1,93]: @ 0x414b71 (unknown) Fri Aug 18 12:44:12 2017[1,93]: @ 0x0 (unknown) Fri Aug 18 12:44:13 2017[1,93]:./train.sh: line 239: 2986 Floating point exceptionpython27-gcc482/bin/python conf/trainer_config.conf Fri Aug 18 12:44:13 2017[1,93]:+ '[' 136 -ne 0 ']' Fri Aug 18 12:44:13 2017[1,93]:+ kill_pserver2_exit Fri Aug 18 12:44:13 2017[1,93]:+ grep -v grep Fri Aug 18 12:44:13 2017[1,93]:+ xargs kill -9