SSD分布式训练 Floating point exception
Created by: ellinyang
1)paddle版本:Paddle v2 2)GPU :Tesla P4 3)系统环境:Linux version 3.10.0_3-1-0-3 (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) 训练信息:多卡 ;图像大小300,初始学习率0.00025
单机跑SSD时没有问题,用分布式训练时跑几个batch便会出现以下问题:
................................Thread [140458398054144] Forwarding detection_output,
*** Aborted at 1558425098 (unix time) try "date -d @1558425098" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGFPE (@0x7fbf03bc1a18) received by PID 52 (TID 0x7fbf04ebf700) from PID 62659096; stack trace: ***
@ 0x7fbf04897390 (unknown)
@ 0x7fbf03bc1a18 __expf_finite
@ 0x7fbf03bc942f expf
@ 0x7fbeb51d4899 paddle::decodeBBoxWithVar()
@ 0x7fbeb51debd1 paddle::DetectionOutputLayer::forward()
@ 0x7fbeb523ffcd paddle::NeuralNetwork::forward()
@ 0x7fbeb5240ce3 paddle::GradientMachine::forwardBackward()
@ 0x7fbeb5516044 GradientMachine::forwardBackward()
@ 0x7fbeb50b1c79 _wrap_GradientMachine_forwardBackward
@ 0x4cb755 PyEval_EvalFrameEx
@ 0x4c2705 PyEval_EvalCodeEx
@ 0x4ca7df PyEval_EvalFrameEx
@ 0x4c2705 PyEval_EvalCodeEx
@ 0x4ca088 PyEval_EvalFrameEx
@ 0x4c2705 PyEval_EvalCodeEx
@ 0x4ca088 PyEval_EvalFrameEx
@ 0x4c9d7f PyEval_EvalFrameEx
@ 0x4c9d7f PyEval_EvalFrameEx
@ 0x4c9d7f PyEval_EvalFrameEx
@ 0x4c2705 PyEval_EvalCodeEx
@ 0x4c24a9 PyEval_EvalCode
@ 0x4f19ef (unknown)
@ 0x4ec372 PyRun_FileExFlags
@ 0x4eaaf1 PyRun_SimpleFileExFlags
@ 0x49e208 Py_Main
@ 0x7fbf044dc830 __libc_start_main
@ 0x49da59 _start
@ 0x0 (unknown)
Floating point exception