MPI 集群模式 使用PrecisionRecallEvaluator
Created by: MonkandMonkey
版本:paddle.v2 简单的cnn 文本分类代码,使用到了PrecisionRecallEvaluator: 现象:本地运行OK,mpi 集群模式出错。workspace/train.log报错如下:
Tue Jun 12 14:45:53 2018[1,0]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8217, Precision 0.6929, recall: 0.5063, F1: 0.5851 Tue Jun 12 14:45:53 2018[1,5]: Tue Jun 12 14:45:53 2018[1,5]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.7998, Precision 0.7063, recall: 0.5132, F1: 0.5945 Tue Jun 12 14:45:53 2018[1,2]: Tue Jun 12 14:45:53 2018[1,2]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8085, Precision 0.7087, recall: 0.5067, F1: 0.5909 Tue Jun 12 14:45:53 2018[1,3]: Tue Jun 12 14:45:53 2018[1,3]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.7902, Precision 0.7047, recall: 0.5066, F1: 0.5894 Tue Jun 12 14:45:53 2018[1,4]: Tue Jun 12 14:45:53 2018[1,4]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.7944, Precision 0.7126, recall: 0.5068, F1: 0.5923 Tue Jun 12 14:45:53 2018[1,6]: Tue Jun 12 14:45:53 2018[1,6]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8456, Precision 0.6797, recall: 0.5000, F1: 0.5762 Tue Jun 12 14:45:53 2018[1,7]: Tue Jun 12 14:45:53 2018[1,7]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8093, Precision 0.6953, recall: 0.5000, F1: 0.5817 Tue Jun 12 14:45:53 2018[1,8]: Tue Jun 12 14:45:53 2018[1,8]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8057, Precision 0.6992, recall: 0.5000, F1: 0.5831 Tue Jun 12 14:45:53 2018[1,9]: Tue Jun 12 14:45:53 2018[1,9]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.7696, Precision 0.7205, recall: 0.5069, F1: 0.5951 Tue Jun 12 14:45:53 2018[1,1]: Tue Jun 12 14:45:53 2018[1,1]:2018-06-12 14:45:53:Pass 0, Batch 0, Cost 0.8540, Precision 0.6758, recall: 0.5000, F1: 0.5748 Tue Jun 12 14:46:43 2018[1,1]:*** Aborted at 1528786003 (unix time) try "date -d @1528786003" if you are using GNU date *** Tue Jun 12 14:46:43 2018[1,1]:PC: @ 0x0 (unknown) Tue Jun 12 14:46:43 2018[1,1]:*** SIGFPE (@0x7fc8a7795aa4) received by PID 3354 (TID 0x7fc8cdf48700) from PID 18446744072224332452; stack trace: *** Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8cdb1f160 (unknown) Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8a7795aa4 paddle::PrecisionRecallEvaluator::storeLocalValues() Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8a7795f90 paddle::PrecisionRecallEvaluator::getNames() Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8a7772298 paddle::CombinedEvaluator::getNames() Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8a798c8f0 Evaluator::getNames() Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8a76466f3 _wrap_Evaluator_getNames Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6033 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b5fb8 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x529340 function_call Tue Jun 12 14:46:43 2018[1,1]: @ 0x422e38 PyObject_CallFunction Tue Jun 12 14:46:43 2018[1,1]: @ 0x45ffda PyObject_GenericGetAttr Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b12a9 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Tue Jun 12 14:46:43 2018[1,1]: @ 0x4b6c52 PyEval_EvalCode Tue Jun 12 14:46:43 2018[1,1]: @ 0x4e1c7d PyRun_FileExFlags Tue Jun 12 14:46:43 2018[1,1]: @ 0x4e3501 PyRun_SimpleFileExFlags Tue Jun 12 14:46:43 2018[1,1]: @ 0x4159dd Py_Main Tue Jun 12 14:46:43 2018[1,1]: @ 0x7fc8cd079bd5 __libc_start_main Tue Jun 12 14:46:43 2018[1,1]: @ 0x414b71 (unknown) Tue Jun 12 14:46:43 2018[1,1]: @ 0x0 (unknown) Tue Jun 12 14:46:43 2018[1,1]:./train.sh: line 239: 3354 Floating point exceptionpython27-gcc482/bin/python conf/trainer_config.conf Tue Jun 12 14:46:43 2018[1,1]:+ '[' 136 -ne 0 ']' Tue Jun 12 14:46:43 2018[1,1]:+ kill_pserver2_exit Tue Jun 12 14:46:43 2018[1,1]:+ ps aux
原因:应该是PrecisionRecallEvaluator在集群模式下的问题,因为换成AUCEvaluator就可以正常运行。 已经尝试过#2563中调整optimizer和cnn_net()的定义顺序,但是问题还在。
麻烦Paddle同学帮忙看一下,谢谢!