使用RecordIO和ParallelExector进行训练出现SegmentionFault
Created by: zzhzz
在使用RecordIO以及ParallelExector加速训练的过程中,发生了SegmentionFault,错误信息如下: *** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date *** 2079471 PC: @ 0x0 (unknown) 2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: *** 2079473 @ 0x7f305bb7e7e0 (unknown) 2079474 @ 0x7f3000000002 (unknown)
神经网络是一个词向量模型,通过设置环境变量输出Paddle的log,报错前的一部分log如下:
I1010 08:42:51.656551 51287 operator.cc:130] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float173], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2_ 3173], ParamOut[fc_1.b_0173]}.
2079458 I1010 08:42:51.656599 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library type[PLAIN]
2079459 I1010 08:42:51.656657 51287 operator.cc:142] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float173], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2 3173], ParamOut[fc_1.b_0173]}.
2079460 I1010 08:42:51.660423 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079461 I1010 08:42:51.660465 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079462 I1010 08:42:51.660521 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079463 I1010 08:42:51.660552 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079464 I1010 08:42:51.660575 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079465 I1010 08:42:51.660604 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079466 I1010 08:42:51.663774 51288 tensor_util.cu:107] TensorCopySync 1 from CUDAPlace(0) to CPUPlace
2079467 I1010 08:42:51.700305 51288 tensor_util.cu:25] TensorCopy 1 from CPUPlace to CPUPlace
2079468 I1010 08:42:51.700296 51286 tensor_util.cu:107] TensorCopySync 21639, 200 from CUDAPlace(0) to CPUPlace
2079469 I1010 08:42:51.703213 51286 tensor_util.cu:25] TensorCopy 21639, 200 from CPUPlace to CPUPlace
2079470 *** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date ***
2079471 PC: @ 0x0 (unknown)
2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: ***
2079473 @ 0x7f305bb7e7e0 (unknown)
2079474 @ 0x7f3000000002 (unknown)