内部集群版序列标注报错
Created by: giraffa126
作业链接:http://nmg01-hpc-off-mon.dmop.baidu.com:8090/job/i-605317/ 主要报错如下: Tue Jul 25 13:39:09 2017[1,22]:Thread [140349248341888] Forwarding recurrent_layer_1, Tue Jul 25 13:39:09 2017[1,22]:mixed_1, recurrent_layer_0, mixed_0, embedding_0, chunk, word, Tue Jul 25 13:39:09 2017[1,22]:*** Aborted at 1500961149 (unix time) try "date -d @1500961149" if you are using GNU date *** Tue Jul 25 13:39:09 2017[1,22]:PC: @ 0x0 (unknown) Tue Jul 25 13:39:10 2017[1,22]:*** SIGFPE (@0xa9076c) received by PID 9062 (TID 0x7fa59b178780) from PID 11077484; stack trace: *** Tue Jul 25 13:39:10 2017[1,22]: @ 0x7fa59a933160 (unknown) Tue Jul 25 13:39:10 2017[1,22]: @ 0xa9076c mkl_blas_avx_sgemm_pst Tue Jul 25 13:39:10 2017[1,22]: @ 0xa8a331 mkl_blas_avx_xsgemm Tue Jul 25 13:39:10 2017[1,22]: @ 0x92bbbc mkl_blas_xsgemm Tue Jul 25 13:39:10 2017[1,22]: @ 0x92b78f mkl_blas_sgemm Tue Jul 25 13:39:10 2017[1,22]: @ 0x92aa79 SGEMM Tue Jul 25 13:39:10 2017[1,22]: @ 0x92a292 cblas_sgemm Tue Jul 25 13:39:10 2017[1,22]: @ 0x810a09 paddle::gemm<>() Tue Jul 25 13:39:10 2017[1,22]: @ 0x81ec62 paddle::CpuMatrix::mul() Tue Jul 25 13:39:10 2017[1,22]: @ 0x82b479 paddle::CpuMatrix::mul() Tue Jul 25 13:39:10 2017[1,22]: @ 0x5e3c9e paddle::RecurrentLayer::forwardOneSequence() Tue Jul 25 13:39:10 2017[1,22]: @ 0x5e4c9f paddle::RecurrentLayer::forwardSequence() Tue Jul 25 13:39:10 2017[1,22]: @ 0x5e5997 paddle::RecurrentLayer::forward() Tue Jul 25 13:39:10 2017[1,22]: @ 0x6a19c0 paddle::NeuralNetwork::forward() Tue Jul 25 13:39:10 2017[1,22]: @ 0x6929c3 paddle::GradientMachine::forwardBackward() Tue Jul 25 13:39:10 2017[1,22]: @ 0x74b213 paddle::TrainerInternal::forwardBackwardBatch() Tue Jul 25 13:39:10 2017[1,22]: @ 0x74bc21 paddle::TrainerInternal::trainOneBatch() Tue Jul 25 13:39:10 2017[1,22]: @ 0x749616 paddle::Trainer::trainOneDataBatch() Tue Jul 25 13:39:10 2017[1,22]: @ 0x749abd paddle::Trainer::trainOnePass() Tue Jul 25 13:39:10 2017[1,22]: @ 0x74a375 paddle::Trainer::train() Tue Jul 25 13:39:10 2017[1,22]: @ 0x5a3d70 main Tue Jul 25 13:39:10 2017[1,22]: @ 0x7fa599b77bd5 __libc_start_main Tue Jul 25 13:39:10 2017[1,22]: @ 0x5b2169 (unknown) Tue Jul 25 13:39:10 2017[1,22]: @ 0x0 (unknown) Tue Jul 25 13:39:10 2017[1,22]:./train.sh: line 207: 9062 Floating point exceptionPYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg} Tue Jul 25 13:39:10 2017[1,22]:+ '[' 136 -ne 0 ']' Tue Jul 25 13:39:10 2017[1,22]:+ kill_pserver2_exit