paddle.fluid.core.EnforceNotMet: CUBLAS: execution failed
Created by: liushanshan07
本地单机单卡训练正常。 本地单机多卡出现如下问题。 34 File "./ctc_train.py", line 125, in main 35 train(args, data_reader=ctc_reader) 36 File "./ctc_train.py", line 87, in train 37 feed=get_feeder_data(data, place)) 38 File "/home/users/liushanshan/common_env/venv/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 238, in run 39 @ 0x7f8e11fe8e1d google::LogMessage::Fail() 40 self.executor.run(fetch_list, fetch_var_name) 41 paddle.fluid.core.EnforceNotMet: CUBLAS: execution failed, at [/home/users/shiwenguo/fluid_paddle/Paddle/paddle/fluid/operators/math/b 42 PaddlePaddle Call Stacks: 43 0 0x7f8e112e3486p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486 44 1 0x7f8e115a6532p void paddle::operators::math::Blaspaddle::platform::CUDADeviceContext::GEMM(CBLAS_TRANSPOSE, CBLAS_TRA 45 2 0x7f8e11b5d58bp void paddle::operators::math::Blaspaddle::platform::CUDADeviceContext::MatMul(paddle::framework::Tenso 46 3 0x7f8e11b5dd6fp paddle::operators::MulGradKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::Execut 47 4 0x7f8e11e72ec6p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform 48 5 0x7f8e11da4697p 49 6 0x7f8e11dc55e0p 50 7 0x7f8e11dc4f75p paddle::framework::details::OpHandleBase::RunAndRecordEvent(std::function<void ()> const&) + 789 51 8 0x7f8e11da41ffp paddle::framework::details::ComputationOpHandle::RunImpl() + 95 52 9 0x7f8e11dc59b7p paddle::framework::details::OpHandleBase::Run(bool) + 343 53 10 0x7f8e11dbc5eep 54 11 0x7f8e11c73913p std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Del 55 12 0x7f8e11c740a7p std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std: 56 13 0x7f8e4b2ff973p pthread_once + 83 57 14 0x7f8e11dbb502p 58 15 0x7f8e11dbf916p std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1 (closed)} ()> >::_M_run() + 406 59 16 0x7f8e1ff6d8a0p 60 17 0x7f8e4b2fa1c3p 61 18 0x7f8e4a92212dp clone + 109 62 63 @ 0x7f8e11fec8cc google::LogMessage::SendToLog() 64 PC: @ 0x0 (unknown) 65 *** SIGSEGV (@0x296604) received by PID 86988 (TID 0x7f8dcce56700) from PID 2713092; stack trace: *** 66 @ 0x7f8e4b302160 (unknown) 67 @ 0x7f8e11fe8943 google::LogMessage::Flush()