Cublas error, CUBLAS_STATUS_EXECUTION_FAILED
Created by: MrChengmo
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
- 版本、环境信息: 1)PaddlePaddle版本:1.8.1 2)CPU:/ 3)GPU:V100 Driver Version: 418.39 CUDA Version: 10.1
- 训练信息 1)单机单卡 3)Operator信息:operator < mul > error
fluid.install_check()没有问题,但是在训练时出core
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 void paddle::operators::math::Blas<paddle::platform::CUDADeviceContext>::MatMul<float>(paddle::framework::Tensor const&, bool, paddle::framework::Tensor const&, bool, float, paddle::framework::Tensor*, float) const
3 paddle::operators::MulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
8 paddle::framework::HogwildWorker::TrainFilesWithProfiler()
------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
File "/home/users/wangjiawei04/paddle_release_home/python/lib64/python2.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/home/users/wangjiawei04/paddle_release_home/python/lib64/python2.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/users/wangjiawei04/paddle_release_home/python/lib64/python2.7/site-packages/paddle/fluid/layers/nn.py", line 1719, in fc
"y_num_col_dims": 1})
File "/home/users/wangjiawei04/chengmo/paddle_attention/paddle/train_net.py", line 639, in fusion_semantic_word
bias_attr=fluid.ParamAttr(name="tdm.cls_fc.bias"))
File "/home/users/wangjiawei04/chengmo/paddle_attention/paddle/train_net.py", line 237, in train_net
semantic_states, word_states)
File "/home/users/wangjiawei04/chengmo/paddle_attention/paddle/local_train.py", line 96, in run_train
avg_cost, auc = tdm_model.train_net(inputs)
File "/home/users/wangjiawei04/chengmo/paddle_attention/paddle/local_train.py", line 209, in main
run_train(args)
File "/home/users/wangjiawei04/chengmo/paddle_attention/paddle/local_train.py", line 216, in <module>
main(args)
----------------------
Error Message Summary:
----------------------
ExternalError: Cublas error, CUBLAS_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/math/blas_impl.cu.h:34)
[operator < mul > error]
W0528 13:54:15.679564 213705 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0528 13:54:15.679577 213705 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0528 13:54:15.679581 213705 init.cc:221] The detail failure signal is:
W0528 13:54:15.679586 213705 init.cc:224] *** Aborted at 1590645255 (unix time) try "date -d @1590645255" if you are using GNU date ***
W0528 13:54:15.684528 213705 init.cc:224] PC: @ 0x0 (unknown)
W0528 13:54:15.684717 213705 init.cc:224] *** SIGABRT (@0x520520002dd8f) received by PID 187791 (TID 0x7f609d7cc700) from PID 187791; stack trace: ***
W0528 13:54:15.685890 213705 init.cc:224] @ 0x7f60ca1c8160 (unknown)
W0528 13:54:15.687609 213705 init.cc:224] @ 0x7f60c97363f7 __GI_raise
W0528 13:54:15.688812 213705 init.cc:224] @ 0x7f60c97377d8 __GI_abort
W0528 13:54:15.690255 213705 init.cc:224] @ 0x7f5feba34c65 __gnu_cxx::__verbose_terminate_handler()
W0528 13:54:15.690800 213705 init.cc:224] @ 0x7f5feba32e06 __cxxabiv1::__terminate()
W0528 13:54:15.691521 213705 init.cc:224] @ 0x7f5feba32e33 std::terminate()
W0528 13:54:15.692054 213705 init.cc:224] @ 0x7f5feba85935 execute_native_thread_routine
W0528 13:54:15.693186 213705 init.cc:224] @ 0x7f60ca1c01c3 start_thread
W0528 13:54:15.694511 213705 init.cc:224] @ 0x7f60c97e812d __clone
W0528 13:54:15.695627 213705 init.cc:224] @ 0x0 (unknown)
I0528 13:54:15.863662 213708 mmap_allocator.cc:124] PID: 213708, MemoryMapFdSet: set size - 0