推理过程中,cudnnGetConvolutionForwardAlgorithm_v7报错
Created by: flishwang
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题: 无
-
版本、环境信息: Paddle version: 1.8.3 Paddle With CUDA: True OS: debian stretch/sid Python version: 3.7.7 CUDA version: 10.1.243 cuDNN version: None.None.None # 注:系统中安装libcudnn.so.7.6.5,且将路径加入$LD_LIBRARY_PATH Nvidia driver version: 418.74
-
问题描述:
使用PaddleDetection中的tool/eval.py进行推理,环境为单机,单卡或多卡,Tesla V100 16G或 Titan RTX 模型为Cascade RCNN (backbone为R101vd或R200vd) , multi-scale test。
以下报错大约有30%的概率出现(使用相同的脚本无法稳定复现,不知道跟什么原因有关):
2020-08-05 12:50:51,695-INFO: start loading proposals 2020-08-05 12:50:52,457-INFO: loading roidb 2012_test 100%|████████████████████████████████████████| 970/970 [00:01<00:00, 601.75it/s] 2020-08-05 12:50:54,377-INFO: finish loading roidb from scope 2012_test 2020-08-05 12:50:54,378-INFO: finish loading roidbs, total num = 970 2020-08-05 12:50:54,379-INFO: set max batches to 0 2020-08-05 12:50:54,380-INFO: places would be ommited when DataLoader is not iterable W0805 12:50:54.530522 4141844 device_context.cc:252] Please NOTE: device: 5, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0 W0805 12:50:55.613425 4141844 device_context.cc:260] device: 5, cuDNN Version: 7.6. W0805 12:51:24.223932 4141881 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly W0805 12:51:24.223980 4141881 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0805 12:51:24.223989 4141881 init.cc:221] The detail failure signal is:
W0805 12:51:24.224001 4141881 init.cc:224] *** Aborted at 1596603084 (unix time) try "date -d @1596603084" if you are using GNU date *** W0805 12:51:24.228863 4141881 init.cc:224] PC: @ 0x0 (unknown) W0805 12:51:24.346484 4141881 init.cc:224] *** SIGSEGV (@0x8) received by PID 4141844 (TID 0x7f012db3d700) from PID 8; stack trace: *** W0805 12:51:24.351244 4141881 init.cc:224] @ 0x7f01e3671390 (unknown) W0805 12:51:24.353901 4141881 init.cc:224] @ 0x7f012eda2747 (unknown) W0805 12:51:24.356168 4141881 init.cc:224] @ 0x7f012ec98d4c (unknown) W0805 12:51:24.358356 4141881 init.cc:224] @ 0x7f012e41b5fc (unknown) W0805 12:51:24.360416 4141881 init.cc:224] @ 0x7f012e42b938 (unknown) W0805 12:51:24.362363 4141881 init.cc:224] @ 0x7f012e41859a cudnnGetConvolutionForwardAlgorithm_v7 W0805 12:51:24.447378 4141881 init.cc:224] @ 0x7f019853ff45 paddle::operators::SearchAlgorithm<>::Find<>() W0805 12:51:24.469980 4141881 init.cc:224] @ 0x7f01985e1889 paddle::operators::CUDNNConvOpKernel<>::Compute() W0805 12:51:24.481895 4141881 init.cc:224] @ 0x7f01985e2b33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4 W0805 12:51:24.504448 4141881 init.cc:224] @ 0x7f019a561ac0 paddle::framework::OperatorWithKernel::RunImpl() W0805 12:51:24.565385 4141881 init.cc:224] @ 0x7f019a5622b1 paddle::framework::OperatorWithKernel::RunImpl() W0805 12:51:24.604465 4141881 init.cc:224] @ 0x7f019a55b261 paddle::framework::OperatorBase::Run() W0805 12:51:24.635419 4141881 init.cc:224] @ 0x7f019a268f16 paddle::framework::details::ComputationOpHandle::RunImpl() W0805 12:51:24.657658 4141881 init.cc:224] @ 0x7f019a210551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() W0805 12:51:24.673673 4141881 init.cc:224] @ 0x7f019a20e04f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() W0805 12:51:24.687579 4141881 init.cc:224] @ 0x7f019a20e314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data W0805 12:51:24.724630 4141881 init.cc:224] @ 0x7f0197001fb3 std::_Function_handler<>::_M_invoke() W0805 12:51:24.769093 4141881 init.cc:224] @ 0x7f0196dfd647 std::__future_base::_State_base::_M_do_set() W0805 12:51:24.773929 4141881 init.cc:224] @ 0x7f01e366ea99 __pthread_once_slow W0805 12:51:24.780242 4141881 init.cc:224] @ 0x7f019a20a4e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv W0805 12:51:24.817785 4141881 init.cc:224] @ 0x7f0196dffaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv W0805 12:51:24.850741 4141881 init.cc:224] @ 0x7f01d4120421 execute_native_thread_routine_compat W0805 12:51:24.857818 4141881 init.cc:224] @ 0x7f01e36676ba start_thread W0805 12:51:24.862519 4141881 init.cc:224] @ 0x7f01e339d41d clone W0805 12:51:24.870891 4141881 init.cc:224] @ 0x0 (unknown) Segmentation fault (core dumped)