模型训练时报错:cudagetlasterror invalid configuration argument errno9
Created by: dingsiyu
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
- 版本、环境信息: 1)PaddlePaddle版本:1.5 3)GPU:CUDA 10.0, CUDNN 7.4 4)系统环境:centos6 python3
- 训练信息 1)单机,多卡, 分布式训练
The n_token for dataset wt103 is 267735
pretraining start
Device count 1, gpu_id:1
theoretical memory usage:
(16902.301077747346, 17707.17255764008, 'MB')
FLASG.is_distributed: True
worker_endpoints:['10.255.122.21:6170', '10.255.122.21:6171', '10.255.122.21:6172', '10.255.122.21:6173', '10.255.122.21:6174', '10.255.122.21:6175', '10.255.122.21:6176', '10.255.122.21:6177'] trainers_num:8 current_endpoint:10.255.122.21:6171 trainer_id:1
W0809 11:28:02.020586 185322 device_context.cc:259] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0
W0809 11:28:02.024772 185322 device_context.cc:267] device: 1, cuDNN Version: 7.4.
I0809 11:28:02.135246 187888 grpc_server.cc:435] Server listening on 10.255.122.21:6171 selected port: 6171
I0809 11:28:04.976687 185322 rpc_server.cc:28] RPCServer ShutDown
W0809 11:28:04.977440 187901 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977444 187902 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977454 187903 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977445 187899 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977448 187900 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
5 5
The value of build_strategy.num_trainers[1] is overwritten by the passed num_trainers[8].
The value of build_strategy.trainer_id[0] is overwritten by the passed trainer_id[1].
WARNING: Logging before flag parsing goes to stderr.
W0809 11:28:05.070413 139980018198272 compiler.py:239]
You can try our memory optimize feature to save your memory usage:
# create a build_strategy variable to set memory optimize option
build_strategy = compiler.BuildStrategy()
build_strategy.enable_inplace = True
build_strategy.memory_optimize = True
# pass the build_strategy to with_data_parallel API
compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
!!! Memory optimize is our experimental feature !!!
some variables may be removed/reused internal to save memory usage,
in order to fetch the right value of the fetch_list, please set the
persistable property to true for each variable in fetch_list
# Sample
conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None)
# if you need to fetch conv1, then:
conv1.persistable = True
I0809 11:28:05.129082 185322 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0809 11:28:06.748011 185322 build_strategy.cc:340] SeqOnlyAllReduceOps:1, num_trainers:8
the current epoch is :1
start training ....
dgsgdgggcsdggcgdg
processing batch 0
F0809 11:29:53.013250 188921 device_context.cc:339] cudaGetLastError invalid configuration argument errno:9
*** Check failure stack trace: ***
@ 0x7f4f67cf616d google::LogMessage::Fail()
@ 0x7f4f67cf9c1c google::LogMessage::SendToLog()
@ 0x7f4f67cf5c93 google::LogMessage::Flush()
@ 0x7f4f67cfb12e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4f69cb16fd _ZNSt17_Function_handlerIFvvEZNK6paddle8platform17CUDADeviceContext4WaitEvEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f4f69cbf574 paddle::platform::TemporaryAllocator::Release()
@ 0x7f4f69cb46b1 paddle::platform::CUDADeviceContext::Wait()
@ 0x7f4f69c3bf91 paddle::framework::TransDataDevice()
@ 0x7f4f69c3b02e paddle::framework::TransformData()
@ 0x7f4f69c3246d paddle::framework::OperatorWithKernel::PrepareData()
@ 0x7f4f69c3359d paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f4f69c33a41 paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f4f69c3103c paddle::framework::OperatorBase::Run()
@ 0x7f4f680c107c paddle::operators::Squeeze2Op::RunImpl()
@ 0x7f4f69c3103c paddle::framework::OperatorBase::Run()
@ 0x7f4f69a2da79 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details19ComputationOpHandle7RunImplEvEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f4f69a1ee5d _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f4f69a1fb94 paddle::framework::details::OpHandleBase::RunAndRecordEvent()
@ 0x7f4f69a2d70c paddle::framework::details::ComputationOpHandle::RunImpl()
@ 0x7f4f69a20130 paddle::framework::details::OpHandleBase::Run()
@ 0x7f4f69a014a6 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f4f69a0010f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f4f69a004cf _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f4f67de2a73 std::_Function_handler<>::_M_invoke()
@ 0x7f4f67c7a1e7 std::__future_base::_State_base::_M_do_set()
@ 0x7f4fa3015620 __pthread_once_internal
@ 0x7f4f699fbb52 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f4f67c7b764 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f4f905cf421 execute_native_thread_routine_compat
@ 0x7f4fa3010893 start_thread
@ 0x7f4fa2d41bfd clone
@ (nil) (unknown)