模型训练时报错：cudagetlasterror invalid configuration argument errno9 (#19085) · Issue · PaddlePaddle / Paddle

模型训练时报错：cudagetlasterror invalid configuration argument errno9

Created by: dingsiyu

为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：

标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
版本、环境信息： 1）PaddlePaddle版本：1.5 3）GPU：CUDA 10.0， CUDNN 7.4 4）系统环境：centos6 python3
训练信息 1）单机，多卡, 分布式训练

   The n_token for dataset wt103 is 267735
pretraining start
Device count 1, gpu_id:1
theoretical memory usage: 
(16902.301077747346, 17707.17255764008, 'MB')
FLASG.is_distributed: True
worker_endpoints:['10.255.122.21:6170', '10.255.122.21:6171', '10.255.122.21:6172', '10.255.122.21:6173', '10.255.122.21:6174', '10.255.122.21:6175', '10.255.122.21:6176', '10.255.122.21:6177'] trainers_num:8 current_endpoint:10.255.122.21:6171                   trainer_id:1
W0809 11:28:02.020586 185322 device_context.cc:259] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0
W0809 11:28:02.024772 185322 device_context.cc:267] device: 1, cuDNN Version: 7.4.
I0809 11:28:02.135246 187888 grpc_server.cc:435] Server listening on 10.255.122.21:6171 selected port: 6171
I0809 11:28:04.976687 185322 rpc_server.cc:28] RPCServer ShutDown 
W0809 11:28:04.977440 187901 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977444 187902 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977454 187903 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977445 187899 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977448 187900 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
5 5
The value of build_strategy.num_trainers[1] is overwritten by the passed num_trainers[8].
The value of build_strategy.trainer_id[0] is overwritten by the passed trainer_id[1].
WARNING: Logging before flag parsing goes to stderr.
W0809 11:28:05.070413 139980018198272 compiler.py:239] 
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I0809 11:28:05.129082 185322 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0809 11:28:06.748011 185322 build_strategy.cc:340] SeqOnlyAllReduceOps:1, num_trainers:8
the current epoch is :1
start training ....
dgsgdgggcsdggcgdg
  processing batch 0
F0809 11:29:53.013250 188921 device_context.cc:339] cudaGetLastError  invalid configuration argument errno:9
*** Check failure stack trace: ***
    @     0x7f4f67cf616d  google::LogMessage::Fail()
    @     0x7f4f67cf9c1c  google::LogMessage::SendToLog()
    @     0x7f4f67cf5c93  google::LogMessage::Flush()
    @     0x7f4f67cfb12e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4f69cb16fd  _ZNSt17_Function_handlerIFvvEZNK6paddle8platform17CUDADeviceContext4WaitEvEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69cbf574  paddle::platform::TemporaryAllocator::Release()
    @     0x7f4f69cb46b1  paddle::platform::CUDADeviceContext::Wait()
    @     0x7f4f69c3bf91  paddle::framework::TransDataDevice()
    @     0x7f4f69c3b02e  paddle::framework::TransformData()
    @     0x7f4f69c3246d  paddle::framework::OperatorWithKernel::PrepareData()
    @     0x7f4f69c3359d  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7f4f69c33a41  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7f4f69c3103c  paddle::framework::OperatorBase::Run()
    @     0x7f4f680c107c  paddle::operators::Squeeze2Op::RunImpl()
    @     0x7f4f69c3103c  paddle::framework::OperatorBase::Run()
    @     0x7f4f69a2da79  _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details19ComputationOpHandle7RunImplEvEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69a1ee5d  _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69a1fb94  paddle::framework::details::OpHandleBase::RunAndRecordEvent()
    @     0x7f4f69a2d70c  paddle::framework::details::ComputationOpHandle::RunImpl()
    @     0x7f4f69a20130  paddle::framework::details::OpHandleBase::Run()
    @     0x7f4f69a014a6  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
    @     0x7f4f69a0010f  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
    @     0x7f4f69a004cf  _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
    @     0x7f4f67de2a73  std::_Function_handler<>::_M_invoke()
    @     0x7f4f67c7a1e7  std::__future_base::_State_base::_M_do_set()
    @     0x7f4fa3015620  __pthread_once_internal
    @     0x7f4f699fbb52  _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
    @     0x7f4f67c7b764  _ZZN10ThreadPoolC1EmENKUlvE_clEv
    @     0x7f4f905cf421  execute_native_thread_routine_compat
    @     0x7f4fa3010893  start_thread
    @     0x7f4fa2d41bfd  clone
    @              (nil)  (unknown)

PaddlePaddle / Paddle 大约 2 年 前同步成功

模型训练时报错：cudagetlasterror invalid configuration argument errno9

PaddlePaddle / Paddle
大约 2 年前同步成功