Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #19085

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 8月 09, 2019 by saxon_zh@saxon_zhGuest

模型训练时报错:cudagetlasterror invalid configuration argument errno9

Created by: dingsiyu

为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:

  • 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
  • 版本、环境信息:    1)PaddlePaddle版本:1.5    3)GPU:CUDA 10.0, CUDNN 7.4    4)系统环境:centos6 python3
  • 训练信息    1)单机,多卡, 分布式训练
   The n_token for dataset wt103 is 267735
pretraining start
Device count 1, gpu_id:1
theoretical memory usage: 
(16902.301077747346, 17707.17255764008, 'MB')
FLASG.is_distributed: True
worker_endpoints:['10.255.122.21:6170', '10.255.122.21:6171', '10.255.122.21:6172', '10.255.122.21:6173', '10.255.122.21:6174', '10.255.122.21:6175', '10.255.122.21:6176', '10.255.122.21:6177'] trainers_num:8 current_endpoint:10.255.122.21:6171                   trainer_id:1
W0809 11:28:02.020586 185322 device_context.cc:259] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0
W0809 11:28:02.024772 185322 device_context.cc:267] device: 1, cuDNN Version: 7.4.
I0809 11:28:02.135246 187888 grpc_server.cc:435] Server listening on 10.255.122.21:6171 selected port: 6171
I0809 11:28:04.976687 185322 rpc_server.cc:28] RPCServer ShutDown 
W0809 11:28:04.977440 187901 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977444 187902 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977454 187903 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977445 187899 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
W0809 11:28:04.977448 187900 grpc_server.cc:547] CompletionQueue RequestSend shutdown!
5 5
The value of build_strategy.num_trainers[1] is overwritten by the passed num_trainers[8].
The value of build_strategy.trainer_id[0] is overwritten by the passed trainer_id[1].
WARNING: Logging before flag parsing goes to stderr.
W0809 11:28:05.070413 139980018198272 compiler.py:239] 
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I0809 11:28:05.129082 185322 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0809 11:28:06.748011 185322 build_strategy.cc:340] SeqOnlyAllReduceOps:1, num_trainers:8
the current epoch is :1
start training ....
dgsgdgggcsdggcgdg
  processing batch 0
F0809 11:29:53.013250 188921 device_context.cc:339] cudaGetLastError  invalid configuration argument errno:9
*** Check failure stack trace: ***
    @     0x7f4f67cf616d  google::LogMessage::Fail()
    @     0x7f4f67cf9c1c  google::LogMessage::SendToLog()
    @     0x7f4f67cf5c93  google::LogMessage::Flush()
    @     0x7f4f67cfb12e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4f69cb16fd  _ZNSt17_Function_handlerIFvvEZNK6paddle8platform17CUDADeviceContext4WaitEvEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69cbf574  paddle::platform::TemporaryAllocator::Release()
    @     0x7f4f69cb46b1  paddle::platform::CUDADeviceContext::Wait()
    @     0x7f4f69c3bf91  paddle::framework::TransDataDevice()
    @     0x7f4f69c3b02e  paddle::framework::TransformData()
    @     0x7f4f69c3246d  paddle::framework::OperatorWithKernel::PrepareData()
    @     0x7f4f69c3359d  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7f4f69c33a41  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7f4f69c3103c  paddle::framework::OperatorBase::Run()
    @     0x7f4f680c107c  paddle::operators::Squeeze2Op::RunImpl()
    @     0x7f4f69c3103c  paddle::framework::OperatorBase::Run()
    @     0x7f4f69a2da79  _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details19ComputationOpHandle7RunImplEvEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69a1ee5d  _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f4f69a1fb94  paddle::framework::details::OpHandleBase::RunAndRecordEvent()
    @     0x7f4f69a2d70c  paddle::framework::details::ComputationOpHandle::RunImpl()
    @     0x7f4f69a20130  paddle::framework::details::OpHandleBase::Run()
    @     0x7f4f69a014a6  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
    @     0x7f4f69a0010f  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
    @     0x7f4f69a004cf  _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
    @     0x7f4f67de2a73  std::_Function_handler<>::_M_invoke()
    @     0x7f4f67c7a1e7  std::__future_base::_State_base::_M_do_set()
    @     0x7f4fa3015620  __pthread_once_internal
    @     0x7f4f699fbb52  _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
    @     0x7f4f67c7b764  _ZZN10ThreadPoolC1EmENKUlvE_clEv
    @     0x7f4f905cf421  execute_native_thread_routine_compat
    @     0x7f4fa3010893  start_thread
    @     0x7f4fa2d41bfd  clone
    @              (nil)  (unknown)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#19085
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7