Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • PaddleDetection
  • Issue
  • #1128

P
PaddleDetection
  • 项目概览

PaddlePaddle / PaddleDetection
大约 2 年 前同步成功

通知 708
Star 11112
Fork 2696
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
PaddleDetection
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 184
    • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
    • 合并请求 40
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 7月 30, 2020 by saxon_zh@saxon_zhGuest

模型训练显卡问题

Created by: weiaoliu

我在训练obj365/cascade_rcnn_cls_aware_r200_vd_fpn_dcnv2_nonlocal_softnms.yml的时候,export CUDA_VISIBLE_DEVICES=3,5指定显卡的时候,不管用,在train.py添加os.environ['CUDA_VISIBLE_DEVICES'] = '3,5',指定显卡时,还是不行。会跑到别的GPU上。我用的0.3的paddle。

export CUDA_VISIBLE_DEVICES=3,5 (wei) wangrunqi@irecog:/data/wei/paddle0.3/PaddleDetection$ python -u tools/train.py -c configs/obj365/cascade_rcnn_cls_aware_r200_vd_fpn_dcnv2_nonlocal_softnms.yml 2020-07-30 09:42:57,638-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000100] in Optimizer will not take effect, and it will only be applied to other Parameters! W0730 09:42:58.374402 48135 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.0, Runtime API Version: 10.0 W0730 09:42:58.379374 48135 device_context.cc:260] device: 0, cuDNN Version: 7.6. 2020-07-30 09:43:01,022-WARNING: /home/wangrunqi/.cache/paddle/weights/ResNet200_vd_pretrained.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] /opt/Anaconda3/envs/wei/lib/python3.7/site-packages/paddle/fluid/io.py:1998: UserWarning: This list is not set, Because of Paramerter not found in program. There are: fc_0.b_0 fc_0.w_0 format(" ".join(unused_para_list))) 2020-07-30 09:43:13,167-INFO: places would be ommited when DataLoader is not iterable W0730 09:43:16.844976 48135 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 380. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 317. W0730 09:43:31.182279 48566 operator.cc:187] concat raises an exception paddle::memory::allocation::BadAlloc,


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackStringstd::string(std::string&&, char const*, int) 1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long) 3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long) 4 paddle::memory::allocation::Allocator::Allocate(unsigned long) 5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long) 6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long) 7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long) 8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long) 9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long) 10 paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 1ul, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, int> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 15 paddle::framework::details::ComputationOpHandle::RunImpl() 16 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 17 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*) 18 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) 19 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 20 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 98.000244MB memory on GPU 0, available memory is only 43.625000MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.

at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69) F0730 09:43:31.183924 48566 exception_holder.h:37] std::exception caught,


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackStringstd::string(std::string&&, char const*, int) 1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long) 3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long) 4 paddle::memory::allocation::Allocator::Allocate(unsigned long) 5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long) 6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long) 7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long) 8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long) 9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long) 10 paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 1ul, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, int> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 15 paddle::framework::details::ComputationOpHandle::RunImpl() 16 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 17 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*) 18 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) 19 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 20 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 98.000244MB memory on GPU 0, available memory is only 43.625000MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.

at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)

* Check failure stack trace: *

@ 0x7fea2a580add google::LogMessage::Fail() @ 0x7fea2a58458c google::LogMessage::SendToLog() @ 0x7fea2a580603 google::LogMessage::Flush() @ 0x7fea2a585a9e google::LogMessageFatal::~LogMessageFatal() @ 0x7fea2d7515b8 paddle::framework::details::ExceptionHolder::Catch() @ 0x7fea2d7ec6ae paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() @ 0x7fea2d7ea04f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() @ 0x7fea2d7ea314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data @ 0x7fea2a5ddfb3 std::_Function_handler<>::_M_invoke() @ 0x7fea2a3d9647 std::__future_base::_State_base::_M_do_set() @ 0x7feb851dca99 __pthread_once_slow @ 0x7fea2d7e64e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv @ 0x7fea2a3dbaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv @ 0x7fea76505421 execute_native_thread_routine_compat @ 0x7feb851d56ba start_thread @ 0x7feb84f0b41d clone @ (nil) (unknown) 已放弃 (核心已转储)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/PaddleDetection#1128
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7