模型训练显卡问题 (#1128) · Issue · PaddlePaddle / PaddleDetection

模型训练显卡问题

Created by: weiaoliu

我在训练obj365/cascade_rcnn_cls_aware_r200_vd_fpn_dcnv2_nonlocal_softnms.yml的时候，export CUDA_VISIBLE_DEVICES=3，5指定显卡的时候，不管用，在train.py添加os.environ['CUDA_VISIBLE_DEVICES'] = '3,5',指定显卡时，还是不行。会跑到别的GPU上。我用的0.3的paddle。

export CUDA_VISIBLE_DEVICES=3,5 (wei) wangrunqi@irecog:/data/wei/paddle0.3/PaddleDetection$ python -u tools/train.py -c configs/obj365/cascade_rcnn_cls_aware_r200_vd_fpn_dcnv2_nonlocal_softnms.yml 2020-07-30 09:42:57,638-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000100] in Optimizer will not take effect, and it will only be applied to other Parameters! W0730 09:42:58.374402 48135 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.0, Runtime API Version: 10.0 W0730 09:42:58.379374 48135 device_context.cc:260] device: 0, cuDNN Version: 7.6. 2020-07-30 09:43:01,022-WARNING: /home/wangrunqi/.cache/paddle/weights/ResNet200_vd_pretrained.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] /opt/Anaconda3/envs/wei/lib/python3.7/site-packages/paddle/fluid/io.py:1998: UserWarning: This list is not set, Because of Paramerter not found in program. There are: fc_0.b_0 fc_0.w_0 format(" ".join(unused_para_list))) 2020-07-30 09:43:13,167-INFO: places would be ommited when DataLoader is not iterable W0730 09:43:16.844976 48135 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 380. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 317. W0730 09:43:31.182279 48566 operator.cc:187] concat raises an exception paddle::memory::allocation::BadAlloc,

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackStringstd::string(std::string&&, char const*, int) 1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long) 3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long) 4 paddle::memory::allocation::Allocator::Allocate(unsigned long) 5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long) 6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long) 7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long) 8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long) 9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long) 10 paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 1ul, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::ConcatKernel<paddle::platform::CUDADeviceContext, int> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 15 paddle::framework::details::ComputationOpHandle::RunImpl() 16 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 17 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*) 18 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) 19 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 20 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 98.000244MB memory on GPU 0, available memory is only 43.625000MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69) F0730 09:43:31.183924 48566 exception_holder.h:37] std::exception caught,

C++ Call Stacks (More useful to developers):

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 98.000244MB memory on GPU 0, available memory is only 43.625000MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)

* Check failure stack trace: *

@ 0x7fea2a580add google::LogMessage::Fail() @ 0x7fea2a58458c google::LogMessage::SendToLog() @ 0x7fea2a580603 google::LogMessage::Flush() @ 0x7fea2a585a9e google::LogMessageFatal::~LogMessageFatal() @ 0x7fea2d7515b8 paddle::framework::details::ExceptionHolder::Catch() @ 0x7fea2d7ec6ae paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() @ 0x7fea2d7ea04f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() @ 0x7fea2d7ea314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data @ 0x7fea2a5ddfb3 std::_Function_handler<>::_M_invoke() @ 0x7fea2a3d9647 std::__future_base::_State_base::_M_do_set() @ 0x7feb851dca99 __pthread_once_slow @ 0x7fea2d7e64e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv @ 0x7fea2a3dbaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv @ 0x7fea76505421 execute_native_thread_routine_compat @ 0x7feb851d56ba start_thread @ 0x7feb84f0b41d clone @ (nil) (unknown) 已放弃 (核心已转储)

PaddlePaddle / PaddleDetection 大约 2 年 前同步成功

模型训练显卡问题

C++ Call Stacks (More useful to developers):

Error Message Summary:

C++ Call Stacks (More useful to developers):

Error Message Summary:

* Check failure stack trace: *

PaddlePaddle / PaddleDetection
大约 2 年前同步成功