单机多卡，GPU指定1卡（1卡空闲），可是GPU 0卡报错。 (#19125) · Issue · PaddlePaddle / Paddle

单机多卡，GPU指定1卡（1卡空闲），可是GPU 0卡报错。

Created by: lalala805

请教，单机多卡，训练时指定了export CUDA_VISIBLE_DEVICES=1，GPU1卡也是闲置状态。训练时报错，我看报错信息好像是GPU0卡内存不够？（GPU0卡被别的模型使用），可我已经指定使用GPU1卡，是我哪里搞错了么？

附报错信息。

export FLAGS_sync_nccl_allreduce=1
FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=1
python -u run_classifier.py --use_cuda true --verbose true --do_train true --do_val false --do_test false --batch_size 16 --init_pretraining_params ../model/params --train_set ../task_data/dev/a.tsv --dev_set ../task_data/dev/b.tsv --test_set ../task_data/dev/c.tsv --vocab_path ../model/vocab.txt --checkpoints ./checkpoints --save_steps 10 --weight_decay 0.0 --warmup_proportion 0.0 --validation_steps 5 --epoch 10 --max_seq_len 128 --ernie_config_path ../model/ernie_config.json --learning_rate 2e-5 --skip_steps 10 --num_iteration_per_drop_scope 1 --num_labels 2 --random_seed 1 ----------- Configuration Arguments ----------- batch_size: 16 checkpoints: ./checkpoints dev_set: ../task_data/dev/b.tsv do_lower_case: True do_test: False do_train: True do_val: False enable_ce: False epoch: 10 ernie_config_path: ../model/ernie_config.json in_tokens: False init_checkpoint: None init_pretraining_params: ../model/params label_map_config: None learning_rate: 2e-05 loss_scaling: 1.0 lr_scheduler: linear_warmup_decay max_seq_len: 128 metrics: True num_iteration_per_drop_scope: 1 num_labels: 2 random_seed: 1 save_steps: 10 shuffle: True skip_steps: 10 test_set: ../task_data/dev/c.tsv train_set: ../task_data/dev/a.tsv use_cuda: True use_fast_executor: False use_fp16: False validation_steps: 5 verbose: True vocab_path: ../model/vocab.txt warmup_proportion: 0.0 weight_decay: 0.0

attention_probs_dropout_prob: 0.1 hidden_act: relu hidden_dropout_prob: 0.1 hidden_size: 768 initializer_range: 0.02 max_position_embeddings: 513 num_attention_heads: 12 num_hidden_layers: 12 type_vocab_size: 2 vocab_size: 18000

Device count: 1 Num train examples: 7976 Max train steps: 4985 Num warmup steps: 0 memory_optimize is deprecated. Use CompiledProgram and Executor Theoretical memory usage in training: 4144.448 - 4341.803 MB W0812 11:06:08.425791 92102 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.2, Runtime API Version: 9.0 W0812 11:06:08.428957 92102 device_context.cc:269] device: 0, cuDNN Version: 7.0. W0812 11:06:08.428998 92102 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Load pretraining parameters from ../model/params. ParallelExecutor is deprecated. Please use CompiledProgram and Executor. CompiledProgram is a central place for optimization and Executor is the unified executor. Example can be found in compiler.py. W0812 11:06:09.157330 92102 graph.h:204] WARN: After a series of passes, the current graph can be quite different from OriginProgram. So, please avoid using the OriginProgram() method! I0812 11:06:09.588352 92102 build_strategy.cc:285] SeqOnlyAllReduceOps:0, num_trainers:1 W0812 11:06:11.642266 92102 system_allocator.cc:125] Cannot malloc 6652.83 MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use or FLAGS_initial_gpu_memory_in_mb or FLAGS_reallocate_gpu_memory_in_mbenvironment variable to a lower value. Current FLAGS_fraction_of_gpu_memory_to_use value is 0.92. Current FLAGS_initial_gpu_memory_in_mb value is 0. Current FLAGS_reallocate_gpu_memory_in_mb value is 0 F0812 11:06:11.642613 92102 legacy_allocator.cc:200] Cannot allocate 2.953125MB in GPU 0, available 436.937500MBtotal 7981694976GpuMinChunkSize 256.000000BGpuMaxChunkSize 6.496908GBGPU memory used: 6.495386GB

* Check failure stack trace: *

@ 0x7f08b66794ad google::LogMessage::Fail() @ 0x7f08b667cf5c google::LogMessage::SendToLog() @ 0x7f08b6678fd3 google::LogMessage::Flush() @ 0x7f08b667e46e google::LogMessageFatal::~LogMessageFatal() @ 0x7f08b826aaf9 paddle::memory::legacy::Alloc<>() @ 0x7f08b826ad35 paddle::memory::allocation::LegacyAllocator::AllocateImpl() @ 0x7f08b828ffeb paddle::memory::allocation::Allocator::Allocate() @ 0x7f08b825e933 paddle::memory::allocation::AllocatorFacade::Alloc() @ 0x7f08b825ea51 paddle::memory::allocation::AllocatorFacade::AllocShared() @ 0x7f08b7e9c920 paddle::memory::AllocShared() @ 0x7f08b8230cea paddle::framework::Tensor::mutable_data() @ 0x7f08b7b11893 paddle::operators::ElementwiseAddKernel<>::Compute() @ 0x7f08b7b11e53 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators20ElementwiseAddKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_lEENSA_ISB_NS7_7float16EEEEEclEPKcSK_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4 @ 0x7f08b81dc6f6 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7f08b81dce64 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7f08b81da78c paddle::framework::OperatorBase::Run() @ 0x7f08b66bf8be paddle::framework::Executor::RunPreparedContext() @ 0x7f08b7cb38d7 paddle::operators::WhileOp::RunImpl() @ 0x7f08b81da78c paddle::framework::OperatorBase::Run() @ 0x7f08b7fe902a paddle::framework::details::ComputationOpHandle::RunImpl() @ 0x7f08b7fdc0e0 paddle::framework::details::OpHandleBase::Run() @ 0x7f08b7f437ad _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpERKSt10shared_ptrINS0_13BlockingQueueIPNS1_13VarHandleBaseEEEEPNS1_12OpHandleBaseEENKUlvE_clEv @ 0x7f08b7f4460d paddle::framework::details::ThreadedSSAGraphExecutor::RunOp() @ 0x7f08b7f48336 paddle::framework::details::ThreadedSSAGraphExecutor::Run() @ 0x7f08b7f3942a paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run() @ 0x7f08b66f6982 paddle::framework::ParallelExecutor::Run() @ 0x7f08b653a02e ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE162_vIS8_SD_SF_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESX @ 0x7f08b657d72e pybind11::cpp_function::dispatcher() @ 0x4aad3a PyEval_EvalFrameEx @ 0x4ac0b7 PyEval_EvalCodeEx @ 0x4aaac4 PyEval_EvalFrameEx @ 0x4ac0b7 PyEval_EvalCodeEx b.sh: line 30: 92102 Aborted python -u run_classifier.py --use_cuda true --verbose true --do_train true --do_val false --do_test false --batch_size 16 --init_pretraining_params ../model/params --train_set ../task_data/dev/a.tsv --dev_set ../task_data/dev/b.tsv --test_set ../task_data/dev/c.tsv --vocab_path ../model/vocab.txt --checkpoints ./checkpoints --save_steps 10 --weight_decay 0.0 --warmup_proportion 0.0 --validation_steps 5 --epoch 10 --max_seq_len 128 --ernie_config_path ../model/ernie_config.json --learning_rate 2e-5 --skip_steps 10 --num_iteration_per_drop_scope 1 --num_labels 2 --random_seed 1

PaddlePaddle / Paddle 大约 2 年 前同步成功

单机多卡，GPU指定1卡（1卡空闲），可是GPU 0卡报错。

attention_probs_dropout_prob: 0.1 hidden_act: relu hidden_dropout_prob: 0.1 hidden_size: 768 initializer_range: 0.02 max_position_embeddings: 513 num_attention_heads: 12 num_hidden_layers: 12 type_vocab_size: 2 vocab_size: 18000

* Check failure stack trace: *

PaddlePaddle / Paddle
大约 2 年前同步成功