GPU Memory error while running script/zh_task/ernie_base/run_drcd.sh
Created by: cibinjohn
I was trying to develop a custom Machine comprehension model over squad v1 data by running script/zh_task/ernie_base/run_drcd.sh, and encountered the following error. Any help would be appreciated.
- export FLAGS_eager_delete_tensor_gb=0
- FLAGS_eager_delete_tensor_gb=0
- export FLAGS_sync_nccl_allreduce=1
- FLAGS_sync_nccl_allreduce=1
- export CUDA_VISIBLE_DEVICES=0,1,2,3
- CUDA_VISIBLE_DEVICES=0,1,2,3
- python -u run_mrc.py --use_cuda true --train_set /home/ubuntu/cibin/squad_v1_1__data/train.json --batch_size 16 --in_tokens false --use_fast_executor true --checkpoints ./checkpoints --vocab_path /home/ubuntu/cibin/ERNIE/pretrained_model/vocab.txt --ernie_config_path /home/ubuntu/cibin/ERNIE/pretrained_model/ernie_config.json --do_train true --do_val true --do_test true --verbose true --save_steps 1000 --validation_steps 100 --warmup_proportion 0.0 --weight_decay 0.01 --epoch 2 --max_seq_len 512 --do_lower_case true --doc_stride 128 --dev_set /home/ubuntu/cibin/squad_v1_1__data/dev.json --test_set /home/ubuntu/cibin/squad_v1_1__data/test.json --learning_rate 5e-5 --num_iteration_per_drop_scope 1 --init_pretraining_params /home/ubuntu/cibin/ERNIE/pretrained_model/params --skip_steps 10
attention_probs_dropout_prob: 0.1
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 768
initializer_range: 0.02
max_position_embeddings: 512
num_attention_heads: 12
num_hidden_layers: 12
sent_type_vocab_size: 4
task_type_vocab_size: 16
vocab_size: 30522
Device count: 4
Num train examples: 1483
Max train steps: 46
Num warmup steps: 0
memory_optimize is deprecated. Use CompiledProgram and Executor
Theoretical memory usage in training: 13971.085 - 14636.375 MB
W0819 05:09:46.604622 511 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0
W0819 05:09:46.606772 511 device_context.cc:267] device: 0, cuDNN Version: 7.6.
Load pretraining parameters from /home/ubuntu/cibin/libor/github/ERNIE/pretrained_model/params.
I0819 05:09:49.962049 511 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 4. And the Program will be copied 4 copies
I0819 05:09:51.959648 511 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
W0819 05:09:55.706979 569 system_allocator.cc:121] Cannot malloc 9770.01 MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use or FLAGS_initial_gpu_memory_in_mb or FLAGS_reallocate_gpu_memory_in_mbenvironment variable to a lower value. Current FLAGS_fraction_of_gpu_memory_to_use value is 0.92. Current FLAGS_initial_gpu_memory_in_mb value is 0. Current FLAGS_reallocate_gpu_memory_in_mb value is 0
F0819 05:09:55.707295 569 legacy_allocator.cc:201] Cannot allocate 139.869873MB in GPU 1, available 648.500000MBtotal 11721506816GpuMinChunkSize 256.000000BGpuMaxChunkSize 9.541025GBGPU memory used: 9.507799GB
* Check failure stack trace: *
@ 0x7f4d748e639d google::LogMessage::Fail()
@ 0x7f4d748e9e4c google::LogMessage::SendToLog()
@ 0x7f4d748e5ec3 google::LogMessage::Flush()
@ 0x7f4d748eb35e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4d768c77d4 paddle::memory::legacy::Alloc<>()
@ 0x7f4d768c7ab5 paddle::memory::allocation::LegacyAllocator::AllocateImpl()
@ 0x7f4d768bbbd5 paddle::memory::allocation::AllocatorFacade::Alloc()
@ 0x7f4d768bbd5a paddle::memory::allocation::AllocatorFacade::AllocShared()
@ 0x7f4d764b489c paddle::memory::AllocShared()
@ 0x7f4d7688d924 paddle::framework::Tensor::mutable_data()
@ 0x7f4d74b90ba5 paddle::operators::MatMulGradKernel<>::MatMul()
@ 0x7f4d74b90e1f paddle::operators::MatMulGradKernel<>::CalcInputGrad()
@ 0x7f4d74b912e7 paddle::operators::MatMulGradKernel<>::Compute()
@ 0x7f4d74b916f3 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators16MatMulGradKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_NS7_7float16EEEEEclEPKcSI_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
@ 0x7f4d7682f187 paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f4d7682f561 paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f4d7682cb5c paddle::framework::OperatorBase::Run()
@ 0x7f4d7662805a paddle::framework::details::ComputationOpHandle::RunImpl()
@ 0x7f4d7661aa00 paddle::framework::details::OpHandleBase::Run()
@ 0x7f4d765fbd76 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f4d765fa9df paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f4d765fad9f _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f4d749d3b53 std::_Function_handler<>::_M_invoke()
@ 0x7f4d74869c47 std::__future_base::_State_base::_M_do_set()
@ 0x7f4dc650da99 __pthread_once_slow
@ 0x7f4d765f6422 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f4d7486b1c4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f4da839cc80 (unknown)
@ 0x7f4dc65066ba start_thread
@ 0x7f4dc623c41d clone
@ (nil) (unknown)
script/zh_task/ernie_base/run_drcd.sh: line 50: 511 Aborted (core dumped) python -u run_mrc.py --use_cuda true --train_set ${TASK_DATA_PATH1}/train.json --batch_size 16 --in_tokens false --use_fast_executor true --checkpoints ./checkpoints --vocab_path ${MODEL_PATH}/vocab.txt --ernie_config_path ${MODEL_PATH}/ernie_config.json --do_train true --do_val true --do_test true --verbose true --save_steps 1000 --validation_steps 100 --warmup_proportion 0.0 --weight_decay 0.01 --epoch 2 --max_seq_len 512 --do_lower_case true --doc_stride 128 --dev_set ${TASK_DATA_PATH}/dev.json --test_set ${TASK_DATA_PATH}/test.json --learning_rate 5e-5 --num_iteration_per_drop_scope 1 --init_pretraining_params ${MODEL_PATH}/params --skip_steps 10