Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • ERNIE
  • Issue
  • #286

E
ERNIE
  • 项目概览

PaddlePaddle / ERNIE
大约 2 年 前同步成功

通知 115
Star 5997
Fork 1271
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 29
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 0
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
E
ERNIE
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 29
    • Issue 29
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 0
    • 合并请求 0
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 8月 19, 2019 by saxon_zh@saxon_zhGuest

GPU Memory error while running script/zh_task/ernie_base/run_drcd.sh

Created by: cibinjohn

I was trying to develop a custom Machine comprehension model over squad v1 data by running script/zh_task/ernie_base/run_drcd.sh, and encountered the following error. Any help would be appreciated.

  • export FLAGS_eager_delete_tensor_gb=0
  • FLAGS_eager_delete_tensor_gb=0
  • export FLAGS_sync_nccl_allreduce=1
  • FLAGS_sync_nccl_allreduce=1
  • export CUDA_VISIBLE_DEVICES=0,1,2,3
  • CUDA_VISIBLE_DEVICES=0,1,2,3
  • python -u run_mrc.py --use_cuda true --train_set /home/ubuntu/cibin/squad_v1_1__data/train.json --batch_size 16 --in_tokens false --use_fast_executor true --checkpoints ./checkpoints --vocab_path /home/ubuntu/cibin/ERNIE/pretrained_model/vocab.txt --ernie_config_path /home/ubuntu/cibin/ERNIE/pretrained_model/ernie_config.json --do_train true --do_val true --do_test true --verbose true --save_steps 1000 --validation_steps 100 --warmup_proportion 0.0 --weight_decay 0.01 --epoch 2 --max_seq_len 512 --do_lower_case true --doc_stride 128 --dev_set /home/ubuntu/cibin/squad_v1_1__data/dev.json --test_set /home/ubuntu/cibin/squad_v1_1__data/test.json --learning_rate 5e-5 --num_iteration_per_drop_scope 1 --init_pretraining_params /home/ubuntu/cibin/ERNIE/pretrained_model/params --skip_steps 10 attention_probs_dropout_prob: 0.1 hidden_act: gelu hidden_dropout_prob: 0.1 hidden_size: 768 initializer_range: 0.02 max_position_embeddings: 512 num_attention_heads: 12 num_hidden_layers: 12 sent_type_vocab_size: 4 task_type_vocab_size: 16 vocab_size: 30522

Device count: 4 Num train examples: 1483 Max train steps: 46 Num warmup steps: 0 memory_optimize is deprecated. Use CompiledProgram and Executor Theoretical memory usage in training: 13971.085 - 14636.375 MB W0819 05:09:46.604622 511 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0 W0819 05:09:46.606772 511 device_context.cc:267] device: 0, cuDNN Version: 7.6. Load pretraining parameters from /home/ubuntu/cibin/libor/github/ERNIE/pretrained_model/params. I0819 05:09:49.962049 511 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 4. And the Program will be copied 4 copies I0819 05:09:51.959648 511 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1 W0819 05:09:55.706979 569 system_allocator.cc:121] Cannot malloc 9770.01 MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use or FLAGS_initial_gpu_memory_in_mb or FLAGS_reallocate_gpu_memory_in_mbenvironment variable to a lower value. Current FLAGS_fraction_of_gpu_memory_to_use value is 0.92. Current FLAGS_initial_gpu_memory_in_mb value is 0. Current FLAGS_reallocate_gpu_memory_in_mb value is 0 F0819 05:09:55.707295 569 legacy_allocator.cc:201] Cannot allocate 139.869873MB in GPU 1, available 648.500000MBtotal 11721506816GpuMinChunkSize 256.000000BGpuMaxChunkSize 9.541025GBGPU memory used: 9.507799GB

* Check failure stack trace: *

@ 0x7f4d748e639d google::LogMessage::Fail() @ 0x7f4d748e9e4c google::LogMessage::SendToLog() @ 0x7f4d748e5ec3 google::LogMessage::Flush() @ 0x7f4d748eb35e google::LogMessageFatal::~LogMessageFatal() @ 0x7f4d768c77d4 paddle::memory::legacy::Alloc<>() @ 0x7f4d768c7ab5 paddle::memory::allocation::LegacyAllocator::AllocateImpl() @ 0x7f4d768bbbd5 paddle::memory::allocation::AllocatorFacade::Alloc() @ 0x7f4d768bbd5a paddle::memory::allocation::AllocatorFacade::AllocShared() @ 0x7f4d764b489c paddle::memory::AllocShared() @ 0x7f4d7688d924 paddle::framework::Tensor::mutable_data() @ 0x7f4d74b90ba5 paddle::operators::MatMulGradKernel<>::MatMul() @ 0x7f4d74b90e1f paddle::operators::MatMulGradKernel<>::CalcInputGrad() @ 0x7f4d74b912e7 paddle::operators::MatMulGradKernel<>::Compute() @ 0x7f4d74b916f3 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators16MatMulGradKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_NS7_7float16EEEEEclEPKcSI_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4 @ 0x7f4d7682f187 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7f4d7682f561 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7f4d7682cb5c paddle::framework::OperatorBase::Run() @ 0x7f4d7662805a paddle::framework::details::ComputationOpHandle::RunImpl() @ 0x7f4d7661aa00 paddle::framework::details::OpHandleBase::Run() @ 0x7f4d765fbd76 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() @ 0x7f4d765fa9df paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() @ 0x7f4d765fad9f _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data @ 0x7f4d749d3b53 std::_Function_handler<>::_M_invoke() @ 0x7f4d74869c47 std::__future_base::_State_base::_M_do_set() @ 0x7f4dc650da99 __pthread_once_slow @ 0x7f4d765f6422 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv @ 0x7f4d7486b1c4 _ZZN10ThreadPoolC1EmENKUlvE_clEv @ 0x7f4da839cc80 (unknown) @ 0x7f4dc65066ba start_thread @ 0x7f4dc623c41d clone @ (nil) (unknown) script/zh_task/ernie_base/run_drcd.sh: line 50: 511 Aborted (core dumped) python -u run_mrc.py --use_cuda true --train_set ${TASK_DATA_PATH1}/train.json --batch_size 16 --in_tokens false --use_fast_executor true --checkpoints ./checkpoints --vocab_path ${MODEL_PATH}/vocab.txt --ernie_config_path ${MODEL_PATH}/ernie_config.json --do_train true --do_val true --do_test true --verbose true --save_steps 1000 --validation_steps 100 --warmup_proportion 0.0 --weight_decay 0.01 --epoch 2 --max_seq_len 512 --do_lower_case true --doc_stride 128 --dev_set ${TASK_DATA_PATH}/dev.json --test_set ${TASK_DATA_PATH}/test.json --learning_rate 5e-5 --num_iteration_per_drop_scope 1 --init_pretraining_params ${MODEL_PATH}/params --skip_steps 10
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/ERNIE#286
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7