GPU训练显存分配失败问题
Created by: lxwzafu
- 版本、环境信息: 1)PaddlePaddle版本:paddlepaddle-1.5.2 2)GPU:训练用GPU型号: GeForce GTX 1080 Ti,驱动版本25.21.14.1896 这样的GPU共2块 3)CUDA 10.0.130_411.31, CUDNN 10.0 v7.5.0.56 4)系统环境:Windows10 64位、VS2017、Python36_64
- 训练信息 1)单机,多卡 (但是目前训练好像只用到了一张卡) 2)显存信息:每张显卡:专用GPU内存 11GB GDDR5X,共享GPU内存 31.8GB,GPU内存 42.8GB 3)SSD模型训练,链接:https://aistudio.baidu.com/aistudio/projectdetail/78972 在控制台显示以下信息: start ssd, train params: {'input_size': [3, 300, 300], 'class_dim': 2, 'label_dict': {'background': 0, 'person': 1}, 'image_count': 7104, 'log_feed_image': False, 'pretrained': True, 'pretrained_model_dir': './pretrained-model', 'continue_train': True, 'save_model_dir': './ssd-model', 'model_prefix': 'mobilenet-ssd', 'data_dir': 'Y:/pascalvoc', 'mean_rgb': [127.5, 127.5, 127.5], 'file_list': 'train_person.txt', 'eval_file_list': 'eval_person.txt', 'label_list': 'label_list_person', 'mode': 'train', 'num_epochs': 400, 'train_batch_size': 64, 'use_gpu': True, 'apply_distort': True, 'apply_expand': True, 'apply_corp': True, 'image_distort_strategy': {'expand_prob': 0.5, 'expand_max_ratio': 4, 'hue_prob': 0.5, 'hue_delta': 18, 'contrast_prob': 0.5, 'contrast_delta': 0.5, 'saturation_prob': 0.5, 'saturation_delta': 0.5, 'brightness_prob': 0.5, 'brightness_delta': 0.125}, 'rsm_strategy': {'learning_rate': 0.001, 'lr_epochs': [40, 60, 80, 100], 'lr_decay': [1, 0.5, 0.25, 0.1, 0.01]}, 'momentum_strategy': {'learning_rate': 0.1, 'decay_steps': 128, 'decay_rate': 0.8}, 'early_stop': {'sample_frequency': 50, 'successive_limit': 3, 'min_loss': 1.28, 'min_curr_map': 0.86}} W0929 14:21:07.261186 35732 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0 W0929 14:21:07.275125 35732 device_context.cc:267] device: 0, cuDNN Version: 7.5. load param from retrain model current round: 0, start read image W0929 14:21:15.004384 35732 system_allocator.cc:121] Cannot malloc 45.1252 MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use or FLAGS_initial_gpu_memory_in_mb or FLAGS_reallocate_gpu_memory_in_mbenvironment variable to a lower value. Current FLAGS_fraction_of_gpu_memory_to_use value is 0.5. Current FLAGS_initial_gpu_memory_in_mb value is 0. Current FLAGS_reallocate_gpu_memory_in_mb value is 0 F0929 14:21:15.005416 35732 legacy_allocator.cc:201] Cannot allocate 45.125000MB in GPU 0, available 6.637500MBtotal 11811160064GpuMinChunkSize 256.000000BGpuMaxChunkSize 4.278862GBGPU memory used: 8.760785GB *** Check failure stack trace: ***
然后跳出:调试适配器已意外退出,程序退出运行状态
把train_batch_size从64调整到32能正常运行。但是有这么大的显存,为什么这么点显存分配还会失败?可以如何设置使用多个GPU和显存管理策略来提高训练效率?