Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • DeepSpeech
  • Issue
  • #178

D
DeepSpeech
  • 项目概览

PaddlePaddle / DeepSpeech
大约 2 年 前同步成功

通知 210
Star 8425
Fork 1598
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 245
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 3
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
D
DeepSpeech
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 245
    • Issue 245
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 3
    • 合并请求 3
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 3月 15, 2018 by saxon_zh@saxon_zhGuest

运行过程中 CUDA out of memmory?

Created by: bolt163

说一下我的情况,因为机器只有4块GPU,每个12GB内存,同时也对应修改了 run_train.sh脚本内容如下

CUDA_VISIBLE_DEVICES=0,1,2,3
python -u train.py
--batch_size=64
--trainer_count=4
--num_passes=50
--num_proc_data=16
--num_conv_layers=2
--num_rnn_layers=3
--rnn_layer_size=1024
--num_iter_print=100
--learning_rate=5e-4
--max_duration=27.0
--min_duration=0.0
--test_off=False
--use_sortagrad=True
--use_gru=True
--use_gpu=True
--is_local=True
--share_rnn_weights=False
--train_manifest='data/aishell/manifest.train'
--dev_manifest='data/aishell/manifest.dev'
--mean_std_path='data/aishell/mean_std.npz'
--vocab_path='data/aishell/vocab.txt'
--output_model_dir='./checkpoints/aishell'
--augment_conf_path='conf/augmentation.config'
--specgram_type='linear'
--shuffle_method='batch_shuffle_clipped'

if [ $? -ne 0 ]; then echo "Failed in training!" exit 1 fi

exit 0 ~
~
~
"run_train.sh" 43L, 1019C

————————————————————————————————————————————— ~/DeepSpeech/examples/aishell> sh run_train.sh ----------- Configuration Arguments ----------- augment_conf_path: conf/augmentation.config batch_size: 64 dev_manifest: data/aishell/manifest.dev init_model_path: None is_local: 1 learning_rate: 0.0005 max_duration: 27.0 mean_std_path: data/aishell/mean_std.npz min_duration: 0.0 num_conv_layers: 2 num_iter_print: 100 num_passes: 50 num_proc_data: 16 num_rnn_layers: 3 output_model_dir: ./checkpoints/aishell rnn_layer_size: 1024 share_rnn_weights: 0 shuffle_method: batch_shuffle_clipped specgram_type: linear test_off: 0 train_manifest: data/aishell/manifest.train trainer_count: 4 use_gpu: 1 use_gru: 1 use_sortagrad: 1 vocab_path: data/aishell/vocab.txt

I0315 16:47:44.366181 11850 Util.cpp:166] commandline: --use_gpu=1 --rnn_use_batch=True --log_clipping=True --trainer_count=4 [INFO 2018-03-15 16:47:46,743 layers.py:2714] output for conv_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-03-15 16:47:46,744 layers.py:3282] output for batch_norm_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-03-15 16:47:46,744 layers.py:7454] output for scale_sub_region_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-03-15 16:47:46,745 layers.py:2714] output for conv_1: c = 32, h = 41, w = 54, size = 70848 [INFO 2018-03-15 16:47:46,746 layers.py:3282] output for batch_norm_1: c = 32, h = 41, w = 54, size = 70848 [INFO 2018-03-15 16:47:46,746 layers.py:7454] output for scale_sub_region_1: c = 32, h = 41, w = 54, size = 70848 I0315 16:47:46.766090 11850 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4 I0315 16:47:46.877454 11850 GradientMachine.cpp:94] Initing parameters.. I0315 16:47:50.904826 11850 GradientMachine.cpp:101] Init parameters done. ................................................................................................... Pass: 0, Batch: 100, TrainCost: 63.639388 ................................................................................................... Pass: 0, Batch: 200, TrainCost: 63.148882 ................................................................................................... Pass: 0, Batch: 300, TrainCost: 64.747351 ................................................................................................... Pass: 0, Batch: 400, TrainCost: 54.954166 ................................................................................................... Pass: 0, Batch: 500, TrainCost: 38.613670 ................................................................................................... Pass: 0, Batch: 600, TrainCost: 30.979173 ................................................................................................... Pass: 0, Batch: 700, TrainCost: 26.576287 ................................................................................................... Pass: 0, Batch: 800, TrainCost: 24.339529 ................................................................................................... Pass: 0, Batch: 900, TrainCost: 22.288584 ...............................F0315 17:24:56.116675 11921 hl_cuda_device.cc:273] Check failed: cudaSuccess == cudaStat (0 vs. 2) Cuda Error: out of memory

* Check failure stack trace: *

@ 0x7fa000f5adad google::LogMessage::Fail() @ 0x7fa000f5ef6c google::LogMessage::SendToLog() @ 0x7fa000f5a8d3 google::LogMessage::Flush() @ 0x7fa000f5f9be google::LogMessageFatal::~LogMessageFatal() @ 0x7fa000f1bf84 hl_malloc_device() @ 0x7fa000dd4a66 paddle::GpuAllocator::alloc() @ 0x7fa000dc85ff paddle::PoolAllocator::alloc() @ 0x7fa000dc8004 paddle::GpuMemoryHandle::GpuMemoryHandle() @ 0x7fa000da05c4 paddle::GpuMatrix::resize() @ 0x7fa000db4839 paddle::Matrix::resizeOrCreate() @ 0x7fa000c15cb2 paddle::Layer::resetSpecifyOutput() @ 0x7fa000c15f64 paddle::Layer::resetOutput() @ 0x7fa000c5e384 paddle::CudnnBatchNormLayer::forward() @ 0x7fa000cc80fd paddle::NeuralNetwork::forward() @ 0x7fa000cd3334 paddle::TrainerThread::forward() @ 0x7fa000cd4625 paddle::TrainerThread::computeThread() @ 0x7fa04a58b870 (unknown) @ 0x7fa055d02dc5 start_thread @ 0x7fa05532729d __clone @ (nil) (unknown) run_train.sh: line 35: 11850 Aborted CUDA_VISIBLE_DEVICES=0,1,2,3 python -u train.py --batch_size=64 --trainer_count=4 --num_passes=50 --num_proc_data=16 --num_conv_layers=2 --num_rnn_layers=3 --rnn_layer_size=1024 --num_iter_print=100 --learning_rate=5e-4 --max_duration=27.0 --min_duration=0.0 --test_off=False --use_sortagrad=True --use_gru=True --use_gpu=True --is_local=True --share_rnn_weights=False --train_manifest='data/aishell/manifest.train' --dev_manifest='data/aishell/manifest.dev' --mean_std_path='data/aishell/mean_std.npz' --vocab_path='data/aishell/vocab.txt' --output_model_dir='./checkpoints/aishell' --augment_conf_path='conf/augmentation.config' --specgram_type='linear' --shuffle_method='batch_shuffle_clipped' Failed in training!

-------------------------------------------------------中途异常退出。。。GPU内存泄漏。。。 Thu Mar 15 17:26:56 2018
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.20 Driver Version: 375.20 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K40m Off | 0000:0D:00.0 Off | 0 | | N/A 37C P0 62W / 235W | 3022MiB / 11471MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40m Off | 0000:0E:00.0 Off | 0 | | N/A 36C P0 62W / 235W | 2967MiB / 11471MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40m Off | 0000:30:00.0 Off | 0 | | N/A 39C P0 62W / 235W | 3032MiB / 11471MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K40m Off | 0000:33:00.0 Off | 0 | | N/A 35C P0 62W / 235W | 2966MiB / 11471MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/DeepSpeech#178
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7