在每一次训练估计是一轮之后就显存不足，batch size设置小也没有用 (#384) · Issue · PaddlePaddle / DeepSpeech

在每一次训练估计是一轮之后就显存不足，batch size设置小也没有用

Created by: yeyupiaoling

我的训练参数是：

export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0 \
python -u train.py \
--batch_size=8 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=2 \
--num_samples=120000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--test_off=False \
--use_sortagrad=True \
--use_gru=True \
--use_gpu=True \
--is_local=True \
--share_rnn_weights=False \
--train_manifest='./dataset/manifest.train' \
--dev_manifest='./dataset/manifest.dev' \
--mean_std_path='./dataset/mean_std.npz' \
--vocab_path='./dataset/zh_vocab.txt' \
--output_model_dir='./models/checkpoints/' \
--augment_conf_path='./conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped' \

我从batch size从原来的32设置到8还是报原来的错误，前面3万多的batch训练时正常的。



Train [2019-11-06 04:35:06.863259] epoch: 0, batch: 35100, train loss: 13.707798

Train [2019-11-06 04:41:04.340795] epoch: 0, batch: 35200, train loss: 17.407038

Train [2019-11-06 04:47:09.349953] epoch: 0, batch: 35300, train loss: 12.541659

Train [2019-11-06 04:53:21.462351] epoch: 0, batch: 35400, train loss: 20.683529

Train [2019-11-06 04:59:41.506114] epoch: 0, batch: 35500, train loss: 21.854664

Train [2019-11-06 05:06:09.245898] epoch: 0, batch: 35600, train loss: 11.866036

Train [2019-11-06 05:12:48.686871] epoch: 0, batch: 35700, train loss: 17.497536

Train [2019-11-06 05:19:38.417454] epoch: 0, batch: 35800, train loss: 13.972732

..........................................

Out of memory error on GPU 0. Cannot allocate 68.156494MB memory on GPU 0, available memory is only 10.187500MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please try one of the following suggestions:
   1) Decrease the batch size of your model.
   2) FLAGS_fraction_of_gpu_memory_to_use is 0.92 now, please set it to a higher value but less than 1.0.
      The command is `export FLAGS_fraction_of_gpu_memory_to_use=xxx`.

PaddlePaddle / DeepSpeech 1 年多 前同步成功

在每一次训练估计是一轮之后就显存不足，batch size设置小也没有用

PaddlePaddle / DeepSpeech
1 年多前同步成功