在每一次训练估计是一轮之后就显存不足,batch size设置小也没有用
Created by: yeyupiaoling
export FLAGS_sync_nccl_allreduce=0
python -u train.py \
--batch_size=8 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=2 \
--num_samples=120000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--test_off=False \
--use_sortagrad=True \
--use_gru=True \
--use_gpu=True \
--is_local=True \
--share_rnn_weights=False \
--train_manifest='./dataset/manifest.train' \
--dev_manifest='./dataset/manifest.dev' \
--mean_std_path='./dataset/mean_std.npz' \
--vocab_path='./dataset/zh_vocab.txt' \
--output_model_dir='./models/checkpoints/' \
--augment_conf_path='./conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped' \
我从batch size从原来的32设置到8还是报原来的错误,前面3万多的batch训练时正常的。
Train [2019-11-06 04:35:06.863259] epoch: 0, batch: 35100, train loss: 13.707798
Train [2019-11-06 04:41:04.340795] epoch: 0, batch: 35200, train loss: 17.407038
Train [2019-11-06 04:47:09.349953] epoch: 0, batch: 35300, train loss: 12.541659
Train [2019-11-06 04:53:21.462351] epoch: 0, batch: 35400, train loss: 20.683529
Train [2019-11-06 04:59:41.506114] epoch: 0, batch: 35500, train loss: 21.854664
Train [2019-11-06 05:06:09.245898] epoch: 0, batch: 35600, train loss: 11.866036
Train [2019-11-06 05:12:48.686871] epoch: 0, batch: 35700, train loss: 17.497536
Train [2019-11-06 05:19:38.417454] epoch: 0, batch: 35800, train loss: 13.972732
Out of memory error on GPU 0. Cannot allocate 68.156494MB memory on GPU 0, available memory is only 10.187500MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please try one of the following suggestions:
1) Decrease the batch size of your model.
2) FLAGS_fraction_of_gpu_memory_to_use is 0.92 now, please set it to a higher value but less than 1.0.
The command is `export FLAGS_fraction_of_gpu_memory_to_use=xxx`.