use Paddlepaddle to train deepspeech show: Cannot allocate memory
Created by: songxia928
============================================================
*** Print content:
============================================================ ................................................................................................... Pass: 25, Batch: 3100, TrainCost: 2.208693 ............................... ------- Time: 7017 sec, Pass: 25, ValidationCost: 52.2345577601 ................................................................................................... Pass: 26, Batch: 100, TrainCost: 1.603295 ................................................................................................... Pass: 26, Batch: 200, TrainCost: 1.676432 ................................................................................................... Pass: 26, Batch: 300, TrainCost: 1.612143 ................................................................................................... Pass: 26, Batch: 400, TrainCost: 1.353869 ................................................................................................... Pass: 26, Batch: 500, TrainCost: 1.482893 ................................................................................................... Pass: 26, Batch: 600, TrainCost: 1.687072 ................................................................................................... Pass: 26, Batch: 700, TrainCost: 1.966297 ................................................................................................... Pass: 26, Batch: 800, TrainCost: 1.420656 ................................................................................................... Pass: 26, Batch: 900, TrainCost: 1.590458 ................................................................................................... Pass: 26, Batch: 1000, TrainCost: 2.041432 ................................................................................................... Pass: 26, Batch: 1100, TrainCost: 1.333077 ................................................................................................... Pass: 26, Batch: 1200, TrainCost: 1.839570 ................................................................................................... Pass: 26, Batch: 1300, TrainCost: 1.614656 ................................................................................................... Pass: 26, Batch: 1400, TrainCost: 1.606205 ................................................................................................... Pass: 26, Batch: 1500, TrainCost: 2.410910 ................................................................................................... Pass: 26, Batch: 1600, TrainCost: 1.942313 ................................................................................................... Pass: 26, Batch: 1700, TrainCost: 2.099164 ................................................................................................... Pass: 26, Batch: 1800, TrainCost: 1.974127 ................................................................................................... Pass: 26, Batch: 1900, TrainCost: 2.180515 ................................................................................................... Pass: 26, Batch: 2000, TrainCost: 2.120815 ................................................................................................... Pass: 26, Batch: 2100, TrainCost: 2.503351 ................................................................................................... Pass: 26, Batch: 2200, TrainCost: 2.302993 ................................................................................................... Pass: 26, Batch: 2300, TrainCost: 2.194658 ................................................................................................... Pass: 26, Batch: 2400, TrainCost: 2.364214 ................................................................................................... Pass: 26, Batch: 2500, TrainCost: 2.714264 ................................................................................................... Pass: 26, Batch: 2600, TrainCost: 2.205300 ................................................................................................... Pass: 26, Batch: 2700, TrainCost: 2.671229 ................................................................................................... Pass: 26, Batch: 2800, TrainCost: 2.073756 ................................................................................................... Pass: 26, Batch: 2900, TrainCost: 2.087103 ................................................................................................... Pass: 26, Batch: 3000, TrainCost: 2.158531 ................................................................................................... Pass: 26, Batch: 3100, TrainCost: 2.519216 ................................ ------- Time: 6982 sec, Pass: 26, ValidationCost: 53.4873382121 Traceback (most recent call last): File "train.py", line 129, in main() File "train.py", line 125, in main train() File "train.py", line 116, in train test_off=args.test_off) File "/data1/chensong/code/20180523_deepspeech/DeepSpeech/model_utils/model.py", line 155, in train feeding=adapted_feeding_dict) File "/usr/local/lib/python2.7/dist-packages/paddle/v2/trainer.py", line 162, in train for batch_id, data_batch in enumerate(reader()): File "/data1/chensong/code/20180523_deepspeech/DeepSpeech/model_utils/model.py", line 391, in adapted_reader for instance in data(): File "/data1/chensong/code/20180523_deepspeech/DeepSpeech/data_utils/data.py", line 199, in batch_reader for instance in instance_reader(): File "/data1/chensong/code/20180523_deepspeech/DeepSpeech/data_utils/utility.py", line 198, in xreader w.start() File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/usr/lib/python2.7/multiprocessing/forking.py", line 121, in init self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory
============================================================
*** my machine state:
============================================================
root@2f9aebc64ff:/data1/code/20180523_deepspeech/DeepSpeech# nvidia-smi
Sun Jun 10 15:10:15 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 23C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 25C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 26C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:0A:00.0 Off | 0 |
| N/A 27C P8 30W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 32C P0 54W / 149W | 6569MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 37C P0 70W / 149W | 6351MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 33C P0 55W / 149W | 6830MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 37C P0 69W / 149W | 6542MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+ root@2f9aebc64ff:/data1/code/20180523_deepspeech/DeepSpeech# free -w total used free shared buffers cache available Mem: 264057556 241080888 3589964 14154876 2435984 16950720 6184440 Swap: 16777212 16452236 324976
============================================================
*** the training parameters I set:
============================================================
CUDA_VISIBLE_DEVICES=4,5,6,7
python -u train.py
--batch_size=32
--trainer_count=4
--num_passes=50
--num_proc_data=16
--num_conv_layers=2
--num_rnn_layers=3
--rnn_layer_size=1024
--num_iter_print=100
--learning_rate=5e-4
--max_duration=20.0
--min_duration=1.0
--test_off=False
--use_sortagrad=True
--use_gru=True
--use_gpu=True
--is_local=True
--share_rnn_weights=False
--train_manifest='data_mangguo_2/aishell/manifest.train'
--dev_manifest='data_mangguo_2/aishell/manifest.dev'
--mean_std_path='data_mangguo_2/aishell/mean_std.npz'
--vocab_path='data_mangguo_2/aishell/vocab.txt'
--output_model_dir='./checkpoints/mangguo_2'
--augment_conf_path='conf/augmentation.config'
--specgram_type='linear'
--shuffle_method='batch_shuffle_clipped'
Hi, I used PaddlePaddle to train a Chinese model based on Deepspeech. I can finish training a model with the data of aishell . when I add more audio data to train , it will show this wrong. From the machine state , I guess the CPU' memory is not enough. But I don't know how to solve it . Anyone could tell me , thanks ! (我用PaddlePaddle训练一个基于Deepspeech的汉语模型。我能使用aishell的数据完成一轮训练,但是当语言数据增加时就会出现内存无法分配的错误提示,请问,有谁能知道怎么解决吗?)