Alwasy oom problem, even if bath_size==1
Created by: myrainbowandsky
python3 -u ./DeepSpeech.py --train_files /home/sky-ai/xwt/DeepSpeech/data/train/train.csv --dev_files /home/sky-ai/xwt/DeepSpeech/data/cv/cv.csv --test_files /home/sky/sky-ai/xwt/DeepSpeech/data/test/test.csv --train_batch_size 24 --dev_batch_size 15 --test_batch_size 20 --epoch 20 --display_step 1 \ --validation_step 1 \ --dropout_rate 0.30 \ --default_stddev 0.046875 \ --learning_rate 0.0001 \ --log_level 0
2019-03-08 16:25:17.113383: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-03-08 16:25:17.213002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:17:00.0 totalMemory: 10.92GiB freeMemory: 10.77GiB 2019-03-08 16:25:17.274068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:65:00.0 totalMemory: 10.92GiB freeMemory: 10.57GiB 2019-03-08 16:25:17.274882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-03-08 16:25:17.936644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 16:25:17.936675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-03-08 16:25:17.936680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y 2019-03-08 16:25:17.936683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N 2019-03-08 16:25:17.937213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1) 2019-03-08 16:25:17.937478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 10226 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1) D Starting coordinator… D Coordinator started. Thread id 140459615708928 Preprocessing [’/home/sky-ai/xwt/DeepSpeech/data/train/train.csv’] Preprocessing done Preprocessing [’/home/sky-ai/xwt/DeepSpeech/data/cv/cv.csv’] Preprocessing done W Parameter --validation_step needs to be >0 for early stopping to work 2019-03-08 16:26:26.961329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-03-08 16:26:26.961413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 16:26:26.961419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-03-08 16:26:26.961424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y 2019-03-08 16:26:26.961428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N 2019-03-08 16:26:26.961956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1) 2019-03-08 16:26:26.962068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10226 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1) 2019-03-08 16:26:28.980208: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980297: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592 2019-03-08 16:26:28.980344: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980353: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 7730940928 2019-03-08 16:26:28.980392: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 6957846528 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980402: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 6957846528 2019-03-08 16:26:28.980433: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 6262061568 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980440: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 6262061568 2019-03-08 16:26:28.980464: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 5635855360 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980471: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 5635855360 2019-03-08 16:26:28.980494: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 5072269824 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980501: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 5072269824 2019-03-08 16:26:28.980526: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 4565042688 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980532: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 4565042688 2019-03-08 16:26:28.980551: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 4108538368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980556: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 4108538368 2019-03-08 16:26:28.980572: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 3697684480 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980577: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 3697684480 2019-03-08 16:26:28.980602: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:28.980607: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592 2019-03-08 16:26:38.980783: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:38.980838: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592 2019-03-08 16:26:38.980875: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2019-03-08 16:26:38.980886: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592 2019-03-08 16:26:38.980901: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (cuda_host_bfc) ran out of memory trying to allocate 3.33GiB. Current allocation summary follows. 2019-03-08 16:26:38.980918: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 12B client-requested in use in bin. 2019-03-08 16:26:38.980931: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.980942: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.980954: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.980964: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.980979: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192): Total Chunks: 12, Chunks in use: 12. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin. 2019-03-08 16:26:38.980990: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981003: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768): Total Chunks: 3, Chunks in use: 3. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin. 2019-03-08 16:26:38.981014: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981025: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981036: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981049: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288): Total Chunks: 1, Chunks in use: 0. 831.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981061: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576): Total Chunks: 3, Chunks in use: 3. 5.00MiB allocated for chunks. 5.00MiB in use in bin. 5.00MiB client-requested in use in bin. 2019-03-08 16:26:38.981075: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152): Total Chunks: 3, Chunks in use: 3. 11.58MiB allocated for chunks. 11.58MiB in use in bin. 11.58MiB client-requested in use in bin. 2019-03-08 16:26:38.981086: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981097: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981110: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216): Total Chunks: 9, Chunks in use: 9. 144.00MiB allocated for chunks. 144.00MiB in use in bin. 144.00MiB client-requested in use in bin. 2019-03-08 16:26:38.981122: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981132: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-03-08 16:26:38.981146: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728): Total Chunks: 4, Chunks in use: 3. 524.24MiB allocated for chunks. 384.00MiB in use in bin. 384.00MiB client-requested in use in bin. 2019-03-08 16:26:38.981158: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456): Total Chunks: 3, Chunks in use: 2. 7.33GiB allocated for chunks. 6.66GiB in use in bin. 6.66GiB client-requested in use in bin. 2019-03-08 16:26:38.981171: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 3.33GiB was 256.00MiB, Chunk State: 2019-03-08 16:26:38.981186: I tensorflow/core/common_runtime/bfc_allocator.cc:619] Size: 684.82MiB | Requested Size: 0B | in_use: 0, prev: Size: 3.33GiB | Requested Size: 3.33GiB | in_use: 1 2019-03-08 16:26:38.981198: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fb956000000 of size 3576881152 2019-03-08 16:26:38.981207: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fba2b32e000 of size 718086144 2019-03-08 16:26:38.981215: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fba74000000 of size 3576881152 2019-03-08 16:26:38.981224: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb4932e000 of size 134217728 2019-03-08 16:26:38.981232: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb5132e000 of size 134217728 2019-03-08 16:26:38.981240: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb5932e000 of size 134217728 2019-03-08 16:26:38.981248: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6132e000 of size 1746688 2019-03-08 16:26:38.981256: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb614d8700 of size 1746688 2019-03-08 16:26:38.981264: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61682e00 of size 1746688 2019-03-08 16:26:38.981273: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6182d500 of size 4046848 2019-03-08 16:26:38.981281: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61c09500 of size 4046848 2019-03-08 16:26:38.981289: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb61fe5500 of size 4046848 2019-03-08 16:26:38.981297: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb623c1500 of size 16777216 2019-03-08 16:26:38.981305: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb633c1500 of size 16777216 2019-03-08 16:26:38.981313: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb643c1500 of size 16777216 2019-03-08 16:26:38.981321: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb653c1500 of size 16777216 2019-03-08 16:26:38.981329: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb663c1500 of size 16777216 2019-03-08 16:26:38.981337: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb673c1500 of size 16777216 2019-03-08 16:26:38.981345: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb683c1500 of size 16777216 2019-03-08 16:26:38.981353: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb693c1500 of size 16777216 2019-03-08 16:26:38.981361: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbb6a3c1500 of size 16777216 2019-03-08 16:26:38.981369: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fbb6b3c1500 of size 147057408 2019-03-08 16:26:38.981378: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14800000 of size 8192 2019-03-08 16:26:38.981386: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14802000 of size 256 2019-03-08 16:26:38.981394: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14802100 of size 8192 2019-03-08 16:26:38.981402: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14804100 of size 8192 2019-03-08 16:26:38.981410: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14806100 of size 8192 2019-03-08 16:26:38.981418: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14808100 of size 8192 2019-03-08 16:26:38.981426: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480a100 of size 8192 2019-03-08 16:26:38.981434: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480c100 of size 8192 2019-03-08 16:26:38.981442: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf1480e100 of size 8192 2019-03-08 16:26:38.981450: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14810100 of size 8192 2019-03-08 16:26:38.981458: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14812100 of size 8192 2019-03-08 16:26:38.981466: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14814100 of size 8192 2019-03-08 16:26:38.981474: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14816100 of size 8192 2019-03-08 16:26:38.981485: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818100 of size 256 2019-03-08 16:26:38.981493: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818200 of size 256 2019-03-08 16:26:38.981501: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14818300 of size 32768 2019-03-08 16:26:38.981509: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14820300 of size 32768 2019-03-08 16:26:38.981517: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x7fbf14828300 of size 32768 2019-03-08 16:26:38.981525: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Free at 0x7fbf14830300 of size 851200 2019-03-08 16:26:38.981533: I tensorflow/core/common_runtime/bfc_allocator.cc:638] Summary of in-use Chunks by size: 2019-03-08 16:26:38.981543: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 256 totalling 768B 2019-03-08 16:26:38.981553: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 12 Chunks of size 8192 totalling 96.0KiB 2019-03-08 16:26:38.981562: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 32768 totalling 96.0KiB 2019-03-08 16:26:38.981571: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 1746688 totalling 5.00MiB 2019-03-08 16:26:38.981581: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 4046848 totalling 11.58MiB 2019-03-08 16:26:38.981590: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 9 Chunks of size 16777216 totalling 144.00MiB 2019-03-08 16:26:38.981599: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 134217728 totalling 384.00MiB 2019-03-08 16:26:38.981608: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 3576881152 totalling 6.66GiB 2019-03-08 16:26:38.981617: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 7.19GiB 2019-03-08 16:26:38.981629: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit: 68719476736 InUse: 7724988416 MaxInUse: 7724988416 NumAllocs: 38 MaxAllocSize: 3576881152
2019-03-08 16:26:38.981646: W tensorflow/core/common_runtime/bfc_allocator.cc:271] _______********* 2019-03-08 16:26:38.982719: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Resource exhausted: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc Traceback (most recent call last): File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1334, in _do_call return fn(*args) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc [[{{node save_1/RestoreV2_1}} = RestoreV2dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node save_1/RestoreV2_1/_43}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “./DeepSpeech.py”, line 964, in tf.app.run(main) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 125, in run _sys.exit(main(argv)) File “./DeepSpeech.py”, line 916, in main train() File “./DeepSpeech.py”, line 549, in train config=Config.session_config) as session: File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 504, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init stop_grace_period_secs=stop_grace_period_secs) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init self._sess = _RecoverableSession(self._coordinated_creator) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init _WrappedSession.init(self, self._create_session()) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session return self._sess_creator.create_session() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session self.tf_sess = self._session_creator.create_session() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 566, in create_session init_fn=self._scaffold.init_fn) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py”, line 288, in prepare_session config=config) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py”, line 218, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1546, in restore {self.saver_def.filename_tensor_name: save_path}) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 929, in run run_metadata_ptr) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1152, in _run feed_dict_tensor, options, run_metadata) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1328, in _do_run run_metadata) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc [[node save_1/RestoreV2_1 (defined at ./DeepSpeech.py:549) = RestoreV2dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node save_1/RestoreV2_1/_43}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op ‘save_1/RestoreV2_1’, defined at: File “./DeepSpeech.py”, line 964, in tf.app.run(main) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 125, in run _sys.exit(main(argv)) File “./DeepSpeech.py”, line 916, in main train() File “./DeepSpeech.py”, line 549, in train config=Config.session_config) as session: File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 504, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init stop_grace_period_secs=stop_grace_period_secs) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init self._sess = _RecoverableSession(self._coordinated_creator) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init _WrappedSession.init(self, self._create_session()) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session return self._sess_creator.create_session() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session self.tf_sess = self._session_creator.create_session() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 557, in create_session self._scaffold.finalize() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py”, line 213, in finalize self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 886, in _get_saver_or_default saver = Saver(sharded=True, allow_empty=True) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1102, in init self.build() File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1114, in build self._build(self._filename, build_save=True, build_restore=True) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 1151, in _build build_save=build_save, build_restore=build_restore) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 789, in _build_internal restore_sequentially, reshape) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 459, in _AddShardedRestoreOps name=“restore_shard”)) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 406, in _AddRestoreOps restore_sequentially) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py”, line 862, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py”, line 1466, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper op_def=op_def) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func return func(*args, **kwargs) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 3274, in create_op op_def=op_def) File “/home/sky-ai/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 1770, in init self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,436631] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc [[node save_1/RestoreV2_1 (defined at ./DeepSpeech.py:549) = RestoreV2dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node save_1/RestoreV2_1/_43}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_48_save_1/RestoreV2_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
No matter how I reduced the train and CV data compatibility, it always said this OMM problem. I’ve tried batch_size 1 but all the same. I used 16GB memory, 2*1080Ti,i7. No other program was running.