There is no next data
Created by: TonyKutuya
Paddle 1.6.1 paddlecloud 单机单卡任务 GPU P40
日志为:
2020-03-05 10:30:15,029-WARNING: consumer[30832] exit abnormally with exitcode[-9]
2020-03-05 10:30:15,029-WARNING: 1 consumers have exited abnormally!!!
2020-03-05 10:30:15,029-WARNING: consumer[30832] exit abnormally with exitcode[-9]
2020-03-05 10:30:15,030-WARNING: 1 consumers have exited abnormally!!!
2020-03-05 10:30:15,030-WARNING: Your reader has raised an exception!
Exception in thread Thread-17:
Traceback (most recent call last):
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/reader.py", line 488, in thread_main
six.reraise(*sys.exc_info())
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/reader.py", line 468, in thread_main
for tensors in self._tensor_reader():
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/reader.py", line 542, in tensor_reader_impl
for slots in paddle_reader():
File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/data_feeder.py", line 454, in reader_creator
for item in reader():
File "/home/slurm/job/tmp/job-72883/env_run/ppdet/data/reader.py", line 417, in _reader
reader.reset()
File "/home/slurm/job/tmp/job-72883/env_run/ppdet/data/parallel_map.py", line 261, in reset
assert self._consumer_healthy(), "cannot start another pass of data"
AssertionError: cannot start another pass of data for some consumers exited abnormally before!!!
2020-03-05 10:30:25,885-INFO: iter: 32720, lr: 0.001000, 'loss': '42.669018', time: 0.650, eta: 1 day, 12:56:17 2020-03-05 10:30:38,755-INFO: iter: 32740, lr: 0.001000, 'loss': '40.872700', time: 0.644, eta: 1 day, 12:36:15 Traceback (most recent call last): File "slim/prune/prune.py", line 395, in main() File "slim/prune/prune.py", line 282, in main outs = exe.run(compiled_train_prog, fetch_list=train_values) File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 775, in run six.reraise(*sys.exc_info()) File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 770, in run use_program_cache=use_program_cache) File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 829, in _run_impl return_numpy=return_numpy) File "/home/slurm/job/tmp/job-72883/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 669, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EOFException: There is no next data. at [/paddle/paddle/fluid/operators/reader/read_op.cc:90]
同时集群slum日志信息为: [INFO]: exit 1 slurmstepd: Exceeded step memory limit at some point. slurmstepd: Exceeded job memory limit at some point.
报错时已经训练了几个epoch了,数据应该是没问题的 请帮看一下是不是存在内存泄漏问题呢