训练DBNET模型时报错退出,log如下,请帮忙看一下
Created by: wolfryu
ubuntu18.04 memory 8G GTX1060 8G 运行 python3 tools/train.py -c configs/det/det_mv3_db.yml 时出错退出,logs如下:
2020-06-12 10:49:21,978-INFO: {'Global': {'algorithm': 'DB', 'use_gpu': True, 'epoch_num': 1200, 'log_smooth_window': 20, 'print_batch_step': 2, 'save_model_dir': './output/det_db/', 'save_epoch_step': 200, 'eval_batch_step': 5000, 'train_batch_size_per_card': 16, 'test_batch_size_per_card': 16, 'image_shape': [3, 640, 640], 'reader_yml': './configs/det/det_db_icdar15_reader.yml', 'pretrain_weights': './pretrain_models/MobileNetV3_large_x0_5_pretrained/', 'checkpoints': None, 'save_res_path': './output/det_db/predicts_db.txt', 'save_inference_dir': None}, 'Architecture': {'function': 'ppocr.modeling.architectures.det_model,DetModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.det_mobilenet_v3,MobileNetV3', 'scale': 0.5, 'model_name': 'large'}, 'Head': {'function': 'ppocr.modeling.heads.det_db_head,DBHead', 'model_name': 'large', 'k': 50, 'inner_channels': 96, 'out_channels': 2}, 'Loss': {'function': 'ppocr.modeling.losses.det_db_loss,DBLoss', 'balance_loss': True, 'main_loss_type': 'DiceLoss', 'alpha': 5, 'beta': 10, 'ohem_ratio': 3}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.001, 'beta1': 0.9, 'beta2': 0.999}, 'PostProcess': {'function': 'ppocr.postprocess.db_postprocess,DBPostProcess', 'thresh': 0.3, 'box_thresh': 0.7, 'max_candidates': 1000, 'unclip_ratio': 2.0}, 'TrainReader': {'reader_function': 'ppocr.data.det.dataset_traversal,TrainReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTrain', 'num_workers': 8, 'img_set_dir': './train_data/icdar2015/text_localization/', 'label_file_path': './train_data/icdar2015/text_localization/train_icdar2015_label.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'img_set_dir': './train_data/icdar2015/text_localization/', 'label_file_path': './train_data/icdar2015/text_localization/test_icdar2015_label.txt', 'test_image_shape': [736, 1280]}, 'TestReader': {'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'infer_img': None, 'img_set_dir': './train_data/icdar2015/text_localization/', 'label_file_path': './train_data/icdar2015/text_localization/test_icdar2015_label.txt', 'test_image_shape': [736, 1280], 'do_eval': True}} 3 640 640 3 640 640 2020-06-12 10:49:24,705-INFO: places would be ommited when DataLoader is not iterable W0612 10:49:26.896463 3987 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0 W0612 10:49:27.025501 3987 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-06-12 10:49:28,773-INFO: Loading parameters from ./pretrain_models/MobileNetV3_large_x0_5_pretrained/... 2020-06-12 10:49:28,774-WARNING: ./pretrain_models/MobileNetV3_large_x0_5_pretrained/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] 2020-06-12 10:49:28,774-WARNING: ./pretrain_models/MobileNetV3_large_x0_5_pretrained/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] 2020-06-12 10:49:29,496-INFO: Finish initing model from ./pretrain_models/MobileNetV3_large_x0_5_pretrained/ I0612 10:49:29.525483 3987 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0612 10:49:29.560851 3987 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0612 10:49:29.638608 3987 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0612 10:49:29.662742 3987 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 Killed
运行之后内存占用100%,cpu占用100%,gpu也100%。 出错退出后,内存占用还是100%,gpu也继续占用。 gpu: | 36% 49C P2 29W / 120W | 5588MiB / 6077MiB | 1% Default | 内存: 4038 5.5 17.1 13284624 1391080 pts/0 Sl 10:49 0:14 python3 tools/train.py -c configs/det/det_mv3_db.yml 4039 5.5 16.5 13284624 1342820 pts/0 Sl 10:49 0:14 python3 tools/train.py -c configs/det/det_mv3_db.yml 4040 5.4 16.9 13352592 1375164 pts/0 Sl 10:49 0:14 python3 tools/train.py -c configs/det/det_mv3_db.yml 4043 5.4 16.8 13284624 1364808 pts/0 Sl 10:49 0:14 python3 tools/train.py -c configs/det/det_mv3_db.yml 4055 5.4 19.1 13772568 1553304 pts/0 Sl 10:49 0:14 python3 tools/train.py -c configs/det/det_mv3_db.yml
请帮忙看下哪里出了问题。多谢!