yolov3 train with multi-process reader, then stuck at train process (#2580) · Issue · PaddlePaddle / models

yolov3 train with multi-process reader, then stuck at train process

Created by: Haijunlv

i use yolov3 with multi-process train reader. train successfully first. But stuck at 15433 iter, almost 2 epochs I did not do many changes except modifying reader to read xml annotation. my model config is :

gpu_num=2, batch_size=8, pretrain=init_weight/yolov3 ,use_multiprocess=True 
class_num=7,  snapshot_iter=1000,  max_iter=500000, no_mixup_iter=40000, input_size=608 , learning_rate=0.001,  num_worker=8

my dataset trainset includes 7088 images, 7 class nums. my code is using branch develop yolov3. paddle version is released paddle:1.4.1-gpu-cuda9.0-cudnn7 docker image. environment is using V100 2GPU. It seems something wrong with multiprocess reader. please help me find out the question.

stuck log is:

W0627 04:02:17.841068   127 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.0, Runtime API Version: 9.0
W0627 04:02:17.844712   127 device_context.cc:269] device: 0, cuDNN Version: 7.0.
W0627 04:02:17.844794   127 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.1, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
W0627 04:02:18.923068   127 graph.h:204] WARN: After a series of passes, the current graph can be quite different from OriginProgram. So, please avoid using the `OriginProgram()` method!
2019-06-27 04:02:18,925-WARNING: 
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
-----------  Configuration Arguments -----------
batch_size: 8
class_num: 7
data_dir: dataset/coco
dataset: coco2017
debug: False
draw_thresh: 0.5
enable_ce: False
image_name: None
image_path: image
input_size: 608
label_smooth: True
learning_rate: 0.001
max_iter: 500000
model_save_dir: work_dirs/uav_img608_lr0.001/
nms_posk: 100
nms_thresh: 0.45
nms_topk: 400
no_mixup_iter: 40000
pretrain: init_weight/yolov3_2
random_shape: True
snapshot_iter: 1000
start_iter: 0
syncbn: True
use_gpu: True
use_multiprocess: 1
valid_thresh: 0.005
weights: weights/yolov3
------------------------------------------------
cfg.class_num:7
out:(-1L, 36L, 19L, 19L)
out:(-1L, 36L, 38L, 38L)
out:(-1L, 36L, 76L, 76L)
Found 2 CUDA devices.
total_iter:1000000;mixup_iter:920000, img nums:7088
Load in 7 categories.
categories:[{u'name': 'sedan', u'id': 0}, {u'name': 'truck', u'id': 1}, {u'name': 'bus', u'id': 2}, {u'name': 'motor', u'id': 3}, {u'name': 'tricycle', u'id': 4}, {u'name': 'person', u'id': 5}, {u'name': 'bicycle', u'id': 6}]
I0627 04:02:19.906926   127 build_strategy.cc:282] set enable_sequential_execution:1
I0627 04:02:24.266566   127 build_strategy.cc:285] SeqOnlyAllReduceOps:0, num_trainers:1
Iter 0, lr 0.000000, loss 8507.810547, time 0.01503
Iter 1, lr 0.000000, loss 9591.240234, time 46.43099

.....

Iter 15425, lr 0.001000, loss 155.478741, time 0.92534
Iter 15426, lr 0.001000, loss 155.480361, time 0.45774
Iter 15427, lr 0.001000, loss 155.479925, time 1.11275
Iter 15428, lr 0.001000, loss 155.480860, time 0.69338
Iter 15429, lr 0.001000, loss 155.484091, time 0.77737
Iter 15430, lr 0.001000, loss 155.484229, time 0.59263
Iter 15431, lr 0.001000, loss 155.486114, time 1.31359
Iter 15432, lr 0.001000, loss 155.484561, time 0.86391
Iter 15433, lr 0.001000, loss 155.482944, time 0.96678

PaddlePaddle / models 大约 1 年 前同步成功

yolov3 train with multi-process reader, then stuck at train process

PaddlePaddle / models
大约 1 年前同步成功