Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #2580

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 6月 27, 2019 by saxon_zh@saxon_zhGuest

yolov3 train with multi-process reader, then stuck at train process

Created by: Haijunlv

i use yolov3 with multi-process train reader. train successfully first. But stuck at 15433 iter, almost 2 epochs I did not do many changes except modifying reader to read xml annotation. my model config is :

gpu_num=2, batch_size=8, pretrain=init_weight/yolov3 ,use_multiprocess=True 
class_num=7,  snapshot_iter=1000,  max_iter=500000, no_mixup_iter=40000, input_size=608 , learning_rate=0.001,  num_worker=8

my dataset trainset includes 7088 images, 7 class nums. my code is using branch develop yolov3. paddle version is released paddle:1.4.1-gpu-cuda9.0-cudnn7 docker image. environment is using V100 2GPU. It seems something wrong with multiprocess reader. please help me find out the question.

stuck log is:

W0627 04:02:17.841068   127 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.0, Runtime API Version: 9.0
W0627 04:02:17.844712   127 device_context.cc:269] device: 0, cuDNN Version: 7.0.
W0627 04:02:17.844794   127 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.1, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
W0627 04:02:18.923068   127 graph.h:204] WARN: After a series of passes, the current graph can be quite different from OriginProgram. So, please avoid using the `OriginProgram()` method!
2019-06-27 04:02:18,925-WARNING: 
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
-----------  Configuration Arguments -----------
batch_size: 8
class_num: 7
data_dir: dataset/coco
dataset: coco2017
debug: False
draw_thresh: 0.5
enable_ce: False
image_name: None
image_path: image
input_size: 608
label_smooth: True
learning_rate: 0.001
max_iter: 500000
model_save_dir: work_dirs/uav_img608_lr0.001/
nms_posk: 100
nms_thresh: 0.45
nms_topk: 400
no_mixup_iter: 40000
pretrain: init_weight/yolov3_2
random_shape: True
snapshot_iter: 1000
start_iter: 0
syncbn: True
use_gpu: True
use_multiprocess: 1
valid_thresh: 0.005
weights: weights/yolov3
------------------------------------------------
cfg.class_num:7
out:(-1L, 36L, 19L, 19L)
out:(-1L, 36L, 38L, 38L)
out:(-1L, 36L, 76L, 76L)
Found 2 CUDA devices.
total_iter:1000000;mixup_iter:920000, img nums:7088
Load in 7 categories.
categories:[{u'name': 'sedan', u'id': 0}, {u'name': 'truck', u'id': 1}, {u'name': 'bus', u'id': 2}, {u'name': 'motor', u'id': 3}, {u'name': 'tricycle', u'id': 4}, {u'name': 'person', u'id': 5}, {u'name': 'bicycle', u'id': 6}]
I0627 04:02:19.906926   127 build_strategy.cc:282] set enable_sequential_execution:1
I0627 04:02:24.266566   127 build_strategy.cc:285] SeqOnlyAllReduceOps:0, num_trainers:1
Iter 0, lr 0.000000, loss 8507.810547, time 0.01503
Iter 1, lr 0.000000, loss 9591.240234, time 46.43099

.....

Iter 15425, lr 0.001000, loss 155.478741, time 0.92534
Iter 15426, lr 0.001000, loss 155.480361, time 0.45774
Iter 15427, lr 0.001000, loss 155.479925, time 1.11275
Iter 15428, lr 0.001000, loss 155.480860, time 0.69338
Iter 15429, lr 0.001000, loss 155.484091, time 0.77737
Iter 15430, lr 0.001000, loss 155.484229, time 0.59263
Iter 15431, lr 0.001000, loss 155.486114, time 1.31359
Iter 15432, lr 0.001000, loss 155.484561, time 0.86391
Iter 15433, lr 0.001000, loss 155.482944, time 0.96678
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#2580
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7