Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #15201

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 1月 08, 2019 by saxon_zh@saxon_zhGuest

多卡训练中途随机中断,同样配置单卡正常训练

Created by: JamesLearning

  • 版本、环境信息:    1)PaddlePaddle版本:1.2    3)GPU:TITAN X (Pascal) CUDA8.0 CUDNN6.0 docker内CUDNN7.0    4)系统环境:CentOS 7.3.1611 使用官方docker镜像hub.baidubce.com/paddlepaddle/paddle:latest-gpu-cuda8.0-cudnn7

  • 训练信息    1)单机多卡    2)训练模型:PyramidBox(https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/face_detection)

  • 复现信息: 在docker中训练PyramidBox,使用默认的wider_face数据集与预训练模型进行训练,使用3个卡,batch_size=6,其他参数维持默认值。

  • 问题描述: 训练过程中会随机地发生停止训练的现象,表现为长时间没有输出信息,不会保存模型,但是不会报错,cpu与gpu使用率不是0。此现象一般发生在训练了几十到几百次迭代之间的时候,有时发生在一个epoch之后。此时此进程无法使用ctrl+c退出,只能用kill 终止,其子进程也不会随着父进程终止而终止,需要一个个kill。 设置batch_size=3也会出现此类故障,但如果只使用单卡,其他一切不变,可以正常训练。 通过先kill -2子进程再kill父进程的方法,得到以下Traceback信息:

nohup: ignoring input
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0107 17:57:42.333487  4954 init.cc:53] Init commandline: dummy train.py --tryfromenv=check_nan_inf,benchmark,eager_delete_scope,use_mkldnn,use_ngraph,initial_cpu_memory_in_mb,init_allocated_mem,free_idle_memory,paddle_num_threads,dist_threadpool_size,eager_delete_tensor_gb,fast_eager_deletion_mode,allocator_strategy,reader_queue_speed_test_mode,print_sub_graph_dir,pe_profile_fname,warpctc_dir,use_pinned_memory,cpu_deterministic,rpc_deadline,rpc_server_profile_path,enable_rpc_profiler,rpc_send_thread_num,rpc_get_thread_num,rpc_prefetch_thread_num,rpc_disable_reuse_port,fraction_of_gpu_memory_to_use,cudnn_deterministic,enable_cublas_tensor_op_math,conv_workspace_size_limit,cudnn_exhaustive_search,memory_optimize_debug,selected_gpus 
-----------  Configuration Arguments -----------
batch_size: 3
data_dir: data
epoc_num: 160
learning_rate: 0.001
mean_BGR: 104., 117., 123.
model_save_dir: output
parallel: True
pretrained_model: ./vgg_ilsvrc_16_fc_reduced/
resize_h: 640
resize_w: 640
use_gpu: True
use_pyramidbox: True
with_mem_opt: True
------------------------------------------------
W0107 17:58:03.730578  4954 device_context.cc:257] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 8.0, Runtime API Version: 8.0
W0107 17:58:03.730641  4954 device_context.cc:265] device: 0, cuDNN Version: 7.0.
I0107 17:58:06.165645  4954 build_strategy.cc:75] CollectiveContext:endpoints_:trainer_id_:0
Pass 0, batch 0, face loss 16.589201, head loss 10.628244, time 0.00017
Pass 0, batch 10, face loss 8.830922, head loss 9.297692, time 0.55010
Pass 0, batch 20, face loss 7.086583, head loss 6.267890, time 0.83622
Pass 0, batch 30, face loss 7.730341, head loss 7.461819, time 0.56961
Pass 0, batch 40, face loss 7.005524, head loss 7.267424, time 0.55874
Pass 0, batch 50, face loss 6.310503, head loss 6.517485, time 0.52383
Pass 0, batch 60, face loss 6.826693, head loss 6.402285, time 0.53796
Pass 0, batch 70, face loss 6.992862, head loss 5.704878, time 0.56570
Pass 0, batch 80, face loss 7.183727, head loss 5.596687, time 0.59800
Process Process-6:
Process Process-8:
Process Process-4:
Process Process-5:
Process Process-2:
Process Process-9:
Traceback (most recent call last):
Process Process-3:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Process Process-7:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/paddle/data_util.py", line 81, in data_generator_task
    self.run()
    task()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/paddle/data_util.py", line 65, in task
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.queue.put((generator_output))
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "<string>", line 2, in put
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  File "/paddle/data_util.py", line 81, in data_generator_task
    self._target(*self._args, **self._kwargs)
    self.run()
  File "/paddle/data_util.py", line 81, in data_generator_task
    self._target(*self._args, **self._kwargs)
    self.run()
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/paddle/data_util.py", line 81, in data_generator_task
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/paddle/data_util.py", line 81, in data_generator_task
  File "/paddle/data_util.py", line 81, in data_generator_task
    self._target(*self._args, **self._kwargs)
  File "/paddle/data_util.py", line 81, in data_generator_task
    self._target(*self._args, **self._kwargs)
    task()
  File "/paddle/data_util.py", line 81, in data_generator_task
    task()
  File "/paddle/data_util.py", line 65, in task
  File "/paddle/data_util.py", line 65, in task
    self.queue.put((generator_output))
    task()
    self.queue.put((generator_output))
  File "<string>", line 2, in put
  File "<string>", line 2, in put
  File "/paddle/data_util.py", line 67, in task
    task()
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    task()
    task()
  File "/paddle/data_util.py", line 67, in task
    time.sleep(self.wait_time)
    task()
  File "/paddle/data_util.py", line 65, in task
    time.sleep(self.wait_time)
  File "/paddle/data_util.py", line 67, in task
  File "/paddle/data_util.py", line 67, in task
    self.queue.put((generator_output))
    time.sleep(self.wait_time)
KeyboardInterrupt
KeyboardInterrupt
  File "<string>", line 2, in put
    time.sleep(self.wait_time)
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
KeyboardInterrupt
    kind, result = conn.recv()
KeyboardInterrupt
    kind, result = conn.recv()
KeyboardInterrupt
    kind, result = conn.recv()
KeyboardInterrupt
    kind, result = conn.recv()
KeyboardInterrupt
*** Aborted at 1546884091 (unix time) try "date -d @1546884091" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGTERM (@0x1) received by PID 4954 (TID 0x7f181b682700) from PID 1; stack trace: ***
    @     0x7f181b260390 (unknown)
    @     0x7f181b25c709 __pthread_cond_timedwait
    @     0x7f17a2807019 paddle::framework::details::ThreadedSSAGraphExecutor::Run()
    @     0x7f17a280b9e3 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
    @     0x7f17a1128662 paddle::framework::ParallelExecutor::Run()
    @     0x7f17a10200be _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE129_vIS8_SD_SF_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESX_
    @     0x7f17a10555de pybind11::cpp_function::dispatcher()
    @           0x4c5326 PyEval_EvalFrameEx
    @           0x4b9b66 PyEval_EvalCodeEx
    @           0x4c17c6 PyEval_EvalFrameEx
    @           0x4b9b66 PyEval_EvalCodeEx
    @           0x4c1f56 PyEval_EvalFrameEx
    @           0x4b9b66 PyEval_EvalCodeEx
    @           0x4eb69f (unknown)
    @           0x4e58f2 PyRun_FileExFlags
    @           0x4e41a6 PyRun_SimpleFileExFlags
    @           0x4938ce Py_Main
    @     0x7f181aea5830 __libc_start_main
    @           0x493299 _start
    @                0x0 (unknown)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#15201
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7