Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • PaddleDetection
  • Issue
  • #1379

P
PaddleDetection
  • 项目概览

PaddlePaddle / PaddleDetection
大约 2 年 前同步成功

通知 708
Star 11112
Fork 2696
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
PaddleDetection
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 184
    • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
    • 合并请求 40
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 9月 09, 2020 by saxon_zh@saxon_zhGuest

在训练PPYOLO的过程中出现cudaStreamSynchronize misaligned address错误

Created by: yeyupiaoling

  • Ubuntu 18.04

  • 本地CUDA 10.0, anaconda 虚拟环境 cudatoolkit=10.0

  • NCCL 2.4.8 for CUDA 10.0

  • Python 3.7

  • PaddlePaddle 1.8.4.post107

  • 双卡,如下:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 51%   53C    P2    60W / 250W |   9859MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 54%   54C    P2    93W / 250W |   9802MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

错误日志:

is_profiler: 0
loss_scale: 8.0
opt: {}
output_eval: save_models/eval
profiler_path: save_models/detection.profiler
resume_checkpoint: None
use_vdl: True
vdl_log_dir: logs/scalar
------------------------------------------------
2020-09-09 17:40:55,396-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000500] in Optimizer will not take effect, and it will only be applied to other Parameters!
2020-09-09 17:40:57,569-INFO: places would be ommited when DataLoader is not iterable
W0909 17:40:57.609563  1560 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 11.0, Runtime API Version: 10.0
W0909 17:40:57.648514  1560 device_context.cc:260] device: 0, cuDNN Version: 7.6.
2020-09-09 17:40:59,413-WARNING: /home/test/.cache/paddle/weights/ResNet50_vd_ssld_pretrained.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
/home/test/anaconda3/envs/PaddlePaddle/lib/python3.7/site-packages/paddle/fluid/io.py:1998: UserWarning: This list is not set, Because of Paramerter not found in program. There are: fc_0.b_0 fc_0.w_0
  format(" ".join(unused_para_list)))
2020-09-09 17:41:05,690-INFO: places would be ommited when DataLoader is not iterable
I0909 17:41:06.986269  1560 build_strategy.cc:361] set enable_sequential_execution:1
W0909 17:41:07.356726  1560 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 240. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 171.
2020-09-09 17:41:13,339-INFO: iter: 0, lr: 0.000000, 'loss_xy': '1.000268', 'loss_wh': '3.644889', 'loss_obj': '4445.346191', 'loss_cls': '7.542771', 'loss_iou': '3.998751', 'loss_iou_aware': '2.708143', 'loss': '4464.240723', time: 0.005, eta: 0:43:14
2020-09-09 17:42:07,639-INFO: iter: 100, lr: 0.000025, 'loss_xy': '1.032357', 'loss_wh': '2.916426', 'loss_obj': '10.692862', 'loss_cls': '7.065003', 'loss_iou': '4.001646', 'loss_iou_aware': '4.178246', 'loss': '30.555357', time: 0.615, eta: 3 days, 13:23:23
2020-09-09 17:43:02,254-INFO: iter: 200, lr: 0.000050, 'loss_xy': '0.994666', 'loss_wh': '2.304071', 'loss_obj': '9.268065', 'loss_cls': '6.182310', 'loss_iou': '3.792430', 'loss_iou_aware': '2.720503', 'loss': '25.305017', time: 0.544, eta: 3 days, 3:33:07
F0909 17:43:09.341343  1592 all_reduce_op_handle.cc:192] cudaStreamSynchronize misaligned address
*** Check failure stack trace: ***
    @     0x7f78ccc0783d  google::LogMessage::Fail()
    @     0x7f78ccc0b2ec  google::LogMessage::SendToLog()
    @     0x7f78ccc07363  google::LogMessage::Flush()
    @     0x7f78ccc0c7fe  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f78cfed4cbc  paddle::framework::details::AllReduceOpHandle::SyncNCCLAllReduce()
    @     0x7f78cfed4d54  paddle::framework::details::AllReduceOpHandle::NCCLAllReduceFunc()
    @     0x7f78cfed5d16  paddle::framework::details::AllReduceOpHandle::AllReduceFunc()
    @     0x7f78cfe3367c  paddle::framework::details::FusedAllReduceOpHandle::FusedAllReduceFunc()
    @     0x7f78cfe34c5e  paddle::framework::details::FusedAllReduceOpHandle::RunImpl()
    @     0x7f78cfe70aa1  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
    @     0x7f78cfe6e59f  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
    @     0x7f78cfe6e864  _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
    @     0x7f78ccc65413  std::_Function_handler<>::_M_invoke()
    @     0x7f78cca5f107  std::__future_base::_State_base::_M_do_set()
    @     0x7f7918f17827  __pthread_once_slow
    @     0x7f78cfe6aa32  _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
    @     0x7f78cca61564  _ZZN10ThreadPoolC1EmENKUlvE_clEv
    @     0x7f790a2e4421  execute_native_thread_routine_compat
    @     0x7f7918f0f6db  start_thread
    @     0x7f7918c38a3f  clone
    @              (nil)  (unknown)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/PaddleDetection#1379
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7