论文复现:Nccl error, unhandled system error
Created by: jordan2013
版本、环境信息: 1)PaddlePaddle版本:1.8 2)CPU:预测若用CPU, 3)GPU:AIStudio四卡的脚本环境 4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本 注:您可以通过执行summary_env.py获取以上信息。
- 训练信息 1)单机/多机,单卡/多卡 2)显存信息 3)Operator信息
- 复现信息:脚本任务号:23007
- 问题描述: raceback (most recent call last): File "main.py", line 732, in Traceback (most recent call last): Traceback (most recent call last): File "main.py", line 732, in File "main.py", line 732, in main() File "main.py", line 57, in main main() File "main.py", line 57, in main main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()
Error Message Summary:
ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)
Thank you for contributing to PaddlePaddle. Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. If there is no solution,please make sure that this is a training issue including the following details: System information -PaddlePaddle version (eg.1.1)or CommitID -CPU: including CPUMKL/OpenBlas/MKLDNN version -GPU: including CUDA/CUDNN version -OS Platform (eg.Mac OS 10.14) -Other imformation: Distriuted training/informantion of operator/ Graphics card storage Note: You can get most of the information by running summary_env.py. To Reproduce Steps to reproduce the behavior Describe your current behavior Code to reproduce the issue Other info / logs [INFO]: current net device: eth0, ip: 172.28.0.81 [INFO]: paddle job envs: POD_IP=172.28.0.81 PADDLE_PORT=12345 PADDLE_TRAINER_ID=0 PADDLE_TRAINERS_NUM=1 PADDLE_USE_CUDA=1 NCCL_SOCKET_IFNAME=eth0 PADDLE_IS_LOCAL=1 OUTPUT_PATH=/root/paddlejob/workspace/output LOCAL_LOG_PATH=/root/paddlejob/workspace/log LOCAL_MOUNT_PATH=/mnt/code_20200905112320,/mnt/datasets_20200905112320 JOB_ID=job-d7944bcff2dc7eac7eafc9e79a9a0c5f TRAINING_ROLE=TRAINER [INFO]: user command: python -u run.py [INFO]: start trainer ~/paddlejob/workspace/code /mnt mkdir: cannot create directory `/root/paddlejob/workspace/output/': File exists unzip success. /root/paddlejob/workspace/output/UCF-101-Img video split success. /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict
-
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
-
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
-
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
-
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
-
Writing list files for training/testing
-
Writing list files for training/testing
-
List files successfully saved to "data/" folder!
-
List files successfully saved to "data/" folder!
-
Writing list files for training/testing
-
Writing list files for training/testing
-
List files successfully saved to "data/" folder!
-
List files successfully saved to "data/" folder!
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict
Environment Versions:
- Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
ECOfull Configurations:
- dataset: ucf101
- modality: RGB
- train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
- val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
- net_model: None
- net_model2D: None
- net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
- net_model3D: None
- arch: ECOfull
- num_segments: 24
- consensus_type: identity
- pretrained_parts: finetune
- k: 3
- dropout: 0
- loss_type: nll
- epochs: 60
- batch_size: 16
- iter_size: 1
- lr: 0.001
- lr_steps: [20, 40]
- momentum: 0.9
- weight_decay: 0.0005
- clip_gradient: 50
- no_partialbn: False
- nesterov: True
- num_saturate: 5
- print_freq: 20
- eval_freq: 5
- workers: 4
- resume:
- evaluate: False
- snapshot_pref: net_runs
- start_epoch: 0
- gpus: None
- flow_prefix:
- rgb_prefix: img_
- use_gpu: True
- num_gpus: 4
- use_data_parallel: True
Environment Versions:
- Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
ECOfull Configurations:
- dataset: ucf101
- modality: RGB
- train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
- val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
- net_model: None
- net_model2D: None
- net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
- net_model3D: None
- arch: ECOfull
- num_segments: 24
- consensus_type: identity
- pretrained_parts: finetune
- k: 3
- dropout: 0
- loss_type: nll
- epochs: 60
- batch_size: 16
- iter_size: 1
- lr: 0.001
- lr_steps: [20, 40]
- momentum: 0.9
- weight_decay: 0.0005
- clip_gradient: 50
- no_partialbn: False
- nesterov: True
- num_saturate: 5
- print_freq: 20
- eval_freq: 5
- workers: 4
- resume:
- evaluate: False
- snapshot_pref: net_runs
- start_epoch: 0
- gpus: None
- flow_prefix:
- rgb_prefix: img_
- use_gpu: True
- num_gpus: 4
- use_data_parallel: True
Environment Versions:
- Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
ECOfull Configurations:
- dataset: ucf101
- modality: RGB
- train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
- val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
- net_model: None
- net_model2D: None
- net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
- net_model3D: None
- arch: ECOfull
- num_segments: 24
- consensus_type: identity
- pretrained_parts: finetune
- k: 3
- dropout: 0
- loss_type: nll
- epochs: 60
- batch_size: 16
- iter_size: 1
- lr: 0.001
- lr_steps: [20, 40]
- momentum: 0.9
- weight_decay: 0.0005
- clip_gradient: 50
- no_partialbn: False
- nesterov: True
- num_saturate: 5
- print_freq: 20
- eval_freq: 5
- workers: 4
- resume:
- evaluate: False
- snapshot_pref: net_runs
- start_epoch: 0
- gpus: None
- flow_prefix:
- rgb_prefix: img_
- use_gpu: True
- num_gpus: 4
- use_data_parallel: True
Environment Versions:
- Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
ECOfull Configurations:
- dataset: ucf101
- modality: RGB
- train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
- val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
- net_model: None
- net_model2D: None
- net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
- net_model3D: None
- arch: ECOfull
- num_segments: 24
- consensus_type: identity
- pretrained_parts: finetune
- k: 3
- dropout: 0
- loss_type: nll
- epochs: 60
- batch_size: 16
- iter_size: 1
- lr: 0.001
- lr_steps: [20, 40]
- momentum: 0.9
- weight_decay: 0.0005
- clip_gradient: 50
- no_partialbn: False
- nesterov: True
- num_saturate: 5
- print_freq: 20
- eval_freq: 5
- workers: 4
- resume:
- evaluate: False
- snapshot_pref: net_runs
- start_epoch: 0
- gpus: None
- flow_prefix:
- rgb_prefix: img_
- use_gpu: True
- num_gpus: 4
- use_data_parallel: True
I0905 12:50:03.092523 18839 nccl_context.cc:127] init nccl context nranks: 4 local rank: 1 gpu id: 1 I0905 12:50:03.092571 18838 nccl_context.cc:127] init nccl context nranks: 4 local rank: 0 gpu id: 0 I0905 12:50:03.092572 18840 nccl_context.cc:127] init nccl context nranks: 4 local rank: 2 gpu id: 2 I0905 12:50:03.092618 18841 nccl_context.cc:127] init nccl context nranks: 4 local rank: 3 gpu id: 3 Traceback (most recent call last): File "main.py", line 732, in main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()
Error Message Summary:
ExternalError: Nccl error, unhandled cuda error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)
Traceback (most recent call last): File "main.py", line 732, in Traceback (most recent call last): Traceback (most recent call last): File "main.py", line 732, in File "main.py", line 732, in main() File "main.py", line 57, in main main() File "main.py", line 57, in main main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()
Error Message Summary:
ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)
strategy = fluid.dygraph.parallel.prepare_context()
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init()parallel_helper._init_parallel_ctx()
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()
Error Message Summary:
ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)
__parallel_ctx__clz__.init()
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()
Error Message Summary:
ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)
ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 3] was aborted. Please check its log. Traceback (most recent call last): File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 406, in watch_local_trainers terminate_local_procs(procs) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 257, in terminate_local_procs p.proc.join(timeout=1) AttributeError: 'Popen' object has no attribute 'join'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "run.py", line 190, in launch.launch(args) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/launch.py", line 220, in launch alive = watch_local_trainers(procs, cluster.trainers_nranks()) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 423, in watch_local_trainers terminate_local_procs(procs) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 257, in terminate_local_procs p.proc.join(timeout=1) AttributeError: 'Popen' object has no attribute 'join' /mnt [INFO]: train job failed! train_ret: 1