论文复现：Nccl error, unhandled system error (#27081) · Issue · PaddlePaddle / Paddle

论文复现：Nccl error, unhandled system error

Created by: jordan2013

版本、环境信息： 1）PaddlePaddle版本：1.8 2）CPU：预测若用CPU， 3）GPU：AIStudio四卡的脚本环境 4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本注：您可以通过执行summary_env.py获取以上信息。

训练信息 1）单机/多机，单卡/多卡 2）显存信息 3）Operator信息
复现信息：脚本任务号：23007
问题描述： raceback (most recent call last): File "main.py", line 732, in Traceback (most recent call last): Traceback (most recent call last): File "main.py", line 732, in File "main.py", line 732, in main() File "main.py", line 57, in main main() File "main.py", line 57, in main main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int) 3 paddle::imperative::NCCLParallelContext::Init()

Error Message Summary:

ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)

Thank you for contributing to PaddlePaddle. Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. If there is no solution,please make sure that this is a training issue including the following details: System information -PaddlePaddle version （eg.1.1）or CommitID -CPU: including CPUMKL/OpenBlas/MKLDNN version -GPU: including CUDA/CUDNN version -OS Platform (eg.Mac OS 10.14) -Other imformation: Distriuted training/informantion of operator/ Graphics card storage Note: You can get most of the information by running summary_env.py. To Reproduce Steps to reproduce the behavior Describe your current behavior Code to reproduce the issue Other info / logs [INFO]: current net device: eth0, ip: 172.28.0.81 [INFO]: paddle job envs: POD_IP=172.28.0.81 PADDLE_PORT=12345 PADDLE_TRAINER_ID=0 PADDLE_TRAINERS_NUM=1 PADDLE_USE_CUDA=1 NCCL_SOCKET_IFNAME=eth0 PADDLE_IS_LOCAL=1 OUTPUT_PATH=/root/paddlejob/workspace/output LOCAL_LOG_PATH=/root/paddlejob/workspace/log LOCAL_MOUNT_PATH=/mnt/code_20200905112320,/mnt/datasets_20200905112320 JOB_ID=job-d7944bcff2dc7eac7eafc9e79a9a0c5f TRAINING_ROLE=TRAINER [INFO]: user command: python -u run.py [INFO]: start trainer ~/paddlejob/workspace/code /mnt mkdir: cannot create directory `/root/paddlejob/workspace/output/': File exists unzip success. /root/paddlejob/workspace/output/UCF-101-Img video split success. /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict

Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
Parse frames under folder /root/paddlejob/workspace/output/UCF-101-Img
Writing list files for training/testing
Writing list files for training/testing
List files successfully saved to "data/" folder!
List files successfully saved to "data/" folder!
Writing list files for training/testing
Writing list files for training/testing
List files successfully saved to "data/" folder!
List files successfully saved to "data/" folder!

/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Mapping, defaultdict

Environment Versions:

Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]

ECOfull Configurations:

dataset: ucf101
modality: RGB
train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
net_model: None
net_model2D: None
net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
net_model3D: None
arch: ECOfull
num_segments: 24
consensus_type: identity
pretrained_parts: finetune
k: 3
dropout: 0
loss_type: nll
epochs: 60
batch_size: 16
iter_size: 1
lr: 0.001
lr_steps: [20, 40]
momentum: 0.9
weight_decay: 0.0005
clip_gradient: 50
no_partialbn: False
nesterov: True
num_saturate: 5
print_freq: 20
eval_freq: 5
workers: 4
resume:
evaluate: False
snapshot_pref: net_runs
start_epoch: 0
gpus: None
flow_prefix:
rgb_prefix: img_
use_gpu: True
num_gpus: 4
use_data_parallel: True

Environment Versions:

Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]

ECOfull Configurations:

dataset: ucf101
modality: RGB
train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
net_model: None
net_model2D: None
net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
net_model3D: None
arch: ECOfull
num_segments: 24
consensus_type: identity
pretrained_parts: finetune
k: 3
dropout: 0
loss_type: nll
epochs: 60
batch_size: 16
iter_size: 1
lr: 0.001
lr_steps: [20, 40]
momentum: 0.9
weight_decay: 0.0005
clip_gradient: 50
no_partialbn: False
nesterov: True
num_saturate: 5
print_freq: 20
eval_freq: 5
workers: 4
resume:
evaluate: False
snapshot_pref: net_runs
start_epoch: 0
gpus: None
flow_prefix:
rgb_prefix: img_
use_gpu: True
num_gpus: 4
use_data_parallel: True

Environment Versions:

Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]

ECOfull Configurations:

dataset: ucf101
modality: RGB
train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
net_model: None
net_model2D: None
net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
net_model3D: None
arch: ECOfull
num_segments: 24
consensus_type: identity
pretrained_parts: finetune
k: 3
dropout: 0
loss_type: nll
epochs: 60
batch_size: 16
iter_size: 1
lr: 0.001
lr_steps: [20, 40]
momentum: 0.9
weight_decay: 0.0005
clip_gradient: 50
no_partialbn: False
nesterov: True
num_saturate: 5
print_freq: 20
eval_freq: 5
workers: 4
resume:
evaluate: False
snapshot_pref: net_runs
start_epoch: 0
gpus: None
flow_prefix:
rgb_prefix: img_
use_gpu: True
num_gpus: 4
use_data_parallel: True

Environment Versions:

Python: 3.7.0 (default, Nov 24 2018, 08:51:28) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]

ECOfull Configurations:

dataset: ucf101
modality: RGB
train_list: /root/paddlejob/workspace/output/ucf101_rgb_train_split_1.txt
val_list: /root/paddlejob/workspace/output/ucf101_rgb_val_split_1.txt
net_model: None
net_model2D: None
net_modelECO: /root/paddlejob/workspace/train_data/datasets/data51969/eco-pp
net_model3D: None
arch: ECOfull
num_segments: 24
consensus_type: identity
pretrained_parts: finetune
k: 3
dropout: 0
loss_type: nll
epochs: 60
batch_size: 16
iter_size: 1
lr: 0.001
lr_steps: [20, 40]
momentum: 0.9
weight_decay: 0.0005
clip_gradient: 50
no_partialbn: False
nesterov: True
num_saturate: 5
print_freq: 20
eval_freq: 5
workers: 4
resume:
evaluate: False
snapshot_pref: net_runs
start_epoch: 0
gpus: None
flow_prefix:
rgb_prefix: img_
use_gpu: True
num_gpus: 4
use_data_parallel: True

I0905 12:50:03.092523 18839 nccl_context.cc:127] init nccl context nranks: 4 local rank: 1 gpu id: 1 I0905 12:50:03.092571 18838 nccl_context.cc:127] init nccl context nranks: 4 local rank: 0 gpu id: 0 I0905 12:50:03.092572 18840 nccl_context.cc:127] init nccl context nranks: 4 local rank: 2 gpu id: 2 I0905 12:50:03.092618 18841 nccl_context.cc:127] init nccl context nranks: 4 local rank: 3 gpu id: 3 Traceback (most recent call last): File "main.py", line 732, in main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

Error Message Summary:

ExternalError: Nccl error, unhandled cuda error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)

Traceback (most recent call last): File "main.py", line 732, in Traceback (most recent call last): Traceback (most recent call last): File "main.py", line 732, in File "main.py", line 732, in main() File "main.py", line 57, in main main() File "main.py", line 57, in main main() File "main.py", line 57, in main strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

Error Message Summary:

ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)

strategy = fluid.dygraph.parallel.prepare_context()

File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context strategy = fluid.dygraph.parallel.prepare_context() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 50, in prepare_context parallel_helper._init_parallel_ctx() File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx parallel_ctx__clz.init()parallel_helper._init_parallel_ctx()

File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 37, in _init_parallel_ctx paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

Error Message Summary:

ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)

__parallel_ctx__clz__.init()

paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

Error Message Summary:

ExternalError: Nccl error, unhandled system error at (/paddle/paddle/fluid/platform/collective_helper.cc:69)

ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 3] was aborted. Please check its log. Traceback (most recent call last): File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 406, in watch_local_trainers terminate_local_procs(procs) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 257, in terminate_local_procs p.proc.join(timeout=1) AttributeError: 'Popen' object has no attribute 'join'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run.py", line 190, in launch.launch(args) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/launch.py", line 220, in launch alive = watch_local_trainers(procs, cluster.trainers_nranks()) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 423, in watch_local_trainers terminate_local_procs(procs) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/utils.py", line 257, in terminate_local_procs p.proc.join(timeout=1) AttributeError: 'Popen' object has no attribute 'join' /mnt [INFO]: train job failed! train_ret: 1

PaddlePaddle / Paddle 8 个月 前同步成功

论文复现：Nccl error, unhandled system error

C++ Call Stacks (More useful to developers):

Error Message Summary:

C++ Call Stacks (More useful to developers):

Error Message Summary:

C++ Call Stacks (More useful to developers):

Error Message Summary:

C++ Call Stacks (More useful to developers):

Error Message Summary:

C++ Call Stacks (More useful to developers):

Error Message Summary:

PaddlePaddle / Paddle
8 个月前同步成功