分布式训练中出现 关于分布式通信的错误 SendRPC name:[NCCLID], ep:[10.255.66.20:9184], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details
Created by: bit-pku-zdf
- 版本、环境信息: 1)PaddlePaddle版本:1.5.1 3)GPU:孔明集群训练,cuda 9.0,cudnn 7.0 4)系统环境:孔明集群gpu p40
- 训练信息 1)多机训练,采用 fluid.DistributeTranspiler的nccl模式(allreduce分布式方式)进行训练 2)错误信息:
worker_endpoints:10.255.76.23:9184,10.255.66.20:9184 trainers_num:2 current_endpoint:10.255.76.23:9184 trainer_id:0 yq01-sys-hic-p40-0030:19866:19866 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. yq01-sys-hic-p40-0030:19866:19866 [0] NCCL INFO Using internal Network Socket yq01-sys-hic-p40-0030:19866:19866 [0] NCCL INFO NET : Using interface eth2:10.255.76.23<0> yq01-sys-hic-p40-0030:19866:19866 [0] NCCL INFO NET/Socket : 1 interfaces found I1015 12:24:49.366034 19866 rpc_client.h:101] init rpc client with trainer_id 0 F1015 12:24:49.371834 27437 grpc_client.cc:424] SendRPC name:[NCCLID], ep:[10.255.66.20:9184], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: *** Check failure stack trace: *** @ 0x7fc3573f355d google::LogMessage::Fail() @ 0x7fc3573f700c google::LogMessage::SendToLog() @ 0x7fc3573f3083 google::LogMessage::Flush() @ 0x7fc3573f851e google::LogMessageFatal::~LogMessageFatal() @ 0x7fc358d2a61a paddle::operators::distributed::GRPCClient::Proceed() @ 0x7fc377bb88a0 execute_native_thread_routine @ 0x7fc3ff6841c3 start_thread @ 0x7fc3fecac12d __clone @ (nil) (unknown)
关于nccl的相关参数设置(通过shell脚本export)
export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 #export NCCL_IB_GDR_LEVEL=4 export NCCL_IB_GID_INDEX=3 #export NCCL_SOCKET_IFNAME=eth2
export NCCL_IB_CUDA_SUPPORT=1
export NCCL_P2P_DISABLE=1
以及这个错误并不稳定出现,例如今天连续提交5次任务,前四次出现了该错误,第5次没有出现过这个错误,能够正常训练(5次提交的机器的ip是一样的 10.255.76.23 和10.255.66.20) 且在孔明集群中训练