Pserver: Port conflict has not been solved!
Created by: yangjing14
1)PaddlePaddle版本:Fluid 1.3 2)CPU: 3)GPU:无 4)系统环境:MPI集群
-
训练信息 1)MPI集群
-
问题描述: 从Paddle Cloud在MPI集群上启动任务,发现Pserver在启动过程中总是检测到端口冲突,最后所有的候选port都重试后还是没有解决。导致后续train的过程中包grpc错误
-
相关日志: job.err.log ++ echo '[pserver port] Port 62004 conflict!' ++ echo '[pserver port] Port 62000 conflict!' ++ echo '[pserver port] Port 62001 conflict!' ++ echo '[pserver port] Port 62002 conflict!' ++ echo '[pserver port] Port 62003 conflict!' ++ echo '[INFO] There is no port left can be bind in [sys_pserver_alter_ports_list]' ++ echo '[pserver port] All alternative ports have been used! Port conflict has not been solved!'
trainer.log F0710 13:11:24.494246 34492 grpc_client.cc:408] GetRPC name:[emb_query.block0], ep:[10.90.104.41:62003], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details: