Kubernetes cluster distributed training error (#17439) · Issue · PaddlePaddle / Paddle

Kubernetes cluster distributed training error

Created by: TheodoreG

Hi, I am using paddle distributed training in "NCCL2" mode within a kubernetes cluster. Here I have two workers in one namespace and can ping each other successfully:

But when I start training it get errors:

Such wrong only happens when two pods are assigned to different nodes. When those two pods are assigned to the same node, which equals multi-GPUs training there's no such error. How to solve it ?

PaddlePaddle / Paddle 大约 2 年 前同步成功

Kubernetes cluster distributed training error

PaddlePaddle / Paddle
大约 2 年前同步成功