PaddlePaddle distributed training API (#2306) · Issue · PaddlePaddle / Paddle

PaddlePaddle distributed training API

Created by: wangkuiyi

Here is the API design of PyTorch.distributed: https://github.com/pytorch/pytorch/issues/241. It looks very similar to MPI by means of

has the concept process group, which is the concept "world" in MPI
has the concept of rank, which is a 0-based id of processes.
processes take actions according to their ranks.
processes communicate and synchronize with each other by calling send, recv, etc MPI style functions.

I personally believe that MPI is very flexible and able to describe various concurrent algorithms. But it is not friendly to fault-recovery. It seems that pytorch.distributed would be good for researchers who work on relatively small datasets and/or on a small group of computers/GPUs. However, PaddlePaddle, as an industrial application platform cannot follow this style.

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle distributed training API

PaddlePaddle / Paddle
1 年多前同步成功