PaddlePaddle distributed training API
Created by: wangkuiyi
Here is the API design of PyTorch.distributed: https://github.com/pytorch/pytorch/issues/241. It looks very similar to MPI by means of
- has the concept process group, which is the concept "world" in MPI
- has the concept of rank, which is a 0-based id of processes.
- processes take actions according to their ranks.
- processes communicate and synchronize with each other by calling
send
,recv
, etc MPI style functions.
I personally believe that MPI is very flexible and able to describe various concurrent algorithms. But it is not friendly to fault-recovery. It seems that pytorch.distributed would be good for researchers who work on relatively small datasets and/or on a small group of computers/GPUs. However, PaddlePaddle, as an industrial application platform cannot follow this style.