Created by: sandyhouse
PR types
New features
PR changes
APIs
Describe
The api in this pr initializes distributed environment for distributed training.
How to start distributed training:
fleetrun --gpus="0,1" train.py
The above command launch two processes with 2 GPUs on a machine, each process runs the train.py script.
How to use this api in train.py
init_distributed_context(rank, rank_num, path="/tmp/tmp0")
This will initialize both the nccl and gloo environment. All processes should on the same machine because the local file system path (i.e., '/tmp/tmp0' is used to initialize gloo.