Fluid distributed training TODO
Created by: Yancey1989
Fluid Distribute Training Features
- code cleanup and polish
- implement LARS to improve training performance, #6811 (closed)
- fault-tolerant
- checkpointing and recovering parameters on pserver
- recover reader offset(may need master and etcd)
- trainer pre-fetch parameters from pserver after the restart
- async training, #9941 (closed)
- distributed data reader(should unify with single machine reader)
- calculate global AUC with the distributed table
- initialize trainable parameters from saved parameters on a trainer
- ring-base architecture to improve training performance
- distributed lookup table, https://github.com/PaddlePaddle/Paddle/projects/56
- full overlapping with parallel-executor on dist training
- split send_op into multiple send_vars_op and fetch_vars_op, #9161 (closed)
EDL
- implement the master process to schedule task
- etcd operator
- implement CRD to support kubernetes v1.8
Support different communication library
- gRPC performance enhancement
- OpenMPI with RDMA and GPU direct
- NCCL2 with multiple nodes
- follow up bRPC
Experiment
- different distributed training strategy (sync, async etc...) influence on model accuracy/throughput
CE
- Auto execute benchmark-job on AWS and generate a report
Future
- differences between multi-machine-single-device and multi-machine-multi-device
- better integration with single-machine training
- think about more flexible user-customized device placement for multi-machine training.
- need to discuss whether we need the remote executor