Fluid distributed training TODO (#10279) · Issue · PaddlePaddle / Paddle

Fluid distributed training TODO

Created by: Yancey1989

code cleanup and polish
implement LARS to improve training performance, #6811 (closed)
fault-tolerant
- checkpointing and recovering parameters on pserver
- recover reader offset(may need master and etcd)
- trainer pre-fetch parameters from pserver after the restart
async training, #9941 (closed)
distributed data reader(should unify with single machine reader)
calculate global AUC with the distributed table
initialize trainable parameters from saved parameters on a trainer
ring-base architecture to improve training performance
distributed lookup table, https://github.com/PaddlePaddle/Paddle/projects/56
full overlapping with parallel-executor on dist training
split send_op into multiple send_vars_op and fetch_vars_op, #9161 (closed)

different distributed training strategy (sync, async etc...) influence on model accuracy/throughput

differences between multi-machine-single-device and multi-machine-multi-device
better integration with single-machine training
think about more flexible user-customized device placement for multi-machine training.
need to discuss whether we need the remote executor