Multi-GPU, multi-node development milestones.
Created by: helinwang
- Single node multiple CPU threads
- Executor support multiple threads: @helinwang propose to drive this.
- Transpiling: convert user's input
ProgramDesc
to aExecutionPlan
that supports CPU-only sync-SGD data parallelism: @Yancey1989 .
- Single node multiple GPUs
- Transpiling: convert user's input
ProgramDesc
to aExecutionPlan
that supports GPU sync-SGD data parallelism.
- Transpiling: convert user's input
- Multiple nodes
- operators for feeding data
- Transpiling: convert user's input
ProgramDesc
to aExecutionPlan
that runs on multiple nodes. - Send / Recv OP
-
ExecutionPlan
partition: partition the singleExecutionPlan
to multipleExecutionPlans
, each partitionedExecutionPlan
runs on a node. Send / Recv OP is added between edges that cross nodes.
- Fault tolerant: single node failure stops the training job and causes a job restart.
- every executor should save state automatically and loads state upon restart.
- Elastic ML: number of nodes can change without interrupting the training (training job will not stop).
Please comment if you have question or suggestions, thanks!