Created by: sandyhouse
PR types
New features
PR changes
Others
Describe
This pr implements the pipeline_trainer and the device worker (i.e., section_worker) to support pipeline. With pipeline, we mean a program is split into multiple sub-programs (sections) each of which is run on a device. The main purpose of pipeline is to train large-scale models that cannot fit on a single device or take advantage of different features of heterogeneous devices to make training more efficiently.
Currently, you have to use device_guard to assign the device on which a sub-program (section) runs; the whole dataset is used for each iteration as the train_from_dataset interface is used.
Todo:
- use Executor.run instead of Executor.train_from_dataset.
- use auto pipeline with fleet instead of device_guard.