Support async checkpoint in Orbit trainer/controller.
This CL adds a field in Orbit trainer/controller indicating whether async checkpoint is enabled for checkpoint saving. BY default this value is set to False, which is equivalent to the existing behavior. In addition, a sync barrier is added at the end of training (in controller) to make sure users code won't prematurely access the checkpoint file/state when the async checkpoint saving is still ongoing. PiperOrigin-RevId: 529300903
Showing
想要评论请 注册 或 登录