Some questions about trainer errors.
Created by: gongweibao
- Should a trainer exit when it meets errors? For example, read data error?And
- Will The whole job exit when a specific task exit abnormally continuously many times which is large than our limit? When this happens, it means that these errors can't be fixed by starting new trainers. And
- Do we need a
TaskFail
interface for trainer report it's error task before exit? A trainer reports task's error state by onlytimeout
now, and it's very slow. And - We should change the life cycle of a single task picture to add
failed task queue
and so on.