If checkpointing is an urgent feature request?
Created by: reyoung
We could not recover the training process by the model we saved right now. The model we saved could be only used for model inference because we only save parameters. In some optimization algorithms(like adagrad, rmsprop, adam), the learning rates for each parameter are adaptive and are calculated by the history of gradients, momentums, etc. To recover training process, we should redefine our saving/loading methods, and make what should be saved or loaded defined in runtime. I think this functionality is critical needed when we add scalability and fault-recoveralbe for Paddle training.
Please give a vote here. Thumb up means this functionality should be added ASAP. Thumb down means this job is not needed right now.
And it is nice if you will give some comments here for why we should or should not do it.