save_model.md 3.9 KB
Newer Older
H
Helin Wang 已提交
1 2 3 4 5 6 7 8 9
# Design Doc: Save Model

## Overview

The model is the output of the training process. There are two
ways from which user can obtain a model:

- Save model triggered by user code: user code asks PaddlePaddle to
  save a model.
H
Helin Wang 已提交
10 11 12 13
- Convert model from the checkpoint: model being converted from
  pservers' periodic checkpoint. In this way, the user can cancel a
  job at any time, and still have a relatively fresh model (we
  checkpoint around every 5 minutes).
H
Helin Wang 已提交
14

H
Helin Wang 已提交
15
### Trainer Saving Model vs. Pservers Saving Model
H
Helin Wang 已提交
16 17

Both trainers and pservers have access to the model. So the model can
H
Helin Wang 已提交
18 19
be saved from a trainer or pservers. We need to decide where the model
is saved from.
H
Helin Wang 已提交
20

H
Helin Wang 已提交
21 22 23
#### Dense Update vs. Sparse Update

There are two types of model update methods: dense update and sparse
H
Helin Wang 已提交
24
update (when the model parameter is configured to be sparse).
H
Helin Wang 已提交
25 26 27 28 29 30 31 32 33 34 35 36

- Dense update

  Every trainer has it's own full copy of the model. Every model
  update will update the entire model.

- Sparse update

  The training input is sparse, and the trainer does not have the
  entire model. It will only download the sub-model necessary related
  to the input. When updating the model, only the sub-model related to
  the training input is updated.
H
Helin Wang 已提交
37 38 39 40 41 42 43 44


#### Pservers Saving Model

The benefit of letting pservers save model is they have the entire
model all the time. However, since pservers are on different nodes, it
requires a merging process to merge model shards into the same
model. Thus requires the pservers to write models to a distributed
H
Helin Wang 已提交
45
filesystem, making the checkpoint shards visible to the merge program.
H
Helin Wang 已提交
46 47 48 49 50

#### Trainer Saving Model

The benefit of letting one trainer to save the model is it does not
require a distributed filesystem. And it's reusing the same save model
H
Helin Wang 已提交
51 52
logic when training locally - except when doing sparse update, the
trainer needs to download the entire model during the saving process.
H
Helin Wang 已提交
53 54 55 56

#### Conclusion

Given trainer saving model does not require a distributed filesystem,
H
Helin Wang 已提交
57 58 59
and is an intuitive extension to trainer saving model when training
locally, we decide to let the trainer save the model when doing
distributed training.
H
Helin Wang 已提交
60 61


H
Helin Wang 已提交
62
### Convert Model from Checkpoint
H
Helin Wang 已提交
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77

TODO


## Timeline

We first implement trainer save the model. Converting the latest
snapshot to a model will be a TODO for future.


## Trainer Save Model

### Trainer Election

One trainer will be elected as the one to save the model. When using
78 79 80 81 82
etcd, trainer ID is a randomly generated UUID, the trainer will
contact the master server requesting to save the model, and find out
if itself is elected. When the master server is not used, unique
trainer IDs will be given by the administrator, the trainer whose ID
is "0" is elected to save the model.
H
Helin Wang 已提交
83 84 85 86 87

### Model Save Path

Each trainer will be given the directory to save the model. The
elected trainer will save the model to
H
Helin Wang 已提交
88 89 90
`given-directory/trainerID`. Since the trainer ID is unique, this
would prevent concurrent save to the same file when multiple trainers
are elected to save the model when split-brain problem happens.
H
Helin Wang 已提交
91 92 93 94 95 96

### What Happens When Model Is Saving

It takes some time to save model, we need to define what will happen
when save model is taking place.

H
Helin Wang 已提交
97
When doing dense update, the trainer uses the local model. Pservers
H
Helin Wang 已提交
98 99
does not need to pause model update.

H
Helin Wang 已提交
100 101 102 103
When doing sparse update. The trainer needs to download the entire
model while saving. To get the most accurate model, the model update
needs to be paused before the download starts and resumed after the
download finishes. Otherwise, the trainer gets a model that is
H
Helin Wang 已提交
104 105 106
"polluted": some part of the model is old, some part of the model is
new.

H
Helin Wang 已提交
107
It's unclear that the "polluted" model will be inferior due to the
H
Helin Wang 已提交
108
stochastic nature of deep learning, and pausing the model update will
H
Helin Wang 已提交
109
add more complexity to the system. Since supporting sparse update is a
H
Helin Wang 已提交
110 111
TODO item. We defer the evaluation of pause the model update or not
during saving model to the future.