checkpoint_doc_en.md 3.1 KB
Newer Older
T
tangwei12 已提交
1
# Checkpoint User Guide
T
tangwei12 已提交
2

T
tangwei12 已提交
3
## Background
T
tangwei12 已提交
4
In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we waste so much time and the performance of the machine but get nothing, which makes us frustrating and we have to restart it again.
T
tangwei12 已提交
5 6

## Purpose
T
tangwei12 已提交
7
The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable, and other needs data in checkpoint directory. When the exception occurs, we can load these variables from the checkpoint directory immediately.
T
tangwei12 已提交
8 9 10 11
## Introduce
### Complete Features Currently:
1. The Trainer 0 will save model variables in training.
2. Each of the Trainer will save its own arguments needed.
T
tangwei12 已提交
12
3. Each of the Parameter Server will save ```Distribute Lookup Table``` variables in training.
T
tangwei12 已提交
13
### Fluid Checkpoint directory structure:
T
tangwei12 已提交
14 15

```
T
tangwei12 已提交
16 17 18 19 20 21 22 23 24 25 26
checkpoint_dir (the checkpoint directory user define)
├── checkpoint_0 (the first save directory)
│   ├── __lockup_table__ (Distribute Lookup Table directory)
│   │   ├── table_pserver_0 (Lookup table's data about Pserver 0)
│   │   └── table_pserver_1
│   ├── __model__ (model directory)
│   │   └── var.w_1
│   └── trainer_0 (each trainer will save its own data)
│       ├── epoch_id
│       └── step_id
└── checkpoint_1 (the second save directory)
T
tangwei12 已提交
27
```
T
tangwei12 已提交
28 29 30

## usage
### Fluid.CheckpointConfig construct
T
tangwei12 已提交
31
When the user wants to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
T
tangwei12 已提交
32 33 34 35 36 37 38 39 40 41

```CheckpointConfig``` has 4 member variables need to be initialized:

| Member Variable | Type | Comment | 
| - | :-: | - | 
| checkpoint_dir | int| checkpoint directory | 
| max_num_checkpoints | int | Maximum number of checkpoint copies | 
| epoch_interval | int |  epoch interval times |
| step_interval | int | step interval times |

T
tangwei12 已提交
42
### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer
T
tangwei12 已提交
43
Because the initialization of Trainer needs an instance of ```CheckpointConfig```., we should declare ```CheckpointConfig``` in ```Fluid``` first.
T
tangwei12 已提交
44 45 46 47 48 49 50 51 52

For example:
```python
config = CheckpointConfig(
    checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, 
    epoch_interval = 2, step_interval = 10)
trainer = Trainer(..., checkpoint_config=config)
```

T
tangwei12 已提交
53
After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, the user can restart it, the train will restore from the latest copy.
T
tangwei12 已提交
54 55

## Related API
T
tangwei12 已提交
56
[Related Trainer API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py)
T
tangwei12 已提交
57

T
tangwei12 已提交
58 59
## Attention
1. Make the ```checkpoint_dir``` only be used by one train job.
T
tangwei12 已提交
60 61
2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size.
3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
62
4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data.