提交 0b4ab9c5 编写于 作者: T tangwei12

文档格式和内容优化

上级 238348ef
# Checkpoint功能使用指南
## 背景
单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。
......@@ -8,8 +9,10 @@ Checkpoint功能能够在训练中途对训练数据中间数据进行保存,
## 说明
### 目前已实现的参数保存:
1. 基于Trainer 0 实现训练过程中的参数保存
2. 基于PServer 实现了Distribute Lookup Table相关参数保存
2. 基于PServer 实现了```Distribute Lookup Table```相关参数保存
### Fluid Checkpoint 保存数据目录结构:
```
checkpoint_dir (用户定义的checkpoint目录)
├── checkpoint_0 (第一次保存)
│ ├── __lockup_table__ (Distribute Lookup Table 目录)
......@@ -21,20 +24,23 @@ checkpoint_dir (用户定义的checkpoint目录)
│ ├── epoch_id
│ └── step_id
└── checkpoint_1 (第二次保存)
```
## 使用方法
### 声明Fluid.CheckpointConfig
用户对checkpoint功能的配置,主要是配置对象Fluid.CheckpointConfig.
CheckpointConfig 包括4个参数:
```table
参数 | 类型 | 说明
checkpoint_dir | int | checkpoint存储目录
max_num_checkpoints | int | 最大保存的checkpoint副本数
epoch_interval | int | 每隔epoch_interval轮epoch
step_interval | int | 每隔step_interval轮step
```
用户对checkpoint功能的配置,主要是配置对象```Fluid```中的```CheckpointConfig```.
```CheckpointConfig``` 包括4个参数:
| 参数 | 类型 | 说明 |
| - | :-: | - |
| checkpoint_dir | int| checkpoint存储目录 |
| max_num_checkpoints | int | 最大保存的checkpoint副本数 |
| epoch_interval | int | 每隔epoch_interval轮epoch |
| step_interval | int | 每隔step_interval轮step |
### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明
Trainer的__init__方法的参数中包含了对CheckpointConfig, 需要传入在声明Trainer前声明的CheckpointConfig对象。
Trainer的__init__方法的参数中包含了对```CheckpointConfig```, 需要传入在声明Trainer前声明的```CheckpointConfig```对象。
如:
```python
config = CheckpointConfig(
......@@ -45,12 +51,10 @@ trainer = Trainer(..., checkpoint_config=config)
定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦!
## 相关API
https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97
Trainer API 说明: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
## 注意
1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。
2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。
3. epoch_interval 和 step_interval 不宜过小, 频繁的进行checkpoint会拖慢训练速度。
4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。
## 后续规划
1. 支持通过etcd进行参数保存。
\ No newline at end of file
2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。
3. ```epoch_interval``` 和 ```step_interval``` 不宜过小, 频繁的进行checkpoint会拖慢训练速度。
4. **分布式训练**的过程中:每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。
\ No newline at end of file
# Checkpoint User Guide
## Background
In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again.
## Purpose
The feature of Checkpoint can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately.
The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately.
## Introduce
### Complete Features Currently:
1. The Trainer 0 will save model variables in training.
2. Each of the Trainer will save its own arguments needed.
3. Each of the Parameter Sever will save Distribute Lookup Table variables in training.
3. Each of the Parameter Sever will save ```Distribute Lookup Table``` variables in training.
### Fluid Checkpoint directory structure:
```
checkpoint_dir (the checkpoint directory user define)
├── checkpoint_0 (the first save directory)
│ ├── __lockup_table__ (Distribute Lookup Table directory)
......@@ -21,21 +24,23 @@ checkpoint_dir (the checkpoint directory user define)
│ ├── epoch_id
│ └── step_id
└── checkpoint_1 (the second save directory)
```
## usage
### Fluid.CheckpointConfig construct
When user want to use Checkpoint feature, the main thing user have to do is declare Fluid.CheckpointConfig and construct it.
CheckpointConfig has 4 member variables need to be initialized:
```table
Member Variable | Type | Comment
checkpoint_dir | int | checkpoint directory
max_num_checkpoints | int | Maximum number of checkpoint copies
epoch_interval | int | epoch interval times
step_interval | int | step interval times
```
When user want to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
```CheckpointConfig``` has 4 member variables need to be initialized:
| Member Variable | Type | Comment |
| - | :-: | - |
| checkpoint_dir | int| checkpoint directory |
| max_num_checkpoints | int | Maximum number of checkpoint copies |
| epoch_interval | int | epoch interval times |
| step_interval | int | step interval times |
### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer
Because the initialization of Trianer needs an instance of CheckpointConfig., we should decare Fluid.CheckpointConfig first.
Because the initialization of Trianer needs an instance of ```CheckpointConfig```., we should decare ```CheckpointConfig``` in ```Fluid``` first.
For example:
```python
......@@ -48,12 +53,10 @@ trainer = Trainer(..., checkpoint_config=config)
After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy.
## Related API
https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97
Related Trainer API: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
## Attention
1. Make the ```checkpoint_dir``` only be used by one train job.
2. The number of max_num_checkpoints need to be adjust by the disk size and model size.
3. Too frequently to slow down the trian speed, so too small epoch_interval and step_interval are not suitable.
2. The number of ```max_num_checkpoints``` need to be adjust by the disk size and model size.
3. Too frequently to slow down the trian speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas.
\ No newline at end of file
## Next Plan
1. Save and restore checkpoint by etcd.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册