未验证 提交 e83040dd 编写于 作者: Y yuyang18

Merge branch 'develop' of https://github.com/PaddlePaddle/FluidDoc into pr/13

# Checkpoint功能使用指南
## 背景
单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。
## 目的
Checkpoint功能能够在训练中途对训练数据中间数据进行保存,出现异常恢复训练的时候能够加载中途保存的数据继续训练, 实现单机/多机的容错训练的功能。
## 说明
### 目前已实现的参数保存:
1. 基于Trainer 0 实现训练过程中的参数保存
2. 基于PServer 实现了```Distribute Lookup Table```相关参数保存
### Fluid Checkpoint 保存数据目录结构:
```
checkpoint_dir (用户定义的checkpoint目录)
├── checkpoint_0 (第一次保存)
│ ├── __lockup_table__ (Distribute Lookup Table 目录)
│ │ ├── table_pserver_0 (Pserver 0 号保存的lookup table 数据)
│ │ └── table_pserver_1
│ ├── __model__ (model 目录)
│ │ └── var.w_1
│ └── trainer_0 (trainer 自有数据保存)
│ ├── epoch_id
│ └── step_id
└── checkpoint_1 (第二次保存)
```
## 使用方法
### 声明Fluid.CheckpointConfig
用户对checkpoint功能的配置,主要是配置对象```Fluid```中的```CheckpointConfig```.
```CheckpointConfig``` 包括4个参数:
| 参数 | 类型 | 说明 |
| - | :-: | - |
| checkpoint_dir | int| checkpoint存储目录 |
| max_num_checkpoints | int | 最大保存的checkpoint副本数 |
| epoch_interval | int | 每隔epoch_interval轮epoch |
| step_interval | int | 每隔step_interval轮step |
### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明
Trainer的__init__方法的参数中包含了对```CheckpointConfig```, 需要传入在声明Trainer前声明的```CheckpointConfig```对象。
如:
```python
config = CheckpointConfig(
checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2,
epoch_interval = 2, step_interval = 10)
trainer = Trainer(..., checkpoint_config=config)
```
定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦!
## 相关API
[Trainer API 说明](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py)
## 注意
1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。
2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。
3. ```epoch_interval``` 和 ```step_interval``` 不宜过小, 频繁的进行checkpoint会拖慢训练速度。
4. **分布式训练**的过程中:每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。
\ No newline at end of file
# Checkpoint User Guide
## Background
In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we waste so much time and the performance of the machine but get nothing, which makes us frustrating and we have to restart it again.
## Purpose
The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable, and other needs data in checkpoint directory. When the exception occurs, we can load these variables from the checkpoint directory immediately.
## Introduce
### Complete Features Currently:
1. The Trainer 0 will save model variables in training.
2. Each of the Trainer will save its own arguments needed.
3. Each of the Parameter Server will save ```Distribute Lookup Table``` variables in training.
### Fluid Checkpoint directory structure:
```
checkpoint_dir (the checkpoint directory user define)
├── checkpoint_0 (the first save directory)
│ ├── __lockup_table__ (Distribute Lookup Table directory)
│ │ ├── table_pserver_0 (Lookup table's data about Pserver 0)
│ │ └── table_pserver_1
│ ├── __model__ (model directory)
│ │ └── var.w_1
│ └── trainer_0 (each trainer will save its own data)
│ ├── epoch_id
│ └── step_id
└── checkpoint_1 (the second save directory)
```
## usage
### Fluid.CheckpointConfig construct
When the user wants to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
```CheckpointConfig``` has 4 member variables need to be initialized:
| Member Variable | Type | Comment |
| - | :-: | - |
| checkpoint_dir | int| checkpoint directory |
| max_num_checkpoints | int | Maximum number of checkpoint copies |
| epoch_interval | int | epoch interval times |
| step_interval | int | step interval times |
### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer
Because the initialization of Trainer needs an instance of ```CheckpointConfig```., we should declare ```CheckpointConfig``` in ```Fluid``` first.
For example:
```python
config = CheckpointConfig(
checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2,
epoch_interval = 2, step_interval = 10)
trainer = Trainer(..., checkpoint_config=config)
```
After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, the user can restart it, the train will restore from the latest copy.
## Related API
[Related Trainer API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py)
## Attention
1. Make the ```checkpoint_dir``` only be used by one train job.
2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size.
3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data.
\ No newline at end of file
......@@ -127,10 +127,10 @@
- :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps1.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
- 启动 pserver 节点
* - trainer0.paddlepaddle.com
- :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
- :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
- 启动第0号 trainer 节点
* - trainer1.paddlepaddle.com
- :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
- :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
- 启动第1号 trainer 节点
**注意**
......
......@@ -9,3 +9,4 @@ PaddlePaddle Fluid支持单机训练,和多节点训练。每种训练模式
single_node
multi_node
save_load_variables
......@@ -162,3 +162,11 @@
完全一致,否则会导致变量加载错误或者未加载。另外,与 :code:`fluid.io.save_params` 类似,
运行 :code:`fluid.default_startup_program()` 也必须在 :code:`fluid.io.load_checkpoint`
之前进行。
多机checkpoint保存
##################
.. toctree::
:maxdepth: 2
checkpoint_doc_cn.md
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册