提交 db5cfcb6 编写于 作者: T tangwei12

文档格式和内容优化

上级 0b4ab9c5
......@@ -51,7 +51,7 @@ trainer = Trainer(..., checkpoint_config=config)
定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦!
## 相关API
Trainer API 说明: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
[Trainer API 说明](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py)
## 注意
1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。
......
# Checkpoint User Guide
## Background
In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again.
In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we waste so much time and the performance of the machine but get nothing, which makes us frustrating and we have to restart it again.
## Purpose
The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately.
The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable, and other needs data in checkpoint directory. When the exception occurs, we can load these variables from the checkpoint directory immediately.
## Introduce
### Complete Features Currently:
1. The Trainer 0 will save model variables in training.
2. Each of the Trainer will save its own arguments needed.
3. Each of the Parameter Sever will save ```Distribute Lookup Table``` variables in training.
3. Each of the Parameter Server will save ```Distribute Lookup Table``` variables in training.
### Fluid Checkpoint directory structure:
```
......@@ -28,7 +28,7 @@ checkpoint_dir (the checkpoint directory user define)
## usage
### Fluid.CheckpointConfig construct
When user want to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
When the user wants to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
```CheckpointConfig``` has 4 member variables need to be initialized:
......@@ -40,7 +40,7 @@ When user want to use ```Checkpoint``` feature, the main thing user have to do i
| step_interval | int | step interval times |
### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer
Because the initialization of Trianer needs an instance of ```CheckpointConfig```., we should decare ```CheckpointConfig``` in ```Fluid``` first.
Because the initialization of Trainer needs an instance of ```CheckpointConfig```., we should declare ```CheckpointConfig``` in ```Fluid``` first.
For example:
```python
......@@ -50,13 +50,13 @@ config = CheckpointConfig(
trainer = Trainer(..., checkpoint_config=config)
```
After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy.
After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, the user can restart it, the train will restore from the latest copy.
## Related API
Related Trainer API: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
[Related Trainer API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py)
## Attention
1. Make the ```checkpoint_dir``` only be used by one train job.
2. The number of ```max_num_checkpoints``` need to be adjust by the disk size and model size.
3. Too frequently to slow down the trian speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas.
\ No newline at end of file
2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size.
3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册