diff --git a/source/user_guides/howto/index.rst b/source/user_guides/howto/index.rst index ab74722d4b02e48a3612f1f4e34963df20ac35bd..c945054fd1c64709559603be90b490ce5888470e 100644 --- a/source/user_guides/howto/index.rst +++ b/source/user_guides/howto/index.rst @@ -3,7 +3,7 @@ #################### .. toctree:: - :maxdepth: 2 + :maxdepth: 2 prepare_data/index diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md new file mode 100644 index 0000000000000000000000000000000000000000..51e07683f341059722f0d718d3fa4375ad551dfd --- /dev/null +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -0,0 +1,60 @@ +# Checkpoint功能使用指南 + +## 背景 +单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。 + +## 目的 +Checkpoint功能能够在训练中途对训练数据中间数据进行保存,出现异常恢复训练的时候能够加载中途保存的数据继续训练, 实现单机/多机的容错训练的功能。 + +## 说明 +### 目前已实现的参数保存: +1. 基于Trainer 0 实现训练过程中的参数保存 +2. 基于PServer 实现了```Distribute Lookup Table```相关参数保存 +### Fluid Checkpoint 保存数据目录结构: + +``` +checkpoint_dir (用户定义的checkpoint目录) +├── checkpoint_0 (第一次保存) +│ ├── __lockup_table__ (Distribute Lookup Table 目录) +│ │ ├── table_pserver_0 (Pserver 0 号保存的lookup table 数据) +│ │ └── table_pserver_1 +│ ├── __model__ (model 目录) +│ │ └── var.w_1 +│ └── trainer_0 (trainer 自有数据保存) +│ ├── epoch_id +│ └── step_id +└── checkpoint_1 (第二次保存) +``` + +## 使用方法 +### 声明Fluid.CheckpointConfig +用户对checkpoint功能的配置,主要是配置对象```Fluid```中的```CheckpointConfig```. + +```CheckpointConfig``` 包括4个参数: + +| 参数 | 类型 | 说明 | +| - | :-: | - | +| checkpoint_dir | int| checkpoint存储目录 | +| max_num_checkpoints | int | 最大保存的checkpoint副本数 | +| epoch_interval | int | 每隔epoch_interval轮epoch | +| step_interval | int | 每隔step_interval轮step | + +### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 +Trainer的__init__方法的参数中包含了对```CheckpointConfig```, 需要传入在声明Trainer前声明的```CheckpointConfig```对象。 +如: +```python +config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) +trainer = Trainer(..., checkpoint_config=config) +``` +定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! + +## 相关API +[Trainer API 说明](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py) + +## 注意 +1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 +2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 +3. ```epoch_interval``` 和 ```step_interval``` 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 +4. **分布式训练**的过程中:每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 \ No newline at end of file diff --git a/source/user_guides/howto/training/checkpoint_doc_en.md b/source/user_guides/howto/training/checkpoint_doc_en.md new file mode 100644 index 0000000000000000000000000000000000000000..60524f64016e910768d0febed034717534f153ed --- /dev/null +++ b/source/user_guides/howto/training/checkpoint_doc_en.md @@ -0,0 +1,62 @@ +# Checkpoint User Guide + +## Background +In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we waste so much time and the performance of the machine but get nothing, which makes us frustrating and we have to restart it again. + +## Purpose +The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable, and other needs data in checkpoint directory. When the exception occurs, we can load these variables from the checkpoint directory immediately. +## Introduce +### Complete Features Currently: +1. The Trainer 0 will save model variables in training. +2. Each of the Trainer will save its own arguments needed. +3. Each of the Parameter Server will save ```Distribute Lookup Table``` variables in training. +### Fluid Checkpoint directory structure: + +``` +checkpoint_dir (the checkpoint directory user define) +├── checkpoint_0 (the first save directory) +│ ├── __lockup_table__ (Distribute Lookup Table directory) +│ │ ├── table_pserver_0 (Lookup table's data about Pserver 0) +│ │ └── table_pserver_1 +│ ├── __model__ (model directory) +│ │ └── var.w_1 +│ └── trainer_0 (each trainer will save its own data) +│ ├── epoch_id +│ └── step_id +└── checkpoint_1 (the second save directory) +``` + +## usage +### Fluid.CheckpointConfig construct +When the user wants to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it. + +```CheckpointConfig``` has 4 member variables need to be initialized: + +| Member Variable | Type | Comment | +| - | :-: | - | +| checkpoint_dir | int| checkpoint directory | +| max_num_checkpoints | int | Maximum number of checkpoint copies | +| epoch_interval | int | epoch interval times | +| step_interval | int | step interval times | + +### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer +Because the initialization of Trainer needs an instance of ```CheckpointConfig```., we should declare ```CheckpointConfig``` in ```Fluid``` first. + +For example: +```python +config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) +trainer = Trainer(..., checkpoint_config=config) +``` + +After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, the user can restart it, the train will restore from the latest copy. + +## Related API +[Related Trainer API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py) + +## Attention +1. Make the ```checkpoint_dir``` only be used by one train job. +2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size. +3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable. +4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data. \ No newline at end of file diff --git a/source/user_guides/howto/training/cluster_quick_start.rst b/source/user_guides/howto/training/cluster_quick_start.rst index 13e1de6c6d23f422da9e3a4b34f4a52552271405..6131c92d6f5386c7e91b2917d25dd7ae830ff182 100644 --- a/source/user_guides/howto/training/cluster_quick_start.rst +++ b/source/user_guides/howto/training/cluster_quick_start.rst @@ -127,10 +127,10 @@ - :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps1.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py` - 启动 pserver 节点 * - trainer0.paddlepaddle.com - - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py` + - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py` - 启动第0号 trainer 节点 * - trainer1.paddlepaddle.com - - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py` + - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com,ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py` - 启动第1号 trainer 节点 **注意** diff --git a/source/user_guides/howto/training/index.rst b/source/user_guides/howto/training/index.rst index a56dc986fb82eff9ec6f507d75aa0325150ff288..68475101e26b3f695c8003995cc1c6a95426ff27 100644 --- a/source/user_guides/howto/training/index.rst +++ b/source/user_guides/howto/training/index.rst @@ -9,3 +9,4 @@ PaddlePaddle Fluid支持单机训练,和多节点训练。每种训练模式 single_node multi_node + save_load_variables diff --git a/source/user_guides/howto/training/save_load_variables.rst b/source/user_guides/howto/training/save_load_variables.rst index 2be0146a23f04df1f02b1bccffe5d7e29286cb5d..7d60231247357c5e7229f80e5653a1e7edbad8e3 100644 --- a/source/user_guides/howto/training/save_load_variables.rst +++ b/source/user_guides/howto/training/save_load_variables.rst @@ -162,3 +162,11 @@ 完全一致,否则会导致变量加载错误或者未加载。另外,与 :code:`fluid.io.save_params` 类似, 运行 :code:`fluid.default_startup_program()` 也必须在 :code:`fluid.io.load_checkpoint` 之前进行。 + +多机checkpoint保存 +################## + +.. toctree:: + :maxdepth: 2 + + checkpoint_doc_cn.md \ No newline at end of file