文档格式和内容优化

0b4ab9c5 · tangwei12 · 238348ef · 0b4ab9c5 · 0b4ab9c5
2 changed file
--- a/source/user_guides/howto/training/checkpoint_doc_cn.md
+++ b/source/user_guides/howto/training/checkpoint_doc_cn.md
 # Checkpoint功能使用指南
 ## 背景
 单机/多机在训练过程中会由于软件/硬件的问题出现异常，导致训练中断，进而导致训练无结果或结果不可用，浪费大量时间和机器性能。
@@ -8,8 +9,10 @@ Checkpoint功能能够在训练中途对训练数据中间数据进行保存，
 ## 说明
 ### 目前已实现的参数保存：
 1. 基于Trainer 0 实现训练过程中的参数保存
-2. 基于PServer 实现了Distribute Lookup Table相关参数保存
+2. 基于PServer 实现了```Distribute Lookup Table```相关参数保存
 ### Fluid Checkpoint 保存数据目录结构：
+```
 checkpoint_dir (用户定义的checkpoint目录)
 ├── checkpoint_0 (第一次保存)
 │   ├── __lockup_table__ (Distribute Lookup Table 目录)
@@ -21,20 +24,23 @@ checkpoint_dir (用户定义的checkpoint目录)
 │       ├── epoch_id
 │       └── step_id
 └── checkpoint_1 (第二次保存)
+```
 ## 使用方法
 ### 声明Fluid.CheckpointConfig
-用户对checkpoint功能的配置，主要是配置对象Fluid.CheckpointConfig.
+用户对checkpoint功能的配置，主要是配置对象```Fluid```中的```CheckpointConfig```.
-CheckpointConfig 包括4个参数：
-```table
+```CheckpointConfig``` 包括4个参数：
-参数   | 类型 |  说明
-checkpoint_dir   |  int | checkpoint存储目录
+| 参数 | 类型 | 说明 | 
-max_num_checkpoints  | int | 最大保存的checkpoint副本数
+| - | :-: | - | 
-epoch_interval  | int |    每隔epoch_interval轮epoch
+| checkpoint_dir | int| checkpoint存储目录 | 
-step_interval   | int |   每隔step_interval轮step
+| max_num_checkpoints | int | 最大保存的checkpoint副本数 | 
-```
+| epoch_interval | int | 每隔epoch_interval轮epoch |
+| step_interval | int | 每隔step_interval轮step |
 ### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明
-Trainer的__init__方法的参数中包含了对CheckpointConfig， 需要传入在声明Trainer前声明的CheckpointConfig对象。
+Trainer的__init__方法的参数中包含了对```CheckpointConfig```， 需要传入在声明Trainer前声明的```CheckpointConfig```对象。
 如：
 ```python
 config = CheckpointConfig(
@@ -45,12 +51,10 @@ trainer = Trainer(..., checkpoint_config=config)
 定义和声明完成后， 训练在运行过程中就会在指定的step和epoch处进行保存，出现异常时，就会自动从最新的checkpoint目录进行参数恢复啦！
 ## 相关API
-https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97
+Trainer API 说明: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
 ## 注意
 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。
-2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整， 保证磁盘的可用性。
+2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整， 保证磁盘的可用性。
-3. epoch_interval  和 step_interval  不宜过小， 频繁的进行checkpoint会拖慢训练速度。
+3. ```epoch_interval```  和 ```step_interval```  不宜过小， 频繁的进行checkpoint会拖慢训练速度。
-4. **分布式训练**的过程中：每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数（只有Trainer 0会保存模型的参数），需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据，恢复训练的时候需要用完整的数据进行恢复。
+4. **分布式训练**的过程中：每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数（只有Trainer 0会保存模型的参数），需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据，恢复训练的时候需要用完整的数据进行恢复。
\ No newline at end of file
-## 后续规划
-1. 支持通过etcd进行参数保存。
\ No newline at end of file
--- a/source/user_guides/howto/training/checkpoint_doc_en.md
+++ b/source/user_guides/howto/training/checkpoint_doc_en.md
 # Checkpoint User Guide
 ## Background
 In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again.
 ## Purpose
-The feature of Checkpoint can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately.
+The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately.
 ## Introduce
 ### Complete Features Currently：
 1. The Trainer 0 will save model variables in training.
 2. Each of the Trainer will save its own arguments needed.
-3. Each of the Parameter Sever will save Distribute Lookup Table variables in training.
+3. Each of the Parameter Sever will save ```Distribute Lookup Table``` variables in training.
 ### Fluid Checkpoint directory structure：
+```
 checkpoint_dir (the checkpoint directory user define)
 ├── checkpoint_0 (the first save directory)
 │   ├── __lockup_table__ (Distribute Lookup Table directory)
@@ -21,21 +24,23 @@ checkpoint_dir (the checkpoint directory user define)
 │       ├── epoch_id
 │       └── step_id
 └── checkpoint_1 (the second save directory)
+```
 ## usage
 ### Fluid.CheckpointConfig construct
-When user want to use Checkpoint feature, the main thing user have to do is declare Fluid.CheckpointConfig and construct it.
+When user want to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it.
-CheckpointConfig has 4 member variables need to be initialized：
+```CheckpointConfig``` has 4 member variables need to be initialized：
-```table
-Member Variable   | Type |  Comment
+| Member Variable | Type | Comment | 
-checkpoint_dir   |  int | checkpoint directory
+| - | :-: | - | 
-max_num_checkpoints  | int | Maximum number of checkpoint copies
+| checkpoint_dir | int| checkpoint directory | 
-epoch_interval  | int |    epoch interval times
+| max_num_checkpoints | int | Maximum number of checkpoint copies | 
-step_interval   | int |   step interval times
+| epoch_interval | int |  epoch interval times |
-```
+| step_interval | int | step interval times |
 ### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer
-Because the initialization of Trianer needs an instance of CheckpointConfig., we should decare Fluid.CheckpointConfig first.
+Because the initialization of Trianer needs an instance of ```CheckpointConfig```., we should decare ```CheckpointConfig``` in ```Fluid``` first.
 For example：
 ```python
@@ -48,12 +53,10 @@ trainer = Trainer(..., checkpoint_config=config)
 After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy.
 ## Related API
-https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97
+Related Trainer API: <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py>
 ## Attention
 1. Make the ```checkpoint_dir``` only be used by one train job.
-2. The number of max_num_checkpoints need to be adjust by the disk size and model size.
+2. The number of ```max_num_checkpoints``` need to be adjust by the disk size and model size.
-3. Too frequently to slow down the trian speed, so too small epoch_interval  and step_interval are not suitable.
+3. Too frequently to slow down the trian speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable.
 4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas.
\ No newline at end of file
-## Next Plan
-1. Save and restore checkpoint by etcd.
\ No newline at end of file