fix according to comments

7c066f6e · Helin Wang · 5157ba69 · 7c066f6e
隐藏空白更改
内联并排

Showing with 31 addition and 21 deletion

doc/design/cluster_train/save_model.md doc/design/cluster_train/save_model.md +31 -21

未找到文件。
--- a/doc/design/cluster_train/save_model.md
+++ b/doc/design/cluster_train/save_model.md
@@ -7,24 +7,34 @@ ways from which user can obtain a model:

 - Save model triggered by user code: user code asks PaddlePaddle to
  save a model.
- Convert model from the snapshot: model being converted from
-  pservers' periodic snapshot. In this way, the user can cancel a job
-  at any time, and still have a relatively fresh model (we snapshot
-  around every 5 minutes).
+- Convert model from the checkpoint: model being converted from
+  pservers' periodic checkpoint. In this way, the user can cancel a
+  job at any time, and still have a relatively fresh model (we
+  checkpoint around every 5 minutes).

-### Save Model Triggered by User Code
+### Trainer Saving Model vs. Pservers Saving Model

 Both trainers and pservers have access to the model. So the model can
 be saved from a trainer or pservers. We need to decide on where the
 model is saved from.

-#### Dense Model vs. Sparse Model
+#### Dense Update vs. Sparse Update
+
+There are two types of model update methods: dense update and sparse
+update (when the parameter is configured to be sparse).
+
+- Dense update
+
+  Every trainer has it's own full copy of the model. Every model
+  update will update the entire model.
+
+- Sparse update
+
+  The training input is sparse, and the trainer does not have the
+  entire model. It will only download the sub-model necessary related
+  to the input. When updating the model, only the sub-model related to
+  the training input is updated.

-There are two types of model: dense and sparse model (when the
-parameter is configured to be sparse). Pservers always jointly have
-the entire model at any given time. Trainers only have the entire
-dense model, but only have a fraction of the sparse model at any given
-time.

 #### Pservers Saving Model

@@ -32,15 +42,15 @@ The benefit of letting pservers save model is they have the entire
 model all the time. However, since pservers are on different nodes, it
 requires a merging process to merge model shards into the same
 model. Thus requires the pservers to write models to a distributed
-filesystem, making the snapshot shards visible to the merge program.
+filesystem, making the checkpoint shards visible to the merge program.

 #### Trainer Saving Model

 The benefit of letting one trainer to save the model is it does not
 require a distributed filesystem. And it's reusing the same save model
-logic when the trainer is training locally - except when training
-sparse model, the trainer needs to download the entire sparse model
-during the saving process.
+logic when the trainer is training locally - except when doing sparse
+update, the trainer needs to download the entire model during the
+saving process.

 #### Conclusion

@@ -49,7 +59,7 @@ and is an intuitive extension to training locally, we decide to let
 the trainer save the model.


-### Convert Model from Snapshot
+### Convert Model from Checkpoint

 TODO

@@ -86,15 +96,15 @@ when save model is taking place.
 When saving a dense model, the trainer uses the local model. Pservers
 does not need to pause model update.

-When saving a sparse model. The trainer needs to download the entire
-sparse model while saving. To get the most accurate model, the model
-update needs to be paused before the download starts and resumed after
-the download finishes. Otherwise, the trainer gets a model that is
+When doing sparse update. The trainer needs to download the entire
+model while saving. To get the most accurate model, the model update
+needs to be paused before the download starts and resumed after the
+download finishes. Otherwise, the trainer gets a model that is
 "polluted": some part of the model is old, some part of the model is
 new.

 It's unclear that the "polluted" model will be inferiod due to the
 stochastic nature of deep learning, and pausing the model update will
-add more complexity to the system. Since supporting sparse model is a
+add more complexity to the system. Since supporting sparse update is a
 TODO item. We defer the evaluation of pause the model update or not
 during saving model to the future.