Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
Paddle
提交
7c066f6e
P
Paddle
项目概览
PaddlePaddle
/
Paddle
1 年多 前同步成功
通知
2302
Star
20931
Fork
5422
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1423
列表
看板
标记
里程碑
合并请求
543
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1,423
Issue
1,423
列表
看板
标记
里程碑
合并请求
543
合并请求
543
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
7c066f6e
编写于
6月 30, 2017
作者:
H
Helin Wang
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
fix according to comments
上级
5157ba69
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
31 addition
and
21 deletion
+31
-21
doc/design/cluster_train/save_model.md
doc/design/cluster_train/save_model.md
+31
-21
未找到文件。
doc/design/cluster_train/save_model.md
浏览文件 @
7c066f6e
...
...
@@ -7,24 +7,34 @@ ways from which user can obtain a model:
-
Save model triggered by user code: user code asks PaddlePaddle to
save a model.
-
Convert model from the
snapsho
t: model being converted from
pservers' periodic
snapshot. In this way, the user can cancel a job
at any time, and still have a relatively fresh model (we snapshot
around every 5 minutes).
-
Convert model from the
checkpoin
t: model being converted from
pservers' periodic
checkpoint. In this way, the user can cancel a
job at any time, and still have a relatively fresh model (we
checkpoint
around every 5 minutes).
###
Save Model Triggered by User Code
###
Trainer Saving Model vs. Pservers Saving Model
Both trainers and pservers have access to the model. So the model can
be saved from a trainer or pservers. We need to decide on where the
model is saved from.
#### Dense Model vs. Sparse Model
#### Dense Update vs. Sparse Update
There are two types of model update methods: dense update and sparse
update (when the parameter is configured to be sparse).
-
Dense update
Every trainer has it's own full copy of the model. Every model
update will update the entire model.
-
Sparse update
The training input is sparse, and the trainer does not have the
entire model. It will only download the sub-model necessary related
to the input. When updating the model, only the sub-model related to
the training input is updated.
There are two types of model: dense and sparse model (when the
parameter is configured to be sparse). Pservers always jointly have
the entire model at any given time. Trainers only have the entire
dense model, but only have a fraction of the sparse model at any given
time.
#### Pservers Saving Model
...
...
@@ -32,15 +42,15 @@ The benefit of letting pservers save model is they have the entire
model all the time. However, since pservers are on different nodes, it
requires a merging process to merge model shards into the same
model. Thus requires the pservers to write models to a distributed
filesystem, making the
snapsho
t shards visible to the merge program.
filesystem, making the
checkpoin
t shards visible to the merge program.
#### Trainer Saving Model
The benefit of letting one trainer to save the model is it does not
require a distributed filesystem. And it's reusing the same save model
logic when the trainer is training locally - except when
training
sparse model, the trainer needs to download the entire sparse model
during the
saving process.
logic when the trainer is training locally - except when
doing sparse
update, the trainer needs to download the entire model during the
saving process.
#### Conclusion
...
...
@@ -49,7 +59,7 @@ and is an intuitive extension to training locally, we decide to let
the trainer save the model.
### Convert Model from
Snapsho
t
### Convert Model from
Checkpoin
t
TODO
...
...
@@ -86,15 +96,15 @@ when save model is taking place.
When saving a dense model, the trainer uses the local model. Pservers
does not need to pause model update.
When
saving a sparse model
. The trainer needs to download the entire
sparse model while saving. To get the most accurate model, the model
update needs to be paused before the download starts and resumed after
the
download finishes. Otherwise, the trainer gets a model that is
When
doing sparse update
. The trainer needs to download the entire
model while saving. To get the most accurate model, the model update
needs to be paused before the download starts and resumed after the
download finishes. Otherwise, the trainer gets a model that is
"polluted": some part of the model is old, some part of the model is
new.
It's unclear that the "polluted" model will be inferiod due to the
stochastic nature of deep learning, and pausing the model update will
add more complexity to the system. Since supporting sparse
model
is a
add more complexity to the system. Since supporting sparse
update
is a
TODO item. We defer the evaluation of pause the model update or not
during saving model to the future.
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录