Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleDetection
提交
62e582e8
P
PaddleDetection
项目概览
PaddlePaddle
/
PaddleDetection
1 年多 前同步成功
通知
696
Star
11112
Fork
2696
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
184
列表
看板
标记
里程碑
合并请求
40
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleDetection
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
184
Issue
184
列表
看板
标记
里程碑
合并请求
40
合并请求
40
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
62e582e8
编写于
7年前
作者:
H
Helin Wang
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
polish wording and grammar.
上级
7c066f6e
变更
1
显示空白变更内容
内联
并排
Showing
1 changed file
with
13 addition
and
13 deletion
+13
-13
doc/design/cluster_train/save_model.md
doc/design/cluster_train/save_model.md
+13
-13
未找到文件。
doc/design/cluster_train/save_model.md
浏览文件 @
62e582e8
...
@@ -15,13 +15,13 @@ ways from which user can obtain a model:
...
@@ -15,13 +15,13 @@ ways from which user can obtain a model:
### Trainer Saving Model vs. Pservers Saving Model
### Trainer Saving Model vs. Pservers Saving Model
Both trainers and pservers have access to the model. So the model can
Both trainers and pservers have access to the model. So the model can
be saved from a trainer or pservers. We need to decide
on where the
be saved from a trainer or pservers. We need to decide
where the model
model
is saved from.
is saved from.
#### Dense Update vs. Sparse Update
#### Dense Update vs. Sparse Update
There are two types of model update methods: dense update and sparse
There are two types of model update methods: dense update and sparse
update (when the parameter is configured to be sparse).
update (when the
model
parameter is configured to be sparse).
-
Dense update
-
Dense update
...
@@ -48,15 +48,15 @@ filesystem, making the checkpoint shards visible to the merge program.
...
@@ -48,15 +48,15 @@ filesystem, making the checkpoint shards visible to the merge program.
The benefit of letting one trainer to save the model is it does not
The benefit of letting one trainer to save the model is it does not
require a distributed filesystem. And it's reusing the same save model
require a distributed filesystem. And it's reusing the same save model
logic when the trainer is training locally - except when doing sparse
logic when training locally - except when doing sparse update, the
update, the trainer needs to download the entire model during the
trainer needs to download the entire model during the saving process.
saving process.
#### Conclusion
#### Conclusion
Given trainer saving model does not require a distributed filesystem,
Given trainer saving model does not require a distributed filesystem,
and is an intuitive extension to training locally, we decide to let
and is an intuitive extension to trainer saving model when training
the trainer save the model.
locally, we decide to let the trainer save the model when doing
distributed training.
### Convert Model from Checkpoint
### Convert Model from Checkpoint
...
@@ -84,16 +84,16 @@ save the model.
...
@@ -84,16 +84,16 @@ save the model.
Each trainer will be given the directory to save the model. The
Each trainer will be given the directory to save the model. The
elected trainer will save the model to
elected trainer will save the model to
`given-directory/trainerID`
. Since the t
ainerID is unique, this would
`given-directory/trainerID`
. Since the t
rainer ID is unique, this
prevent concurrent save to the same file when multiple trainers are
would prevent concurrent save to the same file when multiple trainers
elected to save the model when split-brain problem happens.
are
elected to save the model when split-brain problem happens.
### What Happens When Model Is Saving
### What Happens When Model Is Saving
It takes some time to save model, we need to define what will happen
It takes some time to save model, we need to define what will happen
when save model is taking place.
when save model is taking place.
When
saving a dense model
, the trainer uses the local model. Pservers
When
doing dense update
, the trainer uses the local model. Pservers
does not need to pause model update.
does not need to pause model update.
When doing sparse update. The trainer needs to download the entire
When doing sparse update. The trainer needs to download the entire
...
@@ -103,7 +103,7 @@ download finishes. Otherwise, the trainer gets a model that is
...
@@ -103,7 +103,7 @@ download finishes. Otherwise, the trainer gets a model that is
"polluted": some part of the model is old, some part of the model is
"polluted": some part of the model is old, some part of the model is
new.
new.
It's unclear that the "polluted" model will be inferio
d
due to the
It's unclear that the "polluted" model will be inferio
r
due to the
stochastic nature of deep learning, and pausing the model update will
stochastic nature of deep learning, and pausing the model update will
add more complexity to the system. Since supporting sparse update is a
add more complexity to the system. Since supporting sparse update is a
TODO item. We defer the evaluation of pause the model update or not
TODO item. We defer the evaluation of pause the model update or not
...
...
This diff is collapsed.
Click to expand it.
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录
新手
引导
客服
返回
顶部