Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
机器未来
Paddle
提交
d66d8446
P
Paddle
项目概览
机器未来
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
d66d8446
编写于
5月 15, 2018
作者:
Y
Yancey
提交者:
GitHub
5月 15, 2018
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Refine async update design doc (#10065)
* refine async update design doc * update by comments
上级
ded21532
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
18 addition
and
15 deletion
+18
-15
doc/fluid/design/dist_train/async_update.md
doc/fluid/design/dist_train/async_update.md
+18
-15
未找到文件。
doc/fluid/design/dist_train/async_update.md
浏览文件 @
d66d8446
...
...
@@ -4,34 +4,37 @@
For the typical synchronous distributed training, some significant steps are as follows:
1.
A
Trainer will compute the gradients and SEND them to the Parameter Server(PServer
) nodes.
1.
After the PS
erver
node received gradients came from all the Trainers, It will aggregate the
1.
A
trainer process will compute the gradients and
**send**
them to the parameter server (PS
) nodes.
1.
After the PS node received gradients came from all the Trainers, It will aggregate the
gradient variables for the same parameter into one gradient variable and then apply the aggregated
gradient to the respective parameter, finally using an optimize algorithms(SGD, Monument...)
to update the parameters.
1.
The Trainer would wait for the PS
ervers finished the optimize stage, and GET the parameters from PServer
,
1.
The Trainer would wait for the PS
finished the optimize stage, and GET the parameters from PS
,
so all the Trainers would get the same parameters.
In the synchronously distributed training, there should be a
`Barrier`
to synchronise the
parameters after the optimizing stage. The performance of a distributed training job would
depend on the slowest node if there were hundreds or thousands of training nodes in a
Job, the performance of synchronously distributed training might be very poor because of
the slow node. So this design doc would introduce an approach to implement
*asynchronously*
distributed training in PaddlePaddle Fluid.
In Synchronous Distributed Training, there is a
**barrier**
on each PS to wait until all trainers processes
have completed running current mini-batch. After that, all trainers can continue to run the next
mini-batch. So, we can find that the overall performance of Synchronous Distributed Training depends
on the slowest node.
In Asynchronous Distributed Training, we don't need to wait for a global mini-bach, the optimizer on
the PS will run immediately when the gradient is uploaded to the PS from one trainer. This mode would
train such models that achieve scaling, better throughput. In this design doc, we will introduce how to
implement the Asynchronous Distributed Training base on PaddlePaddle Fluid.
## Design
<img
src=
"./src/async_update.png"
width=
"600"
/>
As the figure above, we describe a global view of
asynchronously
update process and use
As the figure above, we describe a global view of
the asynchronous
update process and use
the parameter
`w1`
as an example to introduce the steps:
1.
For each gradient variables, they may distribute on different GPU card and aggregate
them while they are all calculated.
1.
Split the gradient variable into multiple blocks according to the number of PS
erver
1.
Split the gradient variable into multiple blocks according to the number of PS
instances and then send them.
1.
PS
erver
would run an
`Optimize Block`
using a specified optimize algorithm to update
1.
PS would run an
`Optimize Block`
using a specified optimize algorithm to update
the specified parameter.
1.
The trainer will fetch
latest parameter from PServer
before running forward Op which depends
1.
The trainer will fetch
the latest parameter from PS
before running forward Op which depends
on the specified parameter.
1.
Broadcast the received variable into multiple GPU cards and continue to run the next
mini-batch.
...
...
@@ -40,8 +43,8 @@ mini-batch.
-
For the multiple devices distributed training, we need to aggregate the gradient
variables which placed on different devices firstly and then schedule a
`SendVars`
Operator to
send the gradient variables to the multiple PS
erver
instances.
-
Schedule
`FetchVars`
operator to fetch the latest parameter from PS
erver
before running
send the gradient variables to the multiple PS instances.
-
Schedule
`FetchVars`
operator to fetch the latest parameter from PS before running
the forward ops.
-
There could be a large number of gradient variables to be sent, so we need to use another
thread pool(IO Threadpool) whose a number of the schedulable threads is larger than the
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录