Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
BaiXuePrincess
Paddle
提交
e67325cd
P
Paddle
项目概览
BaiXuePrincess
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
e67325cd
编写于
3月 14, 2018
作者:
Y
Yang Yang
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
update readme
上级
0621c327
变更
1
显示空白变更内容
内联
并排
Showing
1 changed file
with
32 addition
and
10 deletion
+32
-10
doc/design/parallel_executor.md
doc/design/parallel_executor.md
+32
-10
未找到文件。
doc/design/parallel_executor.md
浏览文件 @
e67325cd
...
...
@@ -30,23 +30,45 @@ operator run on each GPU, it will automatically sync with different streams when
// if op's input is params' grad:
// sync with allreduce stream
// e.g. sgd should wait for allreduce to be finished
SyncMultipleStreams
(
op
);
CallBack
->
BeforeOp
(
op
);
op
->
Run
(
*
local_scope
,
place_
);
// if op's output is params' grad:
// sync with computation stream
// e.g. allreduce shoudl wait for fc_grad to be finished.
SyncMultipleStreams
(
op
);
CallBack
->
AfterOp
(
op
);
```
And the
`Callback`
object can be implemented as the following
## API
```
c++
struct
AllReduceCallBack
{
void
BeforeOp
(
framework
::
OperatorBase
*
op
);
void
AfterOp
(
framework
::
OperatorBase
*
op
);
std
::
unordered_set
<
std
::
string
>
reduced_param_grad_names
;
std
::
unordered_set
<
std
::
string
>
param_grad_names_
;
platform
::
DeviceContext
*
computation_dev_ctx
;
// computation device context
platform
::
DeviceContext
*
communication_dev_ctx
;
// communication device context
The
`ParallelExecutor.run`
has similar interface as
`Executor.run`
. Besides
1.
Scope: we don't expose
`scope`
in
`ParallelExecutor.run`
since
`ParallelExecutor`
has its
own scope to maintain NCCL.
1.
Feed: we don't expose
`feed`
in the API either, because the whole point of implementing
parallel_executor is the speed. The input for NN should be implemented in an reader OP.
1.
Fetch: we return the fetched value on all GPUs as a list. (e.g.
`exe.run(..., fetch=loss)`
with return
`[loss_on_gpu0, loss_on_gpu1]`
)
framework
::
Scope
*
scope
;
platform
::
NCCL
::
Communicator
*
nccl_com
;
};
AllReduceCallBack
::
BeforeOp
(
framework
::
OperatorBase
*
op
)
{
if
(
op
->
Input
()
in
reduced_param_grad_names
)
{
communication_dev_ctx
->
Wait
();
reduced_param_grad_names
.
erase
(
op
->
Input
())
}
}
AllReduceCallBack
::
AfterOp
(
framework
::
OperatorBase
*
op
)
{
if
(
op
->
Output
()
in
param_grad_names
)
{
computation_dev_ctx
->
Wait
();
reduced_param_grad_names
.
insert
(
op
->
Output
());
ncclAllreduce
(
scope
,
op
->
Output
(),
communication_dev_ctx
);
}
}
```
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录